Simple Linear Regression
Toggle navigation
    • Introduction

    • Download Report
    • About the application
    • Uploading datasets
    • Example data
    • Introduction


    About the Simple Linear Regression application

    The Simple Linear Regression is an interactive application which allows you to conduct regression analysis.

    The Simple Regression makes it easy to:

    1. Upload your dataset and download example datasets
    2. Visualise distribution and relationships among numeric variables
    3. Display dataset and results in a tabular format
    4. Save the results in a report format
    5. Learn R codes used to generate results.

    Start analysis by selecting either example dataset or by uploading your own dataset on the left sidebar. This will display the selected dataset, a list of numeric variables and a tab: Regression. The Example data tab at the top of the screen contains detailed information about each of the example dataset in this application.

    Click the plus sign (+) to open the box with detailed information about the Simple Regression application features.

    Upload and download data

    Uploading your data and downloading example dataset

    • The Simple Linear Regression application permits the user to upload a .csv file that contains dataset to be displayed. Instructions are given on the Uploading dataset panel.
    • There are a few built-in example datasets which can be used to conduct analysis with numeric variables.

    Regression tab

    Content of the Regression tab

    Click on the Regression tab on the left sidebar to reveal the following options at the top of the main panel:

    Dataset

    • This tab displays all the variables in the selected dataset.
    • Download a selected example dataset as a csv file.

    Descriptive statistics

    • Three sets of descriptive statistics are provided. First, from the base R package, second from psych package and the third from the pastecs package.

    Histograms

    • This tab displays histograms for slected explanatory and response variables.
    • Number of bins/classes could be selected and a density curve will be drown over the histogram to describe the distribution more accurately.

    Scatterplot matrix

    • Displays scatterplot for every pair of numeric variables in the dataset using the car package in R.
    • Displays enhanced scatterplot with boxplots for both response and explanatory variables.

    Correlation matrix

    • Correlation for each pair of numeric variables in the daaset is calculated. The second table displays p-values.
    • Displays correlogram, a graphical presentation of the data in correlation matrix using the corrgram package in R.

    Regression model

    • This tab estimates a simple linear regression model.

    Assessing the regression model

    • This tab performs the validation of a regression model using graphical methods.

    Save results

    To save results in a report:

    1. Enter your name in the textbox (at the top of the sidebar).
    2. Select a document format (PDF, HTML or Word).
    3. Type any comments you may have about the results in the textbox labelled Interpretation.
    4. Press Download Report button.
    5. Save the report on your disc.

    R codes

    R codes used to generate results:

    1. List of basic R functions used in this application is given here.
    2. To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?lm) in the RStudio console.

    Instructions how to prepare and upload your datasets

    This application permits the user to upload a .csv file that contains dataset to be displayed. It is very important to follow the instructions here.

    STEP 1: Check 'Upload your dataset'

    STEP 1: Check 'Upload your dataset' radio button

    STEP 2: Click 'Browse ...'

    STEP 2: Click 'Browse ...' button

    The application will only accept a 'comma delimited' text file (.csv). The first row in the .csv file should contain the variable names (a 'header'). If a .csv file is uploaded without a header (and this is indicated by unchecking the entry box on the sidebar), the variables to choose from will be listed as V1, V2, V3, etc., depending on their position in the .csv file. It is best to use .csv files that include variable names as a header row.


    If a spreadsheet was used to enter the data as illustrated below, then leave the default selections of radio buttons in the Separator and Quote sections.

    STEP 3: Enter your data

    STEP 3: Enter your data in a spreadsheet

    The best approach would begin by creating a file in a spreadsheet such as this:

    STEP 4: Save it as a .csv file

    STEP 4: Save it as a .csv file

    STEP 5: Open the .csv file

    STEP 5: Open the .csv file in a text editor

    Open the .csv file in a text editor (e.g. Notepad) and it should look like this:

    Description of example datasets

    This application contains 8 datasets:

    1. Edgar Anderson's Iris Data
    2. 3 Measures of ability: SATV, SATQ, ACT
    3. Survey results of gym visitors
    4. Examination scores of statistics students
    5. Crop and area
    6. Life expectancy
    7. Library dataset
    8. FTP dataset

    Click the plus sign (+) to open the box with detail information about the dataset.

    Iris dataset

    Edgar Anderson's Iris Data

    This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor , and virginica.

    There are 150 cases and 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:


    No Name Domain/Levels

    1

    Sepal Length

    Numeric variable: from 4.3cm to 7.9cm.

    2

    Sepal Width

    Numeric variable: from 2cm to 4.4cm.

    3

    Petal Length

    Numeric variable: from 1cm to 6.9cm.

    4

    Petal Width

    Numeric variable: From 0.1cm to 2.5cm

    5

    Species Categorical variable with three categories: Setosa, Versicolor, and Virginica.

    Source

    Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

    The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

    SAPA dataset

    3 Measures of ability: SATV, SATQ, ACT

    Self reported scores on the SAT Verbal, SAT Quantitative and ACT were collected as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project.

    There are 700 cases and 6 variables. The variables and their domains (for numeric variables) and levels (for categorical variables) are as follows:


    No Name Levels

    1

    Gender Categorical variable with two categories: Female / Male

    2

    Education Categorical variable: Self-reported education (less than 12 years, high school, some college, at college, collage graduate, grad/prof).

    3

    Age Numeric variable: Full years at the moment of taking test (from 13 years to 65 years).
    4 ACT Numeric variable: ACT composite scores may range from 1 - 36.
    5 SATV Numeric variable: AT Verbal scores may range from 200 - 800.
    6 SATQ Numeric variable: SAT Quantitative scores may range from 200 - 800.

    Source

    Revelle, W., Wilt, J., & Rosenthal, A. (2009) Personality and cognition: The personality-cognition link. In Gruszka, A. & Matthews, Ge. and Szymura, Blazej (Eds.) Handbook of individual differences in cognition: Attention, memory and executive control, Springer.

    Gym survey dataset

    Survey results of the gym visitors

    This dataset provides information on survey results of gym visitors.

    There are 24 observations on 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:


    No Name Description & Domain/Level
    1 Gender Categorical variable with two categories: Female, Male.
    2 Age Numeric variable: Full years at the moment of survey (from 31 years to 43 years).
    3 Exercise Numeric variable: Exercise hours - average number of sport activities per week (from 1 hour to 8 hours).
    4 Diet Categorical variable: Diet type (Low calorie, Normal, High calorie).
    5 BMI Numeric variable: Body mass index measured at the time of survey (from 16.1kg/m2 to 34.8 kg/m2).

    Source

    Course data.

    Examination scores dataset

    Examination scores results of statistics students

    This dataset provides information on assignments and examination scores of 45 statistics students.

    There are 45 observations on 4 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:


    No Name Description & Domain/Level
    1 Gender Categorical variable with two categories: Female, Male.
    2 Overall Numeric variable: Overall course mark (from 42% to 74%).
    3 Assignment1 Numeric variable: Assignment 1 mark (from 49% to 93%).
    4 Assignment2 Numeric variable: Assignment 2 mark (from 42% to 88%).

    Source

    Course data.

    Crop and area dataset

    Survey results of 26 agricultural sections

    This dataset provides information on survey results of agricultural sections.

    There are 26 observations on 2 variables. The variables and their domains are as follows:


    No Name Description & Domain
    1 Area Numeric variable: Relates to the area of a section where crop is cultivated (measured in square meters) (from 380 m2 to 1935 m2.
    2 Crop Numeric variable: Crop measured in kilograms (from 64kg to 288kg).

    Source

    Course data.

    Life expectancy dataset

    Life expectancy and alcohol consumption

    This dataset provides information on life expectancy and alcohol consumption per person in 44 countries.

    There are 44 observations on 2 variables. The variables and their domains are as follows:


    No Name Description & Domain/Level
    1 Life Numeric variable: Life expectancy of citizens (from 54 years to 93 years).
    2 Alcohol Numeric variable: Average alcohol consumption per adult citizen per year measured in litres (from 3 litres to 35 litres).

    Source

    Course data.

    Library dataset

    Observations of library books

    This dataset provides information on observations of library books.

    There are 12 observations on 2 variables. The variables and their domains are as follows:


    No Name Description & Domain
    1 Years Numeric variable: Years (from 8 years to 18 years).
    2 Books Numeric variable: Number of books (from 2 to 45).

    Source

    Course data.

    FTP dataset

    Observations of FTP

    This dataset provides information on observations of FTP.

    There are 15 observations on 2 variables. The variables and their domains are as follows:


    No Name Description & Domain
    1 Hours Numeric variable: Number of hours (from 0 hours to 40 hours).
    2 FTP Numeric variable: FTP (from 12 to 98).

    Source

    Course data.

    • Dataset
    • Descriptive statistics
    • Histograms
    • Scatterplot matrix
    • Correlation matrix
    • Regression model
    • Assessing the regression model
    • Simple Linear Regression

    Display and download dataset

    Display the dataset selected and download the file with the example dataset used.

    Dataset

    Descriptive statistics

    Descriptive, or summary statistics are used to represent and describe nearly every dataset. They also form the building blocks for much more complicated statistical methods and models.

    Three R packages (base, psych and pastecs) cover almost all descriptive statistics.

    Descriptive statistics (basic package)

    
                              

    Descriptive statistics (psych package)

    
                              

    Descriptive statistics (pastecs package)

    
                              

    R codes

    R codes used to generate results

    Descriptive statistics (base package):

    summary(x)
    

    x is a data frame.

    Descriptive statistics (psych package):

    describe(x, skew=FALSE, ranges=FALSE)
    

    Descriptive statistics grouped by one of the categorical variables (psych package):

    describeBy(x, group=y, skew=FALSE, ranges=FALSE)
    

    y is a categorical variable.

    Descriptive statistics (pastecs package):

    stat.desc(x)
    

    Histograms

    A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).

    Histograms are created by dividing up the range of the data into non-overlapping bins/classes of equal width, and counting the number of observations that fall into them. Each bin/class is then represented by a rectangle with the bin/class as its base, where the height of the rectangle is equal to the number of observations in that bin/class.

    Closely related to the histogram is a kernal density plot, or density plot . This plot is a much more effective way to view the distribution of a variable than the histogram. Histograms are sensitive to the choice of bin/class sizes. Density plot depends more on the data and less on this arbitrary parameter choice.

    About a histogram

    1. There is no gap between bars.
    2. The y -axis label is Frequency or Count. When the density plot is shown, then the y -axis label is Density.
    3. Data is grouped into classes with the end points labelled on the x -axis.
    4. There is an explanatory title or caption underneath the graph.

    Things to think about

    1. What is the data range?
    2. How is the data distributed – skewed or symmetric?
    3. Which is the modal (or most frequent class)?
    4. Are there any outliers?

    Response variable

    Explanatory variable

    R codes

    R codes used to generate results

    Histogram:

    hist(x, freq=TRUE, breaks = bins, col = 'darkgray', border = 'white', main = "Main title", xlab = "Horizontal axis title, i.e. variable name", ylab="Frequency")
    lines(density(x), col="blue", lwd = 2)
    

    x is a vector of values for which the histogram is desired.

    Scatterplot matrix

    A scatterplot provides a graphical view of the relationship between two numeric variables.

    About a scatterplot matrix

    1. The scatterplot matrix shows all the pairwise scatterplots of the variables on a single view with multiple scatterplots in a matrix format.
    2. A plot located on the intersection of i-th row and j-th column is a plot of i-th and j-th variables. This means that each row and column is one dimension, and each cell plots a scatterplot of two dimensions.
    3. Optionally scatterplot matrix includes lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.

    Things to think about

    1. The purpose of the scatterplot is to look at the relationship between the variables and determine if there are any problems/issues with the data or if the scatterplot indicates anything unique or interesting about the data (How is the data dispersed? Are there outliers?).
    2. The scatterplot could show nonlinear relationships between variables. The ability to do this can be enhanced by adding a smooth line such as loess.

    Inputs

    Scatterplot Matrix

    Enhanced scatterplot

    R codes

    R codes used to generate results

    Scatterplot (car package):

    scatterplotMatrix(x, var.labels=colnames(x), diagonal=c("density", "boxplot", "histogram", "oned", "qqplot", "none"), main="Scatterplot Matrix")
    

    x is a data matrix, numeric data frame.

    Correlation matrix

    A correlation matrix is used to investigate the association between multiple variables at the same time.

    About a correlation

    1. A correlation coefficient (Pearson's r) measures the strength of linear relationship between two numeric variables.
    2. It takes values between -1 (perfect negative association) and +1 (perfect positive association).
    3. The correlation matrix is symmetric because the correlation between i-th and j-th variables is the same as the correlation between j-th and i-th variables.
    4. P-value determines the significance level for test of Pearson's correlation; null hypothesis is that the correlation is zero, against two-sided alternative.
    5. In the correlogram, correlation coefficients are coloured according to the value.
    6. Positive correlations are displayed in blue and negative correlations in red.
    7. Colour intensity and the size of the circle are proportional to the correlation coefficients.

    Things to think about

    1. Pearson's r is a valid measure of correlation if there are no outliers.
    2. Pearson's r is a valid measure of correlation if the relationship between the variables is linear.
    3. The variables must be continuous or discrete and not ordinal.
    4. For the correlogram, the correlation matrix can be reordered according to the values of the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.

    Correlation matrix

    Correlation matrix - P-values

    Correlation matrix with mean values and standard deviations

    
                              

    Correlogram

    R codes

    R codes used to generate results

    Correlation matrix:

    cor(x, use="pairwise.complete.obs")
    

    x is a matrix or data frame.

    Calculating P-values Hmisc package:

    signif(cor$P, 2)
    

    cor is a correlation matrix.

    Correlogram (corrgram package):

    corrgram(cor, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main = "Correlogram")}
    

    cor is a data frame or correlation matrix.

    Simple linear regression model

    Regression analysis is a way of predicting a response variable from one explanatory variable (simple linear regression model) or several explanatory variables (multiple regression model).

    About a regression

    1. y represents the response variable. x represents the explanatory variable.
    2. Intercept is a point where the regression line crosses the vertical axis of a scatterplot.
    3. Slope of the regression line shows amount of change in y values per unit change in x.
    4. R-squared tells how much variation in the response variable is explained by variation in the explanatory variables. In other words, it tells the percentage of variation in the response variable explained by the regression model.

    Things to think about

    1. The slope regression coefficient should have the same sign as the correlation.
    2. R-squared takes values between 0% and 100%.

    Regression model

    
                                
    
                              

    R codes

    R codes used to generate results

    Regression model, i.e fitting linear model lm() command:

    lm(y ~ x, data="mydata")
    

    y is a response variable, a numeric vector. x is a explanatory variable, a numeric vector. mydata is a data frame containing the variables in the model.

    Assessing the regression model: diagnostics

    The validation of a regression model is performs here using graphical methods.

    About a residuals

    1. The normal or unstandardised residuals are measured in the same units as the response variable and so are difficult to interpret across different models. We can look for residuals that stand out as being particularly large. However, we cannot define a universal cut-off point what constitutes a large residual. To overcome this problem, we use standardised residuals, which are residuals divided by an estimate of their standard deviation.
    2. The first four plots are based on the normal or unstandardised residuals. The diagnostic plots box contains four graphs based on standardised residuals.

    Things to think about

    1. An outlier is a case that differs substantially from the main trend of the data. Outliers can cause the model to be biased because they affect the value of the estimated regression coefficients.
    2. An influential case is any case that significantly alters the value of a regression coefficient whenever it is deleted from an analysis.
    3. In general, influential cases have relatively extreme values on the explanatory variable and somewhat discrepant values on the response variable.

    Residuals histogram

    Residuals boxplot

    Residual plot

    Residuals QQ plot

    Diagnostic plots

    R codes

    R codes used to generate results

    Diagnostic plots provide checks for heteroscedasticity, normality, and influential observations.

    fit <- lm(y ~ x)
    layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
    plot(fit)