Simple Linear Regression

About the Simple Linear Regression application

The Simple Linear Regression is an interactive application which allows you to conduct regression analysis.

The Simple Regression makes it easy to:

Upload your dataset and download example datasets
Visualise distribution and relationships among numeric variables
Display dataset and results in a tabular format
Save the results in a report format
Learn R codes used to generate results.

Start analysis by selecting either example dataset or by uploading your own dataset on the left sidebar. This will display the selected dataset, a list of numeric variables and a tab: Regression. The Example data tab at the top of the screen contains detailed information about each of the example dataset in this application.

Click the plus sign (+) to open the box with detailed information about the Simple Regression application features.

Upload and download data

Uploading your data and downloading example dataset

The Simple Linear Regression application permits the user to upload a .csv file that contains dataset to be displayed. Instructions are given on the Uploading dataset panel.
There are a few built-in example datasets which can be used to conduct analysis with numeric variables.

Regression tab

Content of the Regression tab

Click on the Regression tab on the left sidebar to reveal the following options at the top of the main panel:

Dataset	This tab displays all the variables in the selected dataset. Download a selected example dataset as a csv file.
Descriptive statistics	Three sets of descriptive statistics are provided. First, from the base R package, second from psych package and the third from the pastecs package.
Histograms	This tab displays histograms for slected explanatory and response variables. Number of bins/classes could be selected and a density curve will be drown over the histogram to describe the distribution more accurately.
Scatterplot matrix	Displays scatterplot for every pair of numeric variables in the dataset using the car package in R. Displays enhanced scatterplot with boxplots for both response and explanatory variables.
Correlation matrix	Correlation for each pair of numeric variables in the daaset is calculated. The second table displays p-values. Displays correlogram, a graphical presentation of the data in correlation matrix using the corrgram package in R.
Regression model	This tab estimates a simple linear regression model.
Assessing the regression model	This tab performs the validation of a regression model using graphical methods.

Save results

To save results in a report:

Enter your name in the textbox (at the top of the sidebar).

Select a document format (PDF, HTML or Word).

Type any comments you may have about the results in the textbox labelled Interpretation.

Press Download Report button.

Save the report on your disc.

R codes

R codes used to generate results:

List of basic R functions used in this application is given here.

To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?lm) in the RStudio console.

Instructions how to prepare and upload your datasets

This application permits the user to upload a .csv file that contains dataset to be displayed. It is very important to follow the instructions here.

STEP 1: Check 'Upload your dataset'

STEP 1: Check 'Upload your dataset' radio button

STEP 2: Click 'Browse ...'

STEP 2: Click 'Browse ...' button

The application will only accept a 'comma delimited' text file (.csv). The first row in the .csv file should contain the variable names (a 'header'). If a .csv file is uploaded without a header (and this is indicated by unchecking the entry box on the sidebar), the variables to choose from will be listed as V1, V2, V3, etc., depending on their position in the .csv file. It is best to use .csv files that include variable names as a header row.

If a spreadsheet was used to enter the data as illustrated below, then leave the default selections of radio buttons in the Separator and Quote sections.

STEP 3: Enter your data

STEP 3: Enter your data in a spreadsheet

The best approach would begin by creating a file in a spreadsheet such as this:

STEP 4: Save it as a .csv file

STEP 5: Open the .csv file

STEP 5: Open the .csv file in a text editor

Open the .csv file in a text editor (e.g. Notepad) and it should look like this:

Description of example datasets

This application contains 8 datasets:

Edgar Anderson's Iris Data
3 Measures of ability: SATV, SATQ, ACT
Survey results of gym visitors
Examination scores of statistics students
Crop and area
Life expectancy
Library dataset
FTP dataset

Click the plus sign (+) to open the box with detail information about the dataset.

Iris dataset

Edgar Anderson's Iris Data

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor , and virginica.

There are 150 cases and 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:

No	Name	Domain/Levels
1	Sepal Length	Numeric variable: from 4.3cm to 7.9cm.
2	Sepal Width	Numeric variable: from 2cm to 4.4cm.
3	Petal Length	Numeric variable: from 1cm to 6.9cm.
4	Petal Width	Numeric variable: From 0.1cm to 2.5cm
5	Species	Categorical variable with three categories: Setosa, Versicolor, and Virginica.

Source

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

SAPA dataset

3 Measures of ability: SATV, SATQ, ACT

Self reported scores on the SAT Verbal, SAT Quantitative and ACT were collected as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project.

There are 700 cases and 6 variables. The variables and their domains (for numeric variables) and levels (for categorical variables) are as follows:

No	Name	Levels
1	Gender	Categorical variable with two categories: Female / Male
2	Education	Categorical variable: Self-reported education (less than 12 years, high school, some college, at college, collage graduate, grad/prof).
3	Age	Numeric variable: Full years at the moment of taking test (from 13 years to 65 years).
4	ACT	Numeric variable: ACT composite scores may range from 1 - 36.
5	SATV	Numeric variable: AT Verbal scores may range from 200 - 800.
6	SATQ	Numeric variable: SAT Quantitative scores may range from 200 - 800.

Source

Revelle, W., Wilt, J., & Rosenthal, A. (2009) Personality and cognition: The personality-cognition link. In Gruszka, A. & Matthews, Ge. and Szymura, Blazej (Eds.) Handbook of individual differences in cognition: Attention, memory and executive control, Springer.

Gym survey dataset

Survey results of the gym visitors

This dataset provides information on survey results of gym visitors.

There are 24 observations on 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:

No	Name	Description & Domain/Level
1	Gender	Categorical variable with two categories: Female, Male.
2	Age	Numeric variable: Full years at the moment of survey (from 31 years to 43 years).
3	Exercise	Numeric variable: Exercise hours - average number of sport activities per week (from 1 hour to 8 hours).
4	Diet	Categorical variable: Diet type (Low calorie, Normal, High calorie).
5	BMI	Numeric variable: Body mass index measured at the time of survey (from 16.1kg/m² to 34.8 kg/m²).

Source

Course data.

Examination scores dataset

Examination scores results of statistics students

This dataset provides information on assignments and examination scores of 45 statistics students.

There are 45 observations on 4 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:

No	Name	Description & Domain/Level
1	Gender	Categorical variable with two categories: Female, Male.
2	Overall	Numeric variable: Overall course mark (from 42% to 74%).
3	Assignment1	Numeric variable: Assignment 1 mark (from 49% to 93%).
4	Assignment2	Numeric variable: Assignment 2 mark (from 42% to 88%).

Source

Course data.

Crop and area dataset

Survey results of 26 agricultural sections

This dataset provides information on survey results of agricultural sections.

There are 26 observations on 2 variables. The variables and their domains are as follows:

No	Name	Description & Domain
1	Area	Numeric variable: Relates to the area of a section where crop is cultivated (measured in square meters) (from 380 m² to 1935 m².
2	Crop	Numeric variable: Crop measured in kilograms (from 64kg to 288kg).

Source

Course data.

Life expectancy dataset

Life expectancy and alcohol consumption

This dataset provides information on life expectancy and alcohol consumption per person in 44 countries.

There are 44 observations on 2 variables. The variables and their domains are as follows:

No	Name	Description & Domain/Level
1	Life	Numeric variable: Life expectancy of citizens (from 54 years to 93 years).
2	Alcohol	Numeric variable: Average alcohol consumption per adult citizen per year measured in litres (from 3 litres to 35 litres).

Source

Course data.

Library dataset

Observations of library books

This dataset provides information on observations of library books.

There are 12 observations on 2 variables. The variables and their domains are as follows:

No	Name	Description & Domain
1	Years	Numeric variable: Years (from 8 years to 18 years).
2	Books	Numeric variable: Number of books (from 2 to 45).

Source

Course data.

FTP dataset

Observations of FTP

This dataset provides information on observations of FTP.

There are 15 observations on 2 variables. The variables and their domains are as follows:

No	Name	Description & Domain
1	Hours	Numeric variable: Number of hours (from 0 hours to 40 hours).
2	FTP	Numeric variable: FTP (from 12 to 98).

Source

Course data.

Display and download dataset

Display the dataset selected and download the file with the example dataset used.

Dataset

Number of observations to view

Interpretation

Descriptive statistics

Descriptive, or summary statistics are used to represent and describe nearly every dataset. They also form the building blocks for much more complicated statistical methods and models.

Three R packages (base, psych and pastecs) cover almost all descriptive statistics.

Descriptive statistics (basic package)

Descriptive statistics (psych package)

Descriptive statistics (pastecs package)

R codes

R codes used to generate results

Descriptive statistics (base package):

summary(x)

x is a data frame.

Descriptive statistics (psych package):

describe(x, skew=FALSE, ranges=FALSE)

Descriptive statistics grouped by one of the categorical variables (psych package):

describeBy(x, group=y, skew=FALSE, ranges=FALSE)

y is a categorical variable.

Descriptive statistics (pastecs package):

stat.desc(x)

Interpretation

Histograms

A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).

Histograms are created by dividing up the range of the data into non-overlapping bins/classes of equal width, and counting the number of observations that fall into them. Each bin/class is then represented by a rectangle with the bin/class as its base, where the height of the rectangle is equal to the number of observations in that bin/class.

Closely related to the histogram is a kernal density plot, or density plot . This plot is a much more effective way to view the distribution of a variable than the histogram. Histograms are sensitive to the choice of bin/class sizes. Density plot depends more on the data and less on this arbitrary parameter choice.

About a histogram

There is no gap between bars.
The y -axis label is Frequency or Count. When the density plot is shown, then the y -axis label is Density.
Data is grouped into classes with the end points labelled on the x -axis.
There is an explanatory title or caption underneath the graph.

Things to think about

What is the data range?
How is the data distributed – skewed or symmetric?
Which is the modal (or most frequent class)?
Are there any outliers?

Response variable

Number of bins:

Explanatory variable

Number of bins:

R codes

R codes used to generate results

Histogram:

hist(x, freq=TRUE, breaks = bins, col = 'darkgray', border = 'white', main = "Main title", xlab = "Horizontal axis title, i.e. variable name", ylab="Frequency")
lines(density(x), col="blue", lwd = 2)

x is a vector of values for which the histogram is desired.

Interpretation

Scatterplot matrix

A scatterplot provides a graphical view of the relationship between two numeric variables.

About a scatterplot matrix

The scatterplot matrix shows all the pairwise scatterplots of the variables on a single view with multiple scatterplots in a matrix format.
A plot located on the intersection of i-th row and j-th column is a plot of i-th and j-th variables. This means that each row and column is one dimension, and each cell plots a scatterplot of two dimensions.
Optionally scatterplot matrix includes lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.

Things to think about

The purpose of the scatterplot is to look at the relationship between the variables and determine if there are any problems/issues with the data or if the scatterplot indicates anything unique or interesting about the data (How is the data dispersed? Are there outliers?).
The scatterplot could show nonlinear relationships between variables. The ability to do this can be enhanced by adding a smooth line such as loess.

Inputs

Scatterplot Matrix

Enhanced scatterplot

R codes

R codes used to generate results

Scatterplot (car package):

scatterplotMatrix(x, var.labels=colnames(x), diagonal=c("density", "boxplot", "histogram", "oned", "qqplot", "none"), main="Scatterplot Matrix")

x is a data matrix, numeric data frame.

Interpretation

Correlation matrix

A correlation matrix is used to investigate the association between multiple variables at the same time.

About a correlation

A correlation coefficient (Pearson's r) measures the strength of linear relationship between two numeric variables.
It takes values between -1 (perfect negative association) and +1 (perfect positive association).
The correlation matrix is symmetric because the correlation between i-th and j-th variables is the same as the correlation between j-th and i-th variables.
P-value determines the significance level for test of Pearson's correlation; null hypothesis is that the correlation is zero, against two-sided alternative.
In the correlogram, correlation coefficients are coloured according to the value.
Positive correlations are displayed in blue and negative correlations in red.
Colour intensity and the size of the circle are proportional to the correlation coefficients.

Things to think about

Pearson's r is a valid measure of correlation if there are no outliers.
Pearson's r is a valid measure of correlation if the relationship between the variables is linear.
The variables must be continuous or discrete and not ordinal.
For the correlogram, the correlation matrix can be reordered according to the values of the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.

Correlation matrix

Correlation matrix - P-values

Correlation matrix with mean values and standard deviations

Correlogram

R codes

R codes used to generate results

Correlation matrix:

cor(x, use="pairwise.complete.obs")

x is a matrix or data frame.

Calculating P-values Hmisc package:

signif(cor$P, 2)

cor is a correlation matrix.

Correlogram (corrgram package):

corrgram(cor, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main = "Correlogram")}

cor is a data frame or correlation matrix.

Interpretation