- About the application
- Uploading datasets
- Example data
- Introduction
About the Simple Linear Regression application
The Simple Linear Regression is an interactive application which allows you to conduct regression analysis.
The Simple Regression makes it easy to:
- Upload your dataset and download example datasets
- Visualise distribution and relationships among numeric variables
- Display dataset and results in a tabular format
- Save the results in a report format
- Learn R codes used to generate results.
Start analysis by selecting either example dataset or by uploading your own dataset on the left sidebar. This will display the selected dataset, a list of numeric variables and a tab: Regression. The Example data tab at the top of the screen contains detailed information about each of the example dataset in this application.
Click the plus sign (+) to open the box with detailed information about the Simple Regression application features.
Upload and download data
Uploading your data and downloading example dataset
- The Simple Linear Regression application permits the user to upload a .csv file that contains dataset to be displayed. Instructions are given on the Uploading dataset panel.
- There are a few built-in example datasets which can be used to conduct analysis with numeric variables.
Regression tab
Content of the Regression tab
Click on the Regression tab on the left sidebar to reveal the following options at the top of the main panel:
Dataset |
|
Descriptive statistics |
|
Histograms |
|
Scatterplot matrix |
|
Correlation matrix |
|
Regression model |
|
Assessing the regression model |
|
Save results
To save results in a report:
- Enter your name in the textbox (at the top of the sidebar).
- Select a document format (PDF, HTML or Word).
- Type any comments you may have about the results in the textbox labelled Interpretation.
- Press Download Report button.
- Save the report on your disc.
R codes
R codes used to generate results:
- List of basic R functions used in this application is given here.
- To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?lm) in the RStudio console.
Instructions how to prepare and upload your datasets
This application permits the user to upload a .csv file that contains dataset to be displayed. It is very important to follow the instructions here.
STEP 1: Check 'Upload your dataset'
STEP 1: Check 'Upload your dataset' radio button
STEP 2: Click 'Browse ...'
STEP 2: Click 'Browse ...' button
The application will only accept a 'comma delimited' text file (.csv). The first row in the .csv file should contain the variable names (a 'header'). If a .csv file is uploaded without a header (and this is indicated by unchecking the entry box on the sidebar), the variables to choose from will be listed as V1, V2, V3, etc., depending on their position in the .csv file. It is best to use .csv files that include variable names as a header row.
If a spreadsheet was used to enter the data as illustrated below, then leave the default selections of radio buttons in the Separator and Quote sections.
STEP 3: Enter your data
STEP 3: Enter your data in a spreadsheet
The best approach would begin by creating a file in a spreadsheet such as this:
STEP 4: Save it as a .csv file
STEP 4: Save it as a .csv file
STEP 5: Open the .csv file
STEP 5: Open the .csv file in a text editor
Open the .csv file in a text editor (e.g. Notepad) and it should look like this:
Description of example datasets
This application contains 8 datasets:
- Edgar Anderson's Iris Data
- 3 Measures of ability: SATV, SATQ, ACT
- Survey results of gym visitors
- Examination scores of statistics students
- Crop and area
- Life expectancy
- Library dataset
- FTP dataset
Click the plus sign (+) to open the box with detail information about the dataset.
Iris dataset
Edgar Anderson's Iris Data
This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor , and virginica.
There are 150 cases and 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:
No | Name | Domain/Levels |
1 |
Sepal Length |
Numeric variable: from 4.3cm to 7.9cm. |
2 |
Sepal Width |
Numeric variable: from 2cm to 4.4cm. |
3 |
Petal Length |
Numeric variable: from 1cm to 6.9cm. |
4 |
Petal Width |
Numeric variable: From 0.1cm to 2.5cm |
5 |
Species | Categorical variable with three categories: Setosa, Versicolor, and Virginica. |
Source
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.
The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.
SAPA dataset
3 Measures of ability: SATV, SATQ, ACT
Self reported scores on the SAT Verbal, SAT Quantitative and ACT were collected as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project.
There are 700 cases and 6 variables. The variables and their domains (for numeric variables) and levels (for categorical variables) are as follows:
No | Name | Levels |
1 |
Gender | Categorical variable with two categories: Female / Male |
2 |
Education | Categorical variable: Self-reported education (less than 12 years, high school, some college, at college, collage graduate, grad/prof). |
3 |
Age | Numeric variable: Full years at the moment of taking test (from 13 years to 65 years). |
4 | ACT | Numeric variable: ACT composite scores may range from 1 - 36. |
5 | SATV | Numeric variable: AT Verbal scores may range from 200 - 800. |
6 | SATQ | Numeric variable: SAT Quantitative scores may range from 200 - 800. |
Source
Revelle, W., Wilt, J., & Rosenthal, A. (2009) Personality and cognition: The personality-cognition link. In Gruszka, A. & Matthews, Ge. and Szymura, Blazej (Eds.) Handbook of individual differences in cognition: Attention, memory and executive control, Springer.
Gym survey dataset
Survey results of the gym visitors
This dataset provides information on survey results of gym visitors.
There are 24 observations on 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:
No | Name | Description & Domain/Level | 1 | Gender | Categorical variable with two categories: Female, Male. |
2 | Age | Numeric variable: Full years at the moment of survey (from 31 years to 43 years). |
3 | Exercise | Numeric variable: Exercise hours - average number of sport activities per week (from 1 hour to 8 hours). |
4 | Diet | Categorical variable: Diet type (Low calorie, Normal, High calorie). |
5 | BMI | Numeric variable: Body mass index measured at the time of survey (from 16.1kg/m2 to 34.8 kg/m2). |
Source
Course data.
Examination scores dataset
Examination scores results of statistics students
This dataset provides information on assignments and examination scores of 45 statistics students.
There are 45 observations on 4 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:
No | Name | Description & Domain/Level |
1 | Gender | Categorical variable with two categories: Female, Male. |
2 | Overall | Numeric variable: Overall course mark (from 42% to 74%). |
3 | Assignment1 | Numeric variable: Assignment 1 mark (from 49% to 93%). |
4 | Assignment2 | Numeric variable: Assignment 2 mark (from 42% to 88%). |
Source
Course data.
Crop and area dataset
Survey results of 26 agricultural sections
This dataset provides information on survey results of agricultural sections.
There are 26 observations on 2 variables. The variables and their domains are as follows:
No | Name | Description & Domain |
1 | Area | Numeric variable: Relates to the area of a section where crop is cultivated (measured in square meters) (from 380 m2 to 1935 m2. |
2 | Crop | Numeric variable: Crop measured in kilograms (from 64kg to 288kg). |
Source
Course data.
Life expectancy dataset
Life expectancy and alcohol consumption
This dataset provides information on life expectancy and alcohol consumption per person in 44 countries.
There are 44 observations on 2 variables. The variables and their domains are as follows:
No | Name | Description & Domain/Level |
1 | Life | Numeric variable: Life expectancy of citizens (from 54 years to 93 years). |
2 | Alcohol | Numeric variable: Average alcohol consumption per adult citizen per year measured in litres (from 3 litres to 35 litres). |
Source
Course data.
Library dataset
Observations of library books
This dataset provides information on observations of library books.
There are 12 observations on 2 variables. The variables and their domains are as follows:
No | Name | Description & Domain |
1 | Years | Numeric variable: Years (from 8 years to 18 years). |
2 | Books | Numeric variable: Number of books (from 2 to 45). |
Source
Course data.
FTP dataset
Observations of FTP
This dataset provides information on observations of FTP.
There are 15 observations on 2 variables. The variables and their domains are as follows:
No | Name | Description & Domain |
1 | Hours | Numeric variable: Number of hours (from 0 hours to 40 hours). |
2 | FTP | Numeric variable: FTP (from 12 to 98). |
Source
Course data.
- Dataset
- Descriptive statistics
- Histograms
- Scatterplot matrix
- Correlation matrix
- Regression model
- Assessing the regression model
- Simple Linear Regression
Display and download dataset
Display the dataset selected and download the file with the example dataset used.
Dataset
Descriptive statistics
Descriptive, or summary statistics are used to represent and describe nearly every dataset. They also form the building blocks for much more complicated statistical methods and models.
Three R packages (base, psych and pastecs) cover almost all descriptive statistics.
Descriptive statistics (basic package)
Descriptive statistics (psych package)
Descriptive statistics (pastecs package)
R codes
R codes used to generate results
Descriptive statistics (base package):
summary(x)
x
is a data frame.
Descriptive statistics (psych package):
describe(x, skew=FALSE, ranges=FALSE)
Descriptive statistics grouped by one of the categorical variables (psych package):
describeBy(x, group=y, skew=FALSE, ranges=FALSE)
y
is a categorical variable.
Descriptive statistics (pastecs package):
stat.desc(x)
Histograms
A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).
Histograms are created by dividing up the range of the data into non-overlapping bins/classes of equal width, and counting the number of observations that fall into them. Each bin/class is then represented by a rectangle with the bin/class as its base, where the height of the rectangle is equal to the number of observations in that bin/class.
Closely related to the histogram is a kernal density plot, or density plot . This plot is a much more effective way to view the distribution of a variable than the histogram. Histograms are sensitive to the choice of bin/class sizes. Density plot depends more on the data and less on this arbitrary parameter choice.
About a histogram
- There is no gap between bars.
- The y -axis label is Frequency or Count. When the density plot is shown, then the y -axis label is Density.
- Data is grouped into classes with the end points labelled on the x -axis.
- There is an explanatory title or caption underneath the graph.
Things to think about
- What is the data range?
- How is the data distributed – skewed or symmetric?
- Which is the modal (or most frequent class)?
- Are there any outliers?
Response variable
Explanatory variable
R codes
R codes used to generate results
hist(x, freq=TRUE, breaks = bins, col = 'darkgray', border = 'white', main = "Main title", xlab = "Horizontal axis title, i.e. variable name", ylab="Frequency")
lines(density(x), col="blue", lwd = 2)
x
is a vector of values for which the histogram is desired.
Scatterplot matrix
A scatterplot provides a graphical view of the relationship between two numeric variables.
About a scatterplot matrix
- The scatterplot matrix shows all the pairwise scatterplots of the variables on a single view with multiple scatterplots in a matrix format.
- A plot located on the intersection of i-th row and j-th column is a plot of i-th and j-th variables. This means that each row and column is one dimension, and each cell plots a scatterplot of two dimensions.
- Optionally scatterplot matrix includes lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.
Things to think about
- The purpose of the scatterplot is to look at the relationship between the variables and determine if there are any problems/issues with the data or if the scatterplot indicates anything unique or interesting about the data (How is the data dispersed? Are there outliers?).
- The scatterplot could show nonlinear relationships between variables. The ability to do this can be enhanced by adding a smooth line such as loess.
Inputs
Scatterplot Matrix
Enhanced scatterplot
R codes
R codes used to generate results
Scatterplot (car package):
scatterplotMatrix(x, var.labels=colnames(x), diagonal=c("density", "boxplot", "histogram", "oned", "qqplot", "none"), main="Scatterplot Matrix")
x
is a data matrix, numeric data frame.
Correlation matrix
A correlation matrix is used to investigate the association between multiple variables at the same time.
About a correlation
- A correlation coefficient (Pearson's r) measures the strength of linear relationship between two numeric variables.
- It takes values between -1 (perfect negative association) and +1 (perfect positive association).
- The correlation matrix is symmetric because the correlation between i-th and j-th variables is the same as the correlation between j-th and i-th variables.
- P-value determines the significance level for test of Pearson's correlation; null hypothesis is that the correlation is zero, against two-sided alternative.
- In the correlogram, correlation coefficients are coloured according to the value.
- Positive correlations are displayed in blue and negative correlations in red.
- Colour intensity and the size of the circle are proportional to the correlation coefficients.
Things to think about
- Pearson's r is a valid measure of correlation if there are no outliers.
- Pearson's r is a valid measure of correlation if the relationship between the variables is linear.
- The variables must be continuous or discrete and not ordinal.
- For the correlogram, the correlation matrix can be reordered according to the values of the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.
Correlation matrix
Correlation matrix - P-values
Correlation matrix with mean values and standard deviations
Correlogram
R codes
R codes used to generate results
cor(x, use="pairwise.complete.obs")
x
is a matrix or data frame.
Calculating P-values Hmisc package:
signif(cor$P, 2)
cor
is a correlation matrix.
Correlogram (corrgram package):
corrgram(cor, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main = "Correlogram")}
cor
is a data frame or correlation matrix.
Simple linear regression model
Regression analysis is a way of predicting a response variable from one explanatory variable (simple linear regression model) or several explanatory variables (multiple regression model).
About a regression
- y represents the response variable. x represents the explanatory variable.
- Intercept is a point where the regression line crosses the vertical axis of a scatterplot.
- Slope of the regression line shows amount of change in y values per unit change in x.
- R-squared tells how much variation in the response variable is explained by variation in the explanatory variables. In other words, it tells the percentage of variation in the response variable explained by the regression model.
Things to think about
- The slope regression coefficient should have the same sign as the correlation.
- R-squared takes values between 0% and 100%.
Regression model
R codes
R codes used to generate results
Regression model, i.e fitting linear model lm() command:
lm(y ~ x, data="mydata")
y
is a response variable, a numeric vector. x
is a explanatory variable, a numeric vector. mydata
is a data frame containing the variables in the model.
Assessing the regression model: diagnostics
The validation of a regression model is performs here using graphical methods.
About a residuals
- The normal or unstandardised residuals are measured in the same units as the response variable and so are difficult to interpret across different models. We can look for residuals that stand out as being particularly large. However, we cannot define a universal cut-off point what constitutes a large residual. To overcome this problem, we use standardised residuals, which are residuals divided by an estimate of their standard deviation.
- The first four plots are based on the normal or unstandardised residuals. The diagnostic plots box contains four graphs based on standardised residuals.
Things to think about
- An outlier is a case that differs substantially from the main trend of the data. Outliers can cause the model to be biased because they affect the value of the estimated regression coefficients.
- An influential case is any case that significantly alters the value of a regression coefficient whenever it is deleted from an analysis.
- In general, influential cases have relatively extreme values on the explanatory variable and somewhat discrepant values on the response variable.
Residuals histogram
Residuals boxplot
Residual plot
Residuals QQ plot
Diagnostic plots
R codes
R codes used to generate results
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observations.
fit <- lm(y ~ x)
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)