- About the application
- Uploading data
- Example data
- Introduction
About the Numeric Data Analysis application
The Numeric Data Analysis is an interactive application which allows you to conduct statistical analysis with numeric variables.
The Numeric Data Analysis makes it easy to:
- Upload your data and download example datasets
- Visualise distribution and relationships among numeric data
- Display data and results in a tabular format
- Test hypotheses about normality, mean vales and variances
- Save the results in a report format
- Learn R codes used to generate results.
Start analysis by selecting either example data or by uploading your own data on the left sidebar. This will display the selected dataset, a list of numeric variables and three tabs: Graphs , Tables and Inference. The Example data tab at the top of the screen contains detailed information about each of the example dataset in this application.
Click the plus sign (+) to open the box with detailed information about the Numeric Data Analysis application features.
Upload and download data
Uploading your data and downloading example dataset
- The Numeric Data Analysis application permits the user to upload a .csv file that contains data to be displayed. Instructions are given on the Uploading data panel.
- There are a few built-in example datasets which can be used to conduct analysis with numeric variables.
- Application permits the user to download each example dataset.
- Application also permits the user to generate a simple random sample from each dataset.
Graphs tab
Content of the Graphs tab
Click on the Graphs tab on the left sidebar to reveal the following options at the top of the main panel:
Histogram |
This tab displays a histogram for the selected numeric variable. |
Histograms |
Displays histogram for the selected numeric variable using the lattice package in R. |
Boxplots |
Displays boxplot for the selected numeric variable using the lattice package in R. |
Stemplots |
Displays a stemplot and back-to-back stemplots for the selected numeric variables using lattice package in R. |
Dotplots |
Displays dotplot for the selected numeric variable using the stripchart function in R. |
Stripcharts |
Displays stripchart for the selected numeric variable using the lattice package in R. |
QQ plots |
Displays Normal quantile plot (QQ plots) for the selected numeric variable using the lattice package in R. |
Correlogram | Displays correlogram, a graphical presentation of the data in correlation matrix using the corrgram package in R. |
Scatterplot matrix | Displays scatterplot for every pair of numeric variables in the dataset using the car package in R. |
Tables tab
Content of the Tables tab
Click on the Tables tab on the left sidebar to reveal the following options at the top of the main panel:
Datasets |
|
Descriptive statistics |
|
Outliers detection |
|
Correlation matrix |
|
Inference tab
Content of the Inference tab
Click on the Inference tab on the left sidebar to reveal the following options at the top of the main panel:
Normality tests |
This tab performs nine different normality tests. Null hypothesis for these tests is that the variable is normally distributed. |
t-tests |
This tab performs one sample, two sample t-tests and matched pairs t-test. |
Bartlett test |
This tab performs Bartlett's test of the null that the variances in each of the groups (samples) are the same. |
Levene test |
This tab performs Levene's test of the null of homogeneity of variance (homoscedasticity). It is more robust than the Bartlett test on departure from normality. |
ANOVA |
This tab performs ANOVA for selected numeric variable and at least two categories of the categorical variable (factor). |
Save results
To save results in a report:
- Enter your name in the textbox (at the top of the sidebar).
- Select a document format (PDF, HTML or Word).
- Type any comments you may have about the results in the textbox labelled Interpretation.
- Press Download Report button.
- Save the report on your disc.
R codes
R codes used to generate results:
- List of basic R functions used in this application is given here.
- To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.
Instructions how to prepare and upload your data
This application permits the user to upload a .csv file that contains data to be displayed. It is very important to follow the instructions here.
STEP 1: Check 'Upload your data'
STEP 1: Check 'Upload your data' radio button
STEP 2: Click 'Browse ...'
STEP 2: Click 'Browse ...' button
The application will only accept a 'comma delimited' text file (.csv). The first row in the .csv file should contain the variable names (a 'header'). If a .csv file is uploaded without a header (and this is indicated by unchecking the entry box on the sidebar), the variables to choose from will be listed as V1, V2, V3, etc., depending on their position in the .csv file. It is best to use .csv files that include variable names as a header row.
If a spreadsheet was used to enter the data as illustrated below, then leave the default selections of radio buttons in the Separator and Quote sections.
STEP 3: Enter your data
STEP 3: Enter your data in a spreadsheet
The best approach would begin by creating a file in a spreadsheet such as this:
STEP 4: Save it as a .csv file
STEP 4: Save it as a .csv file
STEP 5: Open the .csv file
STEP 5: Open the .csv file in a text editor
Open the .csv file in a text editor (e.g. Notepad) and it should look like this:
Description of example datasets
This application contains 2 datasets:
- Edgar Anderson's Iris Data
- 3 Measures of ability: SATV, SATQ, ACT
Click the plus sign (+) to open the box with detail information about the dataset.
Iris dataset
Edgar Anderson's Iris Data
This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor , and virginica.
There are 150 cases and 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:
No | Name | Domain/Levels |
1 |
Sepal Length |
Numeric variable: from 4.3cm to 7.9cm. |
2 |
Sepal Width |
Numeric variable: from 2cm to 4.4cm. |
3 |
Petal Length |
Numeric variable: from 1cm to 6.9cm. |
4 |
Petal Width |
Numeric variable: From 0.1cm to 2.5cm |
5 |
Species | Categorical variable with three categories: Setosa, Versicolor, and Virginica. |
Source
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.
The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.
SAPA dataset
3 Measures of ability: SATV, SATQ, ACT
Self reported scores on the SAT Verbal, SAT Quantitative and ACT were collected as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project.
There are 700 cases and 6 variables. The variables and their domains (for numeric variables) and levels (for categorical variables) are as follows:
No | Name | Levels |
1 |
Gender | Categorical variable with two categories: Female / Male |
2 |
Education | Categorical variable: Self-reported education (less than 12 years, high school, some college, at college, collage graduate, grad/prof). |
3 |
Age | Numeric variable: Full years at the moment of taking test (from 13 years to 65 years). |
4 | ACT | Numeric variable: ACT composite scores may range from 1 - 36. |
5 | SATV | Numeric variable: AT Verbal scores may range from 200 - 800. |
6 | SATQ | Numeric variable: SAT Quantitative scores may range from 200 - 800. |
Source
Revelle, W., Wilt, J., & Rosenthal, A. (2009) Personality and cognition: The personality-cognition link. In Gruszka, A. & Matthews, Ge. and Szymura, Blazej (Eds.) Handbook of individual differences in cognition: Attention, memory and executive control, Springer.
- Histogram
- Histograms
- Boxplots
- Stemplot
- Dot plots
- Stripcharts
- QQ plots
- Correlogram
- Scatterplot matrix
- Box-Cox transformation
- Graphs
Histogram
A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).
Histograms are created by dividing up the range of the data into non-overlapping bins/classes of equal width, and counting the number of observations that fall into them. Each bin/class is then represented by a rectangle with the bin/class as its base, where the height of the rectangle is equal to the number of observations in that bin/class.
Closely related to the histogram is a kernal density plot, or density plot . This plot is a much more effective way to view the distribution of a variable than the histogram. Histograms are sensitive to the choice of bin/class sizes. Density plot depends more on the data and less on this arbitrary parameter choice.
About a histogram
- There is no gap between bars.
- The y -axis label is Frequency or Count.
- Data is grouped into classes with the end points labelled on the x -axis.
- There is an explanatory title or caption underneath the graph.
Things to think about
- What is the data range?
- How is the data distributed – skewed or symmetric?
- Which is the modal (or most frequent class)?
- Are there any outliers?
Histogram
Inputs
R codes
R codes used to generate results
hist(x, freq=TRUE, breaks = bins, col = 'darkgray', border = 'white', main = "Main title", xlab = "Horizontal axis title, i.e. variable name", ylab="Frequency")
lines(density(x), col="blue", lwd = 2)
x
is a vector of values for which the histogram is desired.
Histogram with factors
A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).
Histograms are created by dividing up the range of the data into non-overlapping bins/classes of equal width, and counting the number of observations that fall into them. Each bin/class is then represented by a rectangle with the bin/class as its base, where the height of the rectangle is equal to the number of observations in that bin/class. A categorical variable can be used as a factor - this produces a histogram for each value of the categorical variable.
About a histogram
- There is no gap between bars.
- The y -axis label is Frequency or Count.
- Data is grouped into classes with the end points labelled on the x -axis.
- There is an explanatory title or caption underneath the graph.
Things to think about
- What is the data range?
- How is the data distributed – skewed or symmetric?
- Which is the modal (or most frequent class)?
- Is there any outlier?
Histogram
R codes
R codes used to generate results
Histogram (lattice package):
histogram(x | y, type="count", main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")
x
is a vector of values for which the histogram is desired and y
is a factor, i.e. categorical variable.
Boxplot with factors
A boxplot, sometimes called a box and whisker plot, is a graph used to display the distribution of a quantitative variable based on the five number summary. A categorical variable can be used as a factor - this produces a boxplot for each value of the categorical variable, shown side-by-side.
About a boxplot
- The bottom and top of the box are always the first and third quartiles.
- The line inside the box is always the second quartile (the median).
- The ends of the whiskers usually represent the minimum and maximum of all of the values of variable. Any value not included between the whiskers is plotted as an outlier with a dot.
- A boxplot provides information about the shape of a distribution. If a distribution is symmetric the boxplot shows the median roughly in the middle of the box.
- If the longer part of the box is to the left (or below) the median, i.e. observations are concentrated on the low end of the scale, the distribution is said to be skewed left. If the longer part is to the right (or above) the median, the distribution of the variable is skewed right.
Things to think about
- The boxplot shows spread of all values of a variable. Range is the distance between the smallest value and the largest value.
- The boxplot also shows another measure of spread, the interquartile range (IQR). IQR is represented by the width of the box (third quartile minus first quartile).
- Side-by-side boxplots are particularly useful for comparison of distributions between several groups or sets of data.
Boxplot
Inputs
R codes
R codes used to generate results
Box plot or box-whisker plot (lattice package):
bwplot(x | y, type="count", main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")
x
is a vector of values for which the boxplot is desired and y
is a factor, i.e. categorical variable.
Stemplot
A stemplot (also called a stem-and-leaf plot) gives a quick picture of the shape of a distribution while including the actual numeric values in the graph.
About a stemplot
- A stemplot displays the sorted data from smallest to largest.
- A stemplot is used to display quantitative data, generally from small datasets (50 or fewer observations).
- A stemplot allows easy identification of the range and highlights extreme values (‘outliers’).
- The stemplot groups the data into ‘bins’ determined by the choice of stems.
Things to think about
- When comparing two related distributions a back-to-back stemplot can be used. The leaves on each side are ordered out from the common stem.
- In some cases, we can decide to double the number of stems in a plot by splitting each stem into two.
- The depths column contains a number with parentheses around it. The frequency of the row/stem containing the median is placed in these parentheses. It accumulates the values from the top and the bottom, but it stops in each direction when it reaches the row containing the middle value (median) of the variable.
Stem and Leaf Plot
Inputs
R codes
R codes used to generate results
Stemplot, also known as stem and leaf plot (lattice package):
stem.leaf(x)
x
is a vector of values for which the stemplot is desired.
Back-to-back stemplots
stem.leaf.backback(x1, x2)
x1
and x2
are vectors of values for which the back-to-back stemplots are desired. They were created using a factor, i.e. categorical variable. This only works when the categorical variable has two categories (e.g. Female, Male).
Dot plot with factors
A dot plot is a graph used to display observations on a fairly simple scale, typically using filled in circles. If a categorical variable is used as a factor, a dot plot will be displayed for each value of the variable.
About a dot plot
- Dot plots are suitable for small to moderately sized data sets.
- They are useful for highlighting clusters and gaps, as well as outliers.
Things to think about
- Dot plots tend not to be as useful for judging shape as histograms and stemplots.
- Dot plots tend not to present as smooth a picture as histograms.
Dot plot
R codes
R codes used to generate results
Dot plot/Stripchart (stripchart function):
stripchart(x ~ y, type="count", main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")
x
is a vector of values for which the dotplot is desired and y
is a factor, i.e. categorical variable.
Stripchart with factors
Stripcharts produce one dimensional scatter plots of the given data. If a categorical variable is used as a factor, a stripchart will be displayed for each value of the variable.
About a stripchart
- Stripcharts are a good alternative to boxplots when the number of observations is small.
- They are useful for highlighting clusters and gaps, as well as outliers.
Things to think about
- It is common to use systematic jittering in a stripchart, i.e. repeated values are offset so that all observations are visible.
Stripchart
R codes
R codes used to generate results
Stripchart (lattice package):
stripplot(~x | y, type="count", jitter=TRUE, factor=2, main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")
x
is a vector of values for which the stripchart is desired and y
is a factor, i.e. categorical variable.
QQ plot with factors
Normal quantile plots plot quantiles of the data against quantiles of the normal distribution. If a categorical variable is used as a factor, a QQ plot will be displayed for each value of the variable.
About a normal QQ plot
- It plots the Z-score (or normal score) on the x -axis and the variable you are investigating on the y -axis.
- It has a title or caption underneath the graph.
- A common task when analysing continuous numerical variables is to compare them to a theoretical distribution. The most commonly used tool for this job is the theoretical Q-Q plot. For a good fit, a Q-Q plot is roughly linear, with systematic deviations suggesting a lack of fit.
Things to think about
- Is the data normally distributed? If it is, the data will cluster round a straight line.
- Is there skewness shown by deviations in the left or right tails?
- Are there any obvious outliers?
- Granularity may be showing – this indicates several observations with the same value.
QQ Plot
R codes
R codes used to generate results
Normal quantile plot, also known as QQ plot (lattice package):
qqmath(x | y, main = "Main title", ylab = "Variable name", xlab = "z-score", distribution=qnorm,
prepanel.qqmathline(x, distribution = qnorm, probs = c(0.25, 0.75), qtype = 7),
panel=function(...){
panel.qqmath(...)
panel.qqmathline(...)
})
x
is a vector of values for which the QQ plot is desired and y
is a factor, i.e. categorical variable.
Correlogram
Correlogram is a graph of a correlation matrix.
About a correlogram
- In this plot, correlation coefficients are coloured according to the value.
- Positive correlations are displayed in blue and negative correlations in red.
- Colour intensity and the size of the circle are proportional to the correlation coefficients.
Things to think about
- The correlation matrix can be reordered according to the values of the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.
Correlogram
R codes
R codes used to generate results
Correlogram (corrgram package):
corrgram(cor, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main = "Correlogram")}
cor
is a data frame or correlation matrix.
Scatterplot matrix
A scatter plot provides a graphical view of the relationship between two numeric variables.
About a scatterplot matrix
- The scatterplot matrix shows all the pairwise scatterplots of the variables on a single view with multiple scatterplots in a matrix format.
- A plot located on the intersection of i-th row and j-th column is a plot of i-th and j-th variables. This means that each row and column is one dimension, and each cell plots a scatterplot of two dimensions.
- Optionally scatterplot matrix includes lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.
Things to think about
- The purpose of the scatterplot is to look at the relationship between the variables and determine if there are any problems/issues with the data or if the scatterplot indicates anything unique or interesting about the data (How is the data dispersed? Are there outliers?).
- The scatterplot could show nonlinear relationships between variables. The ability to do this can be enhanced by adding a smooth line such as loess.
Inputs
Scatterplot Matrix
R codes
R codes used to generate results
Scatterplot (car package):
scatterplotMatrix(x, var.labels=colnames(x), diagonal=c("density", "boxplot", "histogram", "oned", "qqplot", "none"), main="Scatterplot Matrix")
x
is a data matrix, numeric data frame.
Box-Cox transformation
A procedure used to identify appropriate transformation that would transform data into a 'normal shape'.
About a Box-Cox transformation
- The variable transformation is use to eliminate skewness and other distributional features that complicate analysis. Often the goal is to find a simple transformation that leads to normality.
- The lambda parameter value in the Box-Cox transformation indicates the power to which all observations of the variable should be raised.
- Lambda=-1, means 1/X transformation should be used; Lambda=-0.5, 1/sqrt(X); Lambda=0, log(X); Lambda=1, X and Lambda=2, X^2.
Things to think about
- The Box-Cox power transformation is not a guarantee for normality. This is because it actually does not really check for normality; the method checks for the smallest standard deviation.
- The Box-Cox power transformation only works if all the data is positive and greater than 0. This, however, can usually be achieved easily by adding a constant to all data such that it all becomes positive before it is transformed.
Box-Cox transformation
Inputs
R codes
R codes used to generate results
Box-Cox transformation (MASS package):
boxcox(x~1)
x
is a vector of values for which the Box-Cox transformation is required.
Display and download dataset
Display the dataset selected and download the file with the example dataset used. Take a simple random sample from the dataset used.
- To take a simple random sample from a dataset, select the check box labelled Take a simple random sample.
- Set the Sample size to wanted number of observations.
- Change the seed if a new sample is required. When the same number is used for a seed, then the sample will be the same each time when the Press to generate a sample button is clicked.
- Select the check box labelled Take a sample with replacement if a sample with replacement is wanted.
- Click Download sample button to download a generated sample and save it on a hard drive.
- To analyse a sample upload the saved .csv file and continue using the application.
Dataset
Inputs
R codes
R codes used to generate results
Take a simple random sample:
set.seed(10) # initiate the random number generator
data[sample(1:nrow(data), size=sampleSize, replace=FALSE),]
data
is a data frame from where a simple random sample without replacement is taken.
Descriptive statistics
Descriptive, or summary statistics are used to represent and describe nearly every dataset. They also form the building blocks for much more complicated statistical methods and models.
Three R packages (basic, psych and pastecs) cover almost all descriptive statistics.
Descriptive statistics (basic package)
Descriptive statistics for selected numeric variable and factor (basic package)
Descriptive statistics (psych package)
Descriptive statistics (pastecs package)
R codes
R codes used to generate results
Descriptive statistics (base package):
summary(x)
x
is a data frame.
Descriptive statistics (psych package):
describe(x, skew=FALSE, ranges=FALSE)
Descriptive statistics grouped by one of the categorical variables (psych package):
describeBy(x, group=y, skew=FALSE, ranges=FALSE)
y
is a categorical variable.
Descriptive statistics (pastecs package):
stat.desc(x)
Detecting outliers using 1.5 IQR rule
Table with index/ID for each case that was identified as an outlier and its value is displayed.
Detecting outliers
Inputs
R codes
R codes used to generate results
Detecting outliers using 1.5IQR rule:
# Create space to store outliers and their indices
Outliers <- c()
idxOutliers <- c()
# Get the lower/upper values in the 1.5IQR rule
Upper <- quantile(x,0.75, na.rm=TRUE) + (IQR(x, na.rm=TRUE) * 1.5 )
Lower <- quantile(x,0.25, na.rm=TRUE) - (IQR(x, na.rm=TRUE) * 1.5 )
# Get the id's using which
index <- which(x < Lower | x > Upper)
# Output the value of outliers
Outliers <- c(Outliers, x[index])
# Append the outliers list
idxOutliers <- c(idxOutliers, index)
output <- cbind(idxOutliers, Outliers)
x
is a numeric vector.
Correlation matrix
A correlation matrix is used to investigate the association between multiple variables at the same time.
About a correlation
- A correlation coefficient (Pearson's r) measures the strength of linear relationship between two numeric variables.
- It takes values between -1 (perfect negative association) and +1 (perfect positive association).
- The correlation matrix is symmetric because the correlation between i-th and j-th variables is the same as the correlation between j-th and i-th variables.
- P-value determines the significance level for test of Pearson's correlation; null hypothesis is that the correlation is zero, against two-sided alternative.
Things to think about
- Pearson's r is a valid measure of correlation if there are no outliers.
- Pearson's r is a valid measure of correlation if the relationship between the variables is linear.
- The variables must be continuous or discrete and not ordinal.
Correlation matrix
Correlation matrix - P-values
R codes
R codes used to generate results
cor(x, use="pairwise.complete.obs")
x
is a matrix or data frame.
Calculating P-values Hmisc package:
signif(cor$P, 2)
cor
is a correlation matrix.
- Normality tests
- t-tests
- Bartlett test
- Levene test
- ANOVA
- Inference
Normality tests
The normality tests are supplementary to the graphical assessment of normality (QQ plots).
About normality tests
- Assessing the normality assumption should be taken into account when using parametric statistical tests, such as t-tests.
- It is preferable that normality be assessed both visually and through normality tests, of which the Shapiro-Wilk test is highly recommended.
- The normality tests compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation; the null hypothesis is that 'sample distribution is normal.' If the result is significant, the distribution is non-normal.
Things to think about
- For small sample sizes, normality tests have little power to reject the null hypothesis and therefore small samples most often pass normality tests.
- For large sample sizes, significant results may be derived even in the case of a small deviation from normality, although this small deviation will not affect the results of a parametric test.
- Kolmogorov-Smirnov test is not recommended when parameters are estimated from the data, regardless of sample size.
Normality tests
Inputs
R codes
R codes used to generate results
Normality tests (fBasics package):
Anderson-Darling normality test
adTest(x)
Cramer-von Mises normality test
cvmTest(x)
Lilliefors (Kolmogorov-Smirnov) normality test
lillieTest(x)
Pearson chi-square normality test
pchiTest(x)
Shapiro-Francia normality test
sfTest(x)
Kolmogorov-Smirnov normality test
ksnormTest(x)
Shapiro-Wilk's test for normality
shapiroTest(x)
Jarque-Bera test for normality
jarqueberaTest(x)
D'Agostino normality test
dagoTest(x)
x
is a numeric vector.
t-tests
Perform one, two sample and matched paired t-tests on numeric variables.
About t-tests
- A one-sample t-test is used to test whether the mean of a population has a value specified in a null hypothesis.
- A two-sample t-test is used to test the null hypothesis that the means of two populations are equal.
- A matched pair t-test is used to test the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero.
Things to think about
- Assumption 1: Each of the two populations being compared should follow a normal distribution. This can be tested using a normality test, such as the Shapiro–Wilk or it can be assessed graphically using a normal quantile plot.
- Assumption 2: If using Student's original definition of the t-test, the two populations being compared should have the same variance. If the sample sizes in the two groups being compared are equal, Student's original t-test is highly robust to the presence of unequal variances. Welch's t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.
- Assumption 3: The data used to carry out the test should be sampled independently from the two populations being compared.
t-tests
Inputs
R codes
R codes used to generate results
t-tests (t.test):
# One sample t-test
t.test(x, mu=m0) # Ho: mu = m0
x
is a numeric.
# Two sample t-test
t.test(x, y)
x
and y
are numeric.
# Match paired t-test
t.test(x, y, paired=TRUE) # Both `x` and `y` must be the same length.
The following options are available:
- To specify equal variances and a pool variance estimate:
var.equal=TRUE
. - To specify confidence level of the interval:
conf.level=0.095
. - To specify alternative hypothesis:
alternative=c("two.sided, "less", "greater")
.
Bartlett test of homogeneity of variances
Performs Bartlett's test of the null hypothesis that the variances in each of the k groups (samples) are the same.
About the Bartlett test
- The Bartlett test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then the Bartlett test may simply be testing for non-normality.
Things to think about
- Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption.
- The Levene test is an alternative to the Bartlett test that is less sensitive to departures from normality. However, if we have strong evidence that our data do in fact come from a normal, or nearly normal, distribution, then Bartlett's test has better performance.
Bartlett test
Inputs
R codes
R codes used to generate results
Bartlett test of homogeneity of variances (bartlett.test):
bartlett.test(x, y)
x
is a numeric vector of data values and y
a vector or factor object giving the group for the corresponding elements of x
.
Levene test
Computes Levene test for homogeneity of variance across groups.
About the Levene test
- The Levene test is used to assess the equality of variances for a variable calculated for two or more groups. Some common statistical procedures (e.g. ANOVA) assume that variances of the populations from which different samples are drawn are equal. The Levene test assesses this assumption.
- Although the optimal choice of the function used to compute the centre of each group depends on the underlying distribution, the definition based on the median is recommended as the choice that provides good robustness against many types of non-normal data while retaining good power.
Things to think about
- The Levene test is often used before a comparison of means. When the Levene test shows significance, one should switch to more generalised tests that are free from homoscedasticity assumptions (sometimes even non-parametric tests).
Levene test
Inputs
R codes
R codes used to generate results
Levene's test for homogeneity of variance across groups (car package):
leveneTest(x, y, center=c('mean', 'median'))
x
is a numeric vector of data values and y
a factor defining groups. center
is the name of a function to compute the centre of each group; mean
gives the original Levene test; the default, median
, provides a more robust test.
ANOVA
ANOVA is a statistical technique that allows us to compare the effects of multiple levels of multiple factors.
About ANOVA
- ANOVA is a generalisation of the two sample t-test when we have more than 2 groups.
- The null hypothesis is that the population mean values for all populations are equal.
Things to think about
- ANOVA assumes normally distributed data. This assumption could be tested using the Bartlett or Levene tests.
- There must be few or no outliers in the continuous/discrete data.
- The data must be continuous or discrete and not ordinal.
ANOVA
Inputs
R codes
R codes used to generate results
ANOVA (Analysis of variance) (aov function):
ANOVA <- aov(x ~ y)
summary(ANOVA)
x
is a numeric vector and y
is a factor, i.e. categorical variable.