Numeric Data Analysis

About the Numeric Data Analysis application

The Numeric Data Analysis is an interactive application which allows you to conduct statistical analysis with numeric variables.

The Numeric Data Analysis makes it easy to:

Upload your data and download example datasets
Visualise distribution and relationships among numeric data
Display data and results in a tabular format
Test hypotheses about normality, mean vales and variances
Save the results in a report format
Learn R codes used to generate results.

Start analysis by selecting either example data or by uploading your own data on the left sidebar. This will display the selected dataset, a list of numeric variables and three tabs: Graphs , Tables and Inference. The Example data tab at the top of the screen contains detailed information about each of the example dataset in this application.

Click the plus sign (+) to open the box with detailed information about the Numeric Data Analysis application features.

Upload and download data

Uploading your data and downloading example dataset

The Numeric Data Analysis application permits the user to upload a .csv file that contains data to be displayed. Instructions are given on the Uploading data panel.
There are a few built-in example datasets which can be used to conduct analysis with numeric variables.
Application permits the user to download each example dataset.
Application also permits the user to generate a simple random sample from each dataset.

Graphs tab

Content of the Graphs tab

Click on the Graphs tab on the left sidebar to reveal the following options at the top of the main panel:

Histogram	This tab displays a histogram for the selected numeric variable. Number of bins/classes could be selected and a density curve could be drown over the histogram to describe the distribution more accurately.
Histograms	Displays histogram for the selected numeric variable using the lattice package in R. If the selected dataset has categorical variable then it could be used to generate histogram for each category of the categorical variable.
Boxplots	Displays boxplot for the selected numeric variable using the lattice package in R. If the selected dataset has categorical variable then it could be used to generate boxplots for each category of the categorical variable.
Stemplots	Displays a stemplot and back-to-back stemplots for the selected numeric variables using lattice package in R.
Dotplots	Displays dotplot for the selected numeric variable using the stripchart function in R. If the selected dataset has categorical variable then it could be used to generate dotplots for each category of the categorical variable.
Stripcharts	Displays stripchart for the selected numeric variable using the lattice package in R. If the selected dataset has categorical variable then it could be used to generate stripcharts for each category of the categorical variable.
QQ plots	Displays Normal quantile plot (QQ plots) for the selected numeric variable using the lattice package in R. If the selected dataset has categorical variable then it could be used to generate histogram for each category of the categorical variable.
Correlogram	Displays correlogram, a graphical presentation of the data in correlation matrix using the corrgram package in R.
Scatterplot matrix	Displays scatterplot for every pair of numeric variables in the dataset using the car package in R.

Tables tab

Content of the Tables tab

Click on the Tables tab on the left sidebar to reveal the following options at the top of the main panel:

Datasets	This tab displays all the variables in the selected dataset. Download a selected example dataset as a csv file. A simple random sample from a selected dataset could be generated here.
Descriptive statistics	Three sets of descriptive statistics are provided. First, from the base R package, second from psych package and the third from the pastecs package . Psych package could generate descriptive statistics by grouping/categorical variable if there is a categorical variable in the dataset.
Outliers detection	This tab displays outliers in the selected numeric variable identified using 1.5IQR rule. A selected dataset could be downloaded after the identified outliers are removed.
Correlation matrix	Correlation for each pair of numeric variables in the daaset is calculated. The second table displays p-values.

Inference tab

Content of the Inference tab

Click on the Inference tab on the left sidebar to reveal the following options at the top of the main panel:

Normality tests	This tab performs nine different normality tests. Null hypothesis for these tests is that the variable is normally distributed.
t-tests	This tab performs one sample, two sample t-tests and matched pairs t-test.
Bartlett test	This tab performs Bartlett's test of the null that the variances in each of the groups (samples) are the same.
Levene test	This tab performs Levene's test of the null of homogeneity of variance (homoscedasticity). It is more robust than the Bartlett test on departure from normality.
ANOVA	This tab performs ANOVA for selected numeric variable and at least two categories of the categorical variable (factor).

Save results

To save results in a report:

Enter your name in the textbox (at the top of the sidebar).

Select a document format (PDF, HTML or Word).

Type any comments you may have about the results in the textbox labelled Interpretation.

Press Download Report button.

Save the report on your disc.

R codes

R codes used to generate results:

List of basic R functions used in this application is given here.

To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.

Instructions how to prepare and upload your data

This application permits the user to upload a .csv file that contains data to be displayed. It is very important to follow the instructions here.

STEP 1: Check 'Upload your data'

STEP 1: Check 'Upload your data' radio button

STEP 2: Click 'Browse ...'

STEP 2: Click 'Browse ...' button

The application will only accept a 'comma delimited' text file (.csv). The first row in the .csv file should contain the variable names (a 'header'). If a .csv file is uploaded without a header (and this is indicated by unchecking the entry box on the sidebar), the variables to choose from will be listed as V1, V2, V3, etc., depending on their position in the .csv file. It is best to use .csv files that include variable names as a header row.

If a spreadsheet was used to enter the data as illustrated below, then leave the default selections of radio buttons in the Separator and Quote sections.

STEP 3: Enter your data

STEP 3: Enter your data in a spreadsheet

The best approach would begin by creating a file in a spreadsheet such as this:

STEP 4: Save it as a .csv file

STEP 5: Open the .csv file

STEP 5: Open the .csv file in a text editor

Open the .csv file in a text editor (e.g. Notepad) and it should look like this:

Description of example datasets

This application contains 2 datasets:

Edgar Anderson's Iris Data
3 Measures of ability: SATV, SATQ, ACT

Click the plus sign (+) to open the box with detail information about the dataset.

Iris dataset

Edgar Anderson's Iris Data

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor , and virginica.

There are 150 cases and 5 variables. The variables and their domains (for numeric variables) and levels (for categorical variable) are as follows:

No	Name	Domain/Levels
1	Sepal Length	Numeric variable: from 4.3cm to 7.9cm.
2	Sepal Width	Numeric variable: from 2cm to 4.4cm.
3	Petal Length	Numeric variable: from 1cm to 6.9cm.
4	Petal Width	Numeric variable: From 0.1cm to 2.5cm
5	Species	Categorical variable with three categories: Setosa, Versicolor, and Virginica.

Source

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

SAPA dataset

3 Measures of ability: SATV, SATQ, ACT

Self reported scores on the SAT Verbal, SAT Quantitative and ACT were collected as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project.

There are 700 cases and 6 variables. The variables and their domains (for numeric variables) and levels (for categorical variables) are as follows:

No	Name	Levels
1	Gender	Categorical variable with two categories: Female / Male
2	Education	Categorical variable: Self-reported education (less than 12 years, high school, some college, at college, collage graduate, grad/prof).
3	Age	Numeric variable: Full years at the moment of taking test (from 13 years to 65 years).
4	ACT	Numeric variable: ACT composite scores may range from 1 - 36.
5	SATV	Numeric variable: AT Verbal scores may range from 200 - 800.
6	SATQ	Numeric variable: SAT Quantitative scores may range from 200 - 800.

Source

Revelle, W., Wilt, J., & Rosenthal, A. (2009) Personality and cognition: The personality-cognition link. In Gruszka, A. & Matthews, Ge. and Szymura, Blazej (Eds.) Handbook of individual differences in cognition: Attention, memory and executive control, Springer.

Histogram

A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).

Histograms are created by dividing up the range of the data into non-overlapping bins/classes of equal width, and counting the number of observations that fall into them. Each bin/class is then represented by a rectangle with the bin/class as its base, where the height of the rectangle is equal to the number of observations in that bin/class.

Closely related to the histogram is a kernal density plot, or density plot . This plot is a much more effective way to view the distribution of a variable than the histogram. Histograms are sensitive to the choice of bin/class sizes. Density plot depends more on the data and less on this arbitrary parameter choice.

About a histogram

There is no gap between bars.
The y -axis label is Frequency or Count.
Data is grouped into classes with the end points labelled on the x -axis.
There is an explanatory title or caption underneath the graph.

Things to think about

What is the data range?
How is the data distributed – skewed or symmetric?
Which is the modal (or most frequent class)?
Are there any outliers?

Histogram

Inputs

R codes

R codes used to generate results

Histogram:

hist(x, freq=TRUE, breaks = bins, col = 'darkgray', border = 'white', main = "Main title", xlab = "Horizontal axis title, i.e. variable name", ylab="Frequency")
lines(density(x), col="blue", lwd = 2)

x is a vector of values for which the histogram is desired.

Interpretation

Histogram with factors

A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).

About a histogram

There is no gap between bars.
The y -axis label is Frequency or Count.
Data is grouped into classes with the end points labelled on the x -axis.
There is an explanatory title or caption underneath the graph.

Things to think about

What is the data range?
How is the data distributed – skewed or symmetric?
Which is the modal (or most frequent class)?
Is there any outlier?

Histogram

R codes

R codes used to generate results

Histogram (lattice package):

histogram(x | y, type="count", main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")

x is a vector of values for which the histogram is desired and y is a factor, i.e. categorical variable.

Interpretation

Boxplot with factors

A boxplot, sometimes called a box and whisker plot, is a graph used to display the distribution of a quantitative variable based on the five number summary. A categorical variable can be used as a factor - this produces a boxplot for each value of the categorical variable, shown side-by-side.

About a boxplot

The bottom and top of the box are always the first and third quartiles.
The line inside the box is always the second quartile (the median).
The ends of the whiskers usually represent the minimum and maximum of all of the values of variable. Any value not included between the whiskers is plotted as an outlier with a dot.
A boxplot provides information about the shape of a distribution. If a distribution is symmetric the boxplot shows the median roughly in the middle of the box.
If the longer part of the box is to the left (or below) the median, i.e. observations are concentrated on the low end of the scale, the distribution is said to be skewed left. If the longer part is to the right (or above) the median, the distribution of the variable is skewed right.

Things to think about

The boxplot shows spread of all values of a variable. Range is the distance between the smallest value and the largest value.
The boxplot also shows another measure of spread, the interquartile range (IQR). IQR is represented by the width of the box (third quartile minus first quartile).
Side-by-side boxplots are particularly useful for comparison of distributions between several groups or sets of data.

Boxplot

Inputs

R codes

R codes used to generate results

Box plot or box-whisker plot (lattice package):

bwplot(x | y, type="count", main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")

x is a vector of values for which the boxplot is desired and y is a factor, i.e. categorical variable.

Interpretation

Stemplot

A stemplot (also called a stem-and-leaf plot) gives a quick picture of the shape of a distribution while including the actual numeric values in the graph.

About a stemplot

A stemplot displays the sorted data from smallest to largest.
A stemplot is used to display quantitative data, generally from small datasets (50 or fewer observations).
A stemplot allows easy identification of the range and highlights extreme values (‘outliers’).
The stemplot groups the data into ‘bins’ determined by the choice of stems.

Things to think about

When comparing two related distributions a back-to-back stemplot can be used. The leaves on each side are ordered out from the common stem.
In some cases, we can decide to double the number of stems in a plot by splitting each stem into two.
The depths column contains a number with parentheses around it. The frequency of the row/stem containing the median is placed in these parentheses. It accumulates the values from the top and the bottom, but it stops in each direction when it reaches the row containing the middle value (median) of the variable.

Stem and Leaf Plot

Inputs

R codes

R codes used to generate results

Stemplot, also known as stem and leaf plot (lattice package):

stem.leaf(x)

x is a vector of values for which the stemplot is desired.

Back-to-back stemplots

stem.leaf.backback(x1, x2)

x1 and x2 are vectors of values for which the back-to-back stemplots are desired. They were created using a factor, i.e. categorical variable. This only works when the categorical variable has two categories (e.g. Female, Male).

Interpretation

Dot plot with factors

A dot plot is a graph used to display observations on a fairly simple scale, typically using filled in circles. If a categorical variable is used as a factor, a dot plot will be displayed for each value of the variable.

About a dot plot

Dot plots are suitable for small to moderately sized data sets.
They are useful for highlighting clusters and gaps, as well as outliers.

Things to think about

Dot plots tend not to be as useful for judging shape as histograms and stemplots.
Dot plots tend not to present as smooth a picture as histograms.

Dot plot

R codes

R codes used to generate results

Dot plot/Stripchart (stripchart function):

stripchart(x ~ y, type="count", main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")

x is a vector of values for which the dotplot is desired and y is a factor, i.e. categorical variable.

Interpretation

Stripchart with factors

Stripcharts produce one dimensional scatter plots of the given data. If a categorical variable is used as a factor, a stripchart will be displayed for each value of the variable.

About a stripchart

Stripcharts are a good alternative to boxplots when the number of observations is small.
They are useful for highlighting clusters and gaps, as well as outliers.

Things to think about

It is common to use systematic jittering in a stripchart, i.e. repeated values are offset so that all observations are visible.

Stripchart

R codes

R codes used to generate results

Stripchart (lattice package):

stripplot(~x | y, type="count", jitter=TRUE, factor=2, main="Main title", ylab="Frequency", xlab = "Horizontal axis title, i.e. variable name")

x is a vector of values for which the stripchart is desired and y is a factor, i.e. categorical variable.

Interpretation

QQ plot with factors

Normal quantile plots plot quantiles of the data against quantiles of the normal distribution. If a categorical variable is used as a factor, a QQ plot will be displayed for each value of the variable.

About a normal QQ plot

It plots the Z-score (or normal score) on the x -axis and the variable you are investigating on the y -axis.
It has a title or caption underneath the graph.
A common task when analysing continuous numerical variables is to compare them to a theoretical distribution. The most commonly used tool for this job is the theoretical Q-Q plot. For a good fit, a Q-Q plot is roughly linear, with systematic deviations suggesting a lack of fit.

Things to think about

Is the data normally distributed? If it is, the data will cluster round a straight line.
Is there skewness shown by deviations in the left or right tails?
Are there any obvious outliers?
Granularity may be showing – this indicates several observations with the same value.

QQ Plot

R codes

R codes used to generate results

Normal quantile plot, also known as QQ plot (lattice package):

qqmath(x | y, main = "Main title", ylab = "Variable name", xlab = "z-score", distribution=qnorm,  
   prepanel.qqmathline(x, distribution = qnorm, probs = c(0.25, 0.75), qtype = 7),
    panel=function(...){
    panel.qqmath(...)
    panel.qqmathline(...) 
    })

x is a vector of values for which the QQ plot is desired and y is a factor, i.e. categorical variable.

Interpretation

Correlogram

Correlogram is a graph of a correlation matrix.

About a correlogram

In this plot, correlation coefficients are coloured according to the value.
Positive correlations are displayed in blue and negative correlations in red.
Colour intensity and the size of the circle are proportional to the correlation coefficients.

Things to think about

The correlation matrix can be reordered according to the values of the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix.

Correlogram

R codes

R codes used to generate results

Correlogram (corrgram package):

corrgram(cor, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main = "Correlogram")}

cor is a data frame or correlation matrix.

Interpretation

Scatterplot matrix

A scatter plot provides a graphical view of the relationship between two numeric variables.

About a scatterplot matrix

The scatterplot matrix shows all the pairwise scatterplots of the variables on a single view with multiple scatterplots in a matrix format.
A plot located on the intersection of i-th row and j-th column is a plot of i-th and j-th variables. This means that each row and column is one dimension, and each cell plots a scatterplot of two dimensions.
Optionally scatterplot matrix includes lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.

Things to think about

The purpose of the scatterplot is to look at the relationship between the variables and determine if there are any problems/issues with the data or if the scatterplot indicates anything unique or interesting about the data (How is the data dispersed? Are there outliers?).
The scatterplot could show nonlinear relationships between variables. The ability to do this can be enhanced by adding a smooth line such as loess.

Inputs

Scatterplot Matrix

R codes

R codes used to generate results

Scatterplot (car package):

scatterplotMatrix(x, var.labels=colnames(x), diagonal=c("density", "boxplot", "histogram", "oned", "qqplot", "none"), main="Scatterplot Matrix")

x is a data matrix, numeric data frame.

Interpretation

Box-Cox transformation

A procedure used to identify appropriate transformation that would transform data into a 'normal shape'.

About a Box-Cox transformation

The variable transformation is use to eliminate skewness and other distributional features that complicate analysis. Often the goal is to find a simple transformation that leads to normality.
The lambda parameter value in the Box-Cox transformation indicates the power to which all observations of the variable should be raised.
Lambda=-1, means 1/X transformation should be used; Lambda=-0.5, 1/sqrt(X); Lambda=0, log(X); Lambda=1, X and Lambda=2, X^2.

Things to think about

The Box-Cox power transformation is not a guarantee for normality. This is because it actually does not really check for normality; the method checks for the smallest standard deviation.
The Box-Cox power transformation only works if all the data is positive and greater than 0. This, however, can usually be achieved easily by adding a constant to all data such that it all becomes positive before it is transformed.

Box-Cox transformation

Inputs

R codes

R codes used to generate results

Box-Cox transformation (MASS package):

boxcox(x~1)

x is a vector of values for which the Box-Cox transformation is required.

Interpretation

Display and download dataset

Display the dataset selected and download the file with the example dataset used. Take a simple random sample from the dataset used.

To take a simple random sample from a dataset, select the check box labelled Take a simple random sample.
Set the Sample size to wanted number of observations.
Change the seed if a new sample is required. When the same number is used for a seed, then the sample will be the same each time when the Press to generate a sample button is clicked.
Select the check box labelled Take a sample with replacement if a sample with replacement is wanted.
Click Download sample button to download a generated sample and save it on a hard drive.
To analyse a sample upload the saved .csv file and continue using the application.

Dataset

Number of observations to view

Inputs

R codes

R codes used to generate results

Take a simple random sample:

set.seed(10)         # initiate the random number generator
data[sample(1:nrow(data), size=sampleSize, replace=FALSE),]

data is a data frame from where a simple random sample without replacement is taken.

Interpretation

Descriptive statistics

Descriptive, or summary statistics are used to represent and describe nearly every dataset. They also form the building blocks for much more complicated statistical methods and models.

Three R packages (basic, psych and pastecs) cover almost all descriptive statistics.

Descriptive statistics (basic package)

Descriptive statistics for selected numeric variable and factor (basic package)

Descriptive statistics (psych package)

Descriptive statistics (pastecs package)

R codes

R codes used to generate results

Descriptive statistics (base package):

summary(x)

x is a data frame.

Descriptive statistics (psych package):

describe(x, skew=FALSE, ranges=FALSE)

Descriptive statistics grouped by one of the categorical variables (psych package):

describeBy(x, group=y, skew=FALSE, ranges=FALSE)

y is a categorical variable.

Descriptive statistics (pastecs package):

stat.desc(x)

Interpretation

Detecting outliers using 1.5 IQR rule

Table with index/ID for each case that was identified as an outlier and its value is displayed.

Detecting outliers

Inputs

R codes

R codes used to generate results

Detecting outliers using 1.5IQR rule:

# Create space to store outliers and their indices
  Outliers <- c()
  idxOutliers <- c()

# Get the lower/upper values in the 1.5IQR rule
  Upper <- quantile(x,0.75, na.rm=TRUE) + (IQR(x, na.rm=TRUE) * 1.5 )
  Lower <- quantile(x,0.25, na.rm=TRUE) - (IQR(x, na.rm=TRUE) * 1.5 )

# Get the id's using which
  index <- which(x < Lower | x > Upper)

# Output the value of outliers 
  Outliers <- c(Outliers, x[index])

# Append the outliers list
  idxOutliers <- c(idxOutliers, index) 

  output <- cbind(idxOutliers, Outliers)

x is a numeric vector.

Interpretation

Correlation matrix

A correlation matrix is used to investigate the association between multiple variables at the same time.

About a correlation

A correlation coefficient (Pearson's r) measures the strength of linear relationship between two numeric variables.
It takes values between -1 (perfect negative association) and +1 (perfect positive association).
The correlation matrix is symmetric because the correlation between i-th and j-th variables is the same as the correlation between j-th and i-th variables.
P-value determines the significance level for test of Pearson's correlation; null hypothesis is that the correlation is zero, against two-sided alternative.

Things to think about

Pearson's r is a valid measure of correlation if there are no outliers.
Pearson's r is a valid measure of correlation if the relationship between the variables is linear.
The variables must be continuous or discrete and not ordinal.

Correlation matrix

Correlation matrix - P-values

R codes

R codes used to generate results

Correlation matrix:

cor(x, use="pairwise.complete.obs")

x is a matrix or data frame.

Calculating P-values Hmisc package:

signif(cor$P, 2)

cor is a correlation matrix.

Interpretation

Normality tests

The normality tests are supplementary to the graphical assessment of normality (QQ plots).

About normality tests

Assessing the normality assumption should be taken into account when using parametric statistical tests, such as t-tests.
It is preferable that normality be assessed both visually and through normality tests, of which the Shapiro-Wilk test is highly recommended.
The normality tests compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation; the null hypothesis is that 'sample distribution is normal.' If the result is significant, the distribution is non-normal.

Things to think about

For small sample sizes, normality tests have little power to reject the null hypothesis and therefore small samples most often pass normality tests.
For large sample sizes, significant results may be derived even in the case of a small deviation from normality, although this small deviation will not affect the results of a parametric test.
Kolmogorov-Smirnov test is not recommended when parameters are estimated from the data, regardless of sample size.

Normality tests

Inputs

R codes

R codes used to generate results

Normality tests (fBasics package):

Anderson-Darling normality test

adTest(x)

Cramer-von Mises normality test

cvmTest(x)

Lilliefors (Kolmogorov-Smirnov) normality test

lillieTest(x)

Pearson chi-square normality test

pchiTest(x)

Shapiro-Francia normality test

sfTest(x)

Kolmogorov-Smirnov normality test

ksnormTest(x)

Shapiro-Wilk's test for normality

shapiroTest(x)

Jarque-Bera test for normality

jarqueberaTest(x)

D'Agostino normality test

dagoTest(x)

x is a numeric vector.

Interpretation

t-tests

Perform one, two sample and matched paired t-tests on numeric variables.

About t-tests

A one-sample t-test is used to test whether the mean of a population has a value specified in a null hypothesis.
A two-sample t-test is used to test the null hypothesis that the means of two populations are equal.
A matched pair t-test is used to test the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero.

Things to think about

Assumption 1: Each of the two populations being compared should follow a normal distribution. This can be tested using a normality test, such as the Shapiro–Wilk or it can be assessed graphically using a normal quantile plot.
Assumption 2: If using Student's original definition of the t-test, the two populations being compared should have the same variance. If the sample sizes in the two groups being compared are equal, Student's original t-test is highly robust to the presence of unequal variances. Welch's t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.
Assumption 3: The data used to carry out the test should be sampled independently from the two populations being compared.

t-tests

Inputs

R codes

R codes used to generate results

t-tests (t.test):

# One sample t-test
t.test(x, mu=m0)    # Ho: mu = m0

x is a numeric.

# Two sample t-test
t.test(x, y)

x and y are numeric.

# Match paired t-test
t.test(x, y, paired=TRUE)  # Both `x` and `y` must be the same length.

The following options are available:

To specify equal variances and a pool variance estimate: var.equal=TRUE.
To specify confidence level of the interval: conf.level=0.095.
To specify alternative hypothesis: alternative=c("two.sided, "less", "greater").

Interpretation

Bartlett test of homogeneity of variances

Performs Bartlett's test of the null hypothesis that the variances in each of the k groups (samples) are the same.

About the Bartlett test

The Bartlett test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then the Bartlett test may simply be testing for non-normality.

Things to think about

Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption.
The Levene test is an alternative to the Bartlett test that is less sensitive to departures from normality. However, if we have strong evidence that our data do in fact come from a normal, or nearly normal, distribution, then Bartlett's test has better performance.

Bartlett test

Inputs

R codes

R codes used to generate results

Bartlett test of homogeneity of variances (bartlett.test):

bartlett.test(x, y)

x is a numeric vector of data values and y a vector or factor object giving the group for the corresponding elements of x.

Interpretation

Levene test

Computes Levene test for homogeneity of variance across groups.

About the Levene test

The Levene test is used to assess the equality of variances for a variable calculated for two or more groups. Some common statistical procedures (e.g. ANOVA) assume that variances of the populations from which different samples are drawn are equal. The Levene test assesses this assumption.
Although the optimal choice of the function used to compute the centre of each group depends on the underlying distribution, the definition based on the median is recommended as the choice that provides good robustness against many types of non-normal data while retaining good power.

Things to think about

The Levene test is often used before a comparison of means. When the Levene test shows significance, one should switch to more generalised tests that are free from homoscedasticity assumptions (sometimes even non-parametric tests).

Levene test

Inputs

R codes

R codes used to generate results

Levene's test for homogeneity of variance across groups (car package):

leveneTest(x, y, center=c('mean', 'median'))

x is a numeric vector of data values and y a factor defining groups. center is the name of a function to compute the centre of each group; mean gives the original Levene test; the default, median, provides a more robust test.

Interpretation

ANOVA

ANOVA is a statistical technique that allows us to compare the effects of multiple levels of multiple factors.

About ANOVA

ANOVA is a generalisation of the two sample t-test when we have more than 2 groups.
The null hypothesis is that the population mean values for all populations are equal.

Things to think about

ANOVA assumes normally distributed data. This assumption could be tested using the Bartlett or Levene tests.
There must be few or no outliers in the continuous/discrete data.
The data must be continuous or discrete and not ordinal.

ANOVA

Inputs

R codes

R codes used to generate results

ANOVA (Analysis of variance) (aov function):

ANOVA <- aov(x ~ y)
summary(ANOVA)

x is a numeric vector and y is a factor, i.e. categorical variable.

Interpretation

About the Numeric Data Analysis application

The Numeric Data Analysis is an interactive application which allows you to conduct statistical analysis with numeric variables.

Upload and download data

Uploading your data and downloading example dataset

Graphs tab

Content of the Graphs tab

Tables tab

Content of the Tables tab

Inference tab

Content of the Inference tab

Save results

To save results in a report: Enter your name in the textbox (at the top of the sidebar). Select a document format (PDF, HTML or Word). Type any comments you may have about the results in the textbox labelled Interpretation. Press Download Report button. Save the report on your disc.

R codes

R codes used to generate results: List of basic R functions used in this application is given here. To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.

Instructions how to prepare and upload your data

This application permits the user to upload a .csv file that contains data to be displayed. It is very important to follow the instructions here.

STEP 1: Check 'Upload your data'

STEP 1: Check 'Upload your data' radio button

STEP 2: Click 'Browse ...'

STEP 2: Click 'Browse ...' button

STEP 3: Enter your data

STEP 3: Enter your data in a spreadsheet

STEP 4: Save it as a .csv file

STEP 4: Save it as a .csv file

STEP 5: Open the .csv file

STEP 5: Open the .csv file in a text editor

Description of example datasets

Iris dataset

Edgar Anderson's Iris Data

SAPA dataset

3 Measures of ability: SATV, SATQ, ACT

Histogram

A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).

About a histogram

Things to think about

Histogram

Inputs

R codes

R codes used to generate results

Histogram with factors

A histogram is a graphical representation of the distribution of numeric data (i.e. continuous variable).

About a histogram

Things to think about

Histogram

R codes

R codes used to generate results

Boxplot with factors

A boxplot, sometimes called a box and whisker plot, is a graph used to display the distribution of a quantitative variable based on the five number summary. A categorical variable can be used as a factor - this produces a boxplot for each value of the categorical variable, shown side-by-side.

About a boxplot

Things to think about

Boxplot

Inputs

R codes

R codes used to generate results

Stemplot

A stemplot (also called a stem-and-leaf plot) gives a quick picture of the shape of a distribution while including the actual numeric values in the graph.

About a stemplot

Things to think about

Stem and Leaf Plot

Inputs

R codes

R codes used to generate results

Dot plot with factors

A dot plot is a graph used to display observations on a fairly simple scale, typically using filled in circles. If a categorical variable is used as a factor, a dot plot will be displayed for each value of the variable.

About a dot plot

Things to think about

Dot plot

R codes

R codes used to generate results

Stripchart with factors

Stripcharts produce one dimensional scatter plots of the given data. If a categorical variable is used as a factor, a stripchart will be displayed for each value of the variable.

About a stripchart

Things to think about

Stripchart

R codes

R codes used to generate results

QQ plot with factors

Normal quantile plots plot quantiles of the data against quantiles of the normal distribution. If a categorical variable is used as a factor, a QQ plot will be displayed for each value of the variable.

About a normal QQ plot

Things to think about

To save results in a report:

Enter your name in the textbox (at the top of the sidebar).

Select a document format (PDF, HTML or Word).

Type any comments you may have about the results in the textbox labelled Interpretation.

Press Download Report button.

Save the report on your disc.

R codes used to generate results:

List of basic R functions used in this application is given here.

To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.