Categorical Data Analysis

About the Categorical Data Analysis application

The Categorical Data Analysis is an interactive application which allows you to conduct statistical analysis with categorical/qualitative variables.

The Categorical Data Analysis makes it easy to:

Upload your data and download example datasets
Visualise frequencies and relationships among categorical data
Display data and results in a tabular format
Test hypotheses about independence in contingency tables
Save the results in a report format
Learn R codes used to generate results.

Start analysis by selecting either example data or by uploading your own data on the left sidebar. This will display the selected dataset, a list of categorical variables and three tabs: Graphs, Tables and Inference. The Example data tab at the top of the screen contains detailed information about each of the example dataset in this application.

Click the plus sign (+) to open the box with detailed information about the Categorical Data Analysis application features.

Upload and download data

Uploading your data and downloading example dataset

The Categorical Data Analysis application permits the user to upload a .csv file that contains data to be displayed. Instructions are given on the Uploading data panel.
There are a few built-in example datasets which can be used to conduct analysis with categorical variables.
This application also permits the user to download each example dataset.

Graphs tab

Content of the Graphs tab

Click on the Graphs tab on the left sidebar to reveal the following options at the top of the main panel:

Bar chart	This tab displays a bar chart for the selected categorical variable. If the selected dataset has more than one categorical variable then the second variable could be used to generate stacked or grouped bar charts.
Dotplot	Displays a dotplot for the selected categorical variable using the lattice package in R. If the selected dataset has more than one categorical variable then the second variable could be used to generate dotplot for each category of the second categorical variable.
Pie chart	Displays a pie chart for the selected categorical variable. Sample size or percentage can be inserted by each slice after the category name.
Mosaic plot	Displays a mosaic plot for two categorical variables using vcd package in R.
Association plot	Displays an association plot for two categorical variables using vcd package in R.
Agreement plot	Displays an agreement plot for two categorical variables using vcd package in R.
Correspondence plot	Displays the correspondence analysis 2D plot for two categorical variable using ca package in R.

Tables tab

Content of the Tables tab

Click on the Tables tab on the left sidebar to reveal the following options at the top of the main panel:

Datasets	This tab displays all the variables in the selected dataset. Download a selected example dataset as a csv file. A simple random sample from a selected dataset could be generated here.
Descriptive statistics	Two sets of descriptive statistics are provided. First, from the base R package and the second, from psych package. Psych package could generate descriptive statistics by grouping/categorical variable.
Frequency tables	All possible frequency and contingency tables are provided here, including marginal frequencies and tables of proportions.
Association statistics	Using the vcd package in R the following measures of association in the contingency table are calculated: phi coefficient, contingency coefficient and Cramer’s V. In addition to these measures of association Cohen’s kappa and weighted kappas for a confusion matrix are calculated.
Correspondence analysis	Using the ca package in R correspondence analysis is conducted for two selected categorical variables. Summary output of a correspondence analysis is presented here.

Inference tab

Content of the Inference tab

Click on the Inference tab on the left sidebar to reveal the following options at the top of the main panel:

Chi-squared test	This tab performs the chi-squared test of independence of the row and column variables in two-way contingency tables.
Fisher exact test	This test provides an exact test of independence in a two-dimensional contingency table.
Mantel-Haenszel test	This test provides test of conditional independence between two nominal variables in each stratum of a three-dimensional contingency table.
Log-linear model tests	This tab contains four tests for log-linear models based on a three-way contingency table.

Save results

To save results in a report:

Enter your name in the textbox (at the top of the sidebar).

Select a document format (PDF, HTML or Word).

Type any comments you may have about the results in the textbox labelled Interpretation.

Press Download Report button.

Save the report on your disc.

R codes

R codes used to generate results:

List of basic R functions used in this application is given here.

To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.

Instructions how to prepare and upload your data

This application permits the user to upload a .csv file that contains data to be displayed. It is very important to follow the instructions here.

STEP 1: Check 'Upload your data'

STEP 1: Check 'Upload your data' radio button

STEP 2: Click 'Browse ...'

STEP 2: Click 'Browse ...' button

The application will only accept a 'comma delimited' text file (.csv). The first row in the .csv file should contain the variable names (a 'header'). If a .csv file is uploaded without a header (and this is indicated by unchecking the entry box on the sidebar), the variables to choose from will be listed as V1, V2, V3, etc., depending on their position in the .csv file. It is best to use .csv files that include variable names as a header row.

If a spreadsheet was used to enter the data as illustrated below, then leave the default selections of radio buttons in the Separator and Quote sections.

STEP 3: Enter your data

STEP 3: Enter your data in a spreadsheet

The best approach would begin by creating a file in a spreadsheet such as this:

STEP 4: Save it as a .csv file

STEP 5: Open the .csv file

STEP 5: Open the .csv file in a text editor

Open the .csv file in a text editor (e.g. Notepad) and it should look like this:

Description of example datasets

This application contains 9 datasets:

Survival of passengers on the Titanic
Hair and eye colour of statistics students

The two datasets contain categorical variables only. Please notice that some of the graphical methods, tables and tests require more than one or two categorical variables and will not generate output if this is not the case.

Click the plus sign (+) to open the box with detailed information about the dataset.

Titanic dataset

Survival of passengers on the Titanic

This dataset provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarised according to economic status (class), sex, age and survival.

There are 2201 observations on 4 variables. The variables and their levels are as follows:

No	Name	Levels
1	Class	1st, 2nd, 3rd, Crew
2	Sex	Male, Female
3	Age	Child, Adult
4	Survived	No, Yes

Source

Dawson, Robert J. MacG. (1995). The ‘unusual episode’ data revisited. Journal of Statistics Education, 3. http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html

The source provides a dataset recording class, sex, age, and survival status for each person on board of the Titanic, and is based on data originally collected by the British Board of Trade and reprinted in:

British Board of Trade (1990). Report on the loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing.

Hair and eye colour dataset

Hair and eye colour of statistics students

This dataset provides information on hair, eye colour and sex in 592 statistics students.

There are 592 observations on 3 variables. The variables and their levels are as follows:

No	Name	Levels
1	Hair	Black, Brown, Red, Blond
2	Eye	Brown, Blue, Hazel, Green
3	Sex	Male, Female

Source

Snee, R. D. (1974). Graphical display of two-way contingency tables. The American Statistician, 28, 9–12.
Friendly, M. (1992a). Graphical methods for categorical data. SAS User Group International Conference Proceedings, 17, 190-200. http://www.math.yorku.ca/SCS/sugi/sugi17-paper.html

Bar chart

A bar chart is a graphical representation of the distribution of qualitative data (i.e. categorical variable).

Bar chart is a form of graphical representation of categorical variables. It displays data classified into a number of (usually unordered) categories. Equal-width rectangular bars are constructed over each category with height equal to the observed frequency of the category.

Stacked (component) bar chart shows the component parts as sectors of the bar with lengths in proportion to the relative size.

Grouped (clustered) bar chart shows information about different sub-groups of the main categories. A separate bar represents each of the sub-groups (usually coloured or shaded differently to distinguish between them. In such cases, a legend or key is provided.

About a bar chart

All bars have the same width.
Bars can be shown vertically or horizontally.
The categories are shown on one of the axes.
The frequency of the data in category is represented by the height/length of the bar.
Titles and labels for both axes should be included.
A legend should be included for stacked and grouped bar charts.
There is an explanatory title or caption underneath the graph.

Things to think about

Is the order of categories important? In some cases alphabetical ordering or some other arrangement might be used to produce more useful graphical display.
Which is the modal or most frequent category?

Bar chart

Inputs

R codes

R codes used to generate results

Bar chart:

counts <- table(x)
barplot(counts, main="Main title", xlab="Horizontal axis title")

x is a vector or matrix.

Horizontal bar chart:

counts <- table(x)
barplot(counts, horiz = TRUE, main="Main title")

Stacked bar chart:

counts <- table(x, y)
barplot(counts, col=c("blue", "red"), legend=rownames(counts), main="Main title", xlab="Horizontal axis title")

x and y are vectors.

Grouped bar chart:

counts <- table(x, y)
barplot(counts, beside=TRUE, col=c("blue", "red"), legend=rownames(counts), main="Main title", xlab="Horizontal axis title")

x and y are vectors.

Interpretation

Dotplot

A dotplot is a type of graphic display used to compare frequency counts within categories or groups.

Dotplots are an alternative to bar charts or pie charts, and look somewhat like a horizontal bar chart where the bars are replaced by a dot at the frequency associated with each category. Optionally, horizontal lines are also included connecting dots with the vertical axis.

About a dotplot

Dotplots are less cluttered than the bar charts.
Compared to other types of graphic display, dotplots are used most often to plot frequency counts within a small number of categories, usually with small sets of data.
Titles and labels for both axes should be included.
Title for the horizontal axis is Frequency or Count.
There is an explanatory title or caption underneath the graph.

Things to think about

Multi-panel side-by-side display might be used for comparing several contrasting or similar cases. Use same scales for the horizontal axis across different panels.
Consider ordering categories by frequencies represented, for more accurate perception.

Dotplot

R codes

R codes used to generate results

Dotplot in lattice package:

myTable <- table(x, y)
dotplot(myTable, groups=FALSE, auto.key=list(lines=TRUE), type=c("p", "h"), xlab="Frequency",
        prepanel = function (x, y) {
        list(ylim = levels(reorder(y, x)))
        }, 
        panel = function(x, y, ...){
        panel.dotplot(x, reorder(y,x), ...)
        })

x and y are vectors.

Interpretation

Pie chart

Pie chart is a graphical technique for presenting relative frequencies associated with the observed values of a categorical variable.

The pie chart consists of a circle subdivided into sectors (sometimes called slices) whose sizes are proportional to the quantities they represent. Such displays are popular in the media but have little relevance for serious scientific work when other graphics are generally far more useful (e.g. bar chart and dot plot).

About a pie chart

Slices have to be mutually exclusive; by definition, they cannot overlap.
There are two features that let us read the values on a pie chart: the angle a slice covers (compared to the full circle), and the area of a slice (compared to the entire circle).
Use of 3D pie charts is not recommended. Firstly, it makes it more difficult to read the chart. Secondly, different software my produce different 3D charts. A third problem with 3D charts is that they suggest we know more about the data than we really do. In case of pie charts adding unnecessary dimensions or adding perspective to the pies distorts the data.
Legend should be included or each slice should have its own label. Number or percentage in each slice can also be shown.
There is an explanatory title or caption underneath the chart.

Things to think about

We are not very good at measuring angles, but we recognize 90 and 180 degree angles with very high precision. Slices that cover half or a quarter of the circle will therefore stand out. Others can be compared with some success, but reading actual numbers from a pie chart is next to impossible.
Do the parts make up a meaningful whole? If not, use a different chart.
Do we want to compare the parts to each other or the parts to the whole? If the main purpose is to compare between the parts, use a different chart. The main purpose of the pie chart is to show part-whole relationships.
How many parts do we have? If there are more than five to seven, use a different chart. Pie charts with lots of slices (or slices of very different size) are hard to read.

Pie chart

Inputs

R codes

R codes used to generate results

Pie chart displaying label and percentage for each slice:

myTable <- table(x)
percentlabels <- round(100*myTable/sum(myTable), 1)
labs <- paste(names(myTable), "\n", percentlabels, "%", sep="")
pie(myTable, labels=labs, col=rainbow(length(levels(x))), main="Main title")

x is a vector with observations of a categorical variable.

Interpretation

Mosaic plot

A mosaic plot is a graphical display that allows examination of the relationship among two or more categorical variables.

The mosaic plot is a graphical representation of the two-way frequency table or contingency table. A mosaic plot is divided into rectangles, so that the vertical length of each rectangle represents the proportions of the Y variable at each level of the X variable.

About a mosaic plot

The displayed variables are categorical or ordinal scales.
The plot is of at least two variables. There is no upper limit on the number of variables, but too many variables may be confusing in graphic form.
The surfaces of the rectangular fields that are available for a combination of features are proportional to the number of observations that have this combination of features.
Independence is shown when the boxes across categories all have the same areas.
The significance of different frequencies of the various characteristic values cannot be observed visually.
The colours represent the level of the residual for that cell / combination of levels. The legend is presented at the plot's right. Blue means there are more observations in that cell than would be expected under the independence. Red means there are fewer observations than would have been expected. This is showing you which cells are contributing to the significance of the chi-squared test result.
For Hair and eye colour dataset the mosaic plot indicates that there are more blue-eyed blond students than expected under independence and too few brown-eyed blond students.

Things to think about

Variables that represent an exposure or treatment status should usually represent the first split (i.e. division into rectangles) and variables that represent an outcome should represent the second split.
The categorical variables should be sorted first. Then each variable is assigned to an axis. Another order of categorical variables will result in a different mosaic plot, i.e., as in all multivariate plots, the order of variables plays a role.

Mosaic plot

R codes

R codes used to generate results

Mosaic plot (vcd package):

myTable <- table(x, y)
mosaic(myTable, legend=TRUE, shade=TRUE, las=2, col=TRUE)

x and y are vectors.

Interpretation

Association plot

An association plot indicates deviations from a specified independence model in a possibly high-dimensional contingency table.

About an association plot

The rectangles for each row in the table are positioned relative to a baseline representing independence shown by a dotted line.
Cells with an observed frequency greater than the expected frequency (assuming independence) rise above the line and are coloured blue; cells that contain less than the expected frequency fall below it and are shaded red.
The main purpose of the shading is not to visualize significance but the pattern of deviation from independence.

Things to think about

Variables that represent an exposure or treatment status should usually represent the first split (i.e. division into rectangles) and variables that represent an outcome should represent the second split.
The categorical variables should be sorted first. Then each variable is assigned to an axis. Another order of categorical variables will result in a different mosaic plot, i.e., as in all multivariate plots, the order of variables plays a role.

Association plot

R codes

R codes used to generate results

Association plot (vcd package):

myTable <- table(x, y)
assoc(myTable, legend=TRUE, shade=TRUE, col=TRUE)

x and y are vectors.

Interpretation

Agreement plot

An agreement plot provides a simple graphic representation of the strength of agreement in a contingency table, and a measure of strength of agreement with an intuitive interpretation.

Inter-observer agreement is often used as a method of assessing the reliability of a subjective classification or assessment procedure. For example, two (or more) clinical psychologists might classify patients on a scale with categories: normal, mildly impaired, severely impaired.

About an agreement plot

The agreement chart is constructed as an n x n square, where n is the total sample size.
Black squares show observed agreement. These are positioned within larger rectangles
The large rectangle shows the maximum possible agreement, given the marginal totals.

Things to think about

Observers' ratings can be strongly associated without strong agreement.
If observers tend to use the categories with different frequency, this will affect measures of agreement.

Agreement plot

R codes

R codes used to generate results

Agreement plot (vcd package):

myTable <- table(x, y)
agreementplot(myTable)

x and y are vectors.

Interpretation

Correspondence analysis plot

Graphical display of the relationship between categorical variables in a type of scatterplot diagram.

Two categorical variables are displayed in the form of a contingency table, i.e. two-way table. From a contingency table a set of coordinate values representing the row and column categories are derived. A small number of these derived coordinate values (usually two) are then used to allow the table to be displayed graphically.

About a correspondence plot

The row profile is defined as the counts in a row divided by the total count for that row. If two rows have very similar row profiles, their points in the correspondence analysis plot are close together.
The coordinates are analogous to those resulting from a principal component analysis of continuous variables.
They involve a partition of a chi-squared statistic rather than the total variance.

Things to think about

Column and row profiles are alike because the problem is defined symmetrically.
The distance between a row point and a column point has no meaning.
The directions of columns and rows from the origin are meaningful, and the relationships help interpret the plot.

Correspondence plot

R codes

R codes used to generate results

Correspondence analysis plot (ca package):

myTable <- table(x, y)
plot(ca(myTable))

x and y are vectors.

Interpretation

Display and download dataset

Display the dataset selected and download the file with the example dataset used. Take a simple random sample from the dataset used.

To take a simple random sample from a dataset, select the check box labelled Take a simple random sample.
Set the Sample size to the required number of observations.
Change the seed if a new sample is required. When the same number is used for a seed, the sample generated will be the same each time.
Select the check box labelled Take a sample with replacement if a sample with replacement is wanted.
Click Download sample button to download a generated sample and save it on a hard drive.
To analyse a sample upload the saved .csv file and continue using the application.

Dataset

Number of observations to view

Inputs

R codes

R codes used to generate results

Take a simple random sample:

set.seed(10)         # initiate the random number generator
data[sample(1:nrow(data), size=sampleSize, replace=FALSE),]

data is a data frame from where a simple random sample without replacement is taken.

Interpretation

Descriptive statistics

We summarise categorical variable basically by counting occurrences to give us a frequency. If the categories are coded then the psych package provides also mean values and standard deviations.

Descriptive statistics (base package)

Descriptive statistics (psych package)

R codes

R codes used to generate results

Descriptive statistics (base package):

summary(x)

x is a data frame.

Descriptive statistics (psych package):

describe(x, skew=FALSE, ranges=FALSE)

Descriptive statistics grouped by one the of categorical variables (psych package):

describeBy(x, group=y, skew=FALSE, ranges=FALSE)

y is a categorical variable.

Interpretation

Frequency tables

We summarise categorical variable by counting up frequencies and by counting occurrences to give us proportions and percentages.

The frequency distribution of a categorical variable is a summary of the data occurrence in a collection of non-overlapping categories. The relative frequency distribution of a categorical variable is a summary of the frequency proportion in a collection of non-overlapping categories.

Frequency table

Cell percentages

Expected frequencies

Expected frequencies (relative)

Marginal frequencies (1st variable)

Marginal frequencies (2nd variable)

Marginal percentages (1st variable)

Marginal percentages (2nd variable)

Row percentages

Column percentages

R codes

R codes used to generate results

Frequency/contingency table:

myTable <- table(x, y) # x will be rows, y will be columns

Expected frequencies:

independence_table(myTable, frequency = c("absolute"))
independence_table(myTable, frequency = c("relative"))

Marginal frequencies:

margin.table(myTable, 1) # x frequencies (summed over y)
margin.table(mytable, 2) # y frequencies (summed over x)

Tables of proportions:

prop.table(myTable)    # cell percentages
prop.table(myTable, 1) # row percentages
prop.table(myTable, 2) # column percentages

Interpretation

Measures of Association

Computes the Pearson chi-squared test, the likelihood ratio chi-squared test, the phi coefficient, the contingency coefficient and Cramer’s V.

About the association statistics

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance.
The likelihood ratio chi-square builds on the likelihood of the data under the null hypothesis relative to the maximum likelihood. This is the usual statistic for log-linear analyses.
The phi coefficient is a measure of association for two binary variables. This measure is similar to the Pearson correlation coefficient in its interpretation.
The contingency coefficient is an adjustment to the phi coefficient, intended to adapt it to tables larger than 2x2.
Cramer's V is the most popular of the chi-square-based measures of nominal association because it is designed so that the attainable upper limit is always 1.

Things to think about

The phi coefficient is often used as a measure of association in 2x2 tables formed by true dichotomies.
The contingency coefficient will be always less than 1 and will be approaching 1.0 only for large tables. The larger the contingency coefficient the stronger the association. Some researchers recommend it only for 5x5 tables or larger. For smaller tables it will underestimated the level of association.
In the case of a 2×2 contingency table Cramér's V is equal to the phi coefficient.

Association statistics

R codes

R codes used to generate results

Association statistics (vcd package):

myTable <- table(x, y)
summary(assocstats(myTable))

x and y are vectors.

Interpretation

Correspondence analysis

Correspondence analysis is one of a wide range of alternative ways of handling and representing the relationships between categorical data.

About the correspondence analysis

The correspondence analysis results provide information which is similar to that produced by principal components analysis or factor analysis.
The multivariate treatment of the data through multiple categorical variables is an important feature of correspondence analysis.
It has the advantage of revealing relationships which couldn't be detected during a series of pair wise comparisons of variables.

Things to think about

The correspondence analysis is an exploratory technique.
There are no statistical significance tests that are customarily applied to the results of a correspondence analysis.
The primary purpose of the technique is to produce a simplified (low- dimensional) representation of the information in a large frequency table (or tables with similar measures of correspondence).

Correspondence analysis

R codes

R codes used to generate results

Correspondence analysis results (ca package):

myTable <- table(x, y)
print(ca(myTable))

x and y are vectors.

Interpretation

Chi-squared test (raw data)

The chi-squared test is used to test independence of the row and column variables in the two-way/contingency tables.

About the chi-squared test

Null hypothesis is that there is no association between the row and column classifications. Alternative hypothesis is that there is association between the row and column classifications.
Chi-squared test statistic compares the entire set of observed counts with the set of counts expected if there was no association.
The chi-squared statistic is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts.

Things to think about

The large values of the chi-squared statistic provide evidence against the null hypothesis.
Under the assumption that null hypothesis is true the sampling distribution follows the chi-squared distribution.
The chi-squared test always uses the upper tail of the chi-squared distribution.
For 2x2 tables, all expected cell counts should be 5 or greater.
For larger tables, the average expected cell count should be 5 or greater and all expected cell counts are 1 or greater.

Chi-squared test

R codes

R codes used to generate results

Chi-squared test:

myTable <- table(x, y)
chisq.test(myTable)

x and y are vectors.

Interpretation

Chi-squared test (aggregated data)

The chi-squared test is used to test independence of the row and column variables in the two-way/contingency tables.

About the chi-squared test

Null hypothesis is that there is no association between the row and column classifications. Alternative hypothesis is that there is association between the row and column classifications.
Chi-squared test statistic compares the entire set of observed counts with the set of counts expected if there was no association.
The chi-squared statistic is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts.

Things to think about

The large values of the chi-squared statistic provide evidence against the null hypothesis.
Under the assumption that null hypothesis is true the sampling distribution follows the chi-squared distribution.
The chi-squared test always uses the upper tail of the chi-squared distribution.
For 2x2 tables, all expected cell counts should be 5 or greater.
For larger tables, the average expected cell count should be 5 or greater and all expected cell counts are 1 or greater.

Chi-squared test

Inputs

Number of rows:

Number of columns:

Frequencies in the two-way table by rows separated by commas:

Expected frequencies

Expected frequencies (relative)

R codes

R codes used to generate results

Chi-squared test:

myTable <- table(x, y)
chisq.test(myTable)

x and y are vectors.

Interpretation

Fisher exact test

Fisher's exact test provides an exact test of independence of the row and column variables in the two-way/contingency tables.

About the Fisher's exact test

Null hypothesis is that there is no association between the row and column classifications. Alternative hypothesis is that there is association between the row and column classifications.
The p -value provided by this test is correct no matter what the sample size.
The p -value for Fisher's exact test is considerably different to the p -value from the z test and therefore chi-squared test.

Things to think about

The Fisher’s exact test is used when the sample size is small to avoid using an approximation that is known to be unreliable for small samples.
For 2x2 tables, the null hypothesis of conditional independence is equivalent to the hypothesis that the odds ratio equals one.

Fisher's exact test

R codes

R codes used to generate results

Fisher exact test:

myTable <- table(x, y)
fisher.test(myTable, conf.int = TRUE, conf.level = 0.95, workspace=2e+6, hybrid=TRUE)

x and y are vectors.

Interpretation

Mantel-Haenszel test

Mantel-Haenszel chi-squared test is used to test the null hypothesis that two nominal variables are conditionally independent in each stratum.

This test assumes that there is no three-way interaction. Input into this test is a 3-dimensional contingency table, where the last dimension refers to the strata.

Features of the Mantel-Haenszel test

The null hypothesis is that the relative proportions of one variable are independent of the other variable within the repeats; in other words, there is no consistent difference in proportions in the 2×2 tables.
Technically, the null hypothesis of the Mantel-Haenszel test is that the odds ratios within each repetition are equal to 1. The odds ratio is equal to 1 when the proportions are the same, and the odds ratio is different from 1 when the proportions are different from each other.

Things to think about

The most common situation when we use this test is when we have multiple 2×2 tables of independence, and we've done the experiment multiple times or at multiple locations. There are three nominal variables: the two variables of the 2×2 test of independence, and the third nominal variable that identifies the repeats (such as different times, different locations, or different studies).

Mantel-Haenszel test

Inputs

R codes

R codes used to generate results

Mantel-Haenszel test:

myTable <- table(x, y, z)
mantelhaen.test(myTable, conf.level = 0.95)

x, y and z are vectors.

Interpretation

Log-linear model tests

For a log-linear models based on a three-dimensional contingency tables the following tests are performed: mutual, partial, and conditional independence and no three-way interaction.

About the log-linear model tests

Log-linear analysis is an extension of the two-way contingency table where the conditional relationship between two or more discrete, categorical variables is analysed by taking the natural logarithm of the cell frequencies within a contingency table.
They are more commonly used to evaluate multi-way contingency tables that involve three or more variables.
The variables investigated by log-linear models are all treated as “response variables”. In other words, no distinction is made between independent and dependent variables. Therefore, log-linear models only demonstrate association between variables.

Things to think about

The term log-linear derives from the fact that one can, through logarithmic transformations, restate the problem of analysing multi-way frequency tables in terms that are very similar to ANOVA.
Specifically, one may think of the multi-way frequency table to reflect various main effects and interaction effects that add together in a linear fashion to bring about the observed table of frequencies.
The Chi-squares of models that are hierarchically related to each other can be directly compared.
Two models are hierarchically related to each other if one can be produced from the other by either adding terms (variables or interactions) or deleting terms (but not both at the same time).

Log-linear model tests

Inputs

Select test for a log-linear model:

R codes

R codes used to generate results

Loglinear model tests (MASS package):

myTable <- xtabs(~A+B+C, data=myData) # Three-way contingency table
loglm(~A+B+C, myTable)                # Mutual independence
loglm(~A+B+C+B*C, myTable)            # Partial independence
loglm(~A+B+C+A*C+B*C, myTable)        # Conditional independence
loglm(~A+B+C+A*B+A*C+B*C, myTable)    # No three-way interaction

myData is a data frame containing all categorical variables. A, B and C are vectors.

Interpretation

Test of equal or given proportions

Performs the test for testing the null hypothesis that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.

About the test of equal or given proportions

Only groups with finite numbers of successes and failures are used.
When entering data in the Input box counts of successes and failures must be non-negative and hence not greater than the corresponding numbers of trials which must be positive.
All finite counts should be integers.

Things to think about

We may use the chi-squared test of independence to test for equality of proportions between populations.
In case of small samples use the Yates' continuity correction.

Proportions tests

Inputs

Number of successes:

Total number of observations:

Proportion for null hypothesis:

Alternative hypothesis:

Two-sided Less Greater

Confidence level:

Yates's continuity correction:

R codes

R codes used to generate results

Test of equal or given proportions:

prop.test(x, n, p, alternative=c("two.sided", "less", "greater"), conf.level=0.95, correct=TRUE)

x is a vector of counts of successes, n is a vector of counts of trials, p is a vector of probabilities of success. If p is given and there are more than 2 groups, the null tested is that the underlying probabilities of success are those given by p. The alternative is always two.sided, the returned confidence interval is NULL, and continuity correction is never used.

Interpretation

Binomial test

Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment.

About a binomial test

It is assumed that the variable of interest is considered to be dichotomous in nature where the two values are mutually exclusive and mutually exhaustive in all cases being considered.
The sample size is much smaller than the population size.
The sample is representative for the target population.
Assumption of independent and identically distributed variables is met.

Things to think about

This test can also be used to test hypotheses about the median of a population.
It is a nonparametric analog of the one sample t-test and may come in handy when the population of interest is not normally distributed and the sample size is small (e.g., less than 30)

Binomial test for proportions

Inputs

Number of successes:

Total number of observations:

Proportion for null hypothesis:

Alternative hypothesis:

Two-sided Less Greater

Confidence level:

R codes

R codes used to generate results

Exact binomial test:

binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95)

x is a number of successes, n is a number of trials, p is a hypothesised probability of success.

Interpretation

About the Categorical Data Analysis application

The Categorical Data Analysis is an interactive application which allows you to conduct statistical analysis with categorical/qualitative variables.

Upload and download data

Uploading your data and downloading example dataset

Graphs tab

Content of the Graphs tab

Tables tab

Content of the Tables tab

Inference tab

Content of the Inference tab

Save results

To save results in a report: Enter your name in the textbox (at the top of the sidebar). Select a document format (PDF, HTML or Word). Type any comments you may have about the results in the textbox labelled Interpretation. Press Download Report button. Save the report on your disc.

R codes

R codes used to generate results: List of basic R functions used in this application is given here. To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.

Instructions how to prepare and upload your data

This application permits the user to upload a .csv file that contains data to be displayed. It is very important to follow the instructions here.

STEP 1: Check 'Upload your data'

STEP 1: Check 'Upload your data' radio button

STEP 2: Click 'Browse ...'

STEP 2: Click 'Browse ...' button

STEP 3: Enter your data

STEP 3: Enter your data in a spreadsheet

STEP 4: Save it as a .csv file

STEP 4: Save it as a .csv file

STEP 5: Open the .csv file

STEP 5: Open the .csv file in a text editor

Description of example datasets

Titanic dataset

Survival of passengers on the Titanic

Hair and eye colour dataset

Hair and eye colour of statistics students

Bar chart

A bar chart is a graphical representation of the distribution of qualitative data (i.e. categorical variable).

About a bar chart

Things to think about

Bar chart

Inputs

R codes

R codes used to generate results

Dotplot

A dotplot is a type of graphic display used to compare frequency counts within categories or groups.

About a dotplot

Things to think about

Dotplot

R codes

R codes used to generate results

Pie chart

Pie chart is a graphical technique for presenting relative frequencies associated with the observed values of a categorical variable.

About a pie chart

Things to think about

Pie chart

Inputs

R codes

R codes used to generate results

Mosaic plot

A mosaic plot is a graphical display that allows examination of the relationship among two or more categorical variables.

About a mosaic plot

Things to think about

Mosaic plot

R codes

R codes used to generate results

Association plot

An association plot indicates deviations from a specified independence model in a possibly high-dimensional contingency table.

About an association plot

Things to think about

Association plot

R codes

R codes used to generate results

Agreement plot

An agreement plot provides a simple graphic representation of the strength of agreement in a contingency table, and a measure of strength of agreement with an intuitive interpretation.

About an agreement plot

Things to think about

Agreement plot

R codes

R codes used to generate results

Correspondence analysis plot

Graphical display of the relationship between categorical variables in a type of scatterplot diagram.

About a correspondence plot

Things to think about

Correspondence plot

To save results in a report:

Enter your name in the textbox (at the top of the sidebar).

Select a document format (PDF, HTML or Word).

Type any comments you may have about the results in the textbox labelled Interpretation.

Press Download Report button.

Save the report on your disc.

R codes used to generate results:

List of basic R functions used in this application is given here.

To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.