- About the application
- Uploading data
- Example data
- Introduction
About the Categorical Data Analysis application
The Categorical Data Analysis is an interactive application which allows you to conduct statistical analysis with categorical/qualitative variables.
The Categorical Data Analysis makes it easy to:
- Upload your data and download example datasets
- Visualise frequencies and relationships among categorical data
- Display data and results in a tabular format
- Test hypotheses about independence in contingency tables
- Save the results in a report format
- Learn R codes used to generate results.
Start analysis by selecting either example data or by uploading your own data on the left sidebar. This will display the selected dataset, a list of categorical variables and three tabs: Graphs, Tables and Inference. The Example data tab at the top of the screen contains detailed information about each of the example dataset in this application.
Click the plus sign (+) to open the box with detailed information about the Categorical Data Analysis application features.
Upload and download data
Uploading your data and downloading example dataset
- The Categorical Data Analysis application permits the user to upload a .csv file that contains data to be displayed. Instructions are given on the Uploading data panel.
- There are a few built-in example datasets which can be used to conduct analysis with categorical variables.
- This application also permits the user to download each example dataset.
Graphs tab
Content of the Graphs tab
Click on the Graphs tab on the left sidebar to reveal the following options at the top of the main panel:
Bar chart |
This tab displays a bar chart for the selected categorical variable. |
Dotplot |
Displays a dotplot for the selected categorical variable using the lattice package in R. |
Pie chart |
Displays a pie chart for the selected categorical variable. |
Mosaic plot |
Displays a mosaic plot for two categorical variables using vcd package in R. |
Association plot |
Displays an association plot for two categorical variables using vcd package in R. |
Agreement plot |
Displays an agreement plot for two categorical variables using vcd package in R. |
Correspondence plot |
Displays the correspondence analysis 2D plot for two categorical variable using ca package in R. |
Tables tab
Content of the Tables tab
Click on the Tables tab on the left sidebar to reveal the following options at the top of the main panel:
Datasets |
|
Descriptive statistics |
|
Frequency tables |
|
Association statistics |
|
Correspondence analysis |
|
Inference tab
Content of the Inference tab
Click on the Inference tab on the left sidebar to reveal the following options at the top of the main panel:
Chi-squared test |
This tab performs the chi-squared test of independence of the row and column variables in two-way contingency tables. |
Fisher exact test |
This test provides an exact test of independence in a two-dimensional contingency table. |
Mantel-Haenszel test |
This test provides test of conditional independence between two nominal variables in each stratum of a three-dimensional contingency table. |
Log-linear model tests |
This tab contains four tests for log-linear models based on a three-way contingency table. |
Save results
To save results in a report:
- Enter your name in the textbox (at the top of the sidebar).
- Select a document format (PDF, HTML or Word).
- Type any comments you may have about the results in the textbox labelled Interpretation.
- Press Download Report button.
- Save the report on your disc.
R codes
R codes used to generate results:
- List of basic R functions used in this application is given here.
- To learn more about specific R function and its arguments type the question mark followed by the function name (e.g. ?table) in the RStudio console.
Instructions how to prepare and upload your data
This application permits the user to upload a .csv file that contains data to be displayed. It is very important to follow the instructions here.
STEP 1: Check 'Upload your data'
STEP 1: Check 'Upload your data' radio button
STEP 2: Click 'Browse ...'
STEP 2: Click 'Browse ...' button
The application will only accept a 'comma delimited' text file (.csv). The first row in the .csv file should contain the variable names (a 'header'). If a .csv file is uploaded without a header (and this is indicated by unchecking the entry box on the sidebar), the variables to choose from will be listed as V1, V2, V3, etc., depending on their position in the .csv file. It is best to use .csv files that include variable names as a header row.
If a spreadsheet was used to enter the data as illustrated below, then leave the default selections of radio buttons in the Separator and Quote sections.
STEP 3: Enter your data
STEP 3: Enter your data in a spreadsheet
The best approach would begin by creating a file in a spreadsheet such as this:
STEP 4: Save it as a .csv file
STEP 4: Save it as a .csv file
STEP 5: Open the .csv file
STEP 5: Open the .csv file in a text editor
Open the .csv file in a text editor (e.g. Notepad) and it should look like this:
Description of example datasets
This application contains 9 datasets:
- Survival of passengers on the Titanic
- Hair and eye colour of statistics students
The two datasets contain categorical variables only. Please notice that some of the graphical methods, tables and tests require more than one or two categorical variables and will not generate output if this is not the case.
Click the plus sign (+) to open the box with detailed information about the dataset.
Titanic dataset
Survival of passengers on the Titanic
This dataset provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarised according to economic status (class), sex, age and survival.
There are 2201 observations on 4 variables. The variables and their levels are as follows:
No | Name | Levels |
1 |
Class |
1st, 2nd, 3rd, Crew |
2 |
Sex |
Male, Female |
3 |
Age |
Child, Adult |
4 |
Survived |
No, Yes |
Source
Dawson, Robert J. MacG. (1995). The ‘unusual episode’ data revisited. Journal of Statistics Education, 3. http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html
The source provides a dataset recording class, sex, age, and survival status for each person on board of the Titanic, and is based on data originally collected by the British Board of Trade and reprinted in:
British Board of Trade (1990). Report on the loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing.
Hair and eye colour dataset
Hair and eye colour of statistics students
This dataset provides information on hair, eye colour and sex in 592 statistics students.
There are 592 observations on 3 variables. The variables and their levels are as follows:
No | Name | Levels |
1 |
Hair |
Black, Brown, Red, Blond |
2 |
Eye |
Brown, Blue, Hazel, Green |
3 |
Sex |
Male, Female |
Source
Snee, R. D. (1974). Graphical display of two-way contingency tables. The American Statistician, 28, 9–12.
Friendly, M. (1992a). Graphical methods for categorical data. SAS User Group International Conference Proceedings, 17, 190-200. http://www.math.yorku.ca/SCS/sugi/sugi17-paper.html
Bar chart
A bar chart is a graphical representation of the distribution of qualitative data (i.e. categorical variable).
Bar chart is a form of graphical representation of categorical variables. It displays data classified into a number of (usually unordered) categories. Equal-width rectangular bars are constructed over each category with height equal to the observed frequency of the category.Stacked (component) bar chart shows the component parts as sectors of the bar with lengths in proportion to the relative size.
Grouped (clustered) bar chart shows information about different sub-groups of the main categories. A separate bar represents each of the sub-groups (usually coloured or shaded differently to distinguish between them. In such cases, a legend or key is provided.
About a bar chart
- All bars have the same width.
- Bars can be shown vertically or horizontally.
- The categories are shown on one of the axes.
- The frequency of the data in category is represented by the height/length of the bar.
- Titles and labels for both axes should be included.
- A legend should be included for stacked and grouped bar charts.
- There is an explanatory title or caption underneath the graph.
Things to think about
- Is the order of categories important? In some cases alphabetical ordering or some other arrangement might be used to produce more useful graphical display.
- Which is the modal or most frequent category?
Bar chart
Inputs
R codes
R codes used to generate results
Bar chart:
counts <- table(x)
barplot(counts, main="Main title", xlab="Horizontal axis title")
x
is a vector or matrix.
Horizontal bar chart:
counts <- table(x)
barplot(counts, horiz = TRUE, main="Main title")
Stacked bar chart:
counts <- table(x, y)
barplot(counts, col=c("blue", "red"), legend=rownames(counts), main="Main title", xlab="Horizontal axis title")
x
and y
are vectors.
Grouped bar chart:
counts <- table(x, y)
barplot(counts, beside=TRUE, col=c("blue", "red"), legend=rownames(counts), main="Main title", xlab="Horizontal axis title")
x
and y
are vectors.
Dotplot
A dotplot is a type of graphic display used to compare frequency counts within categories or groups.
Dotplots are an alternative to bar charts or pie charts, and look somewhat like a horizontal bar chart where the bars are replaced by a dot at the frequency associated with each category. Optionally, horizontal lines are also included connecting dots with the vertical axis.
About a dotplot
- Dotplots are less cluttered than the bar charts.
- Compared to other types of graphic display, dotplots are used most often to plot frequency counts within a small number of categories, usually with small sets of data.
- Titles and labels for both axes should be included.
- Title for the horizontal axis is Frequency or Count.
- There is an explanatory title or caption underneath the graph.
Things to think about
- Multi-panel side-by-side display might be used for comparing several contrasting or similar cases. Use same scales for the horizontal axis across different panels.
- Consider ordering categories by frequencies represented, for more accurate perception.
Dotplot
R codes
R codes used to generate results
Dotplot in lattice package:
myTable <- table(x, y)
dotplot(myTable, groups=FALSE, auto.key=list(lines=TRUE), type=c("p", "h"), xlab="Frequency",
prepanel = function (x, y) {
list(ylim = levels(reorder(y, x)))
},
panel = function(x, y, ...){
panel.dotplot(x, reorder(y,x), ...)
})
x
and y
are vectors.
Pie chart
Pie chart is a graphical technique for presenting relative frequencies associated with the observed values of a categorical variable.
The pie chart consists of a circle subdivided into sectors (sometimes called slices) whose sizes are proportional to the quantities they represent. Such displays are popular in the media but have little relevance for serious scientific work when other graphics are generally far more useful (e.g. bar chart and dot plot).
About a pie chart
- Slices have to be mutually exclusive; by definition, they cannot overlap.
- There are two features that let us read the values on a pie chart: the angle a slice covers (compared to the full circle), and the area of a slice (compared to the entire circle).
- Use of 3D pie charts is not recommended. Firstly, it makes it more difficult to read the chart. Secondly, different software my produce different 3D charts. A third problem with 3D charts is that they suggest we know more about the data than we really do. In case of pie charts adding unnecessary dimensions or adding perspective to the pies distorts the data.
- Legend should be included or each slice should have its own label. Number or percentage in each slice can also be shown.
- There is an explanatory title or caption underneath the chart.
Things to think about
- We are not very good at measuring angles, but we recognize 90 and 180 degree angles with very high precision. Slices that cover half or a quarter of the circle will therefore stand out. Others can be compared with some success, but reading actual numbers from a pie chart is next to impossible.
- Do the parts make up a meaningful whole? If not, use a different chart.
- Do we want to compare the parts to each other or the parts to the whole? If the main purpose is to compare between the parts, use a different chart. The main purpose of the pie chart is to show part-whole relationships.
- How many parts do we have? If there are more than five to seven, use a different chart. Pie charts with lots of slices (or slices of very different size) are hard to read.
Pie chart
Inputs
R codes
R codes used to generate results
Pie chart displaying label and percentage for each slice:
myTable <- table(x)
percentlabels <- round(100*myTable/sum(myTable), 1)
labs <- paste(names(myTable), "\n", percentlabels, "%", sep="")
pie(myTable, labels=labs, col=rainbow(length(levels(x))), main="Main title")
x
is a vector with observations of a categorical variable.
Mosaic plot
A mosaic plot is a graphical display that allows examination of the relationship among two or more categorical variables.
The mosaic plot is a graphical representation of the two-way frequency table or contingency table. A mosaic plot is divided into rectangles, so that the vertical length of each rectangle represents the proportions of the Y variable at each level of the X variable.
About a mosaic plot
- The displayed variables are categorical or ordinal scales.
- The plot is of at least two variables. There is no upper limit on the number of variables, but too many variables may be confusing in graphic form.
- The surfaces of the rectangular fields that are available for a combination of features are proportional to the number of observations that have this combination of features.
- Independence is shown when the boxes across categories all have the same areas.
- The significance of different frequencies of the various characteristic values cannot be observed visually.
- The colours represent the level of the residual for that cell / combination of levels. The legend is presented at the plot's right. Blue means there are more observations in that cell than would be expected under the independence. Red means there are fewer observations than would have been expected. This is showing you which cells are contributing to the significance of the chi-squared test result.
- For Hair and eye colour dataset the mosaic plot indicates that there are more blue-eyed blond students than expected under independence and too few brown-eyed blond students.
Things to think about
- Variables that represent an exposure or treatment status should usually represent the first split (i.e. division into rectangles) and variables that represent an outcome should represent the second split.
- The categorical variables should be sorted first. Then each variable is assigned to an axis. Another order of categorical variables will result in a different mosaic plot, i.e., as in all multivariate plots, the order of variables plays a role.
Mosaic plot
R codes
R codes used to generate results
Mosaic plot (vcd package):
myTable <- table(x, y)
mosaic(myTable, legend=TRUE, shade=TRUE, las=2, col=TRUE)
x
and y
are vectors.
Association plot
An association plot indicates deviations from a specified independence model in a possibly high-dimensional contingency table.
About an association plot
- The rectangles for each row in the table are positioned relative to a baseline representing independence shown by a dotted line.
- Cells with an observed frequency greater than the expected frequency (assuming independence) rise above the line and are coloured blue; cells that contain less than the expected frequency fall below it and are shaded red.
- The main purpose of the shading is not to visualize significance but the pattern of deviation from independence.
Things to think about
- Variables that represent an exposure or treatment status should usually represent the first split (i.e. division into rectangles) and variables that represent an outcome should represent the second split.
- The categorical variables should be sorted first. Then each variable is assigned to an axis. Another order of categorical variables will result in a different mosaic plot, i.e., as in all multivariate plots, the order of variables plays a role.
Association plot
R codes
R codes used to generate results
Association plot (vcd package):
myTable <- table(x, y)
assoc(myTable, legend=TRUE, shade=TRUE, col=TRUE)
x
and y
are vectors.
Agreement plot
An agreement plot provides a simple graphic representation of the strength of agreement in a contingency table, and a measure of strength of agreement with an intuitive interpretation.
Inter-observer agreement is often used as a method of assessing the reliability of a subjective classification or assessment procedure. For example, two (or more) clinical psychologists might classify patients on a scale with categories: normal, mildly impaired, severely impaired.
About an agreement plot
- The agreement chart is constructed as an n x n square, where n is the total sample size.
- Black squares show observed agreement. These are positioned within larger rectangles
- The large rectangle shows the maximum possible agreement, given the marginal totals.
Things to think about
- Observers' ratings can be strongly associated without strong agreement.
- If observers tend to use the categories with different frequency, this will affect measures of agreement.
Agreement plot
R codes
R codes used to generate results
Agreement plot (vcd package):
myTable <- table(x, y)
agreementplot(myTable)
x
and y
are vectors.
Correspondence analysis plot
Graphical display of the relationship between categorical variables in a type of scatterplot diagram.
Two categorical variables are displayed in the form of a contingency table, i.e. two-way table. From a contingency table a set of coordinate values representing the row and column categories are derived. A small number of these derived coordinate values (usually two) are then used to allow the table to be displayed graphically.
About a correspondence plot
- The row profile is defined as the counts in a row divided by the total count for that row. If two rows have very similar row profiles, their points in the correspondence analysis plot are close together.
- The coordinates are analogous to those resulting from a principal component analysis of continuous variables.
- They involve a partition of a chi-squared statistic rather than the total variance.
Things to think about
- Column and row profiles are alike because the problem is defined symmetrically.
- The distance between a row point and a column point has no meaning.
- The directions of columns and rows from the origin are meaningful, and the relationships help interpret the plot.
Correspondence plot
R codes
R codes used to generate results
Correspondence analysis plot (ca package):
myTable <- table(x, y)
plot(ca(myTable))
x
and y
are vectors.
- Dataset
- Descriptive statistics
- Frequency tables
- Association Statistics
- Correspondence analysis
- Tables
Display and download dataset
Display the dataset selected and download the file with the example dataset used. Take a simple random sample from the dataset used.
- To take a simple random sample from a dataset, select the check box labelled Take a simple random sample.
- Set the Sample size to the required number of observations.
- Change the seed if a new sample is required. When the same number is used for a seed, the sample generated will be the same each time.
- Select the check box labelled Take a sample with replacement if a sample with replacement is wanted.
- Click Download sample button to download a generated sample and save it on a hard drive.
- To analyse a sample upload the saved .csv file and continue using the application.
Dataset
Inputs
R codes
R codes used to generate results
Take a simple random sample:
set.seed(10) # initiate the random number generator
data[sample(1:nrow(data), size=sampleSize, replace=FALSE),]
data
is a data frame from where a simple random sample without replacement is taken.
Descriptive statistics
We summarise categorical variable basically by counting occurrences to give us a frequency. If the categories are coded then the psych package provides also mean values and standard deviations.
Descriptive statistics (base package)
Descriptive statistics (psych package)
R codes
R codes used to generate results
Descriptive statistics (base package):
summary(x)
x
is a data frame.
Descriptive statistics (psych package):
describe(x, skew=FALSE, ranges=FALSE)
Descriptive statistics grouped by one the of categorical variables (psych package):
describeBy(x, group=y, skew=FALSE, ranges=FALSE)
y
is a categorical variable.
Frequency tables
We summarise categorical variable by counting up frequencies and by counting occurrences to give us proportions and percentages.
The frequency distribution of a categorical variable is a summary of the data occurrence in a collection of non-overlapping categories. The relative frequency distribution of a categorical variable is a summary of the frequency proportion in a collection of non-overlapping categories.
Frequency table
Cell percentages
Expected frequencies
Expected frequencies (relative)
Marginal frequencies (1st variable)
Marginal frequencies (2nd variable)
Marginal percentages (1st variable)
Marginal percentages (2nd variable)
Row percentages
Column percentages
R codes
R codes used to generate results
Frequency/contingency table:
myTable <- table(x, y) # x will be rows, y will be columns
Expected frequencies:
independence_table(myTable, frequency = c("absolute"))
independence_table(myTable, frequency = c("relative"))
Marginal frequencies:
margin.table(myTable, 1) # x frequencies (summed over y)
margin.table(mytable, 2) # y frequencies (summed over x)
Tables of proportions:
prop.table(myTable) # cell percentages
prop.table(myTable, 1) # row percentages
prop.table(myTable, 2) # column percentages
Measures of Association
Computes the Pearson chi-squared test, the likelihood ratio chi-squared test, the phi coefficient, the contingency coefficient and Cramer’s V.
About the association statistics
- Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance.
- The likelihood ratio chi-square builds on the likelihood of the data under the null hypothesis relative to the maximum likelihood. This is the usual statistic for log-linear analyses.
- The phi coefficient is a measure of association for two binary variables. This measure is similar to the Pearson correlation coefficient in its interpretation.
- The contingency coefficient is an adjustment to the phi coefficient, intended to adapt it to tables larger than 2x2.
- Cramer's V is the most popular of the chi-square-based measures of nominal association because it is designed so that the attainable upper limit is always 1.
Things to think about
- The phi coefficient is often used as a measure of association in 2x2 tables formed by true dichotomies.
- The contingency coefficient will be always less than 1 and will be approaching 1.0 only for large tables. The larger the contingency coefficient the stronger the association. Some researchers recommend it only for 5x5 tables or larger. For smaller tables it will underestimated the level of association.
- In the case of a 2×2 contingency table Cramér's V is equal to the phi coefficient.
Association statistics
R codes
R codes used to generate results
Association statistics (vcd package):
myTable <- table(x, y)
summary(assocstats(myTable))
x
and y
are vectors.
Correspondence analysis
Correspondence analysis is one of a wide range of alternative ways of handling and representing the relationships between categorical data.
About the correspondence analysis
- The correspondence analysis results provide information which is similar to that produced by principal components analysis or factor analysis.
- The multivariate treatment of the data through multiple categorical variables is an important feature of correspondence analysis.
- It has the advantage of revealing relationships which couldn't be detected during a series of pair wise comparisons of variables.
Things to think about
- The correspondence analysis is an exploratory technique.
- There are no statistical significance tests that are customarily applied to the results of a correspondence analysis.
- The primary purpose of the technique is to produce a simplified (low- dimensional) representation of the information in a large frequency table (or tables with similar measures of correspondence).
Correspondence analysis
R codes
R codes used to generate results
Correspondence analysis results (ca package):
myTable <- table(x, y)
print(ca(myTable))
x
and y
are vectors.
- Chi-squared test (raw data)
- Chi-squared test (aggregated data)
- Fisher's exact test
- Mantel-Haenszel test
- Log-linear model tests
- Proportions
- Binomial test
- Inference
Chi-squared test (raw data)
The chi-squared test is used to test independence of the row and column variables in the two-way/contingency tables.
About the chi-squared test
- Null hypothesis is that there is no association between the row and column classifications. Alternative hypothesis is that there is association between the row and column classifications.
- Chi-squared test statistic compares the entire set of observed counts with the set of counts expected if there was no association.
- The chi-squared statistic is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts.
Things to think about
- The large values of the chi-squared statistic provide evidence against the null hypothesis.
- Under the assumption that null hypothesis is true the sampling distribution follows the chi-squared distribution.
- The chi-squared test always uses the upper tail of the chi-squared distribution.
- For 2x2 tables, all expected cell counts should be 5 or greater.
- For larger tables, the average expected cell count should be 5 or greater and all expected cell counts are 1 or greater.
Chi-squared test
R codes
R codes used to generate results
myTable <- table(x, y)
chisq.test(myTable)
x
and y
are vectors.
Chi-squared test (aggregated data)
The chi-squared test is used to test independence of the row and column variables in the two-way/contingency tables.
About the chi-squared test
- Null hypothesis is that there is no association between the row and column classifications. Alternative hypothesis is that there is association between the row and column classifications.
- Chi-squared test statistic compares the entire set of observed counts with the set of counts expected if there was no association.
- The chi-squared statistic is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts.
Things to think about
- The large values of the chi-squared statistic provide evidence against the null hypothesis.
- Under the assumption that null hypothesis is true the sampling distribution follows the chi-squared distribution.
- The chi-squared test always uses the upper tail of the chi-squared distribution.
- For 2x2 tables, all expected cell counts should be 5 or greater.
- For larger tables, the average expected cell count should be 5 or greater and all expected cell counts are 1 or greater.
Chi-squared test
Inputs
Expected frequencies
Expected frequencies (relative)
R codes
R codes used to generate results
myTable <- table(x, y)
chisq.test(myTable)
x
and y
are vectors.
Fisher exact test
Fisher's exact test provides an exact test of independence of the row and column variables in the two-way/contingency tables.
About the Fisher's exact test
- Null hypothesis is that there is no association between the row and column classifications. Alternative hypothesis is that there is association between the row and column classifications.
- The p -value provided by this test is correct no matter what the sample size.
- The p -value for Fisher's exact test is considerably different to the p -value from the z test and therefore chi-squared test.
Things to think about
- The Fisher’s exact test is used when the sample size is small to avoid using an approximation that is known to be unreliable for small samples.
- For 2x2 tables, the null hypothesis of conditional independence is equivalent to the hypothesis that the odds ratio equals one.
Fisher's exact test
R codes
R codes used to generate results
myTable <- table(x, y)
fisher.test(myTable, conf.int = TRUE, conf.level = 0.95, workspace=2e+6, hybrid=TRUE)
x
and y
are vectors.
Mantel-Haenszel test
Mantel-Haenszel chi-squared test is used to test the null hypothesis that two nominal variables are conditionally independent in each stratum.
This test assumes that there is no three-way interaction. Input into this test is a 3-dimensional contingency table, where the last dimension refers to the strata.
Features of the Mantel-Haenszel test
- The null hypothesis is that the relative proportions of one variable are independent of the other variable within the repeats; in other words, there is no consistent difference in proportions in the 2×2 tables.
- Technically, the null hypothesis of the Mantel-Haenszel test is that the odds ratios within each repetition are equal to 1. The odds ratio is equal to 1 when the proportions are the same, and the odds ratio is different from 1 when the proportions are different from each other.
Things to think about
- The most common situation when we use this test is when we have multiple 2×2 tables of independence, and we've done the experiment multiple times or at multiple locations. There are three nominal variables: the two variables of the 2×2 test of independence, and the third nominal variable that identifies the repeats (such as different times, different locations, or different studies).
Mantel-Haenszel test
Inputs
R codes
R codes used to generate results
myTable <- table(x, y, z)
mantelhaen.test(myTable, conf.level = 0.95)
x
, y
and z
are vectors.
Log-linear model tests
For a log-linear models based on a three-dimensional contingency tables the following tests are performed: mutual, partial, and conditional independence and no three-way interaction.
About the log-linear model tests
- Log-linear analysis is an extension of the two-way contingency table where the conditional relationship between two or more discrete, categorical variables is analysed by taking the natural logarithm of the cell frequencies within a contingency table.
- They are more commonly used to evaluate multi-way contingency tables that involve three or more variables.
- The variables investigated by log-linear models are all treated as “response variables”. In other words, no distinction is made between independent and dependent variables. Therefore, log-linear models only demonstrate association between variables.
Things to think about
- The term log-linear derives from the fact that one can, through logarithmic transformations, restate the problem of analysing multi-way frequency tables in terms that are very similar to ANOVA.
- Specifically, one may think of the multi-way frequency table to reflect various main effects and interaction effects that add together in a linear fashion to bring about the observed table of frequencies.
- The Chi-squares of models that are hierarchically related to each other can be directly compared.
- Two models are hierarchically related to each other if one can be produced from the other by either adding terms (variables or interactions) or deleting terms (but not both at the same time).
Log-linear model tests
Inputs
R codes
R codes used to generate results
Loglinear model tests (MASS package):
myTable <- xtabs(~A+B+C, data=myData) # Three-way contingency table
loglm(~A+B+C, myTable) # Mutual independence
loglm(~A+B+C+B*C, myTable) # Partial independence
loglm(~A+B+C+A*C+B*C, myTable) # Conditional independence
loglm(~A+B+C+A*B+A*C+B*C, myTable) # No three-way interaction
myData
is a data frame containing all categorical variables. A
, B
and C
are vectors.
Test of equal or given proportions
Performs the test for testing the null hypothesis that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.
About the test of equal or given proportions
- Only groups with finite numbers of successes and failures are used.
- When entering data in the Input box counts of successes and failures must be non-negative and hence not greater than the corresponding numbers of trials which must be positive.
- All finite counts should be integers.
Things to think about
- We may use the chi-squared test of independence to test for equality of proportions between populations.
- In case of small samples use the Yates' continuity correction.
Proportions tests
Inputs
R codes
R codes used to generate results
Test of equal or given proportions:
prop.test(x, n, p, alternative=c("two.sided", "less", "greater"), conf.level=0.95, correct=TRUE)
x
is a vector of counts of successes, n
is a vector of counts of trials, p
is a vector of probabilities of success. If p
is given and there are more than 2 groups, the null tested is that the underlying probabilities of success are those given by p
. The alternative is always two.sided
, the returned confidence interval is NULL, and continuity correction is never used.
Binomial test
Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment.
About a binomial test
- It is assumed that the variable of interest is considered to be dichotomous in nature where the two values are mutually exclusive and mutually exhaustive in all cases being considered.
- The sample size is much smaller than the population size.
- The sample is representative for the target population.
- Assumption of independent and identically distributed variables is met.
Things to think about
- This test can also be used to test hypotheses about the median of a population.
- It is a nonparametric analog of the one sample t-test and may come in handy when the population of interest is not normally distributed and the sample size is small (e.g., less than 30)
Binomial test for proportions
Inputs
R codes
R codes used to generate results
binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95)
x
is a number of successes, n
is a number of trials, p
is a hypothesised probability of success.