###-----------------------------------------------------------------------------------###
### BRM Session 8 - Hypothesis testing: Chi-square test for two categorical variables ###
###-----------------------------------------------------------------------------------###
# Dennis Abel and Lukas Birkenmaier
# Intro and setup ---------------------------------------------------------
# Based on:
# Field, A., Miles, J. and Field, Z. (2012): Discovering Statistics using R, ch. 18.
# https://statsandr.com/blog/chi-square-test-of-independence-in-r/
# Load required packages
#install.packages("gmodels")
library(tidyverse) # Many functions including ggplot
library(gtsummary) # For a nice cross-table
library(gmodels) # For the chi-square function
library(vcd) # For mosaic-plot
# Working directory
setwd("[insert your working directory here]")
# Use read_rds-function to load rds data
oddjob <- read_rds("oddjob.rds")
# Research design ---------------------------------------------------------
# Our aim is to identify the relationship between status groups (blue, silver, gold)
# and the flight purpose (business or leisure). Understanding the relationship between
# these two variables could give us insights into the flight behavior and needs of
# our status groups which we could use to tailor our services more
# specifically for each status group. The following questions should be answered:
# 1. Does the flight purpose differ by status groups?
# In order to address this question, we need the following variables:
# Flight purpose: flight_purpose
ggplot(oddjob, aes(x = flight_purpose))+
geom_bar()
# status groups: status
ggplot(oddjob, aes(x = status))+
geom_bar()
# Formulate hypothesis
# 1. H0: The flight purpose is independent of the status groups
# 1. H1: The flight purpose is dependent on the status groups
# Choose significance level
# Use significance level (alpha) of 0.05, which means that we allow a maximum
# chance of 5% of mistakenly rejecting a true null hypothesis
# Select appropriate test
# We start by defining the testing situation, which is to compare the frequencies
# of the flight purpose across blue, silver and gold status members. Both variables
# are categorical variables. In contrast to numerical variables which we explored before,
# in this situation we cannot calculate means of categorical variables. Instead,
# we will analyse frequencies. We already learned in our descriptive sessions to draw
# a contingency table (also called cross-tabulation) of two categorical variables.
# We can use this table to show the frequencies of each combination of categories.
# Cross-tabulation of two categorical variables
tbl_cross(oddjob, row=status, col=flight_purpose, percent="cell")
# Let's also check out both variables in a barplot
# Absolute values
ggplot(oddjob, aes(x = status, fill = flight_purpose))+
geom_bar()
# Proportions
ggplot(oddjob, aes(x = status, fill = flight_purpose))+
geom_bar(position = "fill")
# Check assumptions -------------------------------------------------------
# Due to the nature of the data (two categorical variables), the chi-square test
# does not rely on assumptions such as having continuous normally distributed data.
# Categorical data cannot be distributed normally. However, the test relies on two
# assumptions:
# 1. Assumption of independence
# What we know about this sample is that it is a random subset and we also know
# that these are independent observations.
# 2. The expected frequencies should be greater than 5. This could be problematic
# in very small samples. In our case, due to the large sample size, we can assume
# that the expected frequencies for all cells are above 5. We will compute the expected
# frequencies later on to verify that this assumption has not been violated.
# Calculate test statistic ------------------------------------------------
# If we want to check whether there is a relationship between these two variables,
# we can use Pearson's chi-square test. This is a simple test which compares the
# frequencies you observe to the frequencies you might expect to get in those categories
# by chance.
CrossTable(oddjob$status, oddjob$flight_purpose, fisher=TRUE, chisq=TRUE,
expected=TRUE, sresid=TRUE, digits=2, prop.c=FALSE, prop.t=FALSE,
prop.chisq=FALSE, format="SPSS")
# First of all: our assumption of expected frequencies above 5 has been met. The minimum
# expected frequency is 70.5.
# The chi-square test is highly significant, indicating that there is an association
# between the two variables (=rejecting the H0).
# The standardized residuals reported for each cell give us more information for each
# individual combination of values. Similar to regression, the residual is the error
# between what the model predicts (the expected frequency) and the data actually
# observed (the observed frequency). To standardize, we simply divided by the square
# root of the expected frequency.
# STANDARDIZED RESIDUAL = (OBSERVEDij - MODELij) / sqrt(MODELij)
# Given that the chi-square statistic is (basically) the sum of these standardized
# residuals, the individual residuals for each cell gives us a good indication of
# the contribution of each combination.
# Furthermore, a standardized residual is a z-score. This is useful because the value
# itself tells us the significance: if the value lies outside +/- 1.96 then it is
# significant at p < .05 (outside +/- 2.58 = p < .01; outside +/- 3.29 = p < .001).
# The sign of the residual tells us the direction of the effect.
# Visual representation of residuals with mosaic plot (=fancy barplot)
mosaic(~ flight_purpose + status,
direction = c("v", "h"),
data = oddjob,
shade = TRUE
)
# You can report the results like this: There is a significant association between
# the status groups and the flight purposes of customers of Oddjob Airways with
# chisquare(2) = 191.79, p < .001.
# In our case, all six combinations have highly significant associations, with blue
# memberships being positively associated with travelling for leisure reasons, whereas
# silver and gold members are positively associated with taking business trips.