Chapter 12 Association in binary data

12.1 2×2 tables

The smallest size of a contingency table is 2×2. A 2×2 table summarizes the joint distribution of two binary variables. It is the basis for many measures of association and classification performance.

Example

Three hundred and eighty-one students of both genders were asked following questions: “Did you watch videos for more than half an hour yesterday?” and “Do you have siblings?

The results are presented in the following table:

Watched videos Didn’t watch videos
Has siblings 270 54
Doesn’t have siblings 49 8

12.1.1 Phi coefficient

The phi coefficient measures the strength and direction of association between two binary variables.

We consider the following contingency table:

Y = 0 Y = 1 Total
X = 0 a b a+b
X = 1 c d c+d
Total a+c b+d n

where \(n = a + b + c + d\).

The \(\phi\) coefficient for a \(2 \times 2\) contingency table is defined as:

\[\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \tag{12.1}\]

The phi coefficient is confined to the range \([-1, 1]\).

Technically, it is equivalent to the Pearson correlation between two binary variables coded as 0/1 (see exercise 12.1).

The phi coefficient is also closely related to the chi-squared (\(\chi^2\)) statistic and Cramér’s V:

\[ \phi^2 = \chi^2/n \tag{12.2} \]

\[ |\phi| = V \tag{12.3} \]

Here, \(n\) denotes the total number of observations (sample size), and \(V\) is Cramér’s V for a 2×2 table (see formula (11.3)).

Example

Three hundred and eighty-two students of both genders were asked whether they drank more than 2 liters of carbonated drinks (cola, sprite) last week. The results are following:

Female Male
No 151 106
Yes 39 86

Phi coefficient (correlation between being a male and soda drinking) is:

\[\phi = \frac{151\cdot86-39\cdot106}{\sqrt{(151+39)(106+86)(151+106)(39+86)}} \approx 0.259\]

12.1.2 Odds ratio

Odds ratio (OR) is a simple way to compare the likelihood (measured via odds) of an event between two groups.

Before defining the odds ratio, one has to define odds. Odds compare how many times the event occurred to how many times it did not occur within the same group. Consider a binary outcome (Event / No event) and one group A. There were a observations of the Event and b observations of No event in the group.

Event No event
Group A a b

The odds of the event are then defined as:

\[\text{odds} = \frac{\text{number of events}}{\text{number of non-events}} = \frac{a}{b} \tag{12.4} \].

  • Odds = 1 → the event and non-event occur equally often

  • Odds > 1 → the event occurs more often than the non-event

  • Odds < 1 → the event occurs less often than the non-event

The odds ratio compares odds in two groups, say A and B:

\[ \text{OR} = \frac{\text{odds in group A}}{\text{odds in group B}} \tag{12.5} \].

  • OR = 1 → no difference between groups

  • OR > 1 → the event is more likely (in terms of odds) in group A

  • OR < 1 → the event is less likely in group A

OR is very popular in medical sciences (e.g. case–control studies) and in statistical modeling, especially logistic regression.

Example: Titanic survival by gender

  • Women: ~73.4% survived

  • Men: ~20.5% survived

2x2 table:

Female Male
Survived 359 352
Didn’t survive 130 1366

Odds of survival:

  • Women: \(359 / 130 \approx 2.76\)

  • Men: \(352 / 1366 \approx 0.26\)

Odds ratio (women vs men):

\[\text{OR} \approx \frac{2.76}{0.26} \approx 11\]

The odds of surviving the Titanic disaster were about 11 times higher for women than for men.

12.1.3 Confusion matrix

Confusion matrix is a special version of a 2×2 table used to evaluate perfomance of a binary classification model. A binary classification model is an algorithm which assigns an object to one of two classes. One of these classes is usually referred to as the positive class (denoted by the symbol “+” or the value 1), while the other is the negative class (denoted by the symbol “−” or the value 0).

Example

In March 1884, Sergeant J. P. Finley began publishing tornado forecasts for 18 regions of the United States. The forecasts were issued twice a day. After three months, he presented the results in the following table.

Observed: Tornado No tornado Total
Forecast: Tornado 28 72 100
Forecast: No tornado 23 2680 2703
Total 51 2752 2803

In the confusion matrix, each of the four cells has a name.

  • True Positives (TP) are observations where the model correctly predicts the positive class. There are 28 TPs in Sergeant Finley’s table.

  • True Negatives (TN) are observation where the model correctly predicts the negative class. There are 2680 TNs in Sergeant Finley’s table.

  • False Positives (FP) – the model mistakenly predicts the positive class when the true class is negative. There are 72 FPs in the table.

  • False Negatives (FN) – the model mistakenly predicts the negative class when the true class is positive. There are 23 FNs in the table.

The simplest measure of classification performance is accuracy. It is defined as the proportion of observations for which the class has been predicted correctly out of all observations:

\[ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}. \tag{12.6}\]

The accuracy in the case of Sergeant Finley’s table is

\[\frac{28+2680}{2803} \approx 0.966\]

Accuracy is an intuitive measure, however it is not always adequate. Notice that the accuracy would be greater if Sergeant Finley had predicted “no tornado” every time!

Instead of accuracy, two measures are often introduced: sensitivity, defined as the proportion of actual positive cases that are correctly predicted, and specificity, defined as the proportion of actual negative cases that are correctly predicted:

\[ \text{Sensitivity} = \frac{TP}{TP+FN}. \tag{12.7}\]

\[ \text{Specificity} = \frac{TN}{TN+FP}. \tag{12.8}\]

In the example, the sensitivity is:

\[ \text{Sensitivity} = \frac{28}{51} \approx 0.549\]

and specificity:

\[ \text{Specificity} = \frac{2680}{2752} \approx 0.974.\]

12.2 Binary and quantitative variables

When one variable is binary and the other quantitative, commonly used measures of association include the point-biserial correlation, Cohen’s d, and AUC / Somers’ D. The latter two can also be applied when assessing association between binary and ordinal variables (see 12.3).

12.2.1 Point-biserial correlation

The point-biserial correlation is mathematically equivalent to Pearson’s correlation when the binary variable is coded as 0 and 1.

An alternative formula is given by:

\[r_{pb} = \frac{ \bar{x}_1- \bar{x}_0}{s_x} \sqrt{ \frac{n_1 n_0}{n(n-1)}} \tag{12.9}\]

In this formula, \(\bar{x}_1\) is the average of the quantitative variable for observations where the binary variable is 1, and \(\bar{x}_0\) is the average where it is 0. The term \(s_x\) is the standard deviation of the quantitative variable. The values \(n_1\) and \(n_0\) are the numbers of observations in each group, and \(n\) is the total number of observations.

12.2.2 Cohen’s d

Cohen’s d tells us how large the difference is between the means of two groups, measured in units of standard deviation.

It is calculated as

\[d = \frac{\bar{x}_1 - \bar{x}_0}{s_p} \tag{12.10}\]

\[ s_p=\sqrt{\frac{(n_1-1)s_1^2+(n_0-1)s_0^2}{n_1+n_0-2} } \tag{12.11}\]

where:

  • \(\bar{x}_1\) is the mean of group 1,
  • \(\bar{x}_0\) is the mean of group 0,
  • \(s_p\) is the pooled standard deviation of the two groups,
  • \(s_1\) is the standard deviation of group 1,
  • \(s_0\) is the standard deviation of group 0,
  • \(n_1\) is the number of observations in group 1,
  • \(n_0\) is the number of observations in group 0.

Cohen’s d is often referred to as the effect size when the means in two groups are compared.

Typical guidelines for interpretation of the effect size:

  • \(d \approx 0.2\) → small effect
  • \(d \approx 0.5\) → medium effect
  • \(d \approx 0.8\) → large effect

These values are, of course, only guidelines. What counts as a small or large effect depends on the research area and context. Therefore, Cohen’s d should be interpreted by comparing it with results from similar studies and by considering its practical importance, not just its absolute size.

Hedges’ g is very similar to Cohen’s d, but it includes a small correction for sample size:

\[ g = d \left( 1 - \frac{3}{4(n_1 + n_0) - 9} \right) \tag{12.12}\]

12.3 Binary and ordinal variables

When one variable (\(X\), „score”) is an (at least) ordinal variable and the other (\(Y\)) is a binary variable, association is assessed using measures based on ranks and concordance.

12.3.1 AUC

Although AUC formally refers to the area under the receiver operating characteristic (ROC) curve curve, we do not need to introduce the ROC concept here.

What matters is its probabilistic interpretation: AUC is the probability that a randomly chosen with \(Y\) = 1 receives a higher value of the ordinal variable\(X\) (higher „score”) than a randomly chosen observation with \(Y\) = 0.

When ties are possible, this interpretation is adjusted as follows: AUC equals the probability that the observation with \(Y=1\) has a higher score than the observation with \(Y=0\), plus one half of the probability that the two observations receive the same score:

\[\text{AUC} = \Pr(x_1 > x_0) + \frac{1}{2}\Pr(x_1 = x_0) \tag{12.13}\]

where \(X_1\) and \(X_0\) denote the values of the ordinal variable for randomly selected observations with \(Y=1\) and \(Y=0\), respectively.

In practice, AUC can be calculated based on number of concordant and tied pairs.

\[\text{AUC} = \frac{C + \tfrac{1}{2} T}{n_0 n_1} \tag{12.14}\]

where:

  • \(C\) – number of pairs where the observation with \(Y=1\) has a higher score (higher value of the ordinal variable \(X\)) than the observation with \(Y=0\)
  • \(T\) – number of pairs where the observation with (Y=1) has the same score as the observation with (Y=0)
  • \(n_1\) – number of observations with \(Y=1\),
  • \(n_0\) – number of observations with \(Y=0\),
  • \(n_0 n_1\) – total number of pairs consisting of one observation with \(Y=0\) and one with \(Y=1\).

When there is no association between \(X\) and \(Y\), AUC equals, all is close to, 0.5.

12.3.2 Somers’ D

12.5 Exercises

Exercise 12.1 Students of both sexes were asked the question: “Did you watch videos for more than half an hour yesterday?

Using the collected responses, compute the φ (phi) coefficient measuring the association between sex and video-watching behavior.

Check whether both methods

  • the one based on the contingency table (formula (12.1)), and
  • the one based on variables transformed into 0/1 form (i.e., recode the variables as 0/1 and compute Pearson’s correlation)

return the same result.