A Formulas
A.1 Number of class intervals
- Square root rule:
\[k=\sqrt{n} \tag{A.1}\]
- Sturges’ rule:
\[k=1+log_2(n) \tag{A.2}\]
- Freedman-Diaconis rule:
\[\text{Bin width}=\frac{2\cdot IQR}{\sqrt[3]{n}} \tag{A.3}\]
- Scott’s rule:
\[\text{Bin width}=\frac{3\cdot s}{\sqrt[3]{n}} \tag{A.4}\]
A.2 Central tendency measures
- Arithmetic mean:
\[\begin{equation} \overline{x} = \frac{\sum_{i=1}^n x_i}{n} \tag{A.5} \end{equation}\]
- Weighted arithmetic mean:
\[\overline{x}_{\text{weighted}} =\sum_{i=1}^n x_iw_i \tag{A.6}\]
- Harmonic mean:
\[ H = \frac{n}{\sum_{i=1}^n\frac{1}{x_i}} \tag{A.7}\]
- Weighted harmonic mean:
\[ H_{\text{weighted}} = \frac{1}{\sum_{i=1}^n\frac{w_i}{x_i}} \tag{A.8}\]
- Geometric mean:
\[ G = \left(x_1\cdot x_2\cdot ... \cdot x_n\right)^{1/n} = \left(\prod_i x_i\right)^{1/n} \tag{A.9}\]
\[ G = \text{exp} \left(\frac {1}{n}\sum \limits _{i=1}^{n}\ln x_{i}\right) \tag{A.10}\]
- Weighted geometric mean:
\[ G_{\text{weighted}} = \text{exp} \left(\sum \limits _{i=1}^{n}w_i\ln x_{i}\right) \tag{A.11}\]
- Median approximated from a grouped frequency distribution:
\[ Me = l_M + \left(\frac{n}{2}-n_{M-}\right)\frac{h_M}{n_M} \tag{A.12}\]
- Mode interpolated from a grouped frequency distribution with equal intervals:
\[ Mo = l_m + \frac{n_m - n_{m-1}}{(n_m - n_{m-1}) + (n_m - n_{m+1})} \cdot h \tag{A.13}\]
- Mode interpolated from a grouped frequency distribution were intervals do not have equal widths:
\[ Mo = l_m + \frac{d_m - d_{m-1}}{(d_m - d_{m-1}) + (d_m - d_{m+1})} \cdot h_m \tag{A.14}\]
A.3 Dispersion measures
- Standard deviation:
\[\begin{equation} \widehat{\sigma}_x = \sqrt{\frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n}} \tag{A.15} \end{equation}\]
\[\begin{equation} s_x = \sqrt{\frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n-1}} \tag{A.16} \end{equation}\]
- Variance:
\[\begin{equation} \widehat{\sigma}^2_x = \frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n} \tag{A.17} \end{equation}\]
\[\begin{equation} s^2_x = \frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n-1} \tag{A.18} \end{equation}\]
- Coefficient of variation:
\[\begin{equation} V_x = \frac{s_x}{\overline{x}} \tag{A.19} \end{equation}\]
- Mean absolute deviation:
\[\begin{equation} MAD_x = \frac{\sum_{i=1}^n |x_i-\overline{x}|}{n} \tag{A.20} \end{equation}\]
- Interquartile range:
\[\begin{equation} IQR = Q_3 - Q_1 \tag{A.21} \end{equation}\]
- Interquartile deviation:
\[Q = IQR/2 \tag{A.22}\]
- The positional coefficient of variation:
\[V = Q/Me \tag{A.23}\]
A.4 Data standardization (z-score)
\[ z = \frac{x - \text{mean}}{\text{standard deviation}} \tag{A.24} \]
A.5 Shape
- Skewness:
\[\begin{equation} g_{1} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\widehat{\sigma}_x}\right)^3 \tag{A.25} \end{equation}\]
\[\begin{equation} G_{1} = \frac{\sqrt{n(n-1)}}{n-2}g_{1} \tag{A.26} \end{equation}\]
- Pearson median skewness:
\[ \frac{3\cdot(\text{mean} - \text{median})}{\text{standard deviation}} \tag{A.27}\]
- Bowley’s measure of skewness:
\[ \frac{\text{quartile 1} + \text{quartile 3}- 2\cdot\text{median}}{\text{quartile 3} - \text{quartile 1}} \tag{A.28}\]
- Kelly’s measure of skewness
\[ \frac{\text{decile 1} + \text{decile 9}- 2\cdot\text{median}}{\text{decile 9} - \text{decile 1}} \tag{A.29}\]
- Kurtosis
\[\begin{equation} g_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\widehat{\sigma}_x}\right)^4-3 \tag{A.30} \end{equation}\]
\[\begin{equation} G_{2} = \frac{n-1}{(n-2)(n-3)}\left[(n+1)g_{2}+6\right] \tag{A.31} \end{equation}\]
\[\begin{equation} b_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{s_x}\right)^4-3 \tag{A.32} \end{equation}\]
A.6 Gini coefficient
\[\begin{equation} G = {\frac {\sum _{i=1}^{n}(2i-n-1)x_{(i)}}{{n^{2}}{\overline {x}}}}, \tag{A.33} \end{equation}\]
A.7 Covariance
\[s_{xy} = \frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{n-1} \tag{A.34}\]
\[ \widehat{\sigma}_{xy} = \frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{n} \tag{A.35} \]
A.8 Pearson correlation coefficient
\[r(X,Y) = \frac{\sum_i{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum_i(x_i-\bar{x})^2\sum_i(y_i-\bar{y})^2}} \tag{A.36}\]
\[r(X,Y) = \frac{1}{n}\sum_{i=1}^n z_{x_i} z_{y_i} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\widehat{\sigma}_X}\right)\left(\frac{y_i-\bar{y}}{\widehat{\sigma}_Y}\right) \tag{A.37} \]
\[r_{xy} = \frac{s_{xy}}{s_x s_y} \tag{A.38} \]
\[r_{xy} = \frac{\widehat{\sigma}_{xy}}{\widehat{\sigma}_x \widehat{\sigma}_y} \tag{A.39} \]
A.9 Spearman’s correlation coefficient
\[r_S (X, Y) =r\left(\text{Rank}(X), \text{Rank}(Y)\right) \tag{A.40} \]
\[ r_s = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)} \tag{A.41}\]
\[d_i = \text{Rank}(x_i) - \text{Rank}(y_i)\]
A.10 Kendall’s tau
\[\tau_A = \frac{\text{# of concordant pairs} - \text{# of discordant pairs}}{ \text{# of pairs} } \tag{A.42} \]
\[\tau_B = \frac{\text{number of concordant pairs} - \text{number of discordant pairs}}{ \sqrt{(N_0-N_1)(N_0-N_2)}} \tag{A.43} \]
\[N_0 = \text{# of concordant pairs} + \text{# of discordant pairs} + \text{# of ties} = \\ = \frac{n(n-1)}{2}\]
A.11 Simple regression
- The slope of the SD line:
\[\text{slope of SD line} = \pm \frac{s_y}{s_x} \tag{A.44} \]
- Fitted line equation:
\[\widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_i, \tag{A.45}\]
- Fitted line slope:
\[\widehat{\beta}_1 = r_{xy} \frac{s_y}{s_x} \tag{A.46}\]
- Fitted line intercept:
\[\widehat{\beta}_0 = \bar{y} - \widehat{\beta}_1 \bar{x}. \tag{A.47}\]
- Residuals:
\[e_i = y_i - \widehat{y}_i \tag{A.48}\]
- R-squared:
\[R^2=1-\frac{\text{SS}_{res}}{\text{SS}_{tot}}, \tag{A.49}\]
\[\text{SS}_{res} = \sum_i{e_i^2} \tag{A.50}\]
\[\text{SS}_{tot} = \sum_i{(y_i-\bar{y})^2} \tag{A.51}\]
\[R^2 = r_{xy}^2 \tag{A.52}\]
- Residual standard deviation:
\[ \text{RSD} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n\left(y_i-\widehat{y}\right)^2} \tag{A.53}\]
\[ RSD = \left(s_y \sqrt{1 - R^2} \right)\sqrt{\frac{n-1}{n-2}} \tag{A.54}\]
\[ RSD \approx s_y \sqrt{1 - R^2} \tag{A.55}\]
- Prediction using a log–log linear model
\[\widehat{\log(y_i)} = \widehat{\beta}_0 + \widehat{\beta}_1\log(x_i) \tag{A.56}\]
\[\widehat{y_p} = \exp(\widehat{\beta}_0 + \widehat{\beta}_1\log(x_p)) \tag{A.57}\]
A.12 Multiple regression
- Fitted equation:
\[\widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_{i1} + \widehat{\beta}_2 x_{i2} + \cdots + \widehat{\beta}_k x_{ik}, \tag{A.58}\]
- Fitted coefficients (matrix formula):
\[\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y} \tag{A.59}\]
A.13 Association in categorical data
- Chi-squared statistic:
\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \tag{A.60} \]
\[E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{n} \tag{A.61} \]
- Cramer’s V
\[V = \sqrt{ \frac{\chi^2}{n \cdot \min(r - 1, c - 1)} } \tag{A.62} \]
- Eta squared:
\[\eta^2 = \frac{\text{SSB}}{\text{SST}} \tag{A.63} \]
\[\text{SST} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2 \tag{A.64}\]
\[\text{SSB} = \sum_{i=1}^{k} n_i (\bar{y}_i - \bar{y})^2 \tag{A.65} \]
- Eta correlation ratio:
\[\eta = \sqrt{\eta^2} = \sqrt{ \frac{\text{SSB}}{\text{SST}}} \tag{A.66}\]
A.14 Association measures for binary data:
- 2×2 table:
| Y = 0 | Y = 1 | Total | |
|---|---|---|---|
| X = 0 | a | b | a+b |
| X = 1 | c | d | c+d |
| Total | a+c | b+d | n |
- Phi coefficient:
\[\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \tag{A.67}\]
- Odds and odds ratio:
\[\text{odds} = \frac{\text{number of events}}{\text{number of non-events}} = \frac{a}{b} \tag{A.68} \]
\[ \text{OR} = \frac{\text{odds in group A}}{\text{odds in group B}} \tag{A.69} \]
- Confusion matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positives, TP | False Negatives, FN |
| Actual Negative | False Positives, FP | True Negatives, TN |
- Confusion matrix measures:
\[ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}. \tag{A.70}\]
\[ \text{Sensitivity} = \frac{TP}{TP+FN}. \tag{A.71}\]
\[ \text{Specificity} = \frac{TN}{TN+FP}. \tag{A.72}\]
- Point-biserial correlation:
\[r_{pb} = \frac{ \bar{x}_1- \bar{x}_0}{s_x} \sqrt{ \frac{n_1 n_0}{n(n-1)}} \tag{A.73}\]
- Cohen’s d:
\[d = \frac{\bar{x}_1 - \bar{x}_0}{s_p} \tag{A.74}\]
\[ s_p=\sqrt{\frac{(n_1-1)s_1^2+(n_0-1)s_0^2}{n_1+n_0-2} } \tag{A.75}\]
- Hedges’ g:
\[ g = d \left( 1 - \frac{3}{4(n_1 + n_0) - 9} \right) \tag{A.76}\]
- AUC:
\[\text{AUC} = \Pr(x_1 > x_0) + \frac{1}{2}\Pr(x_1 = x_0) \tag{A.77}\]
\[\text{AUC} = \frac{C + \tfrac{1}{2} T}{n_0 n_1} \tag{A.78}\]
A.15 Time series
- absolute change:
\[\Delta x = x_t - x_0 \tag{A.79}\]
- relative change:
\[g = \frac{x_t - x_0}{x_0} \tag{A.80}\]
\[g(\%) = \frac{x_t - x_0}{x_0} \cdot 100\% \tag{A.81}\]
- fixed-based index:
\[I^{FB}_t = \frac{x_t}{x_0} \cdot 100 \tag{A.82}\]
- chain index:
\[I^{CH}_n = \frac{x_t}{x_{t-1}} \cdot 100 \tag{A.83}\]
\[I^{FB}_t = \prod_{i=1}^{t}\left(\frac{I^{CH}_i}{100}\right) \cdot 100 \tag{A.84}\]
- CAGR:
\[\text{CAGR} = \left( \frac{x_n}{x_0} \right)^{\frac{1}{n}} - 1 \tag{A.85}\]
\[\text{CAGR}(\%) = \left[ \left( \frac{x_n}{x_0} \right)^{\frac{1}{n}} - 1 \right] \cdot 100\% \tag{A.86}\]