A Formulas

A.1 Number of class intervals

  • Square root rule:

\[k=\sqrt{n} \tag{A.1}\]

  • Sturges’ rule:

\[k=1+log_2(n) \tag{A.2}\]

  • Freedman-Diaconis rule:

\[\text{Bin width}=\frac{2\cdot IQR}{\sqrt[3]{n}} \tag{A.3}\]

  • Scott’s rule:

\[\text{Bin width}=\frac{3\cdot s}{\sqrt[3]{n}} \tag{A.4}\]

A.2 Central tendency measures

  • Arithmetic mean:

\[\begin{equation} \overline{x} = \frac{\sum_{i=1}^n x_i}{n} \tag{A.5} \end{equation}\]

  • Weighted arithmetic mean:

\[\overline{x}_{\text{weighted}} =\sum_{i=1}^n x_iw_i \tag{A.6}\]

  • Harmonic mean:

\[ H = \frac{n}{\sum_{i=1}^n\frac{1}{x_i}} \tag{A.7}\]

  • Weighted harmonic mean:

\[ H_{\text{weighted}} = \frac{1}{\sum_{i=1}^n\frac{w_i}{x_i}} \tag{A.8}\]

  • Geometric mean:

\[ G = \left(x_1\cdot x_2\cdot ... \cdot x_n\right)^{1/n} = \left(\prod_i x_i\right)^{1/n} \tag{A.9}\]

\[ G = \text{exp} \left(\frac {1}{n}\sum \limits _{i=1}^{n}\ln x_{i}\right) \tag{A.10}\]

  • Weighted geometric mean:

\[ G_{\text{weighted}} = \text{exp} \left(\sum \limits _{i=1}^{n}w_i\ln x_{i}\right) \tag{A.11}\]

  • Median approximated from a grouped frequency distribution:

\[ Me = l_M + \left(\frac{n}{2}-n_{M-}\right)\frac{h_M}{n_M} \tag{A.12}\]

  • Mode interpolated from a grouped frequency distribution with equal intervals:

\[ Mo = l_m + \frac{n_m - n_{m-1}}{(n_m - n_{m-1}) + (n_m - n_{m+1})} \cdot h \tag{A.13}\]

  • Mode interpolated from a grouped frequency distribution were intervals do not have equal widths:

\[ Mo = l_m + \frac{d_m - d_{m-1}}{(d_m - d_{m-1}) + (d_m - d_{m+1})} \cdot h_m \tag{A.14}\]

A.3 Dispersion measures

  • Standard deviation:

\[\begin{equation} \widehat{\sigma}_x = \sqrt{\frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n}} \tag{A.15} \end{equation}\]

\[\begin{equation} s_x = \sqrt{\frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n-1}} \tag{A.16} \end{equation}\]

  • Variance:

\[\begin{equation} \widehat{\sigma}^2_x = \frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n} \tag{A.17} \end{equation}\]

\[\begin{equation} s^2_x = \frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n-1} \tag{A.18} \end{equation}\]

  • Coefficient of variation:

\[\begin{equation} V_x = \frac{s_x}{\overline{x}} \tag{A.19} \end{equation}\]

  • Mean absolute deviation:

\[\begin{equation} MAD_x = \frac{\sum_{i=1}^n |x_i-\overline{x}|}{n} \tag{A.20} \end{equation}\]

  • Interquartile range:

\[\begin{equation} IQR = Q_3 - Q_1 \tag{A.21} \end{equation}\]

  • Interquartile deviation:

\[Q = IQR/2 \tag{A.22}\]

  • The positional coefficient of variation:

\[V = Q/Me \tag{A.23}\]

A.4 Data standardization (z-score)

\[ z = \frac{x - \text{mean}}{\text{standard deviation}} \tag{A.24} \]

A.5 Shape

  • Skewness:

\[\begin{equation} g_{1} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\widehat{\sigma}_x}\right)^3 \tag{A.25} \end{equation}\]

\[\begin{equation} G_{1} = \frac{\sqrt{n(n-1)}}{n-2}g_{1} \tag{A.26} \end{equation}\]

  • Pearson median skewness:

\[ \frac{3\cdot(\text{mean} - \text{median})}{\text{standard deviation}} \tag{A.27}\]

  • Bowley’s measure of skewness:

\[ \frac{\text{quartile 1} + \text{quartile 3}- 2\cdot\text{median}}{\text{quartile 3} - \text{quartile 1}} \tag{A.28}\]

  • Kelly’s measure of skewness

\[ \frac{\text{decile 1} + \text{decile 9}- 2\cdot\text{median}}{\text{decile 9} - \text{decile 1}} \tag{A.29}\]

  • Kurtosis

\[\begin{equation} g_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\widehat{\sigma}_x}\right)^4-3 \tag{A.30} \end{equation}\]

\[\begin{equation} G_{2} = \frac{n-1}{(n-2)(n-3)}\left[(n+1)g_{2}+6\right] \tag{A.31} \end{equation}\]

\[\begin{equation} b_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{s_x}\right)^4-3 \tag{A.32} \end{equation}\]

A.6 Gini coefficient

\[\begin{equation} G = {\frac {\sum _{i=1}^{n}(2i-n-1)x_{(i)}}{{n^{2}}{\overline {x}}}}, \tag{A.33} \end{equation}\]

A.7 Covariance

\[s_{xy} = \frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{n-1} \tag{A.34}\]

\[ \widehat{\sigma}_{xy} = \frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{n} \tag{A.35} \]

A.8 Pearson correlation coefficient

\[r(X,Y) = \frac{\sum_i{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum_i(x_i-\bar{x})^2\sum_i(y_i-\bar{y})^2}} \tag{A.36}\]

\[r(X,Y) = \frac{1}{n}\sum_{i=1}^n z_{x_i} z_{y_i} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\widehat{\sigma}_X}\right)\left(\frac{y_i-\bar{y}}{\widehat{\sigma}_Y}\right) \tag{A.37} \]

\[r_{xy} = \frac{s_{xy}}{s_x s_y} \tag{A.38} \]

\[r_{xy} = \frac{\widehat{\sigma}_{xy}}{\widehat{\sigma}_x \widehat{\sigma}_y} \tag{A.39} \]

A.9 Spearman’s correlation coefficient

\[r_S (X, Y) =r\left(\text{Rank}(X), \text{Rank}(Y)\right) \tag{A.40} \]

\[ r_s = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)} \tag{A.41}\]

\[d_i = \text{Rank}(x_i) - \text{Rank}(y_i)\]

A.10 Kendall’s tau

\[\tau_A = \frac{\text{# of concordant pairs} - \text{# of discordant pairs}}{ \text{# of pairs} } \tag{A.42} \]

\[\tau_B = \frac{\text{number of concordant pairs} - \text{number of discordant pairs}}{ \sqrt{(N_0-N_1)(N_0-N_2)}} \tag{A.43} \]

\[N_0 = \text{# of concordant pairs} + \text{# of discordant pairs} + \text{# of ties} = \\ = \frac{n(n-1)}{2}\]

A.11 Simple regression

  • The slope of the SD line:

\[\text{slope of SD line} = \pm \frac{s_y}{s_x} \tag{A.44} \]

  • Fitted line equation:

\[\widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_i, \tag{A.45}\]

  • Fitted line slope:

\[\widehat{\beta}_1 = r_{xy} \frac{s_y}{s_x} \tag{A.46}\]

  • Fitted line intercept:

\[\widehat{\beta}_0 = \bar{y} - \widehat{\beta}_1 \bar{x}. \tag{A.47}\]

  • Residuals:

\[e_i = y_i - \widehat{y}_i \tag{A.48}\]

  • R-squared:

\[R^2=1-\frac{\text{SS}_{res}}{\text{SS}_{tot}}, \tag{A.49}\]

\[\text{SS}_{res} = \sum_i{e_i^2} \tag{A.50}\]

\[\text{SS}_{tot} = \sum_i{(y_i-\bar{y})^2} \tag{A.51}\]

\[R^2 = r_{xy}^2 \tag{A.52}\]

  • Residual standard deviation:

\[ \text{RSD} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n\left(y_i-\widehat{y}\right)^2} \tag{A.53}\]

\[ RSD = \left(s_y \sqrt{1 - R^2} \right)\sqrt{\frac{n-1}{n-2}} \tag{A.54}\]

\[ RSD \approx s_y \sqrt{1 - R^2} \tag{A.55}\]

  • Prediction using a log–log linear model

\[\widehat{\log(y_i)} = \widehat{\beta}_0 + \widehat{\beta}_1\log(x_i) \tag{A.56}\]

\[\widehat{y_p} = \exp(\widehat{\beta}_0 + \widehat{\beta}_1\log(x_p)) \tag{A.57}\]

A.12 Multiple regression

  • Fitted equation:

\[\widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_{i1} + \widehat{\beta}_2 x_{i2} + \cdots + \widehat{\beta}_k x_{ik}, \tag{A.58}\]

  • Fitted coefficients (matrix formula):

\[\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y} \tag{A.59}\]

A.13 Association in categorical data

  • Chi-squared statistic:

\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \tag{A.60} \]

\[E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{n} \tag{A.61} \]

  • Cramer’s V

\[V = \sqrt{ \frac{\chi^2}{n \cdot \min(r - 1, c - 1)} } \tag{A.62} \]

  • Eta squared:

\[\eta^2 = \frac{\text{SSB}}{\text{SST}} \tag{A.63} \]

\[\text{SST} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2 \tag{A.64}\]

\[\text{SSB} = \sum_{i=1}^{k} n_i (\bar{y}_i - \bar{y})^2 \tag{A.65} \]

  • Eta correlation ratio:

\[\eta = \sqrt{\eta^2} = \sqrt{ \frac{\text{SSB}}{\text{SST}}} \tag{A.66}\]

A.14 Association measures for binary data:

  • 2×2 table:
Y = 0 Y = 1 Total
X = 0 a b a+b
X = 1 c d c+d
Total a+c b+d n
  • Phi coefficient:

\[\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \tag{A.67}\]

  • Odds and odds ratio:

\[\text{odds} = \frac{\text{number of events}}{\text{number of non-events}} = \frac{a}{b} \tag{A.68} \]

\[ \text{OR} = \frac{\text{odds in group A}}{\text{odds in group B}} \tag{A.69} \]

  • Confusion matrix:
Predicted Positive Predicted Negative
Actual Positive True Positives, TP False Negatives, FN
Actual Negative False Positives, FP True Negatives, TN
  • Confusion matrix measures:

\[ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}. \tag{A.70}\]

\[ \text{Sensitivity} = \frac{TP}{TP+FN}. \tag{A.71}\]

\[ \text{Specificity} = \frac{TN}{TN+FP}. \tag{A.72}\]

  • Point-biserial correlation:

\[r_{pb} = \frac{ \bar{x}_1- \bar{x}_0}{s_x} \sqrt{ \frac{n_1 n_0}{n(n-1)}} \tag{A.73}\]

  • Cohen’s d:

\[d = \frac{\bar{x}_1 - \bar{x}_0}{s_p} \tag{A.74}\]

\[ s_p=\sqrt{\frac{(n_1-1)s_1^2+(n_0-1)s_0^2}{n_1+n_0-2} } \tag{A.75}\]

  • Hedges’ g:

\[ g = d \left( 1 - \frac{3}{4(n_1 + n_0) - 9} \right) \tag{A.76}\]

  • AUC:

\[\text{AUC} = \Pr(x_1 > x_0) + \frac{1}{2}\Pr(x_1 = x_0) \tag{A.77}\]

\[\text{AUC} = \frac{C + \tfrac{1}{2} T}{n_0 n_1} \tag{A.78}\]

A.15 Time series

  • absolute change:

\[\Delta x = x_t - x_0 \tag{A.79}\]

  • relative change:

\[g = \frac{x_t - x_0}{x_0} \tag{A.80}\]

\[g(\%) = \frac{x_t - x_0}{x_0} \cdot 100\% \tag{A.81}\]

  • fixed-based index:

\[I^{FB}_t = \frac{x_t}{x_0} \cdot 100 \tag{A.82}\]

  • chain index:

\[I^{CH}_n = \frac{x_t}{x_{t-1}} \cdot 100 \tag{A.83}\]

\[I^{FB}_t = \prod_{i=1}^{t}\left(\frac{I^{CH}_i}{100}\right) \cdot 100 \tag{A.84}\]

  • CAGR:

\[\text{CAGR} = \left( \frac{x_n}{x_0} \right)^{\frac{1}{n}} - 1 \tag{A.85}\]

\[\text{CAGR}(\%) = \left[ \left( \frac{x_n}{x_0} \right)^{\frac{1}{n}} - 1 \right] \cdot 100\% \tag{A.86}\]

A.16 Aggregate price indices

\[I = \frac{\sum p_t q_0}{\sum p_0 q_0}\cdot 100 \tag{A.87}\]

\[I = \frac{\sum p_t q_t}{\sum p_0 q_t}\cdot 100 \tag{A.88}\]

\[I_F = \sqrt{I_L \cdot I_P} \tag{A.89}\]