## 4.3 Summary Statistics

The good news is that these descriptive statistics give us a manageable and meaningful summary of the underlying phenomenon. That’s what this chapter is about. The bad news is that any simplification invites abuse. Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading.-Charles Wheelan

• It is simple to calculate basic summary statistics in R, most of the functions are named according to what they do.

• For instance $$\mathtt{mean}$$ calculates the mean of a single variable, $$\mathtt{sd}$$ calculates the standard deviation.

Table: Basic statistics functions in R
 Statistics R-Function Arithmetic mean $mean\left(x\right)$ Geometric mean $exp\left(mean\left(log\left(x\right)\right)\right)$ median $median\left(x\right)$ Range $range\left(x\right)$ variance $var\left(x\right)$ standard deviation $sd\left(x\right)$ Interquantile Range $IQR\left(x\right)$ Other quantiles $quantile\left(x\right)$ Skewness $skewness\left(x\right)$ Kurtosis $kurtosis\left(x\right)$
• The following example demonstrates how to calculate the statistics measures in table-1 for Dow Jones prices in data file data_fin.csv
# change the working directory to the folder containing data_fin.csv or provide
# the full path with the filename
data_stocks = read.csv("data/data_fin.csv")  #import data
head(data_stocks)
        Date     DJI   AXP   MMM   ATT    BA   CAT CISCO    DD   XOM    GE
1  3/01/2000 11357.5 45.82 47.19 47.19 40.12 24.31 54.05 65.00 39.09 49.95
2  4/01/2000 10997.9 44.09 45.31 44.25 40.12 24.00 51.00 65.00 38.41 48.06
3  5/01/2000 11122.7 42.96 46.62 44.94 42.62 24.56 51.19 67.75 40.50 47.70
4  6/01/2000 11253.3 43.78 50.62 43.75 43.06 25.81 50.00 71.50 42.59 48.51
5  7/01/2000 11522.6 44.42 51.47 44.12 44.12 26.66 52.94 71.62 42.31 50.28
6 10/01/2000 11572.2 45.04 51.12 44.75 43.69 25.78 54.91 70.00 41.88 50.37
GS    HD    IBM  INTC   JNJ   JPM   MRK   MCD  MSFT   NKE
1 88.31 65.50 115.56 43.47 46.09 48.69 64.04 39.62 58.34 12.03
2 82.38 61.50 112.06 41.47 44.41 47.27 61.61 38.81 56.31 11.38
3 78.88 61.44 116.00 41.81 44.88 46.98 64.22 39.44 56.91 12.03
4 82.25 60.00 114.62 39.38 46.28 47.65 64.75 38.88 55.00 11.97
5 82.56 62.81 113.31 41.00 47.88 48.52 70.97 39.75 55.72 11.97
6 84.38 63.19 118.44 42.88 47.03 47.69 68.89 40.06 56.12 12.17
DJI = data_stocks$DJI DJI = na.omit(DJI) #remove NAs as it will affect the calculations # Arithmetic mean mean(DJI) [1] 11098.12 # Geometric mean exp(mean(log(DJI))) [1] 10953.39 # median median(DJI) [1] 10748.8 # variance & standard deviation var(DJI) [1] 3280347 sd(DJI) [1] 1811.173 # interquantile range and few quantiles IQR(DJI) [1] 2276.25 quantile(DJI)  0% 25% 50% 75% 100% 6547.10 10063.25 10748.80 12339.50 16576.66  # skewness and kurtosis skewness and kurtosis functions are not available in R # core library but in library e1071 (there are other packages which have # functions for skewness and kurtosis try ??kurtosis or search for the function # on RSearch. library(e1071) skewness(DJI) [1] 0.4777828 kurtosis(DJI) [1] 0.08404185 • The function $$\mathtt{summary}$$ in R provides some basic summary viz., minimum value, maximum value, median value and quartiles for one variable or a dataset. The function $$\mathtt{summary}$$ can be used as follows # summary of one column/variable in a dataframe summary(DJI)  Min. 1st Qu. Median Mean 3rd Qu. Max. 6547 10063 10749 11098 12340 16577  # summary of whole dataset excluding the time column summary(data_stocks[, c(2:21)])  DJI AXP MMM ATT Min. : 6547 Min. :10.26 Min. : 39.50 Min. :19.34 1st Qu.:10063 1st Qu.:38.31 1st Qu.: 62.55 1st Qu.:25.54 Median :10749 Median :47.49 Median : 77.67 Median :29.65 Mean :11098 Mean :46.77 Mean : 75.78 Mean :31.77 3rd Qu.:12340 3rd Qu.:54.44 3rd Qu.: 85.55 3rd Qu.:37.22 Max. :16577 Max. :90.73 Max. :140.25 Max. :58.50 NA's :27 NA's :12 NA's :12 NA's :12 BA CAT CISCO DD Min. : 25.06 Min. : 14.91 Min. : 8.60 Min. :16.14 1st Qu.: 44.00 1st Qu.: 28.64 1st Qu.:17.68 1st Qu.:41.13 Median : 63.56 Median : 57.10 Median :20.43 Median :44.53 Mean : 62.95 Mean : 56.01 Mean :23.41 Mean :44.37 3rd Qu.: 74.90 3rd Qu.: 79.36 3rd Qu.:24.17 3rd Qu.:48.92 Max. :138.36 Max. :116.20 Max. :80.06 Max. :71.62 NA's :12 NA's :12 NA's :13 NA's :12 XOM GE GS HD Min. : 30.27 Min. : 6.66 Min. : 52.0 Min. :18.00 1st Qu.: 42.46 1st Qu.:20.00 1st Qu.: 92.2 1st Qu.:31.00 Median : 64.77 Median :30.33 Median :116.2 Median :37.37 Mean : 63.20 Mean :29.56 Mean :126.9 Mean :40.20 3rd Qu.: 81.62 3rd Qu.:36.03 3rd Qu.:159.5 3rd Qu.:46.23 Max. :101.51 Max. :59.94 Max. :247.9 Max. :82.34 NA's :12 NA's :12 NA's :12 NA's :12 IBM INTC JNJ JPM Min. : 55.07 Min. :12.08 Min. :33.69 Min. :15.45 1st Qu.: 87.82 1st Qu.:20.17 1st Qu.:55.27 1st Qu.:35.66 Median :106.48 Median :22.76 Median :61.30 Median :40.20 Mean :118.83 Mean :25.21 Mean :61.07 Mean :40.36 3rd Qu.:130.00 3rd Qu.:26.77 3rd Qu.:65.20 3rd Qu.:45.71 Max. :215.80 Max. :74.88 Max. :95.63 Max. :65.70 NA's :12 NA's :13 NA's :12 NA's :12 MRK MCD MSFT NKE Min. :20.99 Min. : 12.38 Min. :15.15 Min. : 6.64 1st Qu.:34.53 1st Qu.: 29.19 1st Qu.:25.67 1st Qu.:14.56 Median :43.63 Median : 43.78 Median :27.59 Median :23.25 Mean :44.65 Mean : 51.07 Mean :28.38 Mean :28.01 3rd Qu.:51.55 3rd Qu.: 70.36 3rd Qu.:30.19 3rd Qu.:36.90 Max. :89.85 Max. :103.59 Max. :58.34 Max. :79.86 NA's :13 NA's :12 NA's :13 NA's :12  ### 4.3.1 Example-Descriptive Statistics of Stock Returns • In this example we will use R to calculate descriptive statistics for the returns of 10 stocks in the data file $$\mathtt{data\_fin.csv}$$. • We will first import the dataset into R using the $$\mathtt{read.csv}$$ function. data_cs1 = read.csv("data/data_fin.csv") head(data_cs1) #check the imported data  Date DJI AXP MMM ATT BA CAT CISCO DD XOM GE 1 3/01/2000 11357.5 45.82 47.19 47.19 40.12 24.31 54.05 65.00 39.09 49.95 2 4/01/2000 10997.9 44.09 45.31 44.25 40.12 24.00 51.00 65.00 38.41 48.06 3 5/01/2000 11122.7 42.96 46.62 44.94 42.62 24.56 51.19 67.75 40.50 47.70 4 6/01/2000 11253.3 43.78 50.62 43.75 43.06 25.81 50.00 71.50 42.59 48.51 5 7/01/2000 11522.6 44.42 51.47 44.12 44.12 26.66 52.94 71.62 42.31 50.28 6 10/01/2000 11572.2 45.04 51.12 44.75 43.69 25.78 54.91 70.00 41.88 50.37 GS HD IBM INTC JNJ JPM MRK MCD MSFT NKE 1 88.31 65.50 115.56 43.47 46.09 48.69 64.04 39.62 58.34 12.03 2 82.38 61.50 112.06 41.47 44.41 47.27 61.61 38.81 56.31 11.38 3 78.88 61.44 116.00 41.81 44.88 46.98 64.22 39.44 56.91 12.03 4 82.25 60.00 114.62 39.38 46.28 47.65 64.75 38.88 55.00 11.97 5 82.56 62.81 113.31 41.00 47.88 48.52 70.97 39.75 55.72 11.97 6 84.38 63.19 118.44 42.88 47.03 47.69 68.89 40.06 56.12 12.17 • Apply function to with dates as character and then after converting dates to Date class. # selecting first 10 price series including the data column data_cs1.1 = data_cs1[, c(1:11)] # data cleaning-remove NAs data_cs1.1 = na.omit(data_cs1.1) colnames(data_cs1.1) # see the columns present in the data  [1] "Date" "DJI" "AXP" "MMM" "ATT" "BA" "CAT" "CISCO" "DD" [10] "XOM" "GE"  summary(data_cs1.1) #notice the Date variable  Date DJI AXP MMM Length:3523 Min. : 6547 Min. :10.26 Min. : 39.50 Class :character 1st Qu.:10063 1st Qu.:38.38 1st Qu.: 62.55 Mode :character Median :10749 Median :47.60 Median : 77.67 Mean :11098 Mean :46.83 Mean : 75.80 3rd Qu.:12340 3rd Qu.:54.50 3rd Qu.: 85.61 Max. :16577 Max. :90.73 Max. :140.25 ATT BA CAT CISCO Min. :19.34 Min. : 25.06 Min. : 14.91 Min. : 8.60 1st Qu.:25.54 1st Qu.: 44.02 1st Qu.: 28.48 1st Qu.:17.68 Median :29.76 Median : 63.61 Median : 57.11 Median :20.39 Mean :31.79 Mean : 62.99 Mean : 56.03 Mean :23.42 3rd Qu.:37.23 3rd Qu.: 74.95 3rd Qu.: 79.50 3rd Qu.:24.18 Max. :58.50 Max. :138.36 Max. :116.20 Max. :80.06 DD XOM GE Min. :16.14 Min. : 30.27 Min. : 6.66 1st Qu.:41.17 1st Qu.: 42.41 1st Qu.:20.04 Median :44.58 Median : 64.70 Median :30.37 Mean :44.43 Mean : 63.18 Mean :29.63 3rd Qu.:48.93 3rd Qu.: 81.70 3rd Qu.:36.05 Max. :71.62 Max. :101.51 Max. :59.94  # check class of dates which will be factor ( treated as factor by default)\t class(data_cs1.1$Date)
[1] "character"
# convert dates to class Date
data_cs1.1$Date = as.Date(data_cs1.1$Date, format = "%d/%m/%Y")
class(data_cs1.1$Date) [1] "Date" summary(data_cs1.1) #notice the Date variable  Date DJI AXP MMM Min. :2000-01-03 Min. : 6547 Min. :10.26 Min. : 39.50 1st Qu.:2003-07-08 1st Qu.:10063 1st Qu.:38.38 1st Qu.: 62.55 Median :2007-01-05 Median :10749 Median :47.60 Median : 77.67 Mean :2007-01-03 Mean :11098 Mean :46.83 Mean : 75.80 3rd Qu.:2010-07-06 3rd Qu.:12340 3rd Qu.:54.50 3rd Qu.: 85.61 Max. :2014-01-03 Max. :16577 Max. :90.73 Max. :140.25 ATT BA CAT CISCO Min. :19.34 Min. : 25.06 Min. : 14.91 Min. : 8.60 1st Qu.:25.54 1st Qu.: 44.02 1st Qu.: 28.48 1st Qu.:17.68 Median :29.76 Median : 63.61 Median : 57.11 Median :20.39 Mean :31.79 Mean : 62.99 Mean : 56.03 Mean :23.42 3rd Qu.:37.23 3rd Qu.: 74.95 3rd Qu.: 79.50 3rd Qu.:24.18 Max. :58.50 Max. :138.36 Max. :116.20 Max. :80.06 DD XOM GE Min. :16.14 Min. : 30.27 Min. : 6.66 1st Qu.:41.17 1st Qu.: 42.41 1st Qu.:20.04 Median :44.58 Median : 64.70 Median :30.37 Mean :44.43 Mean : 63.18 Mean :29.63 3rd Qu.:48.93 3rd Qu.: 81.70 3rd Qu.:36.05 Max. :71.62 Max. :101.51 Max. :59.94  • Convert prices to returns d2 = as.data.frame(sapply(data_cs1.1[2:11], function(x) diff(log(x)) * 100)) #note it will be one less # create a different dataframe with returns data_stocks_ret = as.data.frame(cbind(Date = data_cs1.1$Date[2:length(data_cs1.1\$Date)],
d2), stringsAsFactors = FALSE, row.names = NULL)
# visual inspection
head(data_stocks_ret)
        Date        DJI        AXP        MMM        ATT         BA       CAT
1 2000-01-04 -3.2173973 -3.8487678 -4.0654247 -6.4326634  0.0000000 -1.283396
2 2000-01-05  1.1283720 -2.5963549  2.8501875  1.5472895  6.0448664  2.306527
3 2000-01-06  1.1673354  1.8907642  8.2317122 -2.6836654  1.0270865  4.964291
4 2000-01-07  2.3648905  1.4512726  1.6652359  0.8421582  2.4318702  3.240230
5 2000-01-10  0.4295346  1.3861165 -0.6823304  1.4178250 -0.9793951 -3.356532
6 2000-01-11 -0.5293883  0.9061837 -1.7165263 -1.4178250 -1.8713726 -1.563754
CISCO         DD        XOM         GE
1 -5.8083911  0.0000000 -1.7548837 -3.8572275
2  0.3718568  4.1437190  5.2984132 -0.7518832
3 -2.3521195  5.3872990  5.0317510  1.6838564
4  5.7136191  0.1676915 -0.6596019  3.5837421
5  3.6536284 -2.2879123 -1.0215079  0.1788376
6 -3.0697677 -1.8018506  1.1868167  0.2379537

#### 4.3.1.1 Using the $$\mathtt{describe}$$ function

• The package psych comes with a function called $$\mathtt{describe}$$ which generated the descriptive statistics for all the data vectors (columns) in a data frame, matrix or a vector.
library(psych)  #load the required package
args(describe)  #arguments for describe function
function (x, na.rm = TRUE, interp = FALSE, skew = TRUE, ranges = TRUE,
trim = 0.1, type = 3, check = TRUE, fast = NULL, quant = NULL,
IQR = FALSE, omit = FALSE, data = NULL)
NULL
# use describe to calculate descriptive stats for data_cs1.1r
desc1 = describe(data_stocks_ret[, 2:11])  #note we dont pass the date column
# check the output
head(desc1)
    vars    n  mean   sd median trimmed  mad    min   max range  skew kurtosis
DJI    1 3522  0.01 1.23   0.04    0.03 0.82  -8.20 10.51 18.71 -0.06     7.71
AXP    2 3522  0.02 2.89   0.02    0.03 1.55 -19.35 18.77 38.12 -0.01     9.14
MMM    3 3522  0.03 1.55   0.03    0.03 1.10  -9.38 10.39 19.78  0.06     4.87
ATT    4 3522 -0.01 1.80   0.03    0.01 1.22 -13.54 15.08 28.62  0.02     6.26
BA     5 3522  0.03 2.01   0.05    0.06 1.57 -19.39 14.38 33.77 -0.26     5.39
CAT    6 3522  0.04 2.14   0.04    0.05 1.65 -15.69 13.73 29.42 -0.08     4.08
se
DJI 0.02
AXP 0.05
MMM 0.03
ATT 0.03
BA  0.03
CAT 0.04
# the above output is in long format, we can transpose it get column format
desc1.t = t(desc1)
head(desc1.t)
                 DJI          AXP          MMM           ATT           BA
vars    1.000000e+00 2.000000e+00 3.000000e+00  4.000000e+00 5.000000e+00
n       3.522000e+03 3.522000e+03 3.522000e+03  3.522000e+03 3.522000e+03
mean    1.055257e-02 1.908563e-02 3.056011e-02 -8.647491e-03 3.499777e-02
sd      1.226702e+00 2.892586e+00 1.551706e+00  1.799180e+00 2.013123e+00
median  4.442671e-02 1.723604e-02 3.233787e-02  3.018428e-02 5.279680e-02
trimmed 2.597511e-02 3.152635e-02 3.241946e-02  9.134423e-03 5.567097e-02
CAT         CISCO            DD          XOM            GE
vars    6.000000e+00  7.000000e+00  8.000000e+00 9.000000e+00  1.000000e+01
n       3.522000e+03  3.522000e+03  3.522000e+03 3.522000e+03  3.522000e+03
mean    3.710732e-02 -2.554732e-02 -5.379787e-04 2.653014e-02 -1.696661e-02
sd      2.143040e+00  2.744068e+00  1.881993e+00 1.629181e+00  2.057353e+00
median  4.291300e-02  3.800312e-02  0.000000e+00 5.437650e-02  0.000000e+00
trimmed 4.678016e-02 -1.339579e-04  1.469270e-03 4.883990e-02  9.841473e-05
• The descriptive statistics generated above gives mean, median, standard deviation, trimmed mean(trimmed), median, mad (median absolute deviation from the median), minimum (min), maximum (max), skewness (skew), kurtosis and standard error (se) .

• This can easily be transferred to a CSV file or a text file. The following single line of code transfers the descriptive statistics to a CSV file which then can be imported into a word or latex file as required.

• The pastecs package provides the function $$\mathtt{stat.desc}$$ which generated descriptive statistics for a data frame, matrix or a timeseries. Skewness and Kurtosis are not calculated by default in $$\mathtt{stat.desc}$$ but the argument $$\mathtt{norm}$$ can be set to TRUE to get these measures along with their standard errors.

require(pastecs)  # note library and require can both be used to include a package
# detach the package pastecs its useful to avoid any conflicts (e.g psych and
# Hmisc have 'describe' function with two different behaviours
detach("package:psych", unload = TRUE)
# use stat.desc in with default arguments
desc2 = stat.desc(data_stocks_ret[, 2:11], norm = TRUE)
desc2  #note skewness/kurtosis
                       DJI           AXP           MMM           ATT
nbr.val       3.522000e+03  3.522000e+03  3.522000e+03  3.522000e+03
nbr.null      2.000000e+00  2.400000e+01  3.000000e+01  5.000000e+01
nbr.na        0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
min          -8.200737e+00 -1.935233e+01 -9.383688e+00 -1.353821e+01
max           1.050812e+01  1.877116e+01  1.039309e+01  1.508318e+01
range         1.870886e+01  3.812349e+01  1.977678e+01  2.862139e+01
sum           3.716616e+01  6.721959e+01  1.076327e+02 -3.045646e+01
median        4.442671e-02  1.723604e-02  3.233787e-02  3.018428e-02
mean          1.055257e-02  1.908563e-02  3.056011e-02 -8.647491e-03
SE.mean       2.067018e-02  4.874068e-02  2.614656e-02  3.031656e-02
CI.mean.0.95  4.052675e-02  9.556282e-02  5.126395e-02  5.943979e-02
var           1.504798e+00  8.367052e+00  2.407790e+00  3.237048e+00
std.dev       1.226702e+00  2.892586e+00  1.551706e+00  1.799180e+00
coef.var      1.162467e+02  1.515583e+02  5.077551e+01 -2.080580e+02
skewness     -5.829983e-02 -6.689750e-03  5.927112e-02  1.620418e-02
skew.2SE     -7.065472e-01 -8.107441e-02  7.183185e-01  1.963817e-01
kurtosis      7.714304e+00  9.141053e+00  4.865294e+00  6.257155e+00
kurt.2SE      4.675883e+01  5.540681e+01  2.949008e+01  3.792659e+01
normtest.W    9.187712e-01  8.496717e-01  9.384591e-01  9.298653e-01
normtest.p    5.671566e-40  1.020339e-49  6.131053e-36  8.237017e-38
BA           CAT         CISCO            DD
nbr.val       3.522000e+03  3.522000e+03  3.522000e+03  3.522000e+03
nbr.null      1.500000e+01  2.400000e+01  5.000000e+01  2.900000e+01
nbr.na        0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
min          -1.938931e+01 -1.568589e+01 -1.768649e+01 -1.202802e+01
max           1.437774e+01  1.373497e+01  2.182386e+01  1.085590e+01
range         3.376704e+01  2.942086e+01  3.951034e+01  2.288392e+01
sum           1.232621e+02  1.306920e+02 -8.997766e+01 -1.894761e+00
median        5.279680e-02  4.291300e-02  3.800312e-02  0.000000e+00
mean          3.499777e-02  3.710732e-02 -2.554732e-02 -5.379787e-04
SE.mean       3.392155e-02  3.611067e-02  4.623812e-02  3.171197e-02
CI.mean.0.95  6.650788e-02  7.079996e-02  9.065621e-02  6.217570e-02
var           4.052665e+00  4.592620e+00  7.529907e+00  3.541897e+00
std.dev       2.013123e+00  2.143040e+00  2.744068e+00  1.881993e+00
coef.var      5.752148e+01  5.775248e+01 -1.074112e+02 -3.498266e+03
skewness     -2.605203e-01 -8.369997e-02  1.547891e-01 -1.523353e-01
skew.2SE     -3.157298e+00 -1.014377e+00  1.875920e+00 -1.846181e+00
kurtosis      5.392404e+00  4.077013e+00  7.329406e+00  5.061290e+00
kurt.2SE      3.268507e+01  2.471206e+01  4.442585e+01  3.067808e+01
normtest.W    9.550329e-01  9.583075e-01  9.104004e-01  9.381393e-01
normtest.p    1.196739e-31  1.172506e-30  1.882675e-41  5.179385e-36
XOM            GE
nbr.val       3.522000e+03  3.522000e+03
nbr.null      2.900000e+01  6.300000e+01
nbr.na        0.000000e+00  0.000000e+00
min          -1.502710e+01 -1.368410e+01
max           1.586307e+01  1.798444e+01
range         3.089017e+01  3.166854e+01
sum           9.343915e+01 -5.975640e+01
median        5.437650e-02  0.000000e+00
mean          2.653014e-02 -1.696661e-02
SE.mean       2.745205e-02  3.466683e-02
CI.mean.0.95  5.382353e-02  6.796911e-02
var           2.654232e+00  4.232702e+00
std.dev       1.629181e+00  2.057353e+00
coef.var      6.140870e+01 -1.212589e+02
skewness      4.651513e-02  1.102593e-02
skew.2SE      5.637262e-01  1.336254e-01
kurtosis      1.043194e+01  7.781017e+00
kurt.2SE      6.323129e+01  4.716320e+01
normtest.W    9.160764e-01  9.046828e-01
normtest.p    1.839459e-40  2.126284e-42