Topic 4 Data Exploration

The good news is that these descriptive statistics give us a manageable and meaningful summary of the underlying phenomenon. That’s what this chapter is about. The bad news is that any simplification invites abuse. Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading. -Charles Wheelan

4.1 Data Preprocessing

-We will now discuss some methods for data manipulation to clean a dataset, combine various datasets or extract a variable from a data frame before we jump into some programming basics.

4.1.1 Extracting Data

-Data frames are the most used data structures in R as they offer more flexibility in the way they can handle data.

  • Lets see some methods to extract data from a data frame. We will use the example dataset called \(\mathtt{us\_stocks.csv}\).

  • Lets import it using \(\mathtt{read.csv}\)

data_stocks = read.csv(file = "data/us_stocks.csv", header = TRUE)
head(data_stocks)
       Date  MSFT    IBM  AAPL   MCD    PG GOOG
1 2/01/2002 33.52 121.50 11.65 26.49 40.00   NA
2 3/01/2002 34.62 123.66 11.79 26.79 39.62   NA
3 4/01/2002 34.45 125.60 11.84 26.99 39.22   NA
4 7/01/2002 34.28 124.05 11.45 27.20 38.78   NA
5 8/01/2002 34.69 124.70 11.30 27.36 38.88   NA
6 9/01/2002 34.36 124.49 10.82 26.88 38.60   NA
  • The function \(\mathtt{names}\) or \(\mathtt{colnames}\) are used to access the names of the columns (or variables) in the data set as shows below.

  • The function \(\mathtt{row.names}\) can be used to access row names (if any) from a dataset

names(data_stocks)
[1] "Date" "MSFT" "IBM"  "AAPL" "MCD"  "PG"   "GOOG"
colnames(data_stocks)
[1] "Date" "MSFT" "IBM"  "AAPL" "MCD"  "PG"   "GOOG"
  • A specific data variable can be accessed using its name or index (column number) in the data frame.

  • To select any column use \(\mathtt{\$}\) symbol followed by the column name or its name in square brackets as shown in the example below

msft_prices1 = data_stocks$MSFT  #the data is returned as a vector
head(msft_prices1)
[1] 33.52 34.62 34.45 34.28 34.69 34.36
msft_prices2 = data_stocks[["MSFT"]]  #the data is returned as a vector
head(msft_prices2)
[1] 33.52 34.62 34.45 34.28 34.69 34.36
# the following returns data as a data frame
msft_prices3 = data_stocks["MSFT"]  #can also be used to access multiple columns
head(msft_prices3)
   MSFT
1 33.52
2 34.62
3 34.45
4 34.28
5 34.69
6 34.36
  • These data columns can also be accessed like a matrix, using a matrix index.

  • This method can return a complete row, a complete column or just an element from the dataset.

# MSFT is in the second column and leaving the row index blank returns all the
# rows for the particular column

msft_prices4 = data_stocks[, 2]

head(msft_prices4)
[1] 33.52 34.62 34.45 34.28 34.69 34.36
# all the elements in row 4
data_stocks[4, ]
       Date  MSFT    IBM  AAPL  MCD    PG GOOG
4 7/01/2002 34.28 124.05 11.45 27.2 38.78   NA

4.1.2 Combining Data Frames

  • It may be required to combine two data frames during a data processing.

  • This can be done by stacking them row by row or combining them by columns using \(\mathtt{rbind}\) and \(\mathtt{cbind}\) respectively.

  • When using \(\mathtt{cbind}\) the number of rows in the columns combined must be of equal length likewise in \(\mathtt{rbind}\) the number of columns of the datasets combined should be equal. Lets see an example

# First create a vector having the returns for msft
msft_ret = 100 * diff(log(data_stocks$MSFT))
# combine the vector with the data
data_stocks_r = cbind(data_stocks, MSFT_RET = msft_ret)  #this will generate an error message 
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2784, 2783
# different length
length(msft_ret)
[1] 2783
length(data_stocks$MSFT)
[1] 2784
# add one more value to vector msft_ret
msft_ret = c(0, msft_ret)
# check the length
length(msft_ret)
[1] 2784
# lets combine now (it should work)
data_stocks_r = cbind(data_stocks, MSFT_RET = msft_ret)
head(data_stocks_r)  #shows one more column added to the data
       Date  MSFT    IBM  AAPL   MCD    PG GOOG   MSFT_RET
1 2/01/2002 33.52 121.50 11.65 26.49 40.00   NA  0.0000000
2 3/01/2002 34.62 123.66 11.79 26.79 39.62   NA  3.2289274
3 4/01/2002 34.45 125.60 11.84 26.99 39.22   NA -0.4922552
4 7/01/2002 34.28 124.05 11.45 27.20 38.78   NA -0.4946904
5 8/01/2002 34.69 124.70 11.30 27.36 38.88   NA  1.1889367
6 9/01/2002 34.36 124.49 10.82 26.88 38.60   NA -0.9558364
  • The following example adds a row to the data frame.
# create two dataframes from data_stocks
data_r1 = data_stocks[1:10, ]  #first 10 rows
data_r2 = data_stocks[2775:2784, ]  #last 10 rows
data_stocks_rbind = rbind(data_r1, data_r2)
print(data_stocks_rbind)
           Date  MSFT    IBM   AAPL   MCD    PG   GOOG
1     2/01/2002 33.52 121.50  11.65 26.49 40.00     NA
2     3/01/2002 34.62 123.66  11.79 26.79 39.62     NA
3     4/01/2002 34.45 125.60  11.84 26.99 39.22     NA
4     7/01/2002 34.28 124.05  11.45 27.20 38.78     NA
5     8/01/2002 34.69 124.70  11.30 27.36 38.88     NA
6     9/01/2002 34.36 124.49  10.82 26.88 38.60     NA
7    10/01/2002 34.64 122.14  10.62 26.81 38.46     NA
8    11/01/2002 34.30 120.31  10.52 26.34 38.60     NA
9    14/01/2002 34.24 118.05  10.58 26.02 39.35     NA
10   15/01/2002 34.78 118.85  10.85 26.20 39.82     NA
2775 17/12/2012 27.10 193.62 518.83 89.91 69.93 720.78
2776 18/12/2012 27.56 195.69 533.90 90.52 69.97 721.07
2777 19/12/2012 27.31 195.08 526.31 89.71 69.34 720.11
2778 20/12/2012 27.68 194.77 521.73 90.04 69.82 722.36
2779 21/12/2012 27.45 193.42 519.33 90.18 68.72 715.63
2780 24/12/2012 27.06 192.40 520.17 89.29 68.52 709.50
2781 26/12/2012 26.86 191.95 513.00 88.74 68.00 708.87
2782 27/12/2012 26.96 192.71 515.06 88.72 67.97 706.29
2783 28/12/2012 26.55 189.83 509.59 87.58 67.15 700.01
2784 31/12/2012 26.71 191.55 532.17 88.21 67.89 707.38

4.1.3 Sub setting and Logical Data Selection

  • Suppose we want to extract data with particular characteristics like values ranges etc.

  • This can be accomplished using logical statements in bracket notations.

  • The following example illustrates. See \(\mathtt{help(">")}\) to see more comparison operators.

# select all rows with Apple prices above 100
data_aaplgr100 = data_stocks[data_stocks$AAPL > 100, ]
head(data_aaplgr100)
          Date  MSFT    IBM   AAPL   MCD    PG   GOOG
1342 2/05/2007 30.61 102.22 100.39 50.02 62.37 465.78
1343 3/05/2007 30.97 102.80 100.40 49.91 62.00 473.23
1344 4/05/2007 30.56 102.96 100.81 49.92 62.41 471.12
1345 7/05/2007 30.71 103.16 103.92 49.50 62.18 467.27
1346 8/05/2007 30.75 103.29 105.06 49.32 61.75 466.81
1347 9/05/2007 30.78 104.38 106.88 49.84 62.01 469.25
min(data_aaplgr100$AAPL)  #check if the prices are above 100
[1] NA
# this give NA as the minimum which indicates that data frame has NA lets
# remove NAs from data_aaplgr100 using na.omit function
data_aaplgr100 = na.omit(data_aaplgr100)
# now check the minimum again
min(data_aaplgr100$AAPL)
[1] 100.06
  • The \(\mathtt{na.omit}\) function used in the example above can be used to remove all the empty values in the dataset.
head(data_stocks)  #notice NAs in GOOG
       Date  MSFT    IBM  AAPL   MCD    PG GOOG
1 2/01/2002 33.52 121.50 11.65 26.49 40.00   NA
2 3/01/2002 34.62 123.66 11.79 26.79 39.62   NA
3 4/01/2002 34.45 125.60 11.84 26.99 39.22   NA
4 7/01/2002 34.28 124.05 11.45 27.20 38.78   NA
5 8/01/2002 34.69 124.70 11.30 27.36 38.88   NA
6 9/01/2002 34.36 124.49 10.82 26.88 38.60   NA
data_stocks_googlena = data_stocks[!is.na(data_stocks$GOOG), ]
head(data_stocks_googlena)  #after removing NAs
          Date  MSFT   IBM  AAPL   MCD    PG   GOOG
663 19/08/2004 27.12 84.89 15.36 26.60 54.48 100.34
664 20/08/2004 27.20 85.25 15.40 27.07 54.85 108.31
665 23/08/2004 27.24 84.65 15.54 26.64 54.75 109.40
666 24/08/2004 27.24 84.71 15.98 26.87 54.95 104.87
667 25/08/2004 27.55 85.07 16.52 26.95 55.30 106.00
668 26/08/2004 27.44 84.69 17.33 27.10 55.70 107.91
# the above can still leave NAs in other columns use na.omit to remove all the
# blank data
data_stocks_naomit = na.omit(data_stocks)
  • There can be a requirement in data pre processing where one might have to select data in a range.

  • The following example selects data where MSFT prices lie between 20 and 30.

  • \(\mathtt{\&}\) is a Logic operator in R see help(“&”) to see more details and other Logic operators.

data_msft = data_stocks_naomit[data_stocks_naomit$MSFT <= 30 & data_stocks_naomit$MSFT >
    20, ]
min(data_msft$MSFT)  #check 
[1] 20.06
  • These selections can also be performed using the function \(\mathtt{subset}\).

  • The following example uses \(\mathtt{subset}\) function to select rows with AAPL>100. The arguments to the function are also shows in the example

args(subset.data.frame)
function (x, subset, select, drop = FALSE, ...) 
NULL
aaplgr100 = subset(data_stocks_naomit, AAPL > 100)
head(aaplgr100)
          Date  MSFT    IBM   AAPL   MCD    PG   GOOG
1342 2/05/2007 30.61 102.22 100.39 50.02 62.37 465.78
1343 3/05/2007 30.97 102.80 100.40 49.91 62.00 473.23
1344 4/05/2007 30.56 102.96 100.81 49.92 62.41 471.12
1345 7/05/2007 30.71 103.16 103.92 49.50 62.18 467.27
1346 8/05/2007 30.75 103.29 105.06 49.32 61.75 466.81
1347 9/05/2007 30.78 104.38 106.88 49.84 62.01 469.25
min(aaplgr100$AAPL)
[1] 100.06

4.2 Data Transformation from Wide to Long (or vice versa)

  • Sometimes its required to transform wide format data to long, which is often required to work with ggplot2 package (discussed in the graphics section)
  • R package tidyr provides two functions pivot_longer() and pivot_wider() to transform the data into long or wide format.
  • Let’s convert the stocks data to the long format
library(tidyr)

FinData_long = pivot_longer(data = data_stocks, cols = -Date, names_to = "Stock",
    values_to = "Price")
head(FinData_long)
# A tibble: 6 × 3
  Date      Stock Price
  <chr>     <chr> <dbl>
1 2/01/2002 MSFT   33.5
2 2/01/2002 IBM   122. 
3 2/01/2002 AAPL   11.6
4 2/01/2002 MCD    26.5
5 2/01/2002 PG     40  
6 2/01/2002 GOOG   NA  
  • A reverse operation can be conducted using pivot_wider()
FinData_wide = pivot_wider(FinData_long, names_from = Stock, values_from = Price)
head(FinData_wide)
# A tibble: 6 × 7
  Date       MSFT   IBM  AAPL   MCD    PG  GOOG
  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2/01/2002  33.5  122.  11.6  26.5  40      NA
2 3/01/2002  34.6  124.  11.8  26.8  39.6    NA
3 4/01/2002  34.4  126.  11.8  27.0  39.2    NA
4 7/01/2002  34.3  124.  11.4  27.2  38.8    NA
5 8/01/2002  34.7  125.  11.3  27.4  38.9    NA
6 9/01/2002  34.4  124.  10.8  26.9  38.6    NA

4.3 Summary Statistics

The good news is that these descriptive statistics give us a manageable and meaningful summary of the underlying phenomenon. That’s what this chapter is about. The bad news is that any simplification invites abuse. Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading.-Charles Wheelan

  • It is simple to calculate basic summary statistics in R, most of the functions are named according to what they do.

  • For instance \(\mathtt{mean}\) calculates the mean of a single variable, \(\mathtt{sd}\) calculates the standard deviation.

Table: Basic statistics functions in R
Statistics
R-Function
Arithmetic mean
mean( x )
Geometric mean
exp( mean( log( x ) ) )
median
median( x )
Range
range( x )
variance
var( x )
standard deviation
sd( x )
Interquantile Range
IQR( x )
Other quantiles
quantile( x )
Skewness
skewness( x )
Kurtosis
kurtosis( x )
  • The following example demonstrates how to calculate the statistics measures in table-1 for Dow Jones prices in data file data_fin.csv
# change the working directory to the folder containing data_fin.csv or provide
# the full path with the filename
data_stocks = read.csv("data/data_fin.csv")  #import data
head(data_stocks)
        Date     DJI   AXP   MMM   ATT    BA   CAT CISCO    DD   XOM    GE
1  3/01/2000 11357.5 45.82 47.19 47.19 40.12 24.31 54.05 65.00 39.09 49.95
2  4/01/2000 10997.9 44.09 45.31 44.25 40.12 24.00 51.00 65.00 38.41 48.06
3  5/01/2000 11122.7 42.96 46.62 44.94 42.62 24.56 51.19 67.75 40.50 47.70
4  6/01/2000 11253.3 43.78 50.62 43.75 43.06 25.81 50.00 71.50 42.59 48.51
5  7/01/2000 11522.6 44.42 51.47 44.12 44.12 26.66 52.94 71.62 42.31 50.28
6 10/01/2000 11572.2 45.04 51.12 44.75 43.69 25.78 54.91 70.00 41.88 50.37
     GS    HD    IBM  INTC   JNJ   JPM   MRK   MCD  MSFT   NKE
1 88.31 65.50 115.56 43.47 46.09 48.69 64.04 39.62 58.34 12.03
2 82.38 61.50 112.06 41.47 44.41 47.27 61.61 38.81 56.31 11.38
3 78.88 61.44 116.00 41.81 44.88 46.98 64.22 39.44 56.91 12.03
4 82.25 60.00 114.62 39.38 46.28 47.65 64.75 38.88 55.00 11.97
5 82.56 62.81 113.31 41.00 47.88 48.52 70.97 39.75 55.72 11.97
6 84.38 63.19 118.44 42.88 47.03 47.69 68.89 40.06 56.12 12.17
DJI = data_stocks$DJI
DJI = na.omit(DJI)  #remove NAs as it will affect the calculations
# Arithmetic mean
mean(DJI)
[1] 11098.12
# Geometric mean
exp(mean(log(DJI)))
[1] 10953.39
# median
median(DJI)
[1] 10748.8
# variance & standard deviation
var(DJI)
[1] 3280347
sd(DJI)
[1] 1811.173
# interquantile range and few quantiles
IQR(DJI)
[1] 2276.25
quantile(DJI)
      0%      25%      50%      75%     100% 
 6547.10 10063.25 10748.80 12339.50 16576.66 
# skewness and kurtosis skewness and kurtosis functions are not available in R
# core library but in library e1071 (there are other packages which have
# functions for skewness and kurtosis try ??kurtosis or search for the function
# on RSearch.
library(e1071)
skewness(DJI)
[1] 0.4777828
kurtosis(DJI)
[1] 0.08404185
  • The function \(\mathtt{summary}\) in R provides some basic summary viz., minimum value, maximum value, median value and quartiles for one variable or a dataset. The function \(\mathtt{summary}\) can be used as follows
# summary of one column/variable in a dataframe
summary(DJI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6547   10063   10749   11098   12340   16577 
# summary of whole dataset excluding the time column
summary(data_stocks[, c(2:21)])
      DJI             AXP             MMM              ATT       
 Min.   : 6547   Min.   :10.26   Min.   : 39.50   Min.   :19.34  
 1st Qu.:10063   1st Qu.:38.31   1st Qu.: 62.55   1st Qu.:25.54  
 Median :10749   Median :47.49   Median : 77.67   Median :29.65  
 Mean   :11098   Mean   :46.77   Mean   : 75.78   Mean   :31.77  
 3rd Qu.:12340   3rd Qu.:54.44   3rd Qu.: 85.55   3rd Qu.:37.22  
 Max.   :16577   Max.   :90.73   Max.   :140.25   Max.   :58.50  
 NA's   :27      NA's   :12      NA's   :12       NA's   :12     
       BA              CAT             CISCO             DD       
 Min.   : 25.06   Min.   : 14.91   Min.   : 8.60   Min.   :16.14  
 1st Qu.: 44.00   1st Qu.: 28.64   1st Qu.:17.68   1st Qu.:41.13  
 Median : 63.56   Median : 57.10   Median :20.43   Median :44.53  
 Mean   : 62.95   Mean   : 56.01   Mean   :23.41   Mean   :44.37  
 3rd Qu.: 74.90   3rd Qu.: 79.36   3rd Qu.:24.17   3rd Qu.:48.92  
 Max.   :138.36   Max.   :116.20   Max.   :80.06   Max.   :71.62  
 NA's   :12       NA's   :12       NA's   :13      NA's   :12     
      XOM               GE              GS              HD       
 Min.   : 30.27   Min.   : 6.66   Min.   : 52.0   Min.   :18.00  
 1st Qu.: 42.46   1st Qu.:20.00   1st Qu.: 92.2   1st Qu.:31.00  
 Median : 64.77   Median :30.33   Median :116.2   Median :37.37  
 Mean   : 63.20   Mean   :29.56   Mean   :126.9   Mean   :40.20  
 3rd Qu.: 81.62   3rd Qu.:36.03   3rd Qu.:159.5   3rd Qu.:46.23  
 Max.   :101.51   Max.   :59.94   Max.   :247.9   Max.   :82.34  
 NA's   :12       NA's   :12      NA's   :12      NA's   :12     
      IBM              INTC            JNJ             JPM       
 Min.   : 55.07   Min.   :12.08   Min.   :33.69   Min.   :15.45  
 1st Qu.: 87.82   1st Qu.:20.17   1st Qu.:55.27   1st Qu.:35.66  
 Median :106.48   Median :22.76   Median :61.30   Median :40.20  
 Mean   :118.83   Mean   :25.21   Mean   :61.07   Mean   :40.36  
 3rd Qu.:130.00   3rd Qu.:26.77   3rd Qu.:65.20   3rd Qu.:45.71  
 Max.   :215.80   Max.   :74.88   Max.   :95.63   Max.   :65.70  
 NA's   :12       NA's   :13      NA's   :12      NA's   :12     
      MRK             MCD              MSFT            NKE       
 Min.   :20.99   Min.   : 12.38   Min.   :15.15   Min.   : 6.64  
 1st Qu.:34.53   1st Qu.: 29.19   1st Qu.:25.67   1st Qu.:14.56  
 Median :43.63   Median : 43.78   Median :27.59   Median :23.25  
 Mean   :44.65   Mean   : 51.07   Mean   :28.38   Mean   :28.01  
 3rd Qu.:51.55   3rd Qu.: 70.36   3rd Qu.:30.19   3rd Qu.:36.90  
 Max.   :89.85   Max.   :103.59   Max.   :58.34   Max.   :79.86  
 NA's   :13      NA's   :12       NA's   :13      NA's   :12     

4.3.1 Example-Descriptive Statistics of Stock Returns

  • In this example we will use R to calculate descriptive statistics for the returns of 10 stocks in the data file \(\mathtt{data\_fin.csv}\).

• We will first import the dataset into R using the \(\mathtt{read.csv}\) function.

data_cs1 = read.csv("data/data_fin.csv")
head(data_cs1)  #check the imported data
        Date     DJI   AXP   MMM   ATT    BA   CAT CISCO    DD   XOM    GE
1  3/01/2000 11357.5 45.82 47.19 47.19 40.12 24.31 54.05 65.00 39.09 49.95
2  4/01/2000 10997.9 44.09 45.31 44.25 40.12 24.00 51.00 65.00 38.41 48.06
3  5/01/2000 11122.7 42.96 46.62 44.94 42.62 24.56 51.19 67.75 40.50 47.70
4  6/01/2000 11253.3 43.78 50.62 43.75 43.06 25.81 50.00 71.50 42.59 48.51
5  7/01/2000 11522.6 44.42 51.47 44.12 44.12 26.66 52.94 71.62 42.31 50.28
6 10/01/2000 11572.2 45.04 51.12 44.75 43.69 25.78 54.91 70.00 41.88 50.37
     GS    HD    IBM  INTC   JNJ   JPM   MRK   MCD  MSFT   NKE
1 88.31 65.50 115.56 43.47 46.09 48.69 64.04 39.62 58.34 12.03
2 82.38 61.50 112.06 41.47 44.41 47.27 61.61 38.81 56.31 11.38
3 78.88 61.44 116.00 41.81 44.88 46.98 64.22 39.44 56.91 12.03
4 82.25 60.00 114.62 39.38 46.28 47.65 64.75 38.88 55.00 11.97
5 82.56 62.81 113.31 41.00 47.88 48.52 70.97 39.75 55.72 11.97
6 84.38 63.19 118.44 42.88 47.03 47.69 68.89 40.06 56.12 12.17
  • Apply function to with dates as character and then after converting dates to Date class.
# selecting first 10 price series including the data column
data_cs1.1 = data_cs1[, c(1:11)]
# data cleaning-remove NAs
data_cs1.1 = na.omit(data_cs1.1)
colnames(data_cs1.1)  # see the columns present in the data
 [1] "Date"  "DJI"   "AXP"   "MMM"   "ATT"   "BA"    "CAT"   "CISCO" "DD"   
[10] "XOM"   "GE"   
summary(data_cs1.1)  #notice the Date variable
     Date                DJI             AXP             MMM        
 Length:3523        Min.   : 6547   Min.   :10.26   Min.   : 39.50  
 Class :character   1st Qu.:10063   1st Qu.:38.38   1st Qu.: 62.55  
 Mode  :character   Median :10749   Median :47.60   Median : 77.67  
                    Mean   :11098   Mean   :46.83   Mean   : 75.80  
                    3rd Qu.:12340   3rd Qu.:54.50   3rd Qu.: 85.61  
                    Max.   :16577   Max.   :90.73   Max.   :140.25  
      ATT              BA              CAT             CISCO      
 Min.   :19.34   Min.   : 25.06   Min.   : 14.91   Min.   : 8.60  
 1st Qu.:25.54   1st Qu.: 44.02   1st Qu.: 28.48   1st Qu.:17.68  
 Median :29.76   Median : 63.61   Median : 57.11   Median :20.39  
 Mean   :31.79   Mean   : 62.99   Mean   : 56.03   Mean   :23.42  
 3rd Qu.:37.23   3rd Qu.: 74.95   3rd Qu.: 79.50   3rd Qu.:24.18  
 Max.   :58.50   Max.   :138.36   Max.   :116.20   Max.   :80.06  
       DD             XOM               GE       
 Min.   :16.14   Min.   : 30.27   Min.   : 6.66  
 1st Qu.:41.17   1st Qu.: 42.41   1st Qu.:20.04  
 Median :44.58   Median : 64.70   Median :30.37  
 Mean   :44.43   Mean   : 63.18   Mean   :29.63  
 3rd Qu.:48.93   3rd Qu.: 81.70   3rd Qu.:36.05  
 Max.   :71.62   Max.   :101.51   Max.   :59.94  
# check class of dates which will be factor ( treated as factor by default)\t
class(data_cs1.1$Date)
[1] "character"
# convert dates to class Date
data_cs1.1$Date = as.Date(data_cs1.1$Date, format = "%d/%m/%Y")
class(data_cs1.1$Date)
[1] "Date"
summary(data_cs1.1)  #notice the Date variable
      Date                 DJI             AXP             MMM        
 Min.   :2000-01-03   Min.   : 6547   Min.   :10.26   Min.   : 39.50  
 1st Qu.:2003-07-08   1st Qu.:10063   1st Qu.:38.38   1st Qu.: 62.55  
 Median :2007-01-05   Median :10749   Median :47.60   Median : 77.67  
 Mean   :2007-01-03   Mean   :11098   Mean   :46.83   Mean   : 75.80  
 3rd Qu.:2010-07-06   3rd Qu.:12340   3rd Qu.:54.50   3rd Qu.: 85.61  
 Max.   :2014-01-03   Max.   :16577   Max.   :90.73   Max.   :140.25  
      ATT              BA              CAT             CISCO      
 Min.   :19.34   Min.   : 25.06   Min.   : 14.91   Min.   : 8.60  
 1st Qu.:25.54   1st Qu.: 44.02   1st Qu.: 28.48   1st Qu.:17.68  
 Median :29.76   Median : 63.61   Median : 57.11   Median :20.39  
 Mean   :31.79   Mean   : 62.99   Mean   : 56.03   Mean   :23.42  
 3rd Qu.:37.23   3rd Qu.: 74.95   3rd Qu.: 79.50   3rd Qu.:24.18  
 Max.   :58.50   Max.   :138.36   Max.   :116.20   Max.   :80.06  
       DD             XOM               GE       
 Min.   :16.14   Min.   : 30.27   Min.   : 6.66  
 1st Qu.:41.17   1st Qu.: 42.41   1st Qu.:20.04  
 Median :44.58   Median : 64.70   Median :30.37  
 Mean   :44.43   Mean   : 63.18   Mean   :29.63  
 3rd Qu.:48.93   3rd Qu.: 81.70   3rd Qu.:36.05  
 Max.   :71.62   Max.   :101.51   Max.   :59.94  
  • Convert prices to returns
d2 = as.data.frame(sapply(data_cs1.1[2:11], function(x) diff(log(x)) * 100))  #note it will be one less
# create a different dataframe with returns
data_stocks_ret = as.data.frame(cbind(Date = data_cs1.1$Date[2:length(data_cs1.1$Date)],
    d2), stringsAsFactors = FALSE, row.names = NULL)
# visual inspection
head(data_stocks_ret)
        Date        DJI        AXP        MMM        ATT         BA       CAT
1 2000-01-04 -3.2173973 -3.8487678 -4.0654247 -6.4326634  0.0000000 -1.283396
2 2000-01-05  1.1283720 -2.5963549  2.8501875  1.5472895  6.0448664  2.306527
3 2000-01-06  1.1673354  1.8907642  8.2317122 -2.6836654  1.0270865  4.964291
4 2000-01-07  2.3648905  1.4512726  1.6652359  0.8421582  2.4318702  3.240230
5 2000-01-10  0.4295346  1.3861165 -0.6823304  1.4178250 -0.9793951 -3.356532
6 2000-01-11 -0.5293883  0.9061837 -1.7165263 -1.4178250 -1.8713726 -1.563754
       CISCO         DD        XOM         GE
1 -5.8083911  0.0000000 -1.7548837 -3.8572275
2  0.3718568  4.1437190  5.2984132 -0.7518832
3 -2.3521195  5.3872990  5.0317510  1.6838564
4  5.7136191  0.1676915 -0.6596019  3.5837421
5  3.6536284 -2.2879123 -1.0215079  0.1788376
6 -3.0697677 -1.8018506  1.1868167  0.2379537

4.3.1.1 Using the \(\mathtt{describe}\) function

  • The package psych comes with a function called \(\mathtt{describe}\) which generated the descriptive statistics for all the data vectors (columns) in a data frame, matrix or a vector.
library(psych)  #load the required package
args(describe)  #arguments for describe function
function (x, na.rm = TRUE, interp = FALSE, skew = TRUE, ranges = TRUE, 
    trim = 0.1, type = 3, check = TRUE, fast = NULL, quant = NULL, 
    IQR = FALSE, omit = FALSE, data = NULL) 
NULL
# use describe to calculate descriptive stats for data_cs1.1r
desc1 = describe(data_stocks_ret[, 2:11])  #note we dont pass the date column
# check the output
head(desc1)
    vars    n  mean   sd median trimmed  mad    min   max range  skew kurtosis
DJI    1 3522  0.01 1.23   0.04    0.03 0.82  -8.20 10.51 18.71 -0.06     7.71
AXP    2 3522  0.02 2.89   0.02    0.03 1.55 -19.35 18.77 38.12 -0.01     9.14
MMM    3 3522  0.03 1.55   0.03    0.03 1.10  -9.38 10.39 19.78  0.06     4.87
ATT    4 3522 -0.01 1.80   0.03    0.01 1.22 -13.54 15.08 28.62  0.02     6.26
BA     5 3522  0.03 2.01   0.05    0.06 1.57 -19.39 14.38 33.77 -0.26     5.39
CAT    6 3522  0.04 2.14   0.04    0.05 1.65 -15.69 13.73 29.42 -0.08     4.08
      se
DJI 0.02
AXP 0.05
MMM 0.03
ATT 0.03
BA  0.03
CAT 0.04
# the above output is in long format, we can transpose it get column format
desc1.t = t(desc1)
head(desc1.t)
                 DJI          AXP          MMM           ATT           BA
vars    1.000000e+00 2.000000e+00 3.000000e+00  4.000000e+00 5.000000e+00
n       3.522000e+03 3.522000e+03 3.522000e+03  3.522000e+03 3.522000e+03
mean    1.055257e-02 1.908563e-02 3.056011e-02 -8.647491e-03 3.499777e-02
sd      1.226702e+00 2.892586e+00 1.551706e+00  1.799180e+00 2.013123e+00
median  4.442671e-02 1.723604e-02 3.233787e-02  3.018428e-02 5.279680e-02
trimmed 2.597511e-02 3.152635e-02 3.241946e-02  9.134423e-03 5.567097e-02
                 CAT         CISCO            DD          XOM            GE
vars    6.000000e+00  7.000000e+00  8.000000e+00 9.000000e+00  1.000000e+01
n       3.522000e+03  3.522000e+03  3.522000e+03 3.522000e+03  3.522000e+03
mean    3.710732e-02 -2.554732e-02 -5.379787e-04 2.653014e-02 -1.696661e-02
sd      2.143040e+00  2.744068e+00  1.881993e+00 1.629181e+00  2.057353e+00
median  4.291300e-02  3.800312e-02  0.000000e+00 5.437650e-02  0.000000e+00
trimmed 4.678016e-02 -1.339579e-04  1.469270e-03 4.883990e-02  9.841473e-05
  • The descriptive statistics generated above gives mean, median, standard deviation, trimmed mean(trimmed), median, mad (median absolute deviation from the median), minimum (min), maximum (max), skewness (skew), kurtosis and standard error (se) .

  • This can easily be transferred to a CSV file or a text file. The following single line of code transfers the descriptive statistics to a CSV file which then can be imported into a word or latex file as required.

  • The pastecs package provides the function \(\mathtt{stat.desc}\) which generated descriptive statistics for a data frame, matrix or a timeseries. Skewness and Kurtosis are not calculated by default in \(\mathtt{stat.desc}\) but the argument \(\mathtt{norm}\) can be set to TRUE to get these measures along with their standard errors.

require(pastecs)  # note library and require can both be used to include a package
# detach the package pastecs its useful to avoid any conflicts (e.g psych and
# Hmisc have 'describe' function with two different behaviours
detach("package:psych", unload = TRUE)
# use stat.desc in with default arguments
desc2 = stat.desc(data_stocks_ret[, 2:11], norm = TRUE)
desc2  #note skewness/kurtosis
                       DJI           AXP           MMM           ATT
nbr.val       3.522000e+03  3.522000e+03  3.522000e+03  3.522000e+03
nbr.null      2.000000e+00  2.400000e+01  3.000000e+01  5.000000e+01
nbr.na        0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
min          -8.200737e+00 -1.935233e+01 -9.383688e+00 -1.353821e+01
max           1.050812e+01  1.877116e+01  1.039309e+01  1.508318e+01
range         1.870886e+01  3.812349e+01  1.977678e+01  2.862139e+01
sum           3.716616e+01  6.721959e+01  1.076327e+02 -3.045646e+01
median        4.442671e-02  1.723604e-02  3.233787e-02  3.018428e-02
mean          1.055257e-02  1.908563e-02  3.056011e-02 -8.647491e-03
SE.mean       2.067018e-02  4.874068e-02  2.614656e-02  3.031656e-02
CI.mean.0.95  4.052675e-02  9.556282e-02  5.126395e-02  5.943979e-02
var           1.504798e+00  8.367052e+00  2.407790e+00  3.237048e+00
std.dev       1.226702e+00  2.892586e+00  1.551706e+00  1.799180e+00
coef.var      1.162467e+02  1.515583e+02  5.077551e+01 -2.080580e+02
skewness     -5.829983e-02 -6.689750e-03  5.927112e-02  1.620418e-02
skew.2SE     -7.065472e-01 -8.107441e-02  7.183185e-01  1.963817e-01
kurtosis      7.714304e+00  9.141053e+00  4.865294e+00  6.257155e+00
kurt.2SE      4.675883e+01  5.540681e+01  2.949008e+01  3.792659e+01
normtest.W    9.187712e-01  8.496717e-01  9.384591e-01  9.298653e-01
normtest.p    5.671566e-40  1.020339e-49  6.131053e-36  8.237017e-38
                        BA           CAT         CISCO            DD
nbr.val       3.522000e+03  3.522000e+03  3.522000e+03  3.522000e+03
nbr.null      1.500000e+01  2.400000e+01  5.000000e+01  2.900000e+01
nbr.na        0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
min          -1.938931e+01 -1.568589e+01 -1.768649e+01 -1.202802e+01
max           1.437774e+01  1.373497e+01  2.182386e+01  1.085590e+01
range         3.376704e+01  2.942086e+01  3.951034e+01  2.288392e+01
sum           1.232621e+02  1.306920e+02 -8.997766e+01 -1.894761e+00
median        5.279680e-02  4.291300e-02  3.800312e-02  0.000000e+00
mean          3.499777e-02  3.710732e-02 -2.554732e-02 -5.379787e-04
SE.mean       3.392155e-02  3.611067e-02  4.623812e-02  3.171197e-02
CI.mean.0.95  6.650788e-02  7.079996e-02  9.065621e-02  6.217570e-02
var           4.052665e+00  4.592620e+00  7.529907e+00  3.541897e+00
std.dev       2.013123e+00  2.143040e+00  2.744068e+00  1.881993e+00
coef.var      5.752148e+01  5.775248e+01 -1.074112e+02 -3.498266e+03
skewness     -2.605203e-01 -8.369997e-02  1.547891e-01 -1.523353e-01
skew.2SE     -3.157298e+00 -1.014377e+00  1.875920e+00 -1.846181e+00
kurtosis      5.392404e+00  4.077013e+00  7.329406e+00  5.061290e+00
kurt.2SE      3.268507e+01  2.471206e+01  4.442585e+01  3.067808e+01
normtest.W    9.550329e-01  9.583075e-01  9.104004e-01  9.381393e-01
normtest.p    1.196739e-31  1.172506e-30  1.882675e-41  5.179385e-36
                       XOM            GE
nbr.val       3.522000e+03  3.522000e+03
nbr.null      2.900000e+01  6.300000e+01
nbr.na        0.000000e+00  0.000000e+00
min          -1.502710e+01 -1.368410e+01
max           1.586307e+01  1.798444e+01
range         3.089017e+01  3.166854e+01
sum           9.343915e+01 -5.975640e+01
median        5.437650e-02  0.000000e+00
mean          2.653014e-02 -1.696661e-02
SE.mean       2.745205e-02  3.466683e-02
CI.mean.0.95  5.382353e-02  6.796911e-02
var           2.654232e+00  4.232702e+00
std.dev       1.629181e+00  2.057353e+00
coef.var      6.140870e+01 -1.212589e+02
skewness      4.651513e-02  1.102593e-02
skew.2SE      5.637262e-01  1.336254e-01
kurtosis      1.043194e+01  7.781017e+00
kurt.2SE      6.323129e+01  4.716320e+01
normtest.W    9.160764e-01  9.046828e-01
normtest.p    1.839459e-40  2.126284e-42