15.1 Import Data and Pre-processing

Data discription is here https://onlinecourses.science.psu.edu/stat508/book/export/html/796

  • 20 variables

    Status of existing checking account. Duration in month Credit history Purpose Credit amount Savings account/bonds Present employment since Installment rate in percentage of disposable income Personal status and sex Other debtors / guarantors Present residence since Property Age in years Other installment plans Housing Number of existing credits at this bank Job Number of people being liable to provide maintenance for Telephone foreign worker

data_cr = read.csv("data/german_credit.csv")
# preliminary analysis descriptive and visual
str(data_cr)
'data.frame':   1000 obs. of  21 variables:
 $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
 $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
 $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
 $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
 $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
 $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
 $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
 $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
 $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
 $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
 $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
 $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
 $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
 $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
 $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
 $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
 $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
 $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...
# remove NA
data_cr = na.omit(data_cr)
# quick summary
summary(data_cr)
 Creditability Account.Balance Duration.of.Credit..month.
 Min.   :0.0   Min.   :1.000   Min.   : 4.0              
 1st Qu.:0.0   1st Qu.:1.000   1st Qu.:12.0              
 Median :1.0   Median :2.000   Median :18.0              
 Mean   :0.7   Mean   :2.577   Mean   :20.9              
 3rd Qu.:1.0   3rd Qu.:4.000   3rd Qu.:24.0              
 Max.   :1.0   Max.   :4.000   Max.   :72.0              
 Payment.Status.of.Previous.Credit    Purpose       Credit.Amount  
 Min.   :0.000                     Min.   : 0.000   Min.   :  250  
 1st Qu.:2.000                     1st Qu.: 1.000   1st Qu.: 1366  
 Median :2.000                     Median : 2.000   Median : 2320  
 Mean   :2.545                     Mean   : 2.828   Mean   : 3271  
 3rd Qu.:4.000                     3rd Qu.: 3.000   3rd Qu.: 3972  
 Max.   :4.000                     Max.   :10.000   Max.   :18424  
 Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
 Min.   :1.000        Min.   :1.000                Min.   :1.000      
 1st Qu.:1.000        1st Qu.:3.000                1st Qu.:2.000      
 Median :1.000        Median :3.000                Median :3.000      
 Mean   :2.105        Mean   :3.384                Mean   :2.973      
 3rd Qu.:3.000        3rd Qu.:5.000                3rd Qu.:4.000      
 Max.   :5.000        Max.   :5.000                Max.   :4.000      
 Sex...Marital.Status   Guarantors    Duration.in.Current.address
 Min.   :1.000        Min.   :1.000   Min.   :1.000              
 1st Qu.:2.000        1st Qu.:1.000   1st Qu.:2.000              
 Median :3.000        Median :1.000   Median :3.000              
 Mean   :2.682        Mean   :1.145   Mean   :2.845              
 3rd Qu.:3.000        3rd Qu.:1.000   3rd Qu.:4.000              
 Max.   :4.000        Max.   :3.000   Max.   :4.000              
 Most.valuable.available.asset  Age..years.    Concurrent.Credits
 Min.   :1.000                 Min.   :19.00   Min.   :1.000     
 1st Qu.:1.000                 1st Qu.:27.00   1st Qu.:3.000     
 Median :2.000                 Median :33.00   Median :3.000     
 Mean   :2.358                 Mean   :35.54   Mean   :2.675     
 3rd Qu.:3.000                 3rd Qu.:42.00   3rd Qu.:3.000     
 Max.   :4.000                 Max.   :75.00   Max.   :3.000     
 Type.of.apartment No.of.Credits.at.this.Bank   Occupation    No.of.dependents
 Min.   :1.000     Min.   :1.000              Min.   :1.000   Min.   :1.000   
 1st Qu.:2.000     1st Qu.:1.000              1st Qu.:3.000   1st Qu.:1.000   
 Median :2.000     Median :1.000              Median :3.000   Median :1.000   
 Mean   :1.928     Mean   :1.407              Mean   :2.904   Mean   :1.155   
 3rd Qu.:2.000     3rd Qu.:2.000              3rd Qu.:3.000   3rd Qu.:1.000   
 Max.   :3.000     Max.   :4.000              Max.   :4.000   Max.   :2.000   
   Telephone     Foreign.Worker 
 Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:1.000  
 Median :1.000   Median :1.000  
 Mean   :1.404   Mean   :1.037  
 3rd Qu.:2.000   3rd Qu.:1.000  
 Max.   :2.000   Max.   :2.000  
  • Let’s convert data types to as all of them are factors but some of them should be used as numeric
sapply(data_cr, class)
                    Creditability                   Account.Balance 
                        "integer"                         "integer" 
       Duration.of.Credit..month. Payment.Status.of.Previous.Credit 
                        "integer"                         "integer" 
                          Purpose                     Credit.Amount 
                        "integer"                         "integer" 
             Value.Savings.Stocks      Length.of.current.employment 
                        "integer"                         "integer" 
              Instalment.per.cent              Sex...Marital.Status 
                        "integer"                         "integer" 
                       Guarantors       Duration.in.Current.address 
                        "integer"                         "integer" 
    Most.valuable.available.asset                       Age..years. 
                        "integer"                         "integer" 
               Concurrent.Credits                 Type.of.apartment 
                        "integer"                         "integer" 
       No.of.Credits.at.this.Bank                        Occupation 
                        "integer"                         "integer" 
                 No.of.dependents                         Telephone 
                        "integer"                         "integer" 
                   Foreign.Worker 
                        "integer" 
# Keep Duration.of.Credit..month and Credit.Amount as numeric rest to
# factors
id = c(1, 2, 4, 5, 7:21)
data_cr[id] = lapply(data_cr[id], as.factor)