1 Introduction


Objectives: The goal of this kernel is to analyze home credit default via a full data science framework.

In this version, I will only go through some basic EDA and look into the dataset. As I drive into the dataset, I will add more content into this kernel.

If you have any question, please leave a comment and if you like the kernel, please give me an upvote~ Thanks!


2 Basic Set up



2.1 Load Packages


if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, skimr, GGally, plotly, viridis, caret, DT, data.table)

2.2 Load Dataset


train <-fread('application_train.csv', stringsAsFactors = FALSE, showProgress=F,
              data.table = F, na.strings=c("NA","NaN","?", ""))
test <-fread('application_test.csv', stringsAsFactors = FALSE, showProgress=F,
             data.table = F, na.strings=c("NA","NaN","?", ""))

3 Glimpse of Data



3.1 First Glimpse via DT


Let’s take 1000 observation as a sample and have a very brief look at the data.

train[sample(1:nrow(train), size = 1000),] %>% 
  datatable(filter = 'top', options = list(
    pageLength = 15, autoWidth = T
  ))

3.2 Second Glimpse via skim


skim would give you the outlook of the dataset, number of observations, number of columns, the range of the variables, number of missing/ unique values, the histogram, etc. It can serve as a one stop tool to check out

As it shows here, there are 122 variables and 307511 observations in the training set.

train %>% skim() %>% kable()
## Skim summary statistics  
##  n obs: 307511    
##  n variables: 122    
## 
## Variable type: character
## 
## variable                     missing   complete   n        min   max   empty   n_unique 
## ---------------------------  --------  ---------  -------  ----  ----  ------  ---------
## CODE_GENDER                  0         307511     307511   1     3     0       3        
## EMERGENCYSTATE_MODE          145755    161756     307511   2     3     0       2        
## FLAG_OWN_CAR                 0         307511     307511   1     1     0       2        
## FLAG_OWN_REALTY              0         307511     307511   1     1     0       2        
## FONDKAPREMONT_MODE           210295    97216      307511   13    21    0       4        
## HOUSETYPE_MODE               154297    153214     307511   14    16    0       3        
## NAME_CONTRACT_TYPE           0         307511     307511   10    15    0       2        
## NAME_EDUCATION_TYPE          0         307511     307511   15    29    0       5        
## NAME_FAMILY_STATUS           0         307511     307511   5     20    0       6        
## NAME_HOUSING_TYPE            0         307511     307511   12    19    0       6        
## NAME_INCOME_TYPE             0         307511     307511   7     20    0       8        
## NAME_TYPE_SUITE              1292      306219     307511   6     15    0       7        
## OCCUPATION_TYPE              96391     211120     307511   7     21    0       18       
## ORGANIZATION_TYPE            0         307511     307511   3     22    0       58       
## WALLSMATERIAL_MODE           156341    151170     307511   5     12    0       7        
## WEEKDAY_APPR_PROCESS_START   0         307511     307511   6     9     0       7        
## 
## Variable type: integer
## 
## variable                      missing   complete   n        mean        sd          p0       p25        p50      p75        p100     hist     
## ----------------------------  --------  ---------  -------  ----------  ----------  -------  ---------  -------  ---------  -------  ---------
## CNT_CHILDREN                  0         307511     307511   0.42        0.72        0        0          0        1          19       <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## DAYS_BIRTH                    0         307511     307511   -16037      4363.99     -25229   -19682     -15750   -12413     -7489    <U+2583><U+2586><U+2586><U+2586><U+2587><U+2587><U+2587><U+2583> 
## DAYS_EMPLOYED                 0         307511     307511   63815.05    141275.77   -17912   -2760      -1213    -289       365243   <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582> 
## DAYS_ID_PUBLISH               0         307511     307511   -2994.2     1509.45     -7197    -4299      -3254    -1720      0        <U+2581><U+2581><U+2585><U+2587><U+2585><U+2585><U+2585><U+2583> 
## FLAG_CONT_MOBILE              0         307511     307511   1           0.043       0        1          1        1          1        <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## FLAG_DOCUMENT_10              0         307511     307511   2.3e-05     0.0048      0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_11              0         307511     307511   0.0039      0.062       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_12              0         307511     307511   6.5e-06     0.0026      0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_13              0         307511     307511   0.0035      0.059       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_14              0         307511     307511   0.0029      0.054       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_15              0         307511     307511   0.0012      0.035       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_16              0         307511     307511   0.0099      0.099       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_17              0         307511     307511   0.00027     0.016       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_18              0         307511     307511   0.0081      0.09        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_19              0         307511     307511   6e-04       0.024       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_2               0         307511     307511   4.2e-05     0.0065      0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_20              0         307511     307511   0.00051     0.023       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_21              0         307511     307511   0.00033     0.018       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_3               0         307511     307511   0.71        0.45        0        0          1        1          1        <U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## FLAG_DOCUMENT_4               0         307511     307511   8.1e-05     0.009       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_5               0         307511     307511   0.015       0.12        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_6               0         307511     307511   0.088       0.28        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_7               0         307511     307511   0.00019     0.014       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_8               0         307511     307511   0.081       0.27        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_DOCUMENT_9               0         307511     307511   0.0039      0.062       0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_EMAIL                    0         307511     307511   0.057       0.23        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLAG_EMP_PHONE                0         307511     307511   0.82        0.38        0        1          1        1          1        <U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## FLAG_MOBIL                    0         307511     307511   1           0.0018      0        1          1        1          1        <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## FLAG_PHONE                    0         307511     307511   0.28        0.45        0        0          0        1          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2583> 
## FLAG_WORK_PHONE               0         307511     307511   0.2         0.4         0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582> 
## HOUR_APPR_PROCESS_START       0         307511     307511   12.06       3.27        0        10         12       14         23       <U+2581><U+2581><U+2582><U+2587><U+2587><U+2585><U+2581><U+2581> 
## LIVE_CITY_NOT_WORK_CITY       0         307511     307511   0.18        0.38        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582> 
## LIVE_REGION_NOT_WORK_REGION   0         307511     307511   0.041       0.2         0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## REG_CITY_NOT_LIVE_CITY        0         307511     307511   0.078       0.27        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## REG_CITY_NOT_WORK_CITY        0         307511     307511   0.23        0.42        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582> 
## REG_REGION_NOT_LIVE_REGION    0         307511     307511   0.015       0.12        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## REG_REGION_NOT_WORK_REGION    0         307511     307511   0.051       0.22        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## REGION_RATING_CLIENT          0         307511     307511   2.05        0.51        1        2          2        2          3        <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2582> 
## REGION_RATING_CLIENT_W_CITY   0         307511     307511   2.03        0.5         1        2          2        2          3        <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2582> 
## SK_ID_CURR                    0         307511     307511   278180.52   1e+05       1e+05    189145.5   278202   367142.5   456255   <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587> 
## TARGET                        0         307511     307511   0.081       0.27        0        0          0        0          1        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## 
## Variable type: numeric
## 
## variable                       missing   complete   n        mean        sd          p0        p25       p50      p75      p100       hist     
## -----------------------------  --------  ---------  -------  ----------  ----------  --------  --------  -------  -------  ---------  ---------
## AMT_ANNUITY                    12        307499     307511   27108.57    14493.74    1615.5    16524     24903    34596    258025.5   <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_CREDIT                     0         307511     307511   6e+05       4e+05       45000     270000    513531   808650   4e+06      <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_GOODS_PRICE                278       307233     307511   538396.21   369446.46   40500     238500    450000   679500   4e+06      <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_INCOME_TOTAL               0         307511     307511   168797.92   237123.15   25650     112500    147150   2e+05    1.2e+08    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_REQ_CREDIT_BUREAU_DAY      41519     265992     307511   0.007       0.11        0         0         0        0        9          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_REQ_CREDIT_BUREAU_HOUR     41519     265992     307511   0.0064      0.084       0         0         0        0        4          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_REQ_CREDIT_BUREAU_MON      41519     265992     307511   0.27        0.92        0         0         0        0        27         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_REQ_CREDIT_BUREAU_QRT      41519     265992     307511   0.27        0.79        0         0         0        0        261        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_REQ_CREDIT_BUREAU_WEEK     41519     265992     307511   0.034       0.2         0         0         0        0        8          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## AMT_REQ_CREDIT_BUREAU_YEAR     41519     265992     307511   1.9         1.87        0         0         1        3        25         <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## APARTMENTS_AVG                 156061    151450     307511   0.12        0.11        0         0.058     0.088    0.15     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## APARTMENTS_MEDI                156061    151450     307511   0.12        0.11        0         0.058     0.086    0.15     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## APARTMENTS_MODE                156061    151450     307511   0.11        0.11        0         0.052     0.084    0.14     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## BASEMENTAREA_AVG               179943    127568     307511   0.088       0.082       0         0.044     0.076    0.11     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## BASEMENTAREA_MEDI              179943    127568     307511   0.088       0.082       0         0.044     0.076    0.11     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## BASEMENTAREA_MODE              179943    127568     307511   0.088       0.084       0         0.041     0.075    0.11     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## CNT_FAM_MEMBERS                2         307509     307511   2.15        0.91        1         2         2        3        20         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## COMMONAREA_AVG                 214865    92646      307511   0.045       0.076       0         0.0078    0.021    0.051    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## COMMONAREA_MEDI                214865    92646      307511   0.045       0.076       0         0.0079    0.021    0.051    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## COMMONAREA_MODE                214865    92646      307511   0.043       0.074       0         0.0072    0.019    0.049    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## DAYS_LAST_PHONE_CHANGE         1         307510     307511   -962.86     826.81      -4292     -1570     -757     -274     0          <U+2581><U+2581><U+2581><U+2582><U+2583><U+2583><U+2585><U+2587> 
## DAYS_REGISTRATION              0         307511     307511   -4986.12    3522.89     -24672    -7479.5   -4504    -2010    0          <U+2581><U+2581><U+2581><U+2581><U+2582><U+2585><U+2587><U+2587> 
## DEF_30_CNT_SOCIAL_CIRCLE       1021      306490     307511   0.14        0.45        0         0         0        0        34         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## DEF_60_CNT_SOCIAL_CIRCLE       1021      306490     307511   0.1         0.36        0         0         0        0        24         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## ELEVATORS_AVG                  163891    143620     307511   0.079       0.13        0         0         0        0.12     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## ELEVATORS_MEDI                 163891    143620     307511   0.078       0.13        0         0         0        0.12     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## ELEVATORS_MODE                 163891    143620     307511   0.074       0.13        0         0         0        0.12     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## ENTRANCES_AVG                  154828    152683     307511   0.15        0.1         0         0.069     0.14     0.21     1          <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581> 
## ENTRANCES_MEDI                 154828    152683     307511   0.15        0.1         0         0.069     0.14     0.21     1          <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581> 
## ENTRANCES_MODE                 154828    152683     307511   0.15        0.1         0         0.069     0.14     0.21     1          <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581> 
## EXT_SOURCE_1                   173378    134133     307511   0.5         0.21        0.015     0.33      0.51     0.68     0.96       <U+2582><U+2585><U+2587><U+2587><U+2587><U+2587><U+2586><U+2582> 
## EXT_SOURCE_2                   660       306851     307511   0.51        0.19        8.2e-08   0.39      0.57     0.66     0.85       <U+2581><U+2582><U+2582><U+2583><U+2585><U+2587><U+2587><U+2582> 
## EXT_SOURCE_3                   60965     246546     307511   0.51        0.19        0.00053   0.37      0.54     0.67     0.9        <U+2581><U+2582><U+2585><U+2586><U+2587><U+2587><U+2587><U+2582> 
## FLOORSMAX_AVG                  153020    154491     307511   0.23        0.14        0         0.17      0.17     0.33     1          <U+2583><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLOORSMAX_MEDI                 153020    154491     307511   0.23        0.15        0         0.17      0.17     0.33     1          <U+2583><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLOORSMAX_MODE                 153020    154491     307511   0.22        0.14        0         0.17      0.17     0.33     1          <U+2583><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581> 
## FLOORSMIN_AVG                  208642    98869      307511   0.23        0.16        0         0.083     0.21     0.38     1          <U+2586><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581> 
## FLOORSMIN_MEDI                 208642    98869      307511   0.23        0.16        0         0.083     0.21     0.38     1          <U+2586><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581> 
## FLOORSMIN_MODE                 208642    98869      307511   0.23        0.16        0         0.083     0.21     0.38     1          <U+2586><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LANDAREA_AVG                   182590    124921     307511   0.066       0.081       0         0.019     0.048    0.086    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LANDAREA_MEDI                  182590    124921     307511   0.067       0.082       0         0.019     0.049    0.087    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LANDAREA_MODE                  182590    124921     307511   0.065       0.082       0         0.017     0.046    0.084    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LIVINGAPARTMENTS_AVG           210199    97312      307511   0.1         0.093       0         0.05      0.076    0.12     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LIVINGAPARTMENTS_MEDI          210199    97312      307511   0.1         0.094       0         0.051     0.076    0.12     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LIVINGAPARTMENTS_MODE          210199    97312      307511   0.11        0.098       0         0.054     0.077    0.13     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LIVINGAREA_AVG                 154350    153161     307511   0.11        0.11        0         0.045     0.074    0.13     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LIVINGAREA_MEDI                154350    153161     307511   0.11        0.11        0         0.046     0.075    0.13     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## LIVINGAREA_MODE                154350    153161     307511   0.11        0.11        0         0.043     0.073    0.13     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NONLIVINGAPARTMENTS_AVG        213514    93997      307511   0.0088      0.048       0         0         0        0.0039   1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NONLIVINGAPARTMENTS_MEDI       213514    93997      307511   0.0087      0.047       0         0         0        0.0039   1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NONLIVINGAPARTMENTS_MODE       213514    93997      307511   0.0081      0.046       0         0         0        0.0039   1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NONLIVINGAREA_AVG              169682    137829     307511   0.028       0.07        0         0         0.0036   0.028    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NONLIVINGAREA_MEDI             169682    137829     307511   0.028       0.07        0         0         0.0031   0.027    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## NONLIVINGAREA_MODE             169682    137829     307511   0.027       0.07        0         0         0.0011   0.023    1          <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## OBS_30_CNT_SOCIAL_CIRCLE       1021      306490     307511   1.42        2.4         0         0         0        2        348        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## OBS_60_CNT_SOCIAL_CIRCLE       1021      306490     307511   1.41        2.38        0         0         0        2        344        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## OWN_CAR_AGE                    202929    104582     307511   12.06       11.94       0         5         9        15       91         <U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## REGION_POPULATION_RELATIVE     0         307511     307511   0.021       0.014       0.00029   0.01      0.019    0.029    0.073      <U+2586><U+2586><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581> 
## TOTALAREA_MODE                 148431    159080     307511   0.1         0.11        0         0.041     0.069    0.13     1          <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## YEARS_BEGINEXPLUATATION_AVG    150007    157504     307511   0.98        0.059       0         0.98      0.98     0.99     1          <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## YEARS_BEGINEXPLUATATION_MEDI   150007    157504     307511   0.98        0.06        0         0.98      0.98     0.99     1          <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## YEARS_BEGINEXPLUATATION_MODE   150007    157504     307511   0.98        0.065       0         0.98      0.98     0.99     1          <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## YEARS_BUILD_AVG                204488    103023     307511   0.75        0.11        0         0.69      0.76     0.82     1          <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2587><U+2582> 
## YEARS_BUILD_MEDI               204488    103023     307511   0.76        0.11        0         0.69      0.76     0.83     1          <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2587><U+2582> 
## YEARS_BUILD_MODE               204488    103023     307511   0.76        0.11        0         0.7       0.76     0.82     1          <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2587><U+2582>

4 Variables

4.1 TARGET

If you don’t know which variable you are predicting, you can use this code to find out.

setdiff(names(train), names(test))
## [1] "TARGET"

TARGET is a binary variable and it is unbalanced with most of its value in 0.

train %>% count(TARGET) %>% kable()
TARGET n
0 282686
1 24825

Then, we will have a look at the distribution of TARGET via Plotly

train %>% 
  count(TARGET) %>% 
  plot_ly(labels = ~TARGET, values = ~n, type = 'pie') %>%
  layout(title = 'Target Variable Distribution',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

4.2 Gender

train %>% 
  count(CODE_GENDER) %>% 
  plot_ly(labels = ~CODE_GENDER , values = ~n) %>%
  add_pie(hole = 0.6) %>%
  layout(title = "Gender Distribution",  showlegend = F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

4.3 Education

train %>% 
  count(NAME_EDUCATION_TYPE) %>% 
  plot_ly(labels = ~NAME_EDUCATION_TYPE , values = ~n) %>%
  add_pie(hole = 0.6) %>%
  layout(title = "Education Distribution",  showlegend = F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

5 Preprocess

There are many preprocess needed to be done but let’s just separate character and numeric varaibles this time.

chr <- train[,sapply(train,is.character)]
num <- train[,sapply(train,is.numeric)]

6 GGally

# graph <- list()
# 
# for (i in 1:21){
#   
# graph[[i]] <- num %>% na.omit() %>% 
#     select(TARGET,((i-1)*5+1):((i-1)*5+5)) %>% 
#     mutate(TARGET = factor(TARGET)) %>% 
#     ggpairs(aes(col = TARGET, alpha=.4))
#   
# print(graph[[i]])
# }

7 Conclusion


To be Continued.

If you have any question, please leave a comment and if you like the kernel, please give a upvote~ Thanks!


8 References


Home Credit Default Risk : EDA