Objectives: The goal of this kernel is to analyze home credit default via a full data science framework.
In this version, I will only go through some basic EDA and look into the dataset. As I drive into the dataset, I will add more content into this kernel.
If you have any question, please leave a comment and if you like the kernel, please give me an upvote~ Thanks!
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, skimr, GGally, plotly, viridis, caret, DT, data.table)
train <-fread('application_train.csv', stringsAsFactors = FALSE, showProgress=F,
data.table = F, na.strings=c("NA","NaN","?", ""))
test <-fread('application_test.csv', stringsAsFactors = FALSE, showProgress=F,
data.table = F, na.strings=c("NA","NaN","?", ""))
Let’s take 1000 observation as a sample and have a very brief look at the data.
train[sample(1:nrow(train), size = 1000),] %>%
datatable(filter = 'top', options = list(
pageLength = 15, autoWidth = T
))
skim would give you the outlook of the dataset, number of observations, number of columns, the range of the variables, number of missing/ unique values, the histogram, etc. It can serve as a one stop tool to check out
As it shows here, there are 122 variables and 307511 observations in the training set.
train %>% skim() %>% kable()
## Skim summary statistics
## n obs: 307511
## n variables: 122
##
## Variable type: character
##
## variable missing complete n min max empty n_unique
## --------------------------- -------- --------- ------- ---- ---- ------ ---------
## CODE_GENDER 0 307511 307511 1 3 0 3
## EMERGENCYSTATE_MODE 145755 161756 307511 2 3 0 2
## FLAG_OWN_CAR 0 307511 307511 1 1 0 2
## FLAG_OWN_REALTY 0 307511 307511 1 1 0 2
## FONDKAPREMONT_MODE 210295 97216 307511 13 21 0 4
## HOUSETYPE_MODE 154297 153214 307511 14 16 0 3
## NAME_CONTRACT_TYPE 0 307511 307511 10 15 0 2
## NAME_EDUCATION_TYPE 0 307511 307511 15 29 0 5
## NAME_FAMILY_STATUS 0 307511 307511 5 20 0 6
## NAME_HOUSING_TYPE 0 307511 307511 12 19 0 6
## NAME_INCOME_TYPE 0 307511 307511 7 20 0 8
## NAME_TYPE_SUITE 1292 306219 307511 6 15 0 7
## OCCUPATION_TYPE 96391 211120 307511 7 21 0 18
## ORGANIZATION_TYPE 0 307511 307511 3 22 0 58
## WALLSMATERIAL_MODE 156341 151170 307511 5 12 0 7
## WEEKDAY_APPR_PROCESS_START 0 307511 307511 6 9 0 7
##
## Variable type: integer
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ---------------------------- -------- --------- ------- ---------- ---------- ------- --------- ------- --------- ------- ---------
## CNT_CHILDREN 0 307511 307511 0.42 0.72 0 0 0 1 19 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## DAYS_BIRTH 0 307511 307511 -16037 4363.99 -25229 -19682 -15750 -12413 -7489 <U+2583><U+2586><U+2586><U+2586><U+2587><U+2587><U+2587><U+2583>
## DAYS_EMPLOYED 0 307511 307511 63815.05 141275.77 -17912 -2760 -1213 -289 365243 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582>
## DAYS_ID_PUBLISH 0 307511 307511 -2994.2 1509.45 -7197 -4299 -3254 -1720 0 <U+2581><U+2581><U+2585><U+2587><U+2585><U+2585><U+2585><U+2583>
## FLAG_CONT_MOBILE 0 307511 307511 1 0.043 0 1 1 1 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## FLAG_DOCUMENT_10 0 307511 307511 2.3e-05 0.0048 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_11 0 307511 307511 0.0039 0.062 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_12 0 307511 307511 6.5e-06 0.0026 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_13 0 307511 307511 0.0035 0.059 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_14 0 307511 307511 0.0029 0.054 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_15 0 307511 307511 0.0012 0.035 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_16 0 307511 307511 0.0099 0.099 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_17 0 307511 307511 0.00027 0.016 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_18 0 307511 307511 0.0081 0.09 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_19 0 307511 307511 6e-04 0.024 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_2 0 307511 307511 4.2e-05 0.0065 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_20 0 307511 307511 0.00051 0.023 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_21 0 307511 307511 0.00033 0.018 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_3 0 307511 307511 0.71 0.45 0 0 1 1 1 <U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## FLAG_DOCUMENT_4 0 307511 307511 8.1e-05 0.009 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_5 0 307511 307511 0.015 0.12 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_6 0 307511 307511 0.088 0.28 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_7 0 307511 307511 0.00019 0.014 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_8 0 307511 307511 0.081 0.27 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_DOCUMENT_9 0 307511 307511 0.0039 0.062 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_EMAIL 0 307511 307511 0.057 0.23 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLAG_EMP_PHONE 0 307511 307511 0.82 0.38 0 1 1 1 1 <U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## FLAG_MOBIL 0 307511 307511 1 0.0018 0 1 1 1 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## FLAG_PHONE 0 307511 307511 0.28 0.45 0 0 0 1 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2583>
## FLAG_WORK_PHONE 0 307511 307511 0.2 0.4 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582>
## HOUR_APPR_PROCESS_START 0 307511 307511 12.06 3.27 0 10 12 14 23 <U+2581><U+2581><U+2582><U+2587><U+2587><U+2585><U+2581><U+2581>
## LIVE_CITY_NOT_WORK_CITY 0 307511 307511 0.18 0.38 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582>
## LIVE_REGION_NOT_WORK_REGION 0 307511 307511 0.041 0.2 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## REG_CITY_NOT_LIVE_CITY 0 307511 307511 0.078 0.27 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## REG_CITY_NOT_WORK_CITY 0 307511 307511 0.23 0.42 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582>
## REG_REGION_NOT_LIVE_REGION 0 307511 307511 0.015 0.12 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## REG_REGION_NOT_WORK_REGION 0 307511 307511 0.051 0.22 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## REGION_RATING_CLIENT 0 307511 307511 2.05 0.51 1 2 2 2 3 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2582>
## REGION_RATING_CLIENT_W_CITY 0 307511 307511 2.03 0.5 1 2 2 2 3 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2582>
## SK_ID_CURR 0 307511 307511 278180.52 1e+05 1e+05 189145.5 278202 367142.5 456255 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
## TARGET 0 307511 307511 0.081 0.27 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ----------------------------- -------- --------- ------- ---------- ---------- -------- -------- ------- ------- --------- ---------
## AMT_ANNUITY 12 307499 307511 27108.57 14493.74 1615.5 16524 24903 34596 258025.5 <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_CREDIT 0 307511 307511 6e+05 4e+05 45000 270000 513531 808650 4e+06 <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_GOODS_PRICE 278 307233 307511 538396.21 369446.46 40500 238500 450000 679500 4e+06 <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_INCOME_TOTAL 0 307511 307511 168797.92 237123.15 25650 112500 147150 2e+05 1.2e+08 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_REQ_CREDIT_BUREAU_DAY 41519 265992 307511 0.007 0.11 0 0 0 0 9 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_REQ_CREDIT_BUREAU_HOUR 41519 265992 307511 0.0064 0.084 0 0 0 0 4 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_REQ_CREDIT_BUREAU_MON 41519 265992 307511 0.27 0.92 0 0 0 0 27 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_REQ_CREDIT_BUREAU_QRT 41519 265992 307511 0.27 0.79 0 0 0 0 261 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_REQ_CREDIT_BUREAU_WEEK 41519 265992 307511 0.034 0.2 0 0 0 0 8 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## AMT_REQ_CREDIT_BUREAU_YEAR 41519 265992 307511 1.9 1.87 0 0 1 3 25 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## APARTMENTS_AVG 156061 151450 307511 0.12 0.11 0 0.058 0.088 0.15 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## APARTMENTS_MEDI 156061 151450 307511 0.12 0.11 0 0.058 0.086 0.15 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## APARTMENTS_MODE 156061 151450 307511 0.11 0.11 0 0.052 0.084 0.14 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## BASEMENTAREA_AVG 179943 127568 307511 0.088 0.082 0 0.044 0.076 0.11 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## BASEMENTAREA_MEDI 179943 127568 307511 0.088 0.082 0 0.044 0.076 0.11 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## BASEMENTAREA_MODE 179943 127568 307511 0.088 0.084 0 0.041 0.075 0.11 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## CNT_FAM_MEMBERS 2 307509 307511 2.15 0.91 1 2 2 3 20 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## COMMONAREA_AVG 214865 92646 307511 0.045 0.076 0 0.0078 0.021 0.051 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## COMMONAREA_MEDI 214865 92646 307511 0.045 0.076 0 0.0079 0.021 0.051 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## COMMONAREA_MODE 214865 92646 307511 0.043 0.074 0 0.0072 0.019 0.049 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## DAYS_LAST_PHONE_CHANGE 1 307510 307511 -962.86 826.81 -4292 -1570 -757 -274 0 <U+2581><U+2581><U+2581><U+2582><U+2583><U+2583><U+2585><U+2587>
## DAYS_REGISTRATION 0 307511 307511 -4986.12 3522.89 -24672 -7479.5 -4504 -2010 0 <U+2581><U+2581><U+2581><U+2581><U+2582><U+2585><U+2587><U+2587>
## DEF_30_CNT_SOCIAL_CIRCLE 1021 306490 307511 0.14 0.45 0 0 0 0 34 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## DEF_60_CNT_SOCIAL_CIRCLE 1021 306490 307511 0.1 0.36 0 0 0 0 24 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## ELEVATORS_AVG 163891 143620 307511 0.079 0.13 0 0 0 0.12 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## ELEVATORS_MEDI 163891 143620 307511 0.078 0.13 0 0 0 0.12 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## ELEVATORS_MODE 163891 143620 307511 0.074 0.13 0 0 0 0.12 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## ENTRANCES_AVG 154828 152683 307511 0.15 0.1 0 0.069 0.14 0.21 1 <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## ENTRANCES_MEDI 154828 152683 307511 0.15 0.1 0 0.069 0.14 0.21 1 <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## ENTRANCES_MODE 154828 152683 307511 0.15 0.1 0 0.069 0.14 0.21 1 <U+2587><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## EXT_SOURCE_1 173378 134133 307511 0.5 0.21 0.015 0.33 0.51 0.68 0.96 <U+2582><U+2585><U+2587><U+2587><U+2587><U+2587><U+2586><U+2582>
## EXT_SOURCE_2 660 306851 307511 0.51 0.19 8.2e-08 0.39 0.57 0.66 0.85 <U+2581><U+2582><U+2582><U+2583><U+2585><U+2587><U+2587><U+2582>
## EXT_SOURCE_3 60965 246546 307511 0.51 0.19 0.00053 0.37 0.54 0.67 0.9 <U+2581><U+2582><U+2585><U+2586><U+2587><U+2587><U+2587><U+2582>
## FLOORSMAX_AVG 153020 154491 307511 0.23 0.14 0 0.17 0.17 0.33 1 <U+2583><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLOORSMAX_MEDI 153020 154491 307511 0.23 0.15 0 0.17 0.17 0.33 1 <U+2583><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLOORSMAX_MODE 153020 154491 307511 0.22 0.14 0 0.17 0.17 0.33 1 <U+2583><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581>
## FLOORSMIN_AVG 208642 98869 307511 0.23 0.16 0 0.083 0.21 0.38 1 <U+2586><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
## FLOORSMIN_MEDI 208642 98869 307511 0.23 0.16 0 0.083 0.21 0.38 1 <U+2586><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
## FLOORSMIN_MODE 208642 98869 307511 0.23 0.16 0 0.083 0.21 0.38 1 <U+2586><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581>
## LANDAREA_AVG 182590 124921 307511 0.066 0.081 0 0.019 0.048 0.086 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LANDAREA_MEDI 182590 124921 307511 0.067 0.082 0 0.019 0.049 0.087 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LANDAREA_MODE 182590 124921 307511 0.065 0.082 0 0.017 0.046 0.084 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LIVINGAPARTMENTS_AVG 210199 97312 307511 0.1 0.093 0 0.05 0.076 0.12 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LIVINGAPARTMENTS_MEDI 210199 97312 307511 0.1 0.094 0 0.051 0.076 0.12 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LIVINGAPARTMENTS_MODE 210199 97312 307511 0.11 0.098 0 0.054 0.077 0.13 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LIVINGAREA_AVG 154350 153161 307511 0.11 0.11 0 0.045 0.074 0.13 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LIVINGAREA_MEDI 154350 153161 307511 0.11 0.11 0 0.046 0.075 0.13 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## LIVINGAREA_MODE 154350 153161 307511 0.11 0.11 0 0.043 0.073 0.13 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## NONLIVINGAPARTMENTS_AVG 213514 93997 307511 0.0088 0.048 0 0 0 0.0039 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## NONLIVINGAPARTMENTS_MEDI 213514 93997 307511 0.0087 0.047 0 0 0 0.0039 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## NONLIVINGAPARTMENTS_MODE 213514 93997 307511 0.0081 0.046 0 0 0 0.0039 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## NONLIVINGAREA_AVG 169682 137829 307511 0.028 0.07 0 0 0.0036 0.028 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## NONLIVINGAREA_MEDI 169682 137829 307511 0.028 0.07 0 0 0.0031 0.027 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## NONLIVINGAREA_MODE 169682 137829 307511 0.027 0.07 0 0 0.0011 0.023 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## OBS_30_CNT_SOCIAL_CIRCLE 1021 306490 307511 1.42 2.4 0 0 0 2 348 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## OBS_60_CNT_SOCIAL_CIRCLE 1021 306490 307511 1.41 2.38 0 0 0 2 344 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## OWN_CAR_AGE 202929 104582 307511 12.06 11.94 0 5 9 15 91 <U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## REGION_POPULATION_RELATIVE 0 307511 307511 0.021 0.014 0.00029 0.01 0.019 0.029 0.073 <U+2586><U+2586><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581>
## TOTALAREA_MODE 148431 159080 307511 0.1 0.11 0 0.041 0.069 0.13 1 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## YEARS_BEGINEXPLUATATION_AVG 150007 157504 307511 0.98 0.059 0 0.98 0.98 0.99 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## YEARS_BEGINEXPLUATATION_MEDI 150007 157504 307511 0.98 0.06 0 0.98 0.98 0.99 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## YEARS_BEGINEXPLUATATION_MODE 150007 157504 307511 0.98 0.065 0 0.98 0.98 0.99 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## YEARS_BUILD_AVG 204488 103023 307511 0.75 0.11 0 0.69 0.76 0.82 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2587><U+2582>
## YEARS_BUILD_MEDI 204488 103023 307511 0.76 0.11 0 0.69 0.76 0.83 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2587><U+2582>
## YEARS_BUILD_MODE 204488 103023 307511 0.76 0.11 0 0.7 0.76 0.82 1 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2587><U+2582>
If you don’t know which variable you are predicting, you can use this code to find out.
setdiff(names(train), names(test))
## [1] "TARGET"
TARGET is a binary variable and it is unbalanced with most of its value in 0.
train %>% count(TARGET) %>% kable()
TARGET | n |
---|---|
0 | 282686 |
1 | 24825 |
Then, we will have a look at the distribution of TARGET via Plotly
train %>%
count(TARGET) %>%
plot_ly(labels = ~TARGET, values = ~n, type = 'pie') %>%
layout(title = 'Target Variable Distribution',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
train %>%
count(CODE_GENDER) %>%
plot_ly(labels = ~CODE_GENDER , values = ~n) %>%
add_pie(hole = 0.6) %>%
layout(title = "Gender Distribution", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
train %>%
count(NAME_EDUCATION_TYPE) %>%
plot_ly(labels = ~NAME_EDUCATION_TYPE , values = ~n) %>%
add_pie(hole = 0.6) %>%
layout(title = "Education Distribution", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
There are many preprocess needed to be done but let’s just separate character and numeric varaibles this time.
chr <- train[,sapply(train,is.character)]
num <- train[,sapply(train,is.numeric)]
# graph <- list()
#
# for (i in 1:21){
#
# graph[[i]] <- num %>% na.omit() %>%
# select(TARGET,((i-1)*5+1):((i-1)*5+5)) %>%
# mutate(TARGET = factor(TARGET)) %>%
# ggpairs(aes(col = TARGET, alpha=.4))
#
# print(graph[[i]])
# }
To be Continued.
If you have any question, please leave a comment and if you like the kernel, please give a upvote~ Thanks!