Introduction

First of all, I want to give credit to Pranav Pandya, one of the best kernel writers on Kaggle. I learned lots of stuff from his kernel and now I am just applying what I learned to this dataset.

I will try my best to make this kernel as easy to understand as possible. If you have any question, please leave me a comment and if you like the kernel, please give me a upvote~ Thank you so much and enjoy the show!

Load Packages & Dataset

p_load from pacman package would help you to install any package (available) if it has not been installed yet.

Explorational Data Analysis

Data Table

skim

## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: factor 
##  variable missing complete   n n_unique                       top_counts
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0
##  ordered
##    FALSE
## 
## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4
##      hist
##  <U+2587><U+2581><U+2581><U+2582><U+2585><U+2585><U+2583><U+2581>
##  <U+2587><U+2581><U+2581><U+2585><U+2583><U+2583><U+2582><U+2582>
##  <U+2582><U+2587><U+2585><U+2587><U+2586><U+2585><U+2582><U+2582>
##  <U+2581><U+2582><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>

Corrplot

Corrgram

GGPAIRS

We have four different variables, Petal Length, Petal Width, Sepal Length, and Sepal Width; however, the 3D plot can only apply three variables. To choose the best variables, let’s have a look at all the variables.

Sepal Width has the least variance and then I am going to use the other three variables.

Each Variable Distribution

From the graphs below, we need to identify which variable can help us separate the species.

Sepal Length

Petal Length

Sepal Width

Petal Width

3D Interactive Plot

Preprocss

iris$Species <- iris$Species %>% as.factor() %>% as.numeric() - 1

Modeling

Cross Validation

inTrain <- createDataPartition(iris$Species, p=.7, list = F)

train <- iris[inTrain,]
test <- iris[-inTrain,]

# data_train <- lgb.Dataset(data = data.matrix(train[, 1:4]), label = train[, 5])

LightGBM

# params <- list(objective = "multiclass", metric = "auc", num_class = 3)
# model <- lgb.train(params,
#                    data_train,
#                    100,
#                    min_data = 1,
#                    learning_rate = .1)

# result <- predict(model, data.matrix(test[, 1:4]))

# list <- list()
# 
# for (i in 1:nrow(test)){
#     max = max(result[(i-1)*3+1], result[(i-1)*3+2], result[(i-1)*3+3])
#     list[i] <- if_else(max == result[(i-1)*3+3], 3, if_else(max == result[(i-1)*3+2], 2, 1))
# }
# 
# pred <- list %>% as.numeric() - 1

LightGBM in Importance Plot

# lgb.importance(model, percentage = TRUE) %>% kable()
# 
# tree_imp <- lgb.importance(model, percentage = TRUE)
# 
# lgb.plot.importance(tree_imp, measure = "Gain")

LightGBM in Confusion Matrix

# pred_table <- table(pred, test$Species)
# 
# confusionMatrix(pred_table)

Hope you enjoyed the kernel and don’t forget to upvote~ Thanks a lot!

3D Interactive Iris via Plot_ly

Owen Ouyang

May 25, 2018