
First of all, I want to give credit to Pranav Pandya, one of the best kernel writers on Kaggle. I learned lots of stuff from his kernel and now I am just applying what I learned to this dataset.

I will try my best to make this kernel as easy to understand as possible.

Load Packages & Dataset

p_load from pacman package would help you to install any package (available) if it has not been installed yet.

Explorational Data Analysis

Data Table


## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## Variable type: factor 
##  variable missing complete   n n_unique                       top_counts
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0
##  ordered
##    FALSE
## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4
##      hist
##  <U+2587><U+2581><U+2581><U+2582><U+2585><U+2585><U+2583><U+2581>
##  <U+2587><U+2581><U+2581><U+2585><U+2583><U+2583><U+2582><U+2582>
##  <U+2582><U+2587><U+2585><U+2587><U+2586><U+2585><U+2582><U+2582>
##  <U+2581><U+2582><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>




We have four different variables, Petal Length, Petal Width, Sepal Length, and Sepal Width; however, the 3D plot can only apply three variables. To choose the best variables, let’s have a look at all the variables.

Sepal Width has the least variance and then I am going to use the other three variables.

Each Variable Distribution

From the graphs below, we need to identify which variable can help us separate the species.

Sepal Length

Petal Length

Sepal Width

Petal Width

3D Interactive Plot


iris$Species <- iris$Species %>% as.factor() %>% as.numeric() - 1


Cross Validation

inTrain <- createDataPartition(iris$Species, p=.7, list = F)

train <- iris[inTrain,]
test <- iris[-inTrain,]

# data_train <- lgb.Dataset(data = data.matrix(train[, 1:4]), label = train[, 5])


# params <- list(objective = "multiclass", metric = "auc", num_class = 3)
# model <- lgb.train(params,
#                    data_train,
#                    100,
#                    min_data = 1,
#                    learning_rate = .1)

# result <- predict(model, data.matrix(test[, 1:4]))

# list <- list()
# for (i in 1:nrow(test)){
#     max = max(result[(i-1)*3+1], result[(i-1)*3+2], result[(i-1)*3+3])
#     list[i] <- if_else(max == result[(i-1)*3+3], 3, if_else(max == result[(i-1)*3+2], 2, 1))
# }
# pred <- list %>% as.numeric() - 1

LightGBM in Importance Plot

# lgb.importance(model, percentage = TRUE) %>% kable()
# tree_imp <- lgb.importance(model, percentage = TRUE)
# lgb.plot.importance(tree_imp, measure = "Gain")

LightGBM in Confusion Matrix

# pred_table <- table(pred, test$Species)
# confusionMatrix(pred_table)

