First of all, I want to give credit to Pranav Pandya, one of the best kernel writers on Kaggle. I learned lots of stuff from his kernel and now I am just applying what I learned to this dataset.
I will try my best to make this kernel as easy to understand as possible. If you have any question, please leave me a comment and if you like the kernel, please give me a upvote~ Thank you so much and enjoy the show!
p_load from pacman package would help you to install any package (available) if it has not been installed yet.
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## Variable type: factor
## variable missing complete n n_unique top_counts
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0
## ordered
## FALSE
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 p50 p75 p100
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4
## hist
## <U+2587><U+2581><U+2581><U+2582><U+2585><U+2585><U+2583><U+2581>
## <U+2587><U+2581><U+2581><U+2585><U+2583><U+2583><U+2582><U+2582>
## <U+2582><U+2587><U+2585><U+2587><U+2586><U+2585><U+2582><U+2582>
## <U+2581><U+2582><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>
We have four different variables, Petal Length, Petal Width, Sepal Length, and Sepal Width; however, the 3D plot can only apply three variables. To choose the best variables, let’s have a look at all the variables.
Sepal Width has the least variance and then I am going to use the other three variables.
From the graphs below, we need to identify which variable can help us separate the species.
iris$Species <- iris$Species %>% as.factor() %>% as.numeric() - 1
inTrain <- createDataPartition(iris$Species, p=.7, list = F)
train <- iris[inTrain,]
test <- iris[-inTrain,]
# data_train <- lgb.Dataset(data = data.matrix(train[, 1:4]), label = train[, 5])
# params <- list(objective = "multiclass", metric = "auc", num_class = 3)
# model <- lgb.train(params,
# data_train,
# 100,
# min_data = 1,
# learning_rate = .1)
# result <- predict(model, data.matrix(test[, 1:4]))
# list <- list()
#
# for (i in 1:nrow(test)){
# max = max(result[(i-1)*3+1], result[(i-1)*3+2], result[(i-1)*3+3])
# list[i] <- if_else(max == result[(i-1)*3+3], 3, if_else(max == result[(i-1)*3+2], 2, 1))
# }
#
# pred <- list %>% as.numeric() - 1
# lgb.importance(model, percentage = TRUE) %>% kable()
#
# tree_imp <- lgb.importance(model, percentage = TRUE)
#
# lgb.plot.importance(tree_imp, measure = "Gain")
# pred_table <- table(pred, test$Species)
#
# confusionMatrix(pred_table)
Hope you enjoyed the kernel and don’t forget to upvote~ Thanks a lot!