Requirement:
Download the data, load it into your favorite statistical programing software or database. Report the number of rows and columns that you’ve loaded.
Visualize trip distance by time of day in any way you see fit, any observations?
What are the most popular pickup locations on weekdays vs weekend?
I want to know where I can most easily get a cab. Recommend a pick up spot where i can find a cab given my lat long.
This time, I would use R as the programing software for the analysis. The analysis would be based on a standard data science framework and the questions above; however, I would extend the scope of the analysis to identify any unique insight as well as provide detailed explanation of my code.
The scope of data is limited to Green taxi data for February 2016. There is also data dictionary available from the website, which explain the variable. The Data is available via NYC Trip Data.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, DT, lubridate, leaflet, leaflet.extras, maps, data.table, ggthemes, rebus, clue, skimr, plotly)
# Initially use read.csv then write the file so that going forward I can use fread
data <- read.csv("input/green_tripdata_2016-02v2.csv", stringsAsFactors = F)
The first question can be answered by looking at the structure of the dataset. The dataset has 1510722 observations(rows) and 22 variables(columns).
data %>% head(100) %>% select(lpep_pickup_datetime,Pickup_longitude,Pickup_latitude,Trip_distance) %>% datatable(filter = 'top', options = list(
pageLength = 15, autoWidth = TRUE
))
data %>% glimpse() # 1510722 obs. of 21 variables
## Observations: 1,510,722
## Variables: 22
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ VendorID <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ lpep_pickup_datetime <chr> "2016-02-01 00:00:01", "2016-02-01 00:01...
## $ Lpep_dropoff_datetime <chr> "2016-02-01 00:10:06", "2016-02-01 00:20...
## $ Store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", ...
## $ RateCodeID <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Pickup_longitude <dbl> -73.93902, -73.89149, -73.98378, -73.807...
## $ Pickup_latitude <dbl> 40.80521, 40.74665, 40.67613, 40.70037, ...
## $ Dropoff_longitude <dbl> -73.97253, -73.89088, -73.95698, -73.831...
## $ Dropoff_latitude <dbl> 40.78588, 40.74390, 40.71833, 40.70598, ...
## $ Passenger_count <int> 1, 1, 1, 1, 5, 1, 5, 3, 1, 1, 1, 1, 1, 6...
## $ Trip_distance <dbl> 2.86, 3.35, 4.70, 2.11, 0.98, 6.00, 1.46...
## $ Fare_amount <dbl> 10.5, 13.0, 17.5, 8.0, 5.0, 22.5, 7.0, 1...
## $ Extra <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ...
## $ MTA_tax <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ...
## $ Tip_amount <dbl> 0.00, 0.00, 3.76, 0.00, 0.00, 4.76, 0.00...
## $ Tolls_amount <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Ehail_fee <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, ...
## $ Total_amount <dbl> 11.80, 14.30, 22.56, 9.30, 6.30, 28.56, ...
## $ Payment_type <int> 2, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2...
## $ Trip_type <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
data %>% skim() %>% kable()
## Skim summary statistics
## n obs: 1510722
## n variables: 22
##
## Variable type: character
##
## variable missing complete n min max empty n_unique
## ---------------------- -------- --------- -------- ---- ---- ------ ---------
## Lpep_dropoff_datetime 0 1510722 1510722 19 19 0 1065647
## lpep_pickup_datetime 0 1510722 1510722 19 19 0 1067848
## Store_and_fwd_flag 0 1510722 1510722 1 1 0 2
##
## Variable type: integer
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ---------------- -------- --------- -------- --------- ---------- --- ---------- --------- ----------- -------- ---------
## Passenger_count 0 1510722 1510722 1.35 1.02 0 1 1 1 9 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## Payment_type 0 1510722 1510722 1.51 0.53 1 1 1 2 5 <U+2587><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## RateCodeID 0 1510722 1510722 1.09 0.61 1 1 1 1 99 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## Trip_type 2 1510720 1510722 1.02 0.14 1 1 1 1 2 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## VendorID 0 1510722 1510722 1.79 0.41 1 2 2 2 2 <U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## X 0 1510722 1510722 755361.5 436108.02 1 377681.25 755361.5 1133041.75 1510722 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
##
## Variable type: logical
##
## variable missing complete n mean count
## ---------- -------- --------- -------- ----- --------
## Ehail_fee 1510722 0 1510722 NaN 1510722
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ---------------------- -------- --------- -------- ------- ------ -------- ------- ------- ------- -------- ---------
## Dropoff_latitude 0 1510722 1510722 40.68 1.61 0 40.7 40.75 40.79 42.32 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## Dropoff_longitude 0 1510722 1510722 -73.83 2.97 -115.33 -73.97 -73.95 -73.91 0 <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581>
## Extra 0 1510722 1510722 0.35 0.38 -4.5 0 0.5 0.5 4.5 <U+2581><U+2581><U+2581><U+2587><U+2587><U+2581><U+2581><U+2581>
## Fare_amount 0 1510722 1510722 11.75 9.4 -400 6 9 14.5 933.5 <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581>
## improvement_surcharge 0 1510722 1510722 0.29 0.051 -0.3 0.3 0.3 0.3 1.77 <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581>
## MTA_tax 0 1510722 1510722 0.49 0.088 -0.5 0.5 0.5 0.5 3 <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581>
## Pickup_latitude 0 1510722 1510722 40.68 1.68 0 40.69 40.75 40.8 42.32 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## Pickup_longitude 0 1510722 1510722 -73.82 3.1 -115.28 -73.96 -73.95 -73.92 0 <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581>
## Tip_amount 0 1510722 1510722 1.23 2.43 -10 0 0 2 400 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## Tolls_amount 0 1510722 1510722 0.1 1.36 -12.5 0 0 0 902.17 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## Total_amount 0 1510722 1510722 14.2 10.88 -400 7.8 11 17.16 1121.16 <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581>
## Trip_distance 0 1510722 1510722 2.73 2.84 0 1 1.8 3.44 235.5 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
data %>% summary()
## X VendorID lpep_pickup_datetime
## Min. : 1 Min. :1.000 Length:1510722
## 1st Qu.: 377681 1st Qu.:2.000 Class :character
## Median : 755362 Median :2.000 Mode :character
## Mean : 755362 Mean :1.785
## 3rd Qu.:1133042 3rd Qu.:2.000
## Max. :1510722 Max. :2.000
##
## Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID
## Length:1510722 Length:1510722 Min. : 1.000
## Class :character Class :character 1st Qu.: 1.000
## Mode :character Mode :character Median : 1.000
## Mean : 1.091
## 3rd Qu.: 1.000
## Max. :99.000
##
## Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude
## Min. :-115.28 Min. : 0.00 Min. :-115.33 Min. : 0.00
## 1st Qu.: -73.96 1st Qu.:40.69 1st Qu.: -73.97 1st Qu.:40.70
## Median : -73.95 Median :40.75 Median : -73.95 Median :40.75
## Mean : -73.82 Mean :40.68 Mean : -73.83 Mean :40.68
## 3rd Qu.: -73.92 3rd Qu.:40.80 3rd Qu.: -73.91 3rd Qu.:40.79
## Max. : 0.00 Max. :42.32 Max. : 0.00 Max. :42.32
##
## Passenger_count Trip_distance Fare_amount Extra
## Min. :0.000 Min. : 0.000 Min. :-400.00 Min. :-4.5000
## 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.: 6.00 1st Qu.: 0.0000
## Median :1.000 Median : 1.800 Median : 9.00 Median : 0.5000
## Mean :1.352 Mean : 2.733 Mean : 11.75 Mean : 0.3498
## 3rd Qu.:1.000 3rd Qu.: 3.440 3rd Qu.: 14.50 3rd Qu.: 0.5000
## Max. :9.000 Max. :235.500 Max. : 933.50 Max. : 4.5000
##
## MTA_tax Tip_amount Tolls_amount Ehail_fee
## Min. :-0.500 Min. :-10.000 Min. :-12.5000 Mode:logical
## 1st Qu.: 0.500 1st Qu.: 0.000 1st Qu.: 0.0000 NA's:1510722
## Median : 0.500 Median : 0.000 Median : 0.0000
## Mean : 0.487 Mean : 1.226 Mean : 0.1017
## 3rd Qu.: 0.500 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. : 3.000 Max. :400.000 Max. :902.1700
##
## improvement_surcharge Total_amount Payment_type Trip_type
## Min. :-0.3000 Min. :-400.00 Min. :1.000 Min. :1.000
## 1st Qu.: 0.3000 1st Qu.: 7.80 1st Qu.:1.000 1st Qu.:1.000
## Median : 0.3000 Median : 11.00 Median :1.000 Median :1.000
## Mean : 0.2922 Mean : 14.21 Mean :1.509 Mean :1.021
## 3rd Qu.: 0.3000 3rd Qu.: 17.16 3rd Qu.:2.000 3rd Qu.:1.000
## Max. : 1.7700 Max. :1121.16 Max. :5.000 Max. :2.000
## NA's :2
Since the trip payment is not in the scale of this analysis, I took out these variables for shorter running time.
data[,which(str_detect(names(data),"amount|fee|Extra|fee|Pay|tax|ID|charge"))] <- NULL
Looking at the summary result, I got the map below. It is very interesting to see that all the pick up location are outside of the core area of New York City. By doing a little research, I found out that the green taxi are only allowed to pick up passengers (street hails or calls) in outer boroughs (excluding John F. Kennedy International Airport and LaGuardia Airport unless arranged in advance) and in Manhattan above East 96th and West 110th Streets. That explains the pattern we see here.
set.seed(0)
data %>%
sample_n(size=10000) %>%
leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addCircleMarkers(~Pickup_longitude, ~Pickup_latitude, radius = 1,
color = "firebrick", fillOpacity = 0.001, group = 'PickUp') %>%
addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
color = "steelblue", fillOpacity = 0.001, group = 'DropOff') %>%
addLayersControl(
baseGroups = c("Color map", "Light map", "white map"),
overlayGroups = c("PickUp", "DropOff"),
options = layersControlOptions(collapsed = T)
) %>%
addSearchOSM() %>%
# %>% addReverseSearchGoogle()
addSearchFeatures(
targetGroups = c("PickUp", "DropOff"))
set.seed(0)
data %>%
sample_n(size=10000) %>%
leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addCircleMarkers(~Pickup_longitude, ~Pickup_latitude, radius = 1,
color = "firebrick", fillOpacity = 0.001, group = 'PickUp',
clusterOptions = markerClusterOptions()) %>%
addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
color = "steelblue", fillOpacity = 0.001, group = 'DropOff',
clusterOptions = markerClusterOptions()) %>%
addLayersControl(
baseGroups = c("Color map", "Light map", "white map"),
overlayGroups = c("PickUp", "DropOff"),
options = layersControlOptions(collapsed = T)
) %>%
addSearchOSM()
# %>%
# addReverseSearchOSM()
I converted pick_datetime to time data and created variables such as pickup_hour, pickup_weekday, pickup_weekend, etc.
pickup_hour has value from 1 to 24, denoting 24 hours a day.
pickup_weekday has value from Monday to Friday and is categorized as factor.
pickup_weekend has value Weekday and Weekend.
data <- data %>% mutate(lpep_pickup_datetime = ymd_hms(lpep_pickup_datetime),
pickup_hour=hour(lpep_pickup_datetime)+1,
pickup_weekday=as.factor(weekdays(lpep_pickup_datetime)),
pickup_weekend=if_else(pickup_weekday=='Saturday'|pickup_weekday=='Sunday','Weekend','Weekday'),
Lpep_dropoff_datetime = ymd_hms( Lpep_dropoff_datetime),
dropoff_hour=hour(lpep_pickup_datetime)+1,
dropoff_weekday=as.factor(weekdays(lpep_pickup_datetime)))
First of all, let’s have an overview on the distance data. As the graph below, the data is skewed by the distance outliers.
data %>%
ggplot(aes(Trip_distance)) +
geom_histogram(fill = "firebrick", bins = 150) +
theme_economist() +
labs(title='Green Taxi Case Study',subtitle='Initial Historgram of Distance',caption="source: NYC Green Taxi Data",
x="Travel Distance")
After processing the data, I got an almost normal distribution with some more few distance trips.
data %>%
ggplot(aes(Trip_distance)) +
geom_histogram(fill = "firebrick", bins = 150) +
scale_x_log10() +
theme_economist() +
labs(title='Green Taxi Case Study',subtitle='Processed Historgram of Distance',caption="source: NYC Green Taxi Data",
x="Travel Distance (log)")
From an initial look at the trip distance by time of day graph, the median trip distance is longer at 6 AM and 7 AM than any other time during the day. The difference between the two hours and the rest of the day is quite significant.
ggplotly(data %>% group_by(pickup_hour) %>% summarise(avg_trip_distance=median(Trip_distance)) %>%
ggplot(aes(pickup_hour, avg_trip_distance, fill = avg_trip_distance)) + geom_col() +
geom_label(aes(label=round(avg_trip_distance,1)), size=3.5, alpha=.7) +
# coord_flip() +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
theme(legend.position = 'none') +
labs(title='Median Trip Distance',subtitle='All Data Included (Weekday and Weekdend)',caption="source: Green Taxi Data",
y="Average Trip Distance", x="Time of Day (Pickup)"))
Same as the observation from the full dataset, 6 AM and 7 AM are still the two hour that people take taxi for the longest median trip distance. However, in this case, the median trip distance is slightly longer at 6 AM. At that day of the weekday, my assumption is most of the taxi usage is caused by the commute to go to work. Since people usually start work at 8 AM, one of the potential explanation is that people who live further from their work place tend to leave a bit earlier than the people who live close by. Moreover, it might also cause by that more people take taxi to work at 6 AM compared to 7 AM. Therefore, the median trip distance is slightly longer at 6 AM than 7 AM.
ggplotly(data %>% filter(pickup_weekend=='Weekday') %>% group_by(pickup_hour) %>% summarise(avg_trip_distance=median(Trip_distance)) %>%
ggplot(aes(pickup_hour, avg_trip_distance, fill = avg_trip_distance)) + geom_col() +
geom_label(aes(label=round(avg_trip_distance,1)), size=3.5, alpha=.7) +
# coord_flip() +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
theme(legend.position = 'none') +
labs(title='Median Trip Distance',subtitle='Weekday',caption="source: Green Taxi Data",
y="Average Trip Distance", x="Time of Day (Pickup)"))
For the weekend, 7 AM and 8 AM turned out to be the two hours that have the longest trip distance. Additionally, the difference between the two hours and the rest of the day is not as significant as for the weekdays. My speculation is that people tend to sleep in, since they don’t have to go to work in the early morning. Therefore, the longest median trip distance is postponed one hour as well as the difference is not as significant as the weekdays.
ggplotly(data %>% filter(pickup_weekend=='Weekend') %>% group_by(pickup_hour) %>% summarise(avg_trip_distance=median(Trip_distance)) %>%
ggplot(aes(pickup_hour, avg_trip_distance, fill = avg_trip_distance)) + geom_col() +
geom_label(aes(label=round(avg_trip_distance,1)), size=3.5, alpha=.7) +
# coord_flip() +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
theme(legend.position = 'none') +
labs(title='Median Trip Distance',subtitle='Weekend',caption="source: Green Taxi Data",
y="Average Trip Distance", x="Time of Day (Pickup)"))
ggplotly(data %>%
group_by(pickup_hour, pickup_weekend) %>%
summarise(avg_trip_distance=median(Trip_distance)) %>%
ggplot(aes(pickup_hour, avg_trip_distance, color = pickup_weekend)) +
geom_smooth(method = "loess", span = 1/2, se=F) +
geom_point(size = 4) +
labs(x = "Time of Day (Pickup)", y = "Average Trip Distance") +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
scale_color_discrete("Weekday vs. Weekend"))
Rather than directing calculating the top 5 pick up locations, I preprocessed the data a little bit. The logic is that if I directly use the longitude and latitude data, the same pick up spot with slightly different coordinates would be treated as different pick up locations and that would definitely deviate from the actual result. Therefore, I round the longitude and latitude to the 3 decimals from which the coordinates with slightly different number would be treated as one spot. I also used a green cab icon to denote the pick up spots. The graph is interactive and can be zoom in and out. If you place the mouse on the green cab icon, it would show how many pick ups at the location based on the dataset.
round_num <- 3
Weekday_Top5 <- data %>% filter(pickup_weekend=='Weekday') %>%
group_by(lng=round(Pickup_longitude,round_num),lat=round(Pickup_latitude,round_num)) %>%
count() %>% arrange(desc(n)) %>% head(5)
Weekend_Top5 <- data %>% filter(pickup_weekend=='Weekend') %>%
group_by(lng=round(Pickup_longitude,round_num),lat=round(Pickup_latitude,round_num)) %>%
count() %>% arrange(desc(n)) %>% head(5)
greentaxi <- makeIcon(
iconUrl = "https://i.imgur.com/6rw618Q.png",
iconWidth = 38, iconHeight = 35,
iconAnchorX = 19, iconAnchorY = 39
)
There are the top 5 pick up locations during weekdays.
71st Ave and Queens Blvd. (13,987 pick ups in Feb 2016)
E 125th St and Park Ave. (13,235 pick ups in Feb 2016)
Broad Way and Roosevelt Ave. (12,566 pick ups in Feb 2016)
Madison Ave and E 101st St. (7,198 pick ups in Feb 2016)
Bedford Ave and N 7th St. (6,105 pick ups in Feb 2016)
Weekday_Top5 %>%
leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addScaleBar() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
addCircleMarkers(~lng, ~lat, radius = 1,
color = "firebrick", fillOpacity = 0.001) %>%
addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Pick ups:",Weekday_Top5$n))) %>%
addLayersControl(
baseGroups = c("Color map", "Light map","white map"),
options = layersControlOptions(collapsed = FALSE)
)
Broad Way and Roosevelt Ave. (6,465 pick ups in Feb 2016)
71st Ave and Queens Blvd. (5,249 pick ups in Feb 2016)
E 125th St and Park Ave. (4,788 pick ups in Feb 2016)
Wythe Ave and N 11th St. (4,507 pick ups in Feb 2016)
Bedford Ave and N 7th St. (2,768 pick ups in Feb 2016)
Weekend_Top5 %>%
leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addScaleBar() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
addCircleMarkers(~lng, ~lat, radius = 1,
color = "firebrick", fillOpacity = 0.001) %>%
addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Pick ups:",Weekend_Top5$n))) %>%
addLayersControl(
baseGroups = c("Color map", "Light map", "white map"),
options = layersControlOptions(collapsed = FALSE)
)
To recommend a pick up spot, I leverage the power of unsupervised learning by using a simple Kmeans model to group the pick up spots into 50 groups. Each of the pick up locations
According to the dictionary, there are two types of trip - street-hail and dispatch. For this question, we should only focus on the street-hail and exclude the dispatches.
data_coord <- data %>% filter(Trip_type==1) %>% select(Pickup_longitude, Pickup_latitude)
data1 <- data %>% filter(Trip_type==1)
I used kmeans model to classify the coordinates into 50 groups.
set.seed(0)
data_kmeans <- data_coord %>% kmeans(50,nstart=20)
data1$cluster <- data_kmeans$cluster
pal <- colorNumeric(
palette = "Blues",
domain = data$cluster)
I sampled 10,000 observations and put them on the map.
So far, I answered the first three questions. To answer the last question, I would leverage the power of shiny app and make an interactive graph with the input option for longitude and latitude. Then, I would use the kmeans model to predict which cluster the input location would be in and focus on the pickup points within that cluster. Final, I would pick top 20 pick up points to recommend and the coordinate of the closest pick up spot among the Top 20.
Please found these result from the Shiny app below.
set.seed(0)
data1 %>% sample_n(size=10000) %>%
leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addScaleBar() %>%
addCircleMarkers(~Pickup_longitude, ~Pickup_latitude, radius = 1,
color = ~pal(cluster), fillOpacity = 0.001) %>%
addLayersControl(
baseGroups = c("Color map", "Light map", "white map"),
options = layersControlOptions(collapsed = FALSE)
)
I set up the input options for longitude and latitude with sliders. Once that data is input, the program would make a prediction, for which cluster it belongs to, based on the input and kmeans model. Then, it would give 20 recommended pick up spots within the cluster as well as the closest pick up spot among the Top 20.
Please be awared that the graph below is just the screenshot of the actual interactive graph, since Shiny app is not available on Kaggle at the moment.