Objectives:
Download the data, and load it in R Studio & Pycharm and provide initial overview information.
Visualize the location of the car accidents.
Find out the insight from the dataset (i.e. Location/ Time of Day).
Find out the potential car accident area given the current location.
This time, I would leverage the power of R and Python to perform the analysis and present the result via both Rmarkdown (R) and jupiter notebook (python). The analysis would be based on a standard data science framework and answer the questions above; however, I would extend the scope of the analysis to identify any unique insight as well as provide detailed explanation of my code.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, DT, lubridate, leaflet, leaflet.extras, maps, data.table, ggthemes, rebus, clue, skimr, plotly)
data <- read.csv("input/NYPD_Motor_Vehicle_Collisions.csv", stringsAsFactors = F) %>% select(DATE, TIME, LATITUDE, LONGITUDE)
The first question can be answered by looking at the structure of the dataset. The dataset has 1,089,265 observations(rows) and 4 variables(columns).
data %>%
head(200) %>%
datatable(filter = 'top', options = list(
pageLength = 15, autoWidth = TRUE
))
data %>%
glimpse()
## Observations: 1,089,265
## Variables: 4
## $ DATE <chr> "08/04/2017", "08/04/2017", "08/04/2017", "08/04/201...
## $ TIME <chr> "0:00", "0:00", "0:00", "0:00", "0:00", "0:00", "0:0...
## $ LATITUDE <dbl> 40.66689, 40.71995, 40.71867, 40.75468, 40.81831, 40...
## $ LONGITUDE <dbl> -73.79041, -74.00859, -73.96350, -73.97581, -73.8360...
data %>%
skim() %>%
kable()
## Skim summary statistics
## n obs: 1089265
## n variables: 4
##
## Variable type: character
##
## variable missing complete n min max empty n_unique
## --------- -------- --------- -------- ---- ---- ------ ---------
## DATE 0 1089265 1089265 10 10 0 1861
## TIME 0 1089265 1089265 4 5 0 1440
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ---------- -------- --------- -------- ------- ----- -------- ------- ------- ------- ------ ---------
## LATITUDE 207066 882199 1089265 40.72 0.3 0 40.67 40.72 40.77 41.13 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## LONGITUDE 207066 882199 1089265 -73.92 0.99 -201.36 -73.98 -73.93 -73.87 0 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2581><U+2581>
data %>% summary()
## DATE TIME LATITUDE LONGITUDE
## Length:1089265 Length:1089265 Min. : 0.00 Min. :-201.36
## Class :character Class :character 1st Qu.:40.67 1st Qu.: -73.98
## Mode :character Mode :character Median :40.72 Median : -73.93
## Mean :40.72 Mean : -73.92
## 3rd Qu.:40.77 3rd Qu.: -73.87
## Max. :41.13 Max. : 0.00
## NA's :207066 NA's :207066
As the data range section shows, some data entries for latitude and longitude are out of the scale and need to be corrected or removed.
data <- data %>% filter(LATITUDE>0, LONGITUDE<-72, LONGITUDE>-75)
Looking at the summary result, I got the map below, which takes 5,000 examples.
set.seed(0)
data %>%
sample_n(size=5000) %>%
leaflet() %>%
addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addCircleMarkers(~LONGITUDE, ~LATITUDE, radius = 1,
color = "firebrick", fillOpacity = 0.001) %>%
# addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
# color = "steelblue", fillOpacity = 0.001, group = 'DropOff') %>%
addLayersControl(
baseGroups = c("Color map", "Light map"),
# overlayGroups = c("PickUp", "DropOff"),
options = layersControlOptions(collapsed = T)
) %>%
addSearchOSM()
# %>%
# addReverseSearchGoogle()
# addSearchFeatures(
# targetGroups = c("PickUp", "DropOff"))
set.seed(0)
data %>%
sample_n(size=5000) %>%
leaflet() %>%
addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addCircleMarkers(~LONGITUDE, ~LATITUDE, radius = 1,
color = "firebrick", fillOpacity = 0.001,
clusterOptions = markerClusterOptions()) %>%
# addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
# color = "steelblue", fillOpacity = 0.001, group = 'DropOff') %>%
addLayersControl(
baseGroups = c("Color map", "Light map"),
# overlayGroups = c("PickUp", "DropOff"),
options = layersControlOptions(collapsed = T)
) %>%
addSearchOSM()
I converted datetime to time series data and created variables such as hour, weekday, weekend, etc.
Hour has value from 1 to 24, denoting 24 hours a day.
Weekday has value from Monday to Friday and is categorized as factor.
Weekend has value Weekday and Weekend.
data <- data %>%
mutate(dateTime = mdy_hm(paste(DATE, TIME, sep = ' ')),
weekday=as.factor(weekdays(dateTime)),
weekend=if_else(weekday=='Saturday'|weekday=='Sunday','Weekend','Weekday'),
hour = hour(dateTime)+1)
From an initial look at the number of accident by time of day graph, most of the accidents happened during the day with the peak ocurring around hour ending 17~18. The difference between the 8 and 9 is quite significant.
ggplotly(data %>% group_by(hour) %>% summarise(num_accident=n()) %>%
ggplot(aes(hour, num_accident, fill = num_accident)) + geom_col() +
geom_label(aes(label=round(num_accident,1)), size=3.5, alpha=.7) +
# coord_flip() +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
theme(legend.position = 'none') +
labs(title='Number of Accidents (Weekday and Weekdend)',subtitle='All Data Included (Weekday and Weekdend)',caption="source: Kaggle Open Source Data",
y="Number of Accidents", x="Time of Day"))
Same as the observation from the full dataset, a slightly higher peak in the morning, which is presumably caused by the rush hours.
ggplotly(data %>% filter(weekend=='Weekday') %>% group_by(hour) %>% summarise(num_accident=n()) %>%
ggplot(aes(hour, num_accident, fill = num_accident)) + geom_col() +
geom_label(aes(label=round(num_accident,1)), size=3.5, alpha=.7) +
# coord_flip() +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
theme(legend.position = 'none') +
labs(title='Number of Accidents (Weekday)',
y="Number of Accidents", x="Time of Day"))
For the weekend, the pattern changed and the peak is ocurring between hour ending 15 to 17.
ggplotly(data %>% filter(weekend=='Weekend') %>% group_by(hour) %>% summarise(num_accident=n()) %>%
ggplot(aes(hour, num_accident, fill = num_accident)) + geom_col() +
geom_label(aes(label=round(num_accident,1)), size=3.5, alpha=.7) +
# coord_flip() +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
theme(legend.position = 'none') +
labs(title='Number of Accidents (Weekend)',
y="Number of Accidents", x="Time of Day"))
ggplotly(data %>%
group_by(hour, weekend) %>%
summarise(num_accident=n()) %>%
ggplot(aes(hour, num_accident, color = weekend)) +
geom_smooth(method = "loess", span = 1/2, se=F) +
geom_point(size = 4) +
labs(x = "Time of Day", y = "Number of Accidents") +
scale_x_continuous(breaks=seq(1,24,1)) +
theme_economist() +
scale_color_discrete("Weekday vs. Weekend"))
Rather than directing calculating the top 5 Accidents locations, I preprocessed the data a little bit. The logic is that if I directly use the longitude and latitude data, the same pick up spot with slightly different coordinates would be treated as different pick up locations and that would definitely deviate from the actual result. Therefore, I round the longitude and latitude to the 3 decimals from which the coordinates with slightly different number would be treated as one spot. I also used a green cab icon to denote the accident spots. The graph is interactive and can be zoom in and out. If you place the mouse on the green cab icon, it would show how many accidents at the location based on the dataset.
round_num <- 3
Weekday_Top5 <- data %>% filter(weekend=='Weekday') %>%
group_by(lng=round(LONGITUDE,round_num),lat=round(LATITUDE,round_num)) %>%
count() %>% arrange(desc(n)) %>% head(5)
Weekend_Top5 <- data %>% filter(weekend=='Weekend') %>%
group_by(lng=round(LONGITUDE,round_num),lat=round(LATITUDE,round_num)) %>%
count() %>% arrange(desc(n)) %>% head(5)
greentaxi <- makeIcon(
iconUrl = "https://i.imgur.com/UVfnHVr.png",
iconWidth = 38, iconHeight = 35,
iconAnchorX = 19, iconAnchorY = 39
)
Weekday_Top5 %>%
leaflet() %>%
addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
# addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addScaleBar() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
addCircleMarkers(~lng, ~lat, radius = 1,
color = "firebrick", fillOpacity = 0.001) %>%
addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Accidents:",Weekday_Top5$n))) %>%
addLayersControl(
baseGroups = c("Color map", "Light map"),
options = layersControlOptions(collapsed = FALSE)
)
Weekend_Top5 %>%
leaflet() %>%
addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
# addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addScaleBar() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
addCircleMarkers(~lng, ~lat, radius = 1,
color = "firebrick", fillOpacity = 0.001) %>%
addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Accidents:",Weekend_Top5$n))) %>%
addLayersControl(
baseGroups = c("Color map", "Light map"),
options = layersControlOptions(collapsed = FALSE)
)
To recommend a potential accident spot, I leverage the power of unsupervised learning by using a simple Kmeans model to group the accident spots into 50 groups. Each of the accident locations
data_coord <- data %>% select(LONGITUDE, LATITUDE)
data1 <- data
I used kmeans model to classify the coordinates into 50 groups.
set.seed(0)
# data_kmeans <- data_coord %>% kmeans(50,nstart=20)
# save(data_kmeans, file = "input/data_kmeans.rda")
load("input/data_kmeans.rda")
data1$cluster <- data_kmeans$cluster
pal <- colorNumeric(
palette = "Blues",
domain = data$cluster)
For the last objective, I would leverage the power of shiny app and make an interactive graph with the input option for longitude and latitude. Then, I would use the kmeans model to predict which cluster the input location would be in and focus on the accident points within that cluster. Final, I would pick top 20 accident points to recommend and the coordinate of the closest accident spot among the Top 20.
Please found these result from the Shiny app below.
set.seed(0)
data1 %>% sample_n(size=10000) %>%
leaflet() %>%
addProviderTiles(providers$HikeBike.HikeBike, group = "color map") %>%
addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
# addProviderTiles(providers$Stamen.Toner, group = "white map") %>%
addScaleBar() %>%
addCircleMarkers(~LONGITUDE, ~LATITUDE, radius = 1,
color = ~pal(cluster), fillOpacity = 0.001) %>%
addLayersControl(
baseGroups = c("Color map", "Light map"),
options = layersControlOptions(collapsed = FALSE)
)
I set up the input options for longitude and latitude with sliders. Once that data is input, the program would make a prediction, for which cluster it belongs to, based on the input and kmeans model. Then, it would give 20 recommended accident spots within the cluster as well as the closest accident spot among the Top 20.
Please be awared that the graph below is just the screenshot of the actual interactive app, since Shiny app is only available on the website host or showing via Rmarkdown.