1 Step 1: Define the Problem

Requirement:

Download the data, load it into your favorite statistical programing software or database. Report the number of rows and columns that you’ve loaded.
Visualize trip distance by time of day in any way you see fit, any observations?
What are the most popular pickup locations on weekdays vs weekend?
I want to know where I can most easily get a cab. Recommend a pick up spot where i can find a cab given my lat long.

This time, I would use R as the programing software for the analysis. The analysis would be based on a standard data science framework and the questions above; however, I would extend the scope of the analysis to identify any unique insight as well as provide detailed explanation of my code.

2 Step 2: Gather the Data

The scope of data is limited to Green taxi data for February 2016. There is also data dictionary available from the website, which explain the variable. The Data is available via NYC Trip Data.

3 Step 3: Preprocess the Data

3.1 Dependencies

3.1.1 Required libraries

if (!require("pacman")) install.packages("pacman") 
pacman::p_load(tidyverse, DT, lubridate, leaflet, leaflet.extras, maps, data.table, ggthemes, rebus, clue, skimr, plotly)

3.1.2 Required Dataset

# Initially use read.csv then write the file so that going forward I can use fread
data <- read.csv("input/green_tripdata_2016-02v2.csv", stringsAsFactors = F)

3.2 First Glimpse

The first question can be answered by looking at the structure of the dataset. The dataset has 1510722 observations(rows) and 22 variables(columns).

3.2.1 First 20 rows with selected columns

data %>% head(100) %>% select(lpep_pickup_datetime,Pickup_longitude,Pickup_latitude,Trip_distance) %>% datatable(filter = 'top', options = list(
  pageLength = 15, autoWidth = TRUE
))

3.2.2 Structure

data %>% glimpse() # 1510722 obs. of  21 variables

## Observations: 1,510,722
## Variables: 22
## $ X                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ VendorID              <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ lpep_pickup_datetime  <chr> "2016-02-01 00:00:01", "2016-02-01 00:01...
## $ Lpep_dropoff_datetime <chr> "2016-02-01 00:10:06", "2016-02-01 00:20...
## $ Store_and_fwd_flag    <chr> "N", "N", "N", "N", "N", "N", "N", "N", ...
## $ RateCodeID            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Pickup_longitude      <dbl> -73.93902, -73.89149, -73.98378, -73.807...
## $ Pickup_latitude       <dbl> 40.80521, 40.74665, 40.67613, 40.70037, ...
## $ Dropoff_longitude     <dbl> -73.97253, -73.89088, -73.95698, -73.831...
## $ Dropoff_latitude      <dbl> 40.78588, 40.74390, 40.71833, 40.70598, ...
## $ Passenger_count       <int> 1, 1, 1, 1, 5, 1, 5, 3, 1, 1, 1, 1, 1, 6...
## $ Trip_distance         <dbl> 2.86, 3.35, 4.70, 2.11, 0.98, 6.00, 1.46...
## $ Fare_amount           <dbl> 10.5, 13.0, 17.5, 8.0, 5.0, 22.5, 7.0, 1...
## $ Extra                 <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ...
## $ MTA_tax               <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ...
## $ Tip_amount            <dbl> 0.00, 0.00, 3.76, 0.00, 0.00, 4.76, 0.00...
## $ Tolls_amount          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Ehail_fee             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, ...
## $ Total_amount          <dbl> 11.80, 14.30, 22.56, 9.30, 6.30, 28.56, ...
## $ Payment_type          <int> 2, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2...
## $ Trip_type             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

3.2.3 Skim

data %>% skim() %>% kable()

## Skim summary statistics  
##  n obs: 1510722    
##  n variables: 22    
## 
## Variable type: character
## 
## variable                missing   complete   n         min   max   empty   n_unique 
## ----------------------  --------  ---------  --------  ----  ----  ------  ---------
## Lpep_dropoff_datetime   0         1510722    1510722   19    19    0       1065647  
## lpep_pickup_datetime    0         1510722    1510722   19    19    0       1067848  
## Store_and_fwd_flag      0         1510722    1510722   1     1     0       2        
## 
## Variable type: integer
## 
## variable          missing   complete   n         mean       sd          p0   p25         p50        p75          p100      hist     
## ----------------  --------  ---------  --------  ---------  ----------  ---  ----------  ---------  -----------  --------  ---------
## Passenger_count   0         1510722    1510722   1.35       1.02        0    1           1          1            9         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Payment_type      0         1510722    1510722   1.51       0.53        1    1           1          2            5         <U+2587><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## RateCodeID        0         1510722    1510722   1.09       0.61        1    1           1          1            99        <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Trip_type         2         1510720    1510722   1.02       0.14        1    1           1          1            2         <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## VendorID          0         1510722    1510722   1.79       0.41        1    2           2          2            2         <U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## X                 0         1510722    1510722   755361.5   436108.02   1    377681.25   755361.5   1133041.75   1510722   <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587> 
## 
## Variable type: logical
## 
## variable    missing   complete   n         mean   count   
## ----------  --------  ---------  --------  -----  --------
## Ehail_fee   1510722   0          1510722   NaN    1510722 
## 
## Variable type: numeric
## 
## variable                missing   complete   n         mean     sd      p0        p25      p50      p75      p100      hist     
## ----------------------  --------  ---------  --------  -------  ------  --------  -------  -------  -------  --------  ---------
## Dropoff_latitude        0         1510722    1510722   40.68    1.61    0         40.7     40.75    40.79    42.32     <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## Dropoff_longitude       0         1510722    1510722   -73.83   2.97    -115.33   -73.97   -73.95   -73.91   0         <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Extra                   0         1510722    1510722   0.35     0.38    -4.5      0        0.5      0.5      4.5       <U+2581><U+2581><U+2581><U+2587><U+2587><U+2581><U+2581><U+2581> 
## Fare_amount             0         1510722    1510722   11.75    9.4     -400      6        9        14.5     933.5     <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581> 
## improvement_surcharge   0         1510722    1510722   0.29     0.051   -0.3      0.3      0.3      0.3      1.77      <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581> 
## MTA_tax                 0         1510722    1510722   0.49     0.088   -0.5      0.5      0.5      0.5      3         <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Pickup_latitude         0         1510722    1510722   40.68    1.68    0         40.69    40.75    40.8     42.32     <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
## Pickup_longitude        0         1510722    1510722   -73.82   3.1     -115.28   -73.96   -73.95   -73.92   0         <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Tip_amount              0         1510722    1510722   1.23     2.43    -10       0        0        2        400       <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Tolls_amount            0         1510722    1510722   0.1      1.36    -12.5     0        0        0        902.17    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Total_amount            0         1510722    1510722   14.2     10.88   -400      7.8      11       17.16    1121.16   <U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581> 
## Trip_distance           0         1510722    1510722   2.73     2.84    0         1        1.8      3.44     235.5     <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>

3.2.4 Summary

data %>% summary()

##        X              VendorID     lpep_pickup_datetime
##  Min.   :      1   Min.   :1.000   Length:1510722      
##  1st Qu.: 377681   1st Qu.:2.000   Class :character    
##  Median : 755362   Median :2.000   Mode  :character    
##  Mean   : 755362   Mean   :1.785                       
##  3rd Qu.:1133042   3rd Qu.:2.000                       
##  Max.   :1510722   Max.   :2.000                       
##                                                        
##  Lpep_dropoff_datetime Store_and_fwd_flag   RateCodeID    
##  Length:1510722        Length:1510722     Min.   : 1.000  
##  Class :character      Class :character   1st Qu.: 1.000  
##  Mode  :character      Mode  :character   Median : 1.000  
##                                           Mean   : 1.091  
##                                           3rd Qu.: 1.000  
##                                           Max.   :99.000  
##                                                           
##  Pickup_longitude  Pickup_latitude Dropoff_longitude Dropoff_latitude
##  Min.   :-115.28   Min.   : 0.00   Min.   :-115.33   Min.   : 0.00   
##  1st Qu.: -73.96   1st Qu.:40.69   1st Qu.: -73.97   1st Qu.:40.70   
##  Median : -73.95   Median :40.75   Median : -73.95   Median :40.75   
##  Mean   : -73.82   Mean   :40.68   Mean   : -73.83   Mean   :40.68   
##  3rd Qu.: -73.92   3rd Qu.:40.80   3rd Qu.: -73.91   3rd Qu.:40.79   
##  Max.   :   0.00   Max.   :42.32   Max.   :   0.00   Max.   :42.32   
##                                                                      
##  Passenger_count Trip_distance      Fare_amount          Extra        
##  Min.   :0.000   Min.   :  0.000   Min.   :-400.00   Min.   :-4.5000  
##  1st Qu.:1.000   1st Qu.:  1.000   1st Qu.:   6.00   1st Qu.: 0.0000  
##  Median :1.000   Median :  1.800   Median :   9.00   Median : 0.5000  
##  Mean   :1.352   Mean   :  2.733   Mean   :  11.75   Mean   : 0.3498  
##  3rd Qu.:1.000   3rd Qu.:  3.440   3rd Qu.:  14.50   3rd Qu.: 0.5000  
##  Max.   :9.000   Max.   :235.500   Max.   : 933.50   Max.   : 4.5000  
##                                                                       
##     MTA_tax         Tip_amount       Tolls_amount      Ehail_fee     
##  Min.   :-0.500   Min.   :-10.000   Min.   :-12.5000   Mode:logical  
##  1st Qu.: 0.500   1st Qu.:  0.000   1st Qu.:  0.0000   NA's:1510722  
##  Median : 0.500   Median :  0.000   Median :  0.0000                 
##  Mean   : 0.487   Mean   :  1.226   Mean   :  0.1017                 
##  3rd Qu.: 0.500   3rd Qu.:  2.000   3rd Qu.:  0.0000                 
##  Max.   : 3.000   Max.   :400.000   Max.   :902.1700                 
##                                                                      
##  improvement_surcharge  Total_amount      Payment_type     Trip_type    
##  Min.   :-0.3000       Min.   :-400.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 0.3000       1st Qu.:   7.80   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 0.3000       Median :  11.00   Median :1.000   Median :1.000  
##  Mean   : 0.2922       Mean   :  14.21   Mean   :1.509   Mean   :1.021  
##  3rd Qu.: 0.3000       3rd Qu.:  17.16   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   : 1.7700       Max.   :1121.16   Max.   :5.000   Max.   :2.000  
##                                                          NA's   :2

3.3 Data Cleaning: Correcting, Completing, Creating, and Converting

3.3.1 Correcting & Completing

Since the trip payment is not in the scale of this analysis, I took out these variables for shorter running time.

data[,which(str_detect(names(data),"amount|fee|Extra|fee|Pay|tax|ID|charge"))] <- NULL

4 Interactive Map

Looking at the summary result, I got the map below. It is very interesting to see that all the pick up location are outside of the core area of New York City. By doing a little research, I found out that the green taxi are only allowed to pick up passengers (street hails or calls) in outer boroughs (excluding John F. Kennedy International Airport and LaGuardia Airport unless arranged in advance) and in Manhattan above East 96th and West 110th Streets. That explains the pattern we see here.

set.seed(0)
data %>% 
  sample_n(size=10000) %>% 
  
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addCircleMarkers(~Pickup_longitude, ~Pickup_latitude, radius = 1,
                   color = "firebrick", fillOpacity = 0.001, group = 'PickUp') %>%
  addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
                   color = "steelblue", fillOpacity = 0.001, group = 'DropOff') %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map", "white map"),
    overlayGroups = c("PickUp", "DropOff"),
    options = layersControlOptions(collapsed = T)
  ) %>% 
  addSearchOSM()  %>% 
# %>% addReverseSearchGoogle() 
  addSearchFeatures(
     targetGroups = c("PickUp", "DropOff"))

5 Interactive Map with Clustering

set.seed(0)
data %>% 
  sample_n(size=10000) %>% 
  
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addCircleMarkers(~Pickup_longitude, ~Pickup_latitude, radius = 1,
                   color = "firebrick", fillOpacity = 0.001, group = 'PickUp',
                   clusterOptions = markerClusterOptions()) %>%
  addCircleMarkers(~Dropoff_longitude, ~Dropoff_latitude, radius = 1,
                   color = "steelblue", fillOpacity = 0.001, group = 'DropOff',
                   clusterOptions = markerClusterOptions()) %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map", "white map"),
    overlayGroups = c("PickUp", "DropOff"),
    options = layersControlOptions(collapsed = T)
  ) %>% 
  addSearchOSM()

# %>% 
  # addReverseSearchOSM()

5.0.1 Creating, and Converting

I converted pick_datetime to time data and created variables such as pickup_hour, pickup_weekday, pickup_weekend, etc.

pickup_hour has value from 1 to 24, denoting 24 hours a day.

pickup_weekday has value from Monday to Friday and is categorized as factor.

pickup_weekend has value Weekday and Weekend.

data <- data %>% mutate(lpep_pickup_datetime = ymd_hms(lpep_pickup_datetime),
                pickup_hour=hour(lpep_pickup_datetime)+1,
                pickup_weekday=as.factor(weekdays(lpep_pickup_datetime)),
                pickup_weekend=if_else(pickup_weekday=='Saturday'|pickup_weekday=='Sunday','Weekend','Weekday'),
                Lpep_dropoff_datetime = ymd_hms( Lpep_dropoff_datetime),
                dropoff_hour=hour(lpep_pickup_datetime)+1,
                dropoff_weekday=as.factor(weekdays(lpep_pickup_datetime)))

6 Step 4 Perform Exploratory Data Analysis (EDA)

6.1 The Distribution of Distance

6.1.1 Raw

First of all, let’s have an overview on the distance data. As the graph below, the data is skewed by the distance outliers.

data %>%
  ggplot(aes(Trip_distance)) +
  geom_histogram(fill = "firebrick", bins = 150) +
  theme_economist() +
  labs(title='Green Taxi Case Study',subtitle='Initial Historgram of Distance',caption="source: NYC Green Taxi Data",
       x="Travel Distance")

6.1.2 Processed

After processing the data, I got an almost normal distribution with some more few distance trips.

data %>%
  ggplot(aes(Trip_distance)) +
  geom_histogram(fill = "firebrick", bins = 150) +
  scale_x_log10() +
  theme_economist() +
  labs(title='Green Taxi Case Study',subtitle='Processed Historgram of Distance',caption="source: NYC Green Taxi Data",
       x="Travel Distance (log)")

6.2 Visualize Trip Distance by Time of Day

6.2.1 Median Trip Distance by Time of Day for both Weekday and Weekend

From an initial look at the trip distance by time of day graph, the median trip distance is longer at 6 AM and 7 AM than any other time during the day. The difference between the two hours and the rest of the day is quite significant.

ggplotly(data %>% group_by(pickup_hour) %>% summarise(avg_trip_distance=median(Trip_distance)) %>% 
  ggplot(aes(pickup_hour, avg_trip_distance, fill = avg_trip_distance)) + geom_col() +
  geom_label(aes(label=round(avg_trip_distance,1)), size=3.5, alpha=.7) +
  # coord_flip() +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  theme(legend.position = 'none') +
  labs(title='Median Trip Distance',subtitle='All Data Included (Weekday and Weekdend)',caption="source: Green Taxi Data",
       y="Average Trip Distance", x="Time of Day (Pickup)"))

6.2.2 Median Trip Distance by Time of Day for Weekday

Same as the observation from the full dataset, 6 AM and 7 AM are still the two hour that people take taxi for the longest median trip distance. However, in this case, the median trip distance is slightly longer at 6 AM. At that day of the weekday, my assumption is most of the taxi usage is caused by the commute to go to work. Since people usually start work at 8 AM, one of the potential explanation is that people who live further from their work place tend to leave a bit earlier than the people who live close by. Moreover, it might also cause by that more people take taxi to work at 6 AM compared to 7 AM. Therefore, the median trip distance is slightly longer at 6 AM than 7 AM.

ggplotly(data %>% filter(pickup_weekend=='Weekday') %>% group_by(pickup_hour) %>% summarise(avg_trip_distance=median(Trip_distance)) %>% 
  ggplot(aes(pickup_hour, avg_trip_distance, fill = avg_trip_distance)) + geom_col() +
  geom_label(aes(label=round(avg_trip_distance,1)), size=3.5, alpha=.7) +
  # coord_flip() +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  theme(legend.position = 'none') +
  labs(title='Median Trip Distance',subtitle='Weekday',caption="source: Green Taxi Data",
       y="Average Trip Distance", x="Time of Day (Pickup)"))

6.2.3 Median Trip Distance by Time of Day for Weekend

For the weekend, 7 AM and 8 AM turned out to be the two hours that have the longest trip distance. Additionally, the difference between the two hours and the rest of the day is not as significant as for the weekdays. My speculation is that people tend to sleep in, since they don’t have to go to work in the early morning. Therefore, the longest median trip distance is postponed one hour as well as the difference is not as significant as the weekdays.

ggplotly(data %>% filter(pickup_weekend=='Weekend') %>% group_by(pickup_hour) %>% summarise(avg_trip_distance=median(Trip_distance)) %>% 
  ggplot(aes(pickup_hour, avg_trip_distance, fill = avg_trip_distance)) + geom_col() +
  geom_label(aes(label=round(avg_trip_distance,1)), size=3.5, alpha=.7) +
  # coord_flip() +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  theme(legend.position = 'none') +
  labs(title='Median Trip Distance',subtitle='Weekend',caption="source: Green Taxi Data",
       y="Average Trip Distance", x="Time of Day (Pickup)"))

6.2.4 Combined Weekday and Weekend Median Trip Distance

ggplotly(data %>%
  group_by(pickup_hour, pickup_weekend) %>%
  summarise(avg_trip_distance=median(Trip_distance)) %>%
  ggplot(aes(pickup_hour, avg_trip_distance, color = pickup_weekend)) +
  geom_smooth(method = "loess", span = 1/2, se=F) +
  geom_point(size = 4) +
  labs(x = "Time of Day (Pickup)", y = "Average Trip Distance") +
  scale_x_continuous(breaks=seq(1,24,1)) +
  theme_economist() +
  scale_color_discrete("Weekday vs. Weekend"))

6.3 Top 5 Pick Up Locations on Weekdays and Weekend

6.3.1 Basic Set up

Rather than directing calculating the top 5 pick up locations, I preprocessed the data a little bit. The logic is that if I directly use the longitude and latitude data, the same pick up spot with slightly different coordinates would be treated as different pick up locations and that would definitely deviate from the actual result. Therefore, I round the longitude and latitude to the 3 decimals from which the coordinates with slightly different number would be treated as one spot. I also used a green cab icon to denote the pick up spots. The graph is interactive and can be zoom in and out. If you place the mouse on the green cab icon, it would show how many pick ups at the location based on the dataset.

round_num <- 3

Weekday_Top5 <- data %>% filter(pickup_weekend=='Weekday') %>% 
  group_by(lng=round(Pickup_longitude,round_num),lat=round(Pickup_latitude,round_num)) %>% 
  count() %>% arrange(desc(n)) %>% head(5)


Weekend_Top5 <- data %>% filter(pickup_weekend=='Weekend') %>% 
  group_by(lng=round(Pickup_longitude,round_num),lat=round(Pickup_latitude,round_num)) %>% 
  count() %>% arrange(desc(n)) %>% head(5)

greentaxi <- makeIcon(
  iconUrl = "https://i.imgur.com/6rw618Q.png",
  iconWidth = 38, iconHeight = 35,
  iconAnchorX = 19, iconAnchorY = 39
)

6.3.2 Weekday Top 5 Pick up locations

There are the top 5 pick up locations during weekdays.

71st Ave and Queens Blvd. (13,987 pick ups in Feb 2016)
E 125th St and Park Ave. (13,235 pick ups in Feb 2016)
Broad Way and Roosevelt Ave. (12,566 pick ups in Feb 2016)
Madison Ave and E 101st St. (7,198 pick ups in Feb 2016)
Bedford Ave and N 7th St. (6,105 pick ups in Feb 2016)

Weekday_Top5 %>%
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addScaleBar() %>%
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
  addCircleMarkers(~lng, ~lat, radius = 1,
                   color = "firebrick", fillOpacity = 0.001) %>%
  addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Pick ups:",Weekday_Top5$n))) %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map","white map"),
    options = layersControlOptions(collapsed = FALSE)
  )

6.3.3 Weekend Top 5 Pick locations

Broad Way and Roosevelt Ave. (6,465 pick ups in Feb 2016)
71st Ave and Queens Blvd. (5,249 pick ups in Feb 2016)
E 125th St and Park Ave. (4,788 pick ups in Feb 2016)
Wythe Ave and N 11th St. (4,507 pick ups in Feb 2016)
Bedford Ave and N 7th St. (2,768 pick ups in Feb 2016)

Weekend_Top5 %>%
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>% 
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addScaleBar() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
  addCircleMarkers(~lng, ~lat, radius = 1,
                   color = "firebrick", fillOpacity = 0.001) %>%
  addMarkers(~lng, ~lat, icon = greentaxi, label = ~as.character(paste("Number of Pick ups:",Weekend_Top5$n))) %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map", "white map"),
    options = layersControlOptions(collapsed = FALSE)
  )

7 Step 5 Modelling

To recommend a pick up spot, I leverage the power of unsupervised learning by using a simple Kmeans model to group the pick up spots into 50 groups. Each of the pick up locations

7.1 Recommend to Find Pick Up Spot

7.1.1 Preprocess the data

According to the dictionary, there are two types of trip - street-hail and dispatch. For this question, we should only focus on the street-hail and exclude the dispatches.

data_coord <- data %>% filter(Trip_type==1) %>% select(Pickup_longitude, Pickup_latitude)
data1 <- data %>% filter(Trip_type==1)

I used kmeans model to classify the coordinates into 50 groups.

set.seed(0)
data_kmeans <- data_coord %>% kmeans(50,nstart=20)

data1$cluster <- data_kmeans$cluster

pal <- colorNumeric(
  palette = "Blues",
  domain = data$cluster)

I sampled 10,000 observations and put them on the map.

So far, I answered the first three questions. To answer the last question, I would leverage the power of shiny app and make an interactive graph with the input option for longitude and latitude. Then, I would use the kmeans model to predict which cluster the input location would be in and focus on the pickup points within that cluster. Final, I would pick top 20 pick up points to recommend and the coordinate of the closest pick up spot among the Top 20.

Please found these result from the Shiny app below.

set.seed(0)
data1 %>% sample_n(size=10000) %>% 
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap, group = "color map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
  addProviderTiles(providers$Stamen.Toner, group = "white map") %>% 
  addScaleBar() %>%
  addCircleMarkers(~Pickup_longitude, ~Pickup_latitude, radius = 1,
                   color = ~pal(cluster), fillOpacity = 0.001) %>%
  addLayersControl(
    baseGroups = c("Color map", "Light map", "white map"),
    options = layersControlOptions(collapsed = FALSE)
  )

8 Shiny App

8.1 The final mission to Answer Question 4

I set up the input options for longitude and latitude with sliders. Once that data is input, the program would make a prediction, for which cluster it belongs to, based on the input and kmeans model. Then, it would give 20 recommended pick up spots within the cluster as well as the closest pick up spot among the Top 20.

Please be awared that the graph below is just the screenshot of the actual interactive graph, since Shiny app is not available on Kaggle at the moment.

Taxi Interac Map + Unsupervised Learning

Chengran Ouyang

2018-06-26