NYC Car Accident Analysis

Chengran Ouyang

2019-04-24

Step 1: Define the Problem

Requirement:

  1. Download the data, and load it in Pycharm and provide initial overview information.

  2. Visualize the location of the car accidents.

  3. Find out the insight from the dataset (i.e. Location/ Time of Day).

  4. Take Weather Data into Consideration.

This time, I would leverage the power of R and Python to perform the analysis and present the result via both Rmarkdown (R) and jupiter notebook (python). The analysis would be based on a standard data science framework and answer the questions above; however, I would extend the scope of the analysis to identify any unique insight as well as provide detailed explanation of my code.

Step 2: Gather the Data

The dataset is acquired from Kaggle open datasource.The Data is available via NYC hourly car accidents 2013-2016.

Step 3: Preprocess the Data

3.1 Dependencies

3.1.1 Required Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
from IPython.display import display
from IPython.display import HTML
import matplotlib.pyplot as plt
import folium
from folium.plugins import MarkerCluster
from folium.plugins import FastMarkerCluster
from branca.colormap import LinearColormap
import os

3.1.2 Load Datasets

In [2]:
colli = pd.read_csv("input/NYPD_Motor_Vehicle_Collisions.csv")
C:\Users\Owen\AppData\Local\Programs\Python\Python37-32\lib\site-packages\IPython\core\interactiveshell.py:2705: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

3.2 First Glimpse

3.2.1 First and Last 5 Rows

In [3]:
display(colli.head(5))
display(colli.tail(5))
DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 UNIQUE KEY VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 08/04/2017 0:00 QUEENS 11436 40.666885 -73.790405 (40.666885, -73.790405) NORTH CONDUIT AVENUE 149 STREET NaN ... Unspecified NaN NaN NaN 3725017 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN
1 08/04/2017 0:00 NaN NaN 40.719950 -74.008590 (40.71995, -74.00859) HUDSON STREET NaN NaN ... Unsafe Lane Changing NaN NaN NaN 3725047 PICK-UP TRUCK SPORT UTILITY / STATION WAGON NaN NaN NaN
2 08/04/2017 0:00 NaN NaN 40.718666 -73.963500 (40.718666, -73.9635) KENT AVENUE NaN NaN ... Following Too Closely NaN NaN NaN 3725533 SPORT UTILITY / STATION WAGON PASSENGER VEHICLE NaN NaN NaN
3 08/04/2017 0:00 NaN NaN 40.754677 -73.975815 (40.754677, -73.975815) PARK AVENUE NaN NaN ... Unspecified NaN NaN NaN 3724870 SPORT UTILITY / STATION WAGON TAXI NaN NaN NaN
4 08/04/2017 0:00 NaN NaN 40.818314 -73.836000 (40.818314, -73.836) NaN NaN NaN ... Unspecified NaN NaN NaN 3725662 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN

5 rows × 29 columns

DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 UNIQUE KEY VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
1089260 07/01/2012 9:45 NaN NaN NaN NaN NaN NaN NaN NaN ... Unspecified NaN NaN NaN 2896726 PASSENGER VEHICLE UNKNOWN NaN NaN NaN
1089261 07/01/2012 9:50 MANHATTAN 10019 40.766779 -73.996762 (40.7667789, -73.9967618) WEST SIDE HIGHWAY WEST 50 STREET NaN ... Unspecified NaN NaN NaN 37635 VAN SPORT UTILITY / STATION WAGON NaN NaN NaN
1089262 07/01/2012 9:50 QUEENS 11001 40.736245 -73.702966 (40.7362448, -73.7029656) 265 STREET 85 AVENUE NaN ... Unspecified NaN NaN NaN 219395 PASSENGER VEHICLE SPORT UTILITY / STATION WAGON NaN NaN NaN
1089263 07/01/2012 9:57 MANHATTAN 10065 40.765242 -73.957868 (40.7652424, -73.9578679) 1 AVENUE EAST 68 STREET NaN ... Other Vehicular NaN NaN NaN 44907 PASSENGER VEHICLE TAXI NaN NaN NaN
1089264 07/01/2012 9:59 BRONX 10452 40.835397 -73.920305 (40.835397, -73.920305) EAST 167 STREET GERARD AVENUE NaN ... Unspecified NaN NaN NaN 85154 PASSENGER VEHICLE SPORT UTILITY / STATION WAGON NaN NaN NaN

5 rows × 29 columns

3.2.2 Structure

In [4]:
print('Number of rows: ', colli.shape[0])
print('Number of columns: ', colli.shape[1])
Number of rows:  1089265
Number of columns:  29

3.2.3 Data Range

In [5]:
colli.describe()
Out[5]:
LATITUDE LONGITUDE NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED UNIQUE KEY
count 882199.000000 882199.000000 1.089265e+06 1.089265e+06 1.089265e+06 1.089265e+06 1.089265e+06 1.089265e+06 1.089265e+06 1.089265e+06 1.089265e+06
mean 40.721128 -73.924296 2.555733e-01 1.198056e-03 5.247300e-02 6.646684e-04 2.047068e-02 7.803427e-05 1.861668e-01 4.535168e-04 2.200695e+06
std 0.301039 0.988562 6.565478e-01 3.632685e-02 2.408481e-01 2.602076e-02 1.449423e-01 8.833360e-03 6.263905e-01 2.366021e-02 1.519165e+06
min 0.000000 -201.359990 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.200000e+01
25% 40.668960 -73.979025 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.742420e+05
50% 40.723305 -73.933153 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.180753e+06
75% 40.766351 -73.869289 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.453075e+06
max 41.126150 0.000000 4.300000e+01 5.000000e+00 2.800000e+01 2.000000e+00 4.000000e+00 1.000000e+00 4.300000e+01 5.000000e+00 3.726256e+06

3.2.4 Data Type

In [6]:
colli.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1089265 entries, 0 to 1089264
Data columns (total 29 columns):
DATE                             1089265 non-null object
TIME                             1089265 non-null object
BOROUGH                          792241 non-null object
ZIP CODE                         792171 non-null object
LATITUDE                         882199 non-null float64
LONGITUDE                        882199 non-null float64
LOCATION                         882199 non-null object
ON STREET NAME                   892128 non-null object
CROSS STREET NAME                854727 non-null object
OFF STREET NAME                  160587 non-null object
NUMBER OF PERSONS INJURED        1089265 non-null int64
NUMBER OF PERSONS KILLED         1089265 non-null int64
NUMBER OF PEDESTRIANS INJURED    1089265 non-null int64
NUMBER OF PEDESTRIANS KILLED     1089265 non-null int64
NUMBER OF CYCLIST INJURED        1089265 non-null int64
NUMBER OF CYCLIST KILLED         1089265 non-null int64
NUMBER OF MOTORIST INJURED       1089265 non-null int64
NUMBER OF MOTORIST KILLED        1089265 non-null int64
CONTRIBUTING FACTOR VEHICLE 1    1084674 non-null object
CONTRIBUTING FACTOR VEHICLE 2    947827 non-null object
CONTRIBUTING FACTOR VEHICLE 3    71026 non-null object
CONTRIBUTING FACTOR VEHICLE 4    15318 non-null object
CONTRIBUTING FACTOR VEHICLE 5    3759 non-null object
UNIQUE KEY                       1089265 non-null int64
VEHICLE TYPE CODE 1              1081881 non-null object
VEHICLE TYPE CODE 2              922402 non-null object
VEHICLE TYPE CODE 3              69043 non-null object
VEHICLE TYPE CODE 4              14800 non-null object
VEHICLE TYPE CODE 5              3650 non-null object
dtypes: float64(2), int64(9), object(18)
memory usage: 166.2+ MB

3.3 Data Cleaning: Correcting, Completing, Creating, and Converting

As the data range section shows, some data entries for latitude and longitude are out of the scale and need to be corrected or removed.

In [7]:
rev_colli = colli[(colli['LATITUDE']>0)&(colli['LONGITUDE']>-75)&(colli['LONGITUDE']<-72)]

By removing the incorrect lat and lon information, I am examing the data again, which shows a more accurate scale.

In [8]:
display(rev_colli.describe())
LATITUDE LONGITUDE NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED UNIQUE KEY
count 882110.000000 882110.000000 882110.000000 882110.000000 882110.000000 882110.000000 882110.000000 882110.000000 882110.000000 882110.000000 8.821100e+05
mean 40.723205 -73.923021 0.255677 0.001156 0.055269 0.000641 0.021104 0.000077 0.181391 0.000438 2.120917e+06
std 0.077699 0.086314 0.653579 0.035836 0.245425 0.025479 0.145954 0.008780 0.617081 0.023517 1.540086e+06
min 40.498949 -74.742000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.200000e+01
25% 40.668964 -73.979025 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.508022e+05
50% 40.723305 -73.933153 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.159300e+06
75% 40.766351 -73.869294 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.421072e+06
max 41.126150 -73.700584 43.000000 5.000000 28.000000 2.000000 4.000000 1.000000 43.000000 5.000000 3.726256e+06

Step 4 Interactive Map

In [9]:
rev1_colli = rev_colli.sample(2000)

lat = rev1_colli['LATITUDE'].tolist()
lon = rev1_colli['LONGITUDE'].tolist()
locations = list(zip(lat, lon))

map1 = folium.Map(location=[40.723205, -73.923021], zoom_start=11.5)
#FastMarkerCluster(data=locations).add_to(map1)
#                 icon = folium.Icon(icon='car'))
marker_cluster = MarkerCluster().add_to(map1)

for point in range(len(locations)):
    folium.Marker(
        location=locations[point],
        #popup='Add popup text here.',
        icon=folium.Icon(color='red', icon='remove-sign'),
    ).add_to(marker_cluster)

map1.save("map1.html")
map1
Out[9]:

Step 5 Sever Car Accidents

In [10]:
display(rev_colli[rev_colli['NUMBER OF MOTORIST INJURED']==43].iloc[:,0:10])
display(rev_colli[rev_colli['NUMBER OF MOTORIST INJURED']==43].iloc[:,10:20])
display(rev_colli[rev_colli['NUMBER OF MOTORIST INJURED']==43].iloc[:,20:29])
DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME
850220 09/09/2013 16:38 BROOKLYN 11213 40.679035 -73.938508 (40.6790346, -73.9385077) HERKIMER STREET ALBANY AVENUE NaN
NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2
850220 43 0 0 0 0 0 43 0 Unspecified Unspecified
CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 UNIQUE KEY VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
850220 NaN NaN NaN 183929 PASSENGER VEHICLE BUS NaN NaN NaN
In [11]:
display(rev_colli[rev_colli['NUMBER OF PERSONS KILLED']==5].iloc[:,0:10])
display(rev_colli[rev_colli['NUMBER OF PERSONS KILLED']==5].iloc[:,10:20])
display(rev_colli[rev_colli['NUMBER OF PERSONS KILLED']==5].iloc[:,20:29])
DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME
1077587 07/22/2012 3:20 QUEENS 11418 40.697204 -73.813334 (40.6972045, -73.8133341) 94 AVENUE VAN WYCK EXPRESSWAY NaN
NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2
1077587 3 5 0 0 0 0 3 5 Traffic Control Disregarded NaN
CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 UNIQUE KEY VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
1077587 NaN NaN NaN 208863 SPORT UTILITY / STATION WAGON NaN NaN NaN NaN
In [12]:
display(rev_colli[rev_colli['NUMBER OF PEDESTRIANS INJURED']==28].iloc[:,0:10])
display(rev_colli[rev_colli['NUMBER OF PEDESTRIANS INJURED']==28].iloc[:,10:20])
display(rev_colli[rev_colli['NUMBER OF PEDESTRIANS INJURED']==28].iloc[:,20:29])
DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME
51357 05/18/2017 11:54 MANHATTAN 10036 40.756035 -73.98695 (40.756035, -73.98695) 7 AVENUE WEST 42 STREET NaN
NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2
51357 28 1 28 1 0 0 0 0 Drugs (Illegal) NaN
CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 UNIQUE KEY VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
51357 NaN NaN NaN 3673559 PASSENGER VEHICLE NaN NaN NaN NaN
In [13]:
pd.read_csv('input/weather_data_nyc_centralpark_2016.csv')
Out[13]:
date maximum temerature minimum temperature average temperature precipitation snow fall snow depth
0 1-1-2016 42 34 38.0 0.00 0.0 0
1 2-1-2016 40 32 36.0 0.00 0.0 0
2 3-1-2016 45 35 40.0 0.00 0.0 0
3 4-1-2016 36 14 25.0 0.00 0.0 0
4 5-1-2016 29 11 20.0 0.00 0.0 0
5 6-1-2016 41 25 33.0 0.00 0.0 0
6 7-1-2016 46 31 38.5 0.00 0.0 0
7 8-1-2016 46 31 38.5 0.00 0.0 0
8 9-1-2016 47 40 43.5 T 0.0 0
9 10-1-2016 59 40 49.5 1.80 0.0 0
10 11-1-2016 40 26 33.0 0.00 0.0 0
11 12-1-2016 44 25 34.5 0.00 T 0
12 13-1-2016 30 22 26.0 0.00 0.0 0
13 14-1-2016 38 22 30.0 0.00 T 0
14 15-1-2016 51 34 42.5 T 0.0 0
15 16-1-2016 52 42 47.0 0.24 0.0 0
16 17-1-2016 42 30 36.0 0.05 0.4 0
17 18-1-2016 31 18 24.5 T T T
18 19-1-2016 28 16 22.0 0.00 0.0 T
19 20-1-2016 37 27 32.0 0.00 0.0 T
20 21-1-2016 36 26 31.0 0.00 0.0 0
21 22-1-2016 30 21 25.5 0.01 0.2 0
22 23-1-2016 27 24 25.5 2.31 27.3 6
23 24-1-2016 35 20 27.5 T T 22
24 25-1-2016 39 28 33.5 0.00 0.0 19
25 26-1-2016 48 38 43.0 0.00 0.0 17
26 27-1-2016 47 34 40.5 T 0.0 9
27 28-1-2016 42 32 37.0 0.00 0.0 6
28 29-1-2016 41 30 35.5 0.00 0.0 6
29 30-1-2016 39 28 33.5 0.00 0.0 6
... ... ... ... ... ... ... ...
336 2-12-2016 51 40 45.5 0 0 0
337 3-12-2016 47 41 44.0 T 0 0
338 4-12-2016 47 39 43.0 0 0 0
339 5-12-2016 49 38 43.5 0.19 0 0
340 6-12-2016 46 37 41.5 0.35 0 0
341 7-12-2016 46 40 43.0 0.09 0 0
342 8-12-2016 45 35 40.0 0 0 0
343 9-12-2016 39 29 34.0 0 0 0
344 10-12-2016 35 28 31.5 0 0 0
345 11-12-2016 35 28 31.5 0.03 0.4 0
346 12-12-2016 46 34 40.0 0.5 0 0
347 13-12-2016 43 35 39.0 0 0 0
348 14-12-2016 42 34 38.0 0 0 0
349 15-12-2016 34 19 26.5 0 T 0
350 16-12-2016 27 17 22.0 0 0 0
351 17-12-2016 39 24 31.5 0.73 2.8 2
352 18-12-2016 58 31 44.5 0.04 0 1
353 19-12-2016 31 23 27.0 0 0 0
354 20-12-2016 33 20 26.5 0 0 0
355 21-12-2016 40 30 35.0 0 0 0
356 22-12-2016 49 37 43.0 0 0 0
357 23-12-2016 47 38 42.5 0 0 0
358 24-12-2016 47 38 42.5 0.47 0 0
359 25-12-2016 50 36 43.0 0 0 0
360 26-12-2016 50 33 41.5 0.02 0 0
361 27-12-2016 60 40 50.0 0 0 0
362 28-12-2016 40 34 37.0 0 0 0
363 29-12-2016 46 33 39.5 0.39 0 0
364 30-12-2016 40 33 36.5 0.01 T 0
365 31-12-2016 44 31 37.5 0 0 0

366 rows × 7 columns