import folium
import requests
import pandas
from functools import reduce
import numpy as np
from folium.plugins import HeatMap
import random
# Pull data from online csv file
arrest_table = pandas.read_csv("http://www.hcbravo.org/IntroDataSci/misc/BPD_Arrests.csv")
# tidy up data to make EPA and visualization easier
arrest_table["race_new"] = arrest_table["sex"]
arrest_table["sex_new"] = arrest_table["race"]
arrest_table["race"] = arrest_table["race_new"]
arrest_table["sex"] = arrest_table["sex_new"]
arrest_table = arrest_table.drop('race_new', 1)
arrest_table = arrest_table.drop('sex_new', 1)
arrest_table = arrest_table[pandas.notnull(arrest_table["Location 1"])]
arrest_table["lat"], arrest_table["long"] = arrest_table["Location 1"].str.split(",").str
arrest_table["lat"] = arrest_table["lat"].str.replace("(", "").astype(float)
arrest_table["long"] = arrest_table["long"].str.replace(")", "").astype(float)
arrest_table.head()
# iterate over arrest table to count instances of crimes in districts
m = dict()
for i in arrest_table.iterrows():
element = i[1]
if element[11] not in m:
m[element[11]] = 1
else:
m[element[11]] += 1
# take a random sample of 500 different crime for heat map
random_sample = arrest_table.sample(n=500)
# functionally scale the dictionary created to the sample size
sample_size = 150
m = {k: v//1000 for k, v in m.items()}
count = sum(m.values())
m = {k: int((v/count)*(sample_size)) for k, v in m.items()}
# create 9 different dataframes for all the different districts
districts = []
d_names = list(m.keys())
d_names.pop()
for k,v in m.items():
d = arrest_table.loc[arrest_table['district'] == k]
districts.append(d)
# this is one of the 9 dataframes for the southern district
districts[0].head()
# populate the 9 dataframes created
sample = []
for d in range(len(districts)-1):
temp = districts[d].head(1)
name = temp['district'].values[0]
num = m[name]
sample.append(arrest_table.sample(n=num))
# gather crime statistics on the different districts to make broadcast in our marker
crime_stats = {}
for d in range(len(districts)):
for i in districts[d].iterrows():
elements = i[1]
n = elements[11]
crime = elements[7]
if n not in crime_stats:
crime_stats[n] = {}
if crime not in crime_stats[n]:
crime_stats[n][crime] = 0
crime_stats[n][crime] += 1
# Select the top 5 offenses and create and HTML representation of our findings
top_five = {}
for d in crime_stats.keys():
curr_crime = crime_stats[d]
total_dist = sum(crime_stats[d].values())
top_five[d] = [str(k + ': ' + str(curr_crime[k]) + " <b>(" + str(int(curr_crime[k]/total_dist*100)) + "%)</b>") for k in sorted(curr_crime, key=curr_crime.get, reverse=True)][:5]
map_osm = folium.Map(location=[39.29, -76.61], zoom_start=12, tiles='stamentoner')
sex_legend = {'M': 'black','F':"#EBD7FA"}
# Gender overlay
for s in sample:
for i in s.iterrows():
element = i[1]
coords = [element[15], element[16]]
age = element[1]
sex = element[2]
race = element[3]
arrest_date = element[4]
description = "<b>Age:</b> %s</br><b>Sex:</b> %s</br><b>Race:</b> %s</br><b>Arrest Date:</b> %s" % (age,sex,race,arrest_date)
folium.CircleMarker(
location=coords,
radius=8,
popup= description,
color='black',
fill = True,
fill_color=sex_legend[sex],
fill_opacity= .65,
weight = .9,
icon=folium.Icon(color=sex_legend[sex],icon="remove-sign")).add_to(map_osm)
# District overlay
colors = {0:'blue', 1:'red', 2:'green', 3:'purple', 4:'darkpurple', 5:'black', 6:'lightgray', 7:'orange', 8:'lightgreen'}
for d in range(len(districts)-1):
temp = districts[d].head(1)
name = temp['district'].values[0]
description = "<b>"+ name + "</b></br>" + "</br>".join(top_five[name])
coords = [np.average(districts[d]['lat']),np.average(districts[d]['long'])]
folium.Marker(coords,
popup=description,
icon=folium.Icon(
color=colors[d])).add_to(map_osm)
# Heatmap overlay
data = []
for i in random_sample.iterrows():
element = i[1]
coords = [element[15], element[16]]
data.append(coords)
HeatMap(data).add_to(map_osm)
map_osm
Realizing very early on through brute force and multiple piazza posts I could not blindly just use all of the data made available to us in the the arrest_table dataframe; it was way to much for my computer to process and make computations. So in order to fix this bottleneck I employed multiple techniques that we discussed in Professor Amol's and Professor John's lectures. The solution I adopted was one that pulled random samples of crime data from the 9 different districts found in the Baltimore City. I amortized the cost of the computations by iterating over the whole dataset and counted the number of occurrences of the 9 different districts and then scaled it by a factor of 1000 in which I just truncated the the counts by 3 places functionally (In essence if I had a count of 7615 I would do integer multiplication $7615//1000$ to yield 7). Now that I had a more molecular numbers to work with I scaled them again by a factor of 150 (or my sample size) to make the the sum of all the districts be 150. After this I make 9 distinct data frames containing random instances of the district present in the original dataframe; I only take n of them from the original data frame where n is the number i made after scaling by my sample size. Now we have more useable data which will make visualization much easier.
For the Folium map I incorporated 3 features to the map interface: a heat map to visualize where most of the crime happens, district makers that are computed by taking the average of all the coordinates latitudinal and longitudinally while also gathering stats on the district, and random instances of crime that I grab from my 9 distinct random dataframes. Commenting on the heat map. I made this by looking and multiple examples of this feature plugin of the folium map. I first gained and large enough sample size from the original dataframe, in my case it was 500 (this was because more 500 would make my heatmap too dense and yield a lot of red zones or high crime activity and the computation and time average would be to high and less than 500 would not be enough info and would yield a lot of discontinuous blue clusters and wouldn’t give enough information). I then integrated this data by overlaying it over our original map. I had to change the default tile that the folium map provides us because the color that it provides us supersedes the heat map and renders it useless. the Tile I use is called “stamentoner” which basically makes the map black and white and lets any color overlay stand out quite nicely, i also like it because it looks better when you try to identify smaller details in the map. The way I overplayed the district data was by taking the average of all the coordinates. this method worked quite nicely as it preserved the general cardinal locations of the different districts (as the different district are labeled according to there locations). I then mark the districts i computed whit the native marker and colored them differently to make the interface look as if it discretized the zones. I also added information that I gather from our dataframe that tally the top 5 instances of that specific district and computes there percentage, glancing over our data it seems that the Baltimore areas top crime offense is Narcotics with the police code 87. It was interesting to make this because the only text hat the marker popup takes is html. Finally the Gender overlay is just discretizing our data from the 9 different dataframes to crimes of Males (Black marker) and females (White Marker). I also give more insight on the individual by providing personal non identifying information about the race the age and the arrest date.