For this tutorial, we landed on a dataset from the World Bank which is publicly available and updated annually: http://databank.worldbank.org/data/reports.aspx?Code=NY.GDP.MKTP.KD.ZG&id=1ff4a498&report_name=Popular-Indicators&populartype=series&ispopular=y
This dataset coalesces all the popular indicators of different geographic regions around the world, separated by countries. There were countless indicators present within this dataset, but we specifically chose a few to zoom into and take a closer look at for critical observing. Some different factors that we chose include: Total country population, Mortality rates, Life expectancy at birth, Poverty headcount, etc. This dataset separates the information and delivers it via chronological order, with records dating back to 2000 and going all the way up until 2015. We wanted to observe some potential relationships between these factors in order to make some conclusions, or generalizations, about these different indicators.
You will need Python 3.5 or higher and the following libraries (i.e pip install [Library Name]):
Anaconda is a great resource to manage all the dependencies and make sure that you don't clobber any previous versions and keep updated on innovations on python libraries.
Here we import the necessary libraries for python to use:
import math
import folium
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
from geopy.geocoders import Nominatim
import sklearn.metrics
from sklearn import linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
We utilized a native method from pandas in order to parse a CSV (comma separated values) file. This will make it much easier to read our data by inserting it into a dataframe as a set of raw and untidy values.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Here, we can see this “raw” and “untidy” dataframe by referencing the top few elements.
#load our data into dataframe
pop_indicators = pd.read_csv("Indicators.csv")
#print first 5 rows of our data in a human-readable form.
pop_indicators.head()
Here we can see that our data is split upon different indicators, one of which is the total population in a specific country. The data is parameterized by the different years ranging from years 2000 to 2015; this is very good for machine learning in order to predict future values using supervised learning methods such as Decision trees or even linear regression. Before we do that we must massage our data into an easier form for us to go into exploratory data analysis and eventually Machine Learning.
Some characteristics of our data upon exploring the dataset are:
These are all characteristics we have to keep in mind for when we want to tidy our data
As mentioned earlier, the initial dataframe is parsed as raw, uncleaned, and rather difficult to digest conclusively. Here we tidy it up a bit, by doing a few different things to the dataframe.
Here a conclusive list of what we did:
# Grabs the names of all columns within the dataframe
col_names = pop_indicators.columns.values
# Uses all columns after the 4th - rest are not necessary
col_names = col_names[4:]
# Replaces all '..' with NaN
pop_indicators = pop_indicators.replace({'\.\.': 'NaN'}, regex=True)
# Renames columns to make them appear simpler
for c in col_names:
pop_indicators[c[:4]] = pop_indicators[c]
pop_indicators = pop_indicators.drop(c, 1)
# Dropping unnecessary columns
pop_indicators = pop_indicators.drop("Series Code", 1)
pop_indicators = pop_indicators.drop("Country Code", 1)
col_names = pop_indicators.columns.values
col_names = col_names[2:]
for c in col_names:
pop_indicators[c] = pop_indicators[c].astype(float)
# As you can see below, the data now appears much cleaner than before
pop_indicators.head()
# Initialize array of top dataframes
top_dfs = []
# The names of indicators that will be observed
top_indicators = ['Population, total', 'Mortality rate, under-5 (per 1,000 live births)',
'Immunization, measles (% of children ages 12-23 months)', 'GDP (current US$)',
'Life expectancy at birth, total (years)', 'Income share held by lowest 20%',
'Poverty headcount ratio at national poverty lines (% of population)',
'School enrollment, secondary (% gross)', 'Inflation, consumer prices (annual %)',
'Market capitalization of listed domestic companies (% of GDP)']
indicator_map = {}
for i in range(len(top_indicators)):
indicator_map[top_indicators[i]] = i
# discretized different popular indicators
for i in top_indicators:
top_dfs.append(pop_indicators.loc[pop_indicators["Series Name"] == i].reset_index())
i = 0
for data_frame in top_dfs:
for c in range(3, 19):
top_dfs[i][col_names[c-3]] = data_frame[col_names[c-3]].replace({float('nan'): data_frame[col_names[c-3]].mean()})
i+=1
# For example, this is the dataframe that correlates with Mortality Rates
top_dfs[1].head()
For filling in missing data, we noticed some interesting problems. There are different decisions to be made about data that is nonexistent. We can either drop rows entirely with missing data, base our missing data on other known data via hot-deck imputation, or ignore the missingness entirely. For our purposes, we chose to fill the missing data in by averaging missing data points for a certain country with the average this year among all countries. Some alternatives to this approach is to average the missing attributes with the data present within the same row (aka, same country, but different year). However, we ran into complications with this approach due to some countries having no data to report at all for certain indicators. We should keep in mind that there are a variety of ways to deal with missing data such as this.
In this section, we explore visualizations of different indicators through our cleaner dataframes. The goal of this section is to view overall trends of the data for each of our indicators over the 16 years of time that is available to us. Due to the fact that there are over 217 different countries present within the entire dataframe, we are forced to take a sample of these countries to see trends within these specifically. Later on, we can explore countries more specifically through the lens of different geographic regions, but this is intended to be a very general plot between ten random countries from the dataframe for the sake of observing any obvious trends overtime and making any potential conclusions about the general well-being of the world as a whole based on the sample. It may be difficult to make generalizations based on only ten different countries, but adding more countries to plot will simply lead to messy visualizations that are not too easily interpreted.
For this section we decided to to make use of our large repository of countries and pull random samples of different countries found within our dataframe. We made 10 different plots for the different Popular indicators within our discretized array of dataframes. We can use these popular indicators and the visualization of them to draw conclusions on our data by making a hypothesis if what we observe in our data as these plots can be very good predictors for correlations in the future. We will eventually make a hypothesis and eventually figure out using stats, ML, and hypothesis testing to see whether we should reject or accept the null hypothesis. For some of the countries the correlation is not always obvious as we imputed some data for some of them using the mean of all values for that particular year. This is good if the data is uniformly distributed but that is not always the case so from time to time we may see hinges that are not always great when using linear regression this is why we may use a more sophisticated learning algorithms to consolidate for this. We plot are dat over 16 years, 2000 - 2015.
# Get a random sample of countries using the first dataframe of the dataframe array
s = top_dfs[0].sample(10)
sample_map = {}
# Get their indices and country names, respectively
s_indices = s['index'].values
s_values = s['Country Name'].values
# Maps country name to respective index for later use
for i in range(len(s_values)):
sample_map[s_values[i]] = s_indices[i]
i = 0
# Loops through each indicator stored in the array of dataframes and plots the sample of countries over 2000-2015
for data_frame in top_dfs:
indicator = top_indicators[i]
figure, axel = plt.subplots(figsize=(15, 10))
for country in s_values:
c = data_frame.groupby("Country Name").get_group(country)
years = []
vals = []
for year in range(2000, 2016):
years.append(year)
vals.append(c[str(year)][sample_map[country]])
plt.plot(years, vals, label=country)
# Create respective plot for indicator collected
axel.set_ylabel(indicator)
axel.set_xlabel("Year")
axel.set_title("%s over time"%(indicator))
handles, labels = axel.get_legend_handles_labels()
axel.legend(handles, labels)
plt.show()
i+=1
As noted earlier, it can be difficult to draw general conclusions based on this random sample. However, here are some general conclusions that can be inferred, based on our graphs:
With these factors under consideration, we can generally say that the overall standard of life across the planet is on a slight uptrend. As of more recent years, countries have been utilizing resources to provide their citizens with better healthcare, better corporate policies for better market capitalizations, and an attempt to lower poverty levels. While this may not accurately reflect all countries across the globe, it does reflect a potential hypothesis that we can draw to later observe more critically with statistical experiment.
For the Map visualization we aimed to add some basics statistics of the most up to date information (i.e. 2015 data) about the 10 different popular indicators we chose initially. In order to locate the latitude and longitude for these individual countries, we utilized geopy, which is a python API in order for coders to locate and reverse locate to coordinate and or names of countries in order to overlay on a map. Geopy is an experimental API that works better with more information and in our case we are only passing in the country name which is inherently vague so geopy may give wrong coordinates or not even find it at all for some of our countries but rest assure that the statistics that we gather is correct. To put in the information we used the Folium Native marker to input the HTML representation of our data.
map_osm = folium.Map(location=[39.29, -76.61], zoom_start=2)
# initialize our geopy object for finfing coordinates
geolocator = Nominatim()
data = []
# choose a random sample from our dfs
random_sample = top_dfs[0].sample(n = 25).reset_index()
for i in random_sample.iterrows():
element = i[1]
country = element[3]
# load location name into geopy object
location = geolocator.geocode(country)
# if geopy doesnt find a match we continue are loop
if location is None:
continue
# get coordinates to overlay on our map
coords = [location.latitude, location.longitude]
stats = []
i = 0
for d in top_dfs:
# loads statistical information into popup parameter in HTML format
indicator = "<b>"+ top_indicators[i] + "</b>"
info = str(d.loc[d['Country Name'] == country]['2015'].values[0])
stats.append(indicator + ": " + info)
i += 1
description = "<b>" + country + "</b></br></br>" + "</br>".join(stats)
folium.Marker(coords,
popup=description,
icon=folium.Icon(
color="red")).add_to(map_osm)
map_osm
Now we want to drill down on the information that we have. We attempted to select the most well-off country overall (through the basis of wealth, GDP) from each continent, with the goal of comparing the wellbeing of each continent as a whole. Here are the countries we selected from the continents:
For these plots we used plotly which is also a very new Python API; it is great for plotting data, we feel it makes it much easier to digest and more intuitive to use than Matplotlib. We decided to use this API just to show that there are many ways to skin a cat and that Python is such a great tool for creativity and just about anyone and make a great contribution and help someone else out. We will be performing Hypothesis Testing on these 7 countries later on and conduct ML on them to predict what 2016 may hold (although the World bank would be publishing their finding soon!). We locate the different countries using pandas loc function kind of like list interpretation in python.
# locate the case study countries
s= top_dfs[0].loc[(top_dfs[0]['Country Name'] == 'United States') | (top_dfs[0]['Country Name'] == 'Brazil') |
(top_dfs[0]['Country Name'] == 'China') | (top_dfs[0]['Country Name'] == 'Australia') |
(top_dfs[0]['Country Name'] == 'France') | (top_dfs[0]['Country Name'] == 'Nigeria') |
(top_dfs[0]['Country Name'] == 'United Arab Emirates')]
sample_map = {}
s_indices = s['index'].values
s_values = s['Country Name'].values
for i in range(len(s_values)):
sample_map[s_values[i]] = s_indices[i]
i = 0
for data_frame in top_dfs:
indicator = top_indicators[i]
trace = []
for country in s_values:
# group counties using pandas groupby function
c = data_frame.groupby("Country Name").get_group(country)
years = []
vals = []
# append vvalues for our plot
for year in range(2000, 2016):
years.append(year)
vals.append(c[str(year)][sample_map[country]])
# append a trace of our plot
trace.append(go.Scatter(
x =years,
y =vals,
name= country))
data = trace
# adds layout parameters to plot
layout = go.Layout(title=indicator)
data = trace
fig = go.Figure(data=data, layout=layout)
iplot(fig)
i+=1
As you can see from our data we can draw some analysis about:
Machine learning is an awesome tool used my data scientists, data engineers, and statisticians to make forecasts about future. This is why we will be using it here to try and see how good our predictions are by using Scikit-learn and use various Machine Learning algorithms to see how closely out f(x) values are to our 2015 data. We will be using K-Nearest neighbours, Decision Trees, and Random Forests to see which is the best predictor of our data set (we felt that linear regression would be the best for this).
This part took the longest due to all the intricacies of the different learning algorithms. We decided to abstract our code a little bit and define the functions that will predict our values in the background and we would only need to pass in the data frame to compute the values we need. We will then do some visualization in order to figure out which learning algorithm works best on our data. We have imported all the requisite libraries from Scikit-learn for us to utilize. The setup for the different learning algorithms are all inherently the same; we start by dropping all the unnecessary columns from our dataframe, in essence, all the columns that don't have a numerical value. We then cast the type of the dataframe that comes from it as integer values (this is required by most of the learning algorithms found in the Scikit-learn package for example multivariate logistic regression). Now we set our X training data and our y training data. The X training data is basically going to be a matrix of all the features in our data frame which will basically be te matrix spanning the years 2000 - 2014. The y training data will be a vector of the 2015 data that we will be predicting. We will also be setting the x test data and the y test data to the x training data and the y training data respectively. We then use our Classifiers predict function to run the algorithm and finally we will return the accuracy score, found in the Scikit-learn package to see how good of a predictor it is!
def knn_ML(df):
# initialize our KNN object with necessary parameters
knnclf = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=10, p=2,
weights='uniform')
# assign training data and testing data that we pass into out fit function
trainX = testX = df.drop(['Series Name','Country Name','index',"2015"], axis=1).astype(int)
trainY = testY = df['2015'].astype(int)
knnclf.fit(trainX, trainY)
# gather an accuracy measure and return
knnprediction = knnclf.predict(testX)
knnfinal = accuracy_score(testY,knnprediction)
return knnfinal
def decision_ML(df):
# initialize our Decision Tree object with necessary parameters
treeclf = DecisionTreeClassifier(max_depth=5)
# assign training data and testing data that we pass into out fit function
trainX = testX = df.drop(['Series Name','Country Name','index',"2015"], axis=1).astype(int)
trainY = testY = df['2015'].astype(int)
treeclf.fit(trainX, trainY)
treeprediction = treeclf.predict(testX)
treefinal = accuracy_score(testY,treeprediction)
return treefinal
def RF_ML(df):
# initialize our Random Forest object with necessary parameters
rfclf = RandomForestClassifier(n_estimators=10)
# assign training data and testing data that we pass into out fit function
trainX = testX = df.drop(['Series Name','Country Name','index',"2015"], axis=1).astype(int)
trainY = testY = df['2015'].astype(int)
rfclf.fit(trainX, trainY)
# gather an accuracy measure and return
rfprediction = rfclf.predict(testX)
rffinal = accuracy_score(testY,rfprediction)
return rffinal
Now we will put our machine learning functions to use and apply them to all the different Popular Indicators in our top_dfs array. We will visualize the strength of our predictive models using a bar graph that is bounded from zero to one which is essentially the score of how close our coefficients are to the real answers. Since we have encapsulated most of our code into functions the accuracy test is more straightforward. We will be inserting all my findings both into a dictionary to hold the current value as well as appending to an Numpy array that will be used to later to find the best algorithm to use for our data. We then display this in a bar graph below.
# initalize our dictionary that we will use to average predictions later
mean_map = {
'Random Forest': np.array([]),
'Decision Tree': np.array([]),
'K-Nearest Neighbours': np.array([])
}
for i in range(len(top_dfs)):
# Initialize the plot
f, ax = plt.subplots(figsize=(5,5))
predict_map = {}
# use algorithims defined above and load into our map for the plots
predict_map['Random Forest'] = RF_ML(top_dfs[i])
predict_map['Decision Tree'] = decision_ML(top_dfs[i])
predict_map['K-Nearest Neighbours'] = knn_ML(top_dfs[i])
# append our finings to the mean map above
mean_map['Random Forest'] = np.append(mean_map['Random Forest'], predict_map['Random Forest'])
mean_map['Decision Tree'] = np.append(mean_map['Decision Tree'], predict_map['Decision Tree'])
mean_map['K-Nearest Neighbours'] = np.append(mean_map['K-Nearest Neighbours'], predict_map['K-Nearest Neighbours'])
# create our dataframe to use for plot
predict_df = pd.DataFrame.from_dict(data=predict_map, orient='index')
predict_df['model'] = predict_df.index
predict_df['prediction'] = predict_df[predict_df.columns.values[0]]
# create plote using the dataframes
sns.barplot(data=predict_df ,x=predict_df['model'], y = predict_df['prediction'], ax= ax)
ax.set_title(top_indicators[i])
ax.set_ylabel('Prediction Score')
ax.set_xlabel('Learning Algorithm')
plt.tight_layout()
plt.show()
In order to choose which of these learning algorithms is the best we must use some sort of metric to score them. Some methods we could use are the mean, the median , and the mode of our scores. In this case we feel that the mode is the ideal measure to use because there aren't many recurring scores found in the measly 10 Popular index predictors. We could use the mode because it is a robust descriptive statistic although there aren't very many instances. We feel that the most robust statistical measure is the mean because our range is normalized which means all the scores are standardized (in our case 0 to 1) and also because the mean is a very robust measure under the right circumstances.
https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Descriptive_Statistics.pdf
Using this we made a similar bar plot as we did above utilizing my mean map which maps to a numpy array and we can use one if its native functions to find the mean. As we can see the Random Forest is the best Learning algorithm to use on our data with accuracy converging to 1 (the highest our predictors can get). The decision tree method is the second best with a score of 0.6 which we think is the case because Random Forest is a generalization of Decision trees because it combines a large number of specifically-built decision trees and reduces the generalization error of decision trees. Finally we have K-Nearest Neighbours which performed the worst although it put up a respectable fight with about a score of 0.5. This is probably because KNN is predicated on distance metrics and if we look at the ranges of some of the indicators we can see that some are very spread which can impede clustering.
# now we find the average of of our preditions
mean_map['Random Forest'] = mean_map['Random Forest'].mean()
mean_map['Decision Tree'] = mean_map['Decision Tree'].mean()
mean_map['K-Nearest Neighbours'] = mean_map['K-Nearest Neighbours'].mean()
# Initialize the plot
f, ax = plt.subplots(figsize=(5,5))
# load information into a dataframe for plotting
mean_df = pd.DataFrame.from_dict(data=mean_map, orient='index')
mean_df['model'] = mean_df.index
mean_df['prediction'] = mean_df[mean_df.columns.values[0]]
# use dataframe to plot findings
sns.barplot(data=mean_df ,x=mean_df['model'], y = mean_df['prediction'], ax= ax)
ax.set_title("Mean of Indicators")
ax.set_ylabel('Prediction Score')
ax.set_xlabel('Learning Algorithm')
plt.show()
Based on all the data we have just explored, we would like to take a deeper look at some potential relationships between the indicators that we have selected. For example, one potential avenue worthy of additional exploration is to test whether a country’s total population has any correlation with their total GDP per year. The purpose of these comparisons is ultimately to develop a deeper understanding of some of these indicators and make generalizations about how factors relate to one another. After choosing our desired indicators, we locate the exact dataframes that the indicators correlate to via our discretized list of dataframes, and we loop through each year of data that is present inside the dataframe, add the points to our sample, and perform a Linear Regression test on them through Scikit-learn’s built in regression methods. We simply use the ‘.fit’ function and pass the two variables that we are comparing with one another into it, and this spits out the correlation coefficient in return. This coefficient can provide us with some very useful insight, and we can later visually confirm our predictions.
First, we simply plot each of the data points that we are observing against each other on a scatter plot. Subsequently, we can use the ‘.predict’ function provided in the same Scikit-learn library to generate the regression line and in order to see any overarching trends present among our data. This can be very useful for visually confirming the correlation that the coefficient conveys to us, and can either confirm or reject whether a correlation even exists for the two indicators that we selected.
http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
First, want to test to see if there is a correlation between the income share held by the lowest 20% of a population and proportion of a population that sits at or below national poverty lines.
Our hypothesis is: As the income share held by the lowest 20% of the population increases, the proportion of individuals at or below natinoal poverty levels decreases. We expect a negative correlation.
Now that we have a hypothesis, we need to validate or invalidate it using the data we collected. As we mentioned earlier, this is a simple procedure that can be leveraged with Scikit-learn’s built in regression functions. Here is the general procedure we followed:
# Initialize linear regression data object
regData = linear_model.LinearRegression()
# Pull the indices of the indicators with the dictionary we created earlier
income_idx = indicator_map['Income share held by lowest 20%']
pov_idx = indicator_map['Poverty headcount ratio at national poverty lines (% of population)']
# Create references for each of the corresponding dataframes for income share and poverty headv
income_df = top_dfs[income_idx]
poverty_df = top_dfs[pov_idx]
# Start it off with the first year year of the period on the table
x = income_df['2000']
y = poverty_df['2000']
# Append all other years to the table
for year in range(2001, 2016):
x = x.append(income_df[str(year)])
y = y.append(poverty_df[str(year)])
regData.fit(x.values.reshape(-1,1), y)
print('Coefficients: \n', 1/regData.coef_)
As we can see above, the correlation coefficient is approximately -0.28. This means that we should expect a negative correlation that is rather weak, but noticeable. Let’s go ahead and run through the steps required to visually confirm our discovery and perform an analysis.
We must start off by initializing a scatter plot through the standard procedure outlined in their documentation: https://pythonspot.com/matplotlib-scatterplot/
Afterwards, we are required to re-initialize our variables, due to the fact that they were manipulated while calculating their coefficients. Lastly, we generate a scatter plot in which we simultaneously plot our values along with the predicted regression line by using ‘regData.predict’ from Scikit-learn’s library. Below, you can see the output of our code.
# Initialize the plot
figure, axel = plt.subplots(figsize=(15, 10))
# Create Scatterplot
plt.scatter(x,y)
# Sets the x and y-axis labels
plt.ylabel('Poverty Headcount Ratio (%)')
plt.xlabel('Income Share Held By Lowest 20%')
# Re-initializes the series
x = income_df['2000']
y = poverty_df['2000']
for year in range(2001, 2016):
x = x.append(income_df[str(year)])
y = y.append(poverty_df[str(year)])
plt.plot(x.values.reshape(-1,1), regData.predict(x.values.reshape(-1,1)), color='gray', linewidth=3)
plt.show()
It can be visually confirmed through both our regression line and scatter plot that the correlation is negative. However, as the scatterplot shows, the correlation is not the strongest. We want to aim for a number closest to -1 or 1, which corresponds to a perfect correlation. In this case, there is certainly proof of some correlation, but not all that much. In terms of our actual hypothesis, it shows that there is SOME, not a lot, of correlation between the income share held by the lowest 20% of a population and proportion of a population that sits at or below national poverty lines. The relationship makes contextual sense, because the graph essentially says that, as the income share held by the lowest 20% of the population goes up, the poverty headcount ratio within their country goes down. This was the anticipated relationship that we expected going into this.
We realized that we selected ten different indicators, and our hypothesis above only tested two. So we wanted to follow up and test a couple different indicators for the sake of better understanding our data!
Or second hypothesis aims to test the relationship between one’s total life expectancy at birth, in years, against the mortality rate, under-5 (per 1,000 live births).
For our hypothesis, we predict that higher life expectancies at birth correspond with lower mortality rates for children under 5 in different countries. Once again, we are expecting a negative correlation.
Let’s repeat the same exact steps outlined above, only now with our revised features!
regData = linear_model.LinearRegression()
life_exp_idx = indicator_map['Life expectancy at birth, total (years)']
mort_idx = indicator_map['Mortality rate, under-5 (per 1,000 live births)']
life_exp_df = top_dfs[life_exp_idx]
mort_df = top_dfs[mort_idx]
figure, axel = plt.subplots(figsize=(15, 10))
x = life_exp_df['2000']
y = mort_df['2000']
for year in range(2001, 2016):
x = x.append(life_exp_df[str(year)])
y = y.append(mort_df[str(year)])
regData.fit(x.values.reshape(-1,1), y)
print('Coefficients: \n', 1/regData.coef_)
Once again, our correlation coefficient is approximately -0.24, which indicates the presence of slight negative correlation. Let’s follow the same steps as we performed above to visualize this prediction.
figure, axel = plt.subplots(figsize=(15, 10))
plt.ylabel('Mortality rate, under-5 (per 1,000 live births)')
plt.xlabel('Life expectancy at birth')
x = life_exp_df['2000']
y = mort_df['2000']
for year in range(2001, 2016):
x = x.append(life_exp_df[str(year)])
y = y.append(mort_df[str(year)])
plt.scatter(x,y)
plt.plot(x.values.reshape(-1,1), regData.predict(x.values.reshape(-1,1)), color='gray', linewidth=3)
plt.show()
Our correlation coefficient and plots indicate the presence of some negative correlation, once again. Personally, I was surprised that the correlation is weaker than the previous one, given how the scatterplot turned out visually. It is only slightly weaker than the one before, but our data is still indicative of the presence of a correlation. This also makes contextual sense and follows our hypothesis, as it indicates that countries with higher life expectancies at birth typically correspond to lower mortality rates upon birth. This is most likely due to the presence of better healthcare infrastructures across different countries along with some having higher standards of living compared to others.
Data science is an awesome way for Computer scientists, Statisticians, and Mathematicians to make forecast of the future. It's a continuous process that you make more and more head way into the problem as we got forward. As you can see we have successfully traversed the whole data science pipeline:
These steps are not mutually exclusive and it is possible that one step could lead to a previous step that needs to be accounted for: We definitely had to step backwards to Tidying our data when we hit Machine Learning and Hypothesis Testing!
We can see from our ML that if we wanted to predict 2016 or even 2017 Popular indicators that our go to Learning Algorithm would be the Random Forest Algorithm with a very promising score but if we would like to put accuracy at the back burner and reduce computation we could use the Decision Tree Algorithm as we saw that Random Forest is the superset of the Decision Tree. There are still a multitude of different learning algorithms out there and we only scratched the surface a of a few Supervised learning ones which is a subset of Artificial Intelligence. Some other classifications of learning algorithms are: Unsupervised, Reinforcement, and Deep Learning. We challenge you to to predict 2016 scores when the World Bank publishes them and see how accurate you get your values to be!
Alternatively, we can see from hypothesis testing that we can test the presence of correlation between two seemingly relevant factors through the virtue of a basic regression test. By leveraging these tests, we can see test the strength of potential correlations by the coefficients, and visually confirm these correlations using scatterplots and predicted regression lines. The goal of these tests are to ultimately help you understand your data better and make inferences about different situational infrastructures beyond the data presented to you. It will effectively convert the data from raw numbers into real insight about a more meaningful problem being approached beyond it.
Ultimately, Data Science has been proven as a wonderful discipline, and CMSC 320 has done an incredible job exposing us to some of the basics! We hope you enjoyed following our tutorial!