Working with Different Data Structures
Data Understanding
The data sources for this analysis will be pulled from two separate files.
world_cup_2018.json
- Source: This dataset comes from football.db, a “free and open public domain football database & schema for use in any (programming) language”
country_populations.csv
- Source: This dataset comes from a curated collection by DataHub.io, originally sourced from the World Bank
Here, we use the json
package to load the data from the world_cup_file
into a dictionary (which we’ll use for analysis) called world_cup_data
and we’ll use the csv
package to to load the data from population_file
into a list of dictionaries called population_data
# Men's World Cup Data
world_cup_data = json.load(world_cup_file)
# Here's we're loading the data, and saving the data into an object, world_cup_data
# Close the file now that we're done reading from it
world_cup_file.close()
# Population Data
population_data = list(csv.DictReader(population_file))
# Same as with the world_cup_data
# Close the file now that we're done reading from it
population_file.close()
Working on the Men’s World Cup Data
What is the structure of the world_cup_data
?
# This line of code will return the keys of the dictionary, which allows us to see
# how the data is structured
world_cup_data.keys()
This indicates the World Cup data is a dictionary, with two keys, ‘name’ and ‘rounds’
Extracting the Rounds
First, we want to extract the rounds of the World Cup matches, so we can determine the number of wins from those matches.
# Create a variable called "rounds" from the "rounds" key.
rounds = world_cup_data["rounds"]
# This prints the type of data of "rounds" (a list), the length of the "rounds" variable (20 rounds), and prints the three ...
print("type(rounds):", type(rounds))
print("len(rounds):", len(rounds))
print("rounds[3]:")
# This line just prints the 4th entry, so we can we that everything worked
rounds[3]
The rounds
variable we have created is a list of dictionaries, with each dictionary of inside of rounds
containing a name (such as ‘Matchday 4’) as well as the list of matches that occurred on that day (along with match details)
Extracting the Matches
# Create a variable, "matches," that is an empty list
matches = []
# This is a looping function
for round_ in rounds:
# Extract the list of matches for this round
round_matches = round_["matches"]
# Add them to the overall list of matches, which is the variable we created at the beginning
matches.extend(round_matches)
# View the first entry
matches[0]
Extracting the Teams
We know that each match has two teams, team1
and team2
, so we’re going to create a list of all unique team names by looping over every match in matches
and adding the “name” values associated with both team1
and team2
# Create a variable, "teams_set", that is an empty set
teams_set = set()
for match in matches:
# Add team1 value to teams_set
teams_set.add(match["team1"]["name"])
# Add team2 value to teams_set
teams_set.add(match["team2"]["name"])
# Create a "teams" variable that is a sorted list based on the teams_set
teams = sorted(list(teams_set))
# View the list
print(teams)
Now that we have our teams
variable that is comprised of all the participating teams at the 2018 Men’s World Cup, we move on to connecting each team name with its performance at the World Cup, as measured by the count of total wins.
Men’s World Cup Performance
In this section, we start by initializing a dictionary called combined_data
containing:
- Keys: the strings from teams
- Values: each value the same, a dictionary containing the key ‘wins’ with the associated value 0.
# This uses dictionary comprehension, creating a dictionary (noted by {}) with the team as the key and the value set to number of wins - for each team in the teams list
combined_data = {team: {"wins": 0} for team in teams}
To add wins to the dictionary, we’re going to write a function that we can call on the match
dictionary.
# This first chunk is defining the function, find_winner()
def find_winner(match):
"""
Given a dictionary containing information about a match,
return the name of the winner (or None in the case of a tie)
"""
score_1 = match["score1"]
score_2 = match["score2"]
if score_1 > score_2:
return match["team1"]["name"]
elif score_2 > score_1:
return match["team2"]["name"]
else:
return None
# ^ this else: return None is not actually necessary since Python returns
# None if nothing is returned
Once this function is defined, we can loop over every match in matches
to find the winner and add +1 to associated count in combined_data
for match in matches:
# Get the name of the winner
winner = find_winner(match) # the function we wrote above
# Only proceed to the next step if there was a winner
if winner:
# Add 1 to the associated count of wins
combined_data[winner]["wins"] += 1
# Visually inspect the output to ensure the wins are different for different countries
combined_data
Analysis of Wins
While we could try to understand all 32 of those numbers just by scanning through them, let’s use some descriptive statistics and data visualizations instead!
Statistical Summary of Wins
The code below calculates the mean, median, and standard deviation of the number of wins. If it doesn’t work, that is an indication that something went wrong with the creation of the combined_data
variable, and you might want to look at the solution branch and fix your code before proceeding.
import numpy as np
# numpy is a python package that allows us to do statistical analysis
wins = [val["wins"] for val in combined_data.values()]
print("Mean number of wins:", np.mean(wins))
print("Median number of wins:", np.median(wins))
print("Standard deviation of number of wins:", np.std(wins))
Visualization
import matplotlib.pyplot as plt
# matplotlib is a python package that allows us to create visualizations
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))
fig.set_tight_layout(True)
# Histogram of Wins and Frequencies
ax1.hist(x=wins, bins=range(8), align="left", color="green")
ax1.set_xticks(range(7))
ax1.set_xlabel("Wins in 2018 World Cup")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of Wins")
# Horizontal Bar Graph of Wins by Country
ax2.barh(teams[::-1], wins[::-1], color="green")
ax2.set_xlabel("Wins in 2018 World Cup")
ax2.set_title("Wins by Country");
Working on the World Population Data
We need to filter the population_data
dataset to include only those populations from 2018 and only those countries listed in the teams
dataset we completed above.
First, we need to normalize the country names to ensure country names match.
def normalize_location(country_name):
"""
Given a country name, return the name that the
country uses when playing in the FIFA World Cup
"""
name_sub_dict = {
"Russian Federation": "Russia",
"Egypt, Arab Rep.": "Egypt",
"Iran, Islamic Rep.": "Iran",
"Korea, Rep.": "South Korea",
"United Kingdom": "England"
}
# The .get method returns the corresponding value from
# the dict if present, otherwise returns country_name
return name_sub_dict.get(country_name, country_name)
Then we can create our filtered population dataset.
population_data_filtered = []
for record in population_data:
# Get normalized country name
normalized_name = normalize_location(record["Country Name"])
# Add record to population_data_filtered if relevant
if (normalized_name in teams) and (record["Year"] == "2018"):
# Replace the country name in the record
record["Country Name"] = normalized_name
# Append to list
population_data_filtered.append(record)
Unfortunately, the population values from the population_data
file are entered as objects/characters, not integers, so we need to change the values’ datatype.
for record in population_data_filtered:
# Convert the population value from str to int
record["Value"] = int(record["Value"])
Now, we can loop over population_data_filtered
and add information about population to each country in combined_data
!
for record in population_data_filtered:
# Extract the country name from the record
country = record["Country Name"]
# Extract the population value from the record
population = record["Value"]
# Add this information to combined_data
combined_data[country]["population"] = population
Statistical Analysis of Population
populations = [val["population"] for val in combined_data.values()]
print("Mean population:", np.mean(populations))
print("Median population:", np.median(populations))
print("Standard deviation of population:", np.std(populations))
Mean population: 51687460.84 Median population: 34864542.5 Standard deviation of population: 55195121.61
Visualizations of Population
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))
fig.set_tight_layout(True)
# Histogram of Populations and Frequencies
ax1.hist(x=populations, color="blue")
ax1.set_xlabel("2018 Population")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of Population")
# Horizontal Bar Graph of Population by Country
ax2.barh(teams[::-1], populations[::-1], color="blue")
ax2.set_xlabel("2018 Population")
ax2.set_title("Population by Country");
Analysis of Performance vs. Population
Statistical Measure - Correlation
np.corrcoef(wins, populations)[0][1]
Correlation coefficient: 0.0759
This returns a correlation a very weak positive correlation between population and performance.
Data Visualization of Correlation
# Set up figure
fig, ax = plt.subplots(figsize=(8, 5))
# Basic scatter plot
ax.scatter(
x=populations,
y=wins,
color="gray", alpha=0.5, s=100
)
ax.set_xlabel("2018 Population")
ax.set_ylabel("2018 World Cup Wins")
ax.set_title("Population vs. World Cup Wins")
# Add annotations for specific points of interest
highlighted_points = {
"Belgium": 2, # Numbers are the index of that
"Brazil": 3, # country in populations & wins
"France": 10,
"Nigeria": 17
}
for country, index in highlighted_points.items():
# Get x and y position of data point
x = populations[index]
y = wins[index]
# Move each point slightly down and to the left
# (numbers were chosen by manually tweaking)
xtext = x - (1.25e6 * len(country))
ytext = y - 0.5
# Annotate with relevant arguments
ax.annotate(
text=country,
xy=(x, y),
xytext=(xtext, ytext)
)
Final Analysis
What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?
Logic would dictate that the greater a country’s population, the more likely they are to have a better soccer team because they have a larger player pool than countries with smaller populations. The more people, the more possible players.
However, sheer numbers of people doesn’t equate to succes in sports - in this case, the men’s World Cup. Yes, Brazil makes a good case, but what about China, India, or the United States? And why would such a small country like Belgium be so good at soccer if its population is comparatively so much smaller?
I’d want to evaluate available government funding of soccer in these countries. How much state support do the national soccer federations receive? Is there a correlation with country wealth as well? Maybe measured by GDP per capita or income per capita.
Attribution: The Flatiron School repo