Web Scraping Advanced Football Statistics

Lately I’ve being debating about the role of luck in football. Or maybe it’s not pure luck, but a skill? Do teams win their leagues purely on skill? Or the importance of luck is quite huge? Who is lucky and who is not? Did that team deserve the relegation? And many-many more.

But as I am a data guy, I thought, so let’s get the data and find that out. Although, how do you measure luck? How do you measure skill? There is no such a single metric like in FIFA or PES computer games. We have to look at the general picture, on a long-term data with multiple variables and taking into account the context of every single game played. Because in some moments the player of one team just don’t have enough luck to score the winning goal on last minutes after total domination over the opponent and it ends up as equalizer and both teams get 1 point, while it was clear that the first team deserved the victory. One team was lucky, another wasn’t. Yes, in this situation it’s a luck, because one team did everything, created enough dangerous moments, but didn’t score. It happens. And that’s why we love football. Because everything can happen here.

Although you cannot measure luck, but you get an understanding of how the team played based on a relatively new metric in football – xG, or expected goals.

xG – is a statistical measure of the quality of chances created and conceded

You can find the data with this metric at understat.com. This is the web I am about to scrap.

Understanding the data

So what the heck is xG and why is it important. Answer we can find at understat.com home page.

Expected goals (xG) is the new revolutionary football metric, which allows you to evaluate team and player performance.

In a low-scoring game such as football, final match score does not provide a clear picture of performance.

This is why more and more sports analytics turn to the advanced models like xG, which is a statistical measure of the quality of chances created and conceded.

Our goal was to create the most precise method for shot quality evaluation.

For this case, we trained neural network prediction algorithms with the large dataset (>100,000 shots, over 10 parameters for each).


The researchers trained neural network based on situations that led to goals and now it gives us an estimation of how many real chances did the team have during the match. Because you can have 25 shots during the game, but if they all are from long distance or from low angle or too weak, shorter – low quality shots, they won’t lead to the goal. While some “experts” that didn’t see the game will say that the team dominated, created tons of chances bla-bla-bla-bla. Quality of those chances matters. And that’s where the xG metric becomes very handy. With this metric you now understand that Messi creates goals in conditions where it’s very hard to score, or the goalkeeper makes save where it should’ve being goal. All these things add up and we see champions that have skilled players and some luck and we see losers that might have good players, but don’t have enough luck. And my intent with this project is to understand and present these numbers to show the role of luck in today’s football.

Let’s begin

We start by importing libraries that will be used in this project:

  • numpy – fundamental package for scientific computing with Python
  • pandas – library providing high-performance, easy-to-use data structures and data analysis tools
  • requests – is the only Non-GMO HTTP library for Python, safe for human consumption. (love this line from official docs :D)
  • BeautifulSoup – a Python library for pulling data out of HTML and XML files.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import requests
from bs4 import BeautifulSoup

Website research and structure of data

In any web scraping project first thing you have to do is to research the web-page you want to scrape and understand how it works. That’s fundamental. So we start from there.

On the home page we can notice that the site has data for 6 European Leagues:

understat.com header menu
  • La Liga
  • EPL
  • BundesLiga
  • Serie A
  • Ligue 1
  • RFPL

And we also see that the data collected is starting from season 2014/2015. Another notion we make is the structure of URL. It is ‘https://understat.com/league' + ‘/name_of_the_league‘ + ‘/year_start_of_the_season

understat.com menu

So we create variables with this data to be able to select any season or any league.

# create urls for all seasons of all leagues
base_url = 'https://understat.com/league'
leagues = ['La_liga', 'EPL', 'Bundesliga', 'Serie_A', 'Ligue_1', 'RFPL']
seasons = ['2014', '2015', '2016', '2017', '2018']

Next step is to understan where the data is located on the web-page. For this we open Developer Tools in Chrome, go to tab “Network”, find file with data (in this case 2018) and check the “Response” tab. This is what we will get after running requests.get(URL)

After going through content of the web-page we find that the data is stored under “script” tag and it is JSON encoded. So we will need to find this tag, get JSON from it and convert it into Python readable data structure.

# Starting with latest data for Spanish league, because I'm a Barcelona fan
url = base_url+'/'+leagues[0]+'/'+seasons[4]
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")

# Based on the structure of the webpage, I found that data is in the JSON variable, under 'script' tags
scripts = soup.find_all('script')

Working with JSON

We found that the data interesting us is stored in teamsData variable, after creating a soup of html tags it becomes just a string, so we find that text and extract JSON from it.

import json

string_with_json_obj = ''

# Find data for teams
for el in scripts:
    if 'teamsData' in el.text:
      string_with_json_obj = el.text.strip()
# print(string_with_json_obj)

# strip unnecessary symbols and get only JSON data
ind_start = string_with_json_obj.index("('")+2
ind_end = string_with_json_obj.index("')")
json_data = string_with_json_obj[ind_start:ind_end]

json_data = json_data.encode('utf8').decode('unicode_escape')

Once we have gotten our JSON and cleaned it up we can convert it into Python dictionary and check how it looks (commented print statement).

Understanding data with Python

When we start to research the data we understand that this is a dictionary of dictionaries of 3 keys: idtitle and history. The first layer of dictionary uses ids as keys too.

Also from this we understand that history has data regarding every single match the team played in its own league (League Cup or Champions League games are not included).

We can gather teams names after going over the first layer dictionary.

# Get teams and their relevant ids and put them into separate dictionary
teams = {}
for id in data.keys():
  teams[id] = data[id]['title']

The history is the array of dictionaries where keys are names of metrics (read column names) and values are values, despite how tautological is that :D.

We understand that column names repeat over and over again so we add them to separate list. Also checking how the sample values look like.

# EDA to get a feeling of how the JSON is structured
# Column names are all the same, so we just use first element
columns = []
# Check the sample of values per each column
values = []
for id in data.keys():
  columns = list(data[id]['history'][0].keys())
  values = list(data[id]['history'][0].values())

After outputting few print statements we find that Sevilla has the id=138, so getting all the data for this team to be able to reproduce the same steps for all teams in the league.

sevilla_data = []
for row in data['138']['history']:
df = pd.DataFrame(sevilla_data, columns=columns)

For the sake of leaving this article clean I won’t add the content of created DataFrame, but in the end you will find links to IPython notebooks on Github and Kaggle with all code and outputs. Here just samples for the context.

So, wualya, congrats! We have the data for all matches of Sevilla in season 2018-2019 within La Liga! Now we want to get that data for all Spanish teams. Let’s loop through those bites baby!

# Getting data for all teams
dataframes = {}
for id, team in teams.items():
  teams_data = []
  for row in data[id]['history']:
  df = pd.DataFrame(teams_data, columns=columns)
  dataframes[team] = df
  print('Added data for {}.'.format(team))

After this code finishes running we have a dictionary of DataFrames where key is the name of the team and value is the DataFrame with all games of that team.

Manipulations to make data as in the original source

When we look at the content of DataFrame we can notice that such metrics as PPDA and OPPDA (ppda and ppda_allowed) are represented as total amounts of attacking/defensive actions, but in the original table it is shown as coefficients. Let’s fix that!

for team, df in dataframes.items():
  dataframes[team]['ppda_coef'] = dataframes[team]['ppda'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
  dataframes[team]['oppda_coef'] = dataframes[team]['ppda_allowed'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)

Now we have all our numbers, but for every single game. What we need is the totals for the team. Let’s find out the columns we have to sum up. For this we go back to original table at understat.com and we find that all metrics shoud be summed up and only PPDA and OPPDA are means in the end.

cols_to_sum = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts', 'wins', 'draws', 'loses', 'pts', 'npxGD']
cols_to_mean = ['ppda_coef', 'oppda_coef']

We are ready to calculate our totals and means. For this we loop through dictionary of dataframes and call .sum() and .mean() DataFrame methods that return Series, that’s why we add .transpose() to those calls. We put these new DataFrames into a list and after that concat them into a new DataFrame full_stat.

frames = []
for team, df in dataframes.items():
  sum_data = pd.DataFrame(df[cols_to_sum].sum()).transpose()
  mean_data = pd.DataFrame(df[cols_to_mean].mean()).transpose()
  final_df = sum_data.join(mean_data)
  final_df['team'] = team
  final_df['matches'] = len(df)
full_stat = pd.concat(frames)

Next we reorder columns for better readability, sort rows based on points, reset index and add column ‘position’.

full_stat = full_stat[['team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'npxG', 'xGA', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts']]
full_stat.sort_values('pts', ascending=False, inplace=True)
full_stat.reset_index(inplace=True, drop=True)
full_stat['position'] = range(1,len(full_stat)+1)

Also in the original table we have values of differences between expected metrics and real. Let’s add those too.

full_stat['xG_diff'] = full_stat['xG'] - full_stat['scored']
full_stat['xGA_diff'] = full_stat['xGA'] - full_stat['missed']
full_stat['xpts_diff'] = full_stat['xpts'] - full_stat['pts']

Converting floats to integers where appropriate.

cols_to_int = ['wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'deep', 'deep_allowed']
full_stat[cols_to_int] = full_stat[cols_to_int].astype(int)

Prettifying output and final view of a DataFrame

col_order = ['position','team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'xG_diff', 'npxG', 'xGA', 'xGA_diff', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts', 'xpts_diff']
full_stat = full_stat[col_order]
full_stat.columns = ['#', 'team', 'M', 'W', 'D', 'L', 'G', 'GA', 'PTS', 'xG', 'xG_diff', 'NPxG', 'xGA', 'xGA_diff', 'NPxGA', 'NPxGD', 'PPDA', 'OPPDA', 'DC', 'ODC', 'xPTS', 'xPTS_diff']
pd.options.display.float_format = '{:,.2f}'.format
Printscreen of Collaboratory output

Original table:

Printscreen from understat.com

Now when we got our numbers for one season from one league we can replicate the code and put it into the loop to get all data for all seasons of all leagues. I won’t put this code here, but will leave a link to entire scraping solution at Github and Kaggle.

Final dataset

After looping through all leagues and all seasons and few manipulation steps to make data exportable I’ve got a CSV file with scraped numbers. The dataset is available here.


Hope you found it useful and got some valuable info. Anyway, if you reached this point I just want to say thank you for reading, for allocating your time, energy and attention to my 5 cents and wish you a lot of love and happiness, you’re awesome!

Thanks for the featured photo by Michael Lee on Unsplash

Karma +1 when you share it:

10 thoughts on “Web Scraping Advanced Football Statistics”

  1. Hi, thanks for this article!

    I just have a question about this part :
    “teams = {}
    for id in data.keys():
    teams[id] = data[id][‘title’]”

    Where the “data” (data.keys) was defined before ? I have an error for this part on my notebook…

    1. Hi,

      data appears when you read the json into it with this line of code:
      data = json.loads(json_data)

      I missed it in the article. Please, check my notebook at Kaggle for working code here

  2. Hi Sergi!

    Your article is awesome! it is really helpful! I love it!!!

    I have a question for you!

    I would like to edit the code for getting the same information (matches, xg, xga), but only for home games and away games. I do not want to have the overall table. I want to have two different tables: home and away

    What do I have to edit in your code in order to only have home games stats and away games stats?

    I am a really beginner, so if you can explain it step by step would be great

    I really hope you can help me out with this! I would really appreciate it

    Thank you

  3. Hello Sergi!

    Thanks for the article and the dataset!

    My plan is to make an analysis taking into consideration the home and away results.

    I have 2 questions for you:

    – which changes need to be made in the code in order to get 2 tables, one for home and other for away games?

    – also, i was looking at the code and this error came up:

    ind_start = string_with_json_obj.index(“(‘”)+2
    ValueError: substring not found

    Can you help me solve it?

    I still have little experience in web scrapping, so your help would be much aprecciated.

    Keep up with your good work!

    Thanks in advance,

    1. Thank you Francisco for kind feedback!

      In order to achieve that you have to separate the data by column ‘h_a’ before summing everything up. If you want to do that on your own you have to stop before the paragraph “Manipulations to make data as in the original source”. In the dataframe you get in that step there will be all raw data and “home/away” column (‘h_a’).

      Here you can find my Kaggle notebook without summing up the data. It contains all the data per every game – the output you get there can be just splitted by column ‘h_a’ manually in Excel or just add an additional line in the code and export 2 CSVs.

      Also, if you don’t want to play too much with scraping, here is the dataset I maintain https://www.kaggle.com/slehkyi/extended-football-stats-for-european-leagues-xg – it has both summary and game records. Updating twice a year.

      Hope it helps! If you still have questions you can reach me on social or by email 🙂 all info in the footer 🙂

  4. Hola Sergi,

    Realizando los pasos que mencionas al llegar a esta parte del código

    import json

    string_with_json_obj = ”

    for el in scripts:
    if’teamsData’ in el.text:
    string_with_json_obj = el.text.strip()

    ind_start = string_with_json_obj.index(“(‘”)+2
    ind_end = string_with_json_obj.index(“‘)”)
    json_data = string_with_json_obj[ind_start:ind_end]

    json_data = json_data.encode(‘utf8’).decode(‘unicode_escape’)

    me salta un error : substring not found ; podrías indicarme como se puede resolver?

    1. Hola 🙂

      Bastante probable que no has descargado los datos en el paso anterior. Para revisar esto añade un par de prints para entender dónde no tienes datos.

      Por ejemplo aquí puedes comprobar si el variable scripts tiene datos:
      for el in scripts:
      # aquí puedes ver si hay algunos elementos en script
      if 'teamsData' in el.text:
      string_with_json_obj = el.text.strip()
      # aquí por ejemplo puedes revisar si string_with_json_obj tiene algún dato

      Y así puedes validar otras cosas. Con simples print()

      Espero que eso ayude 😉

      1. Hello Sergi,

        I’m sorry to bother you, but I’m a beginner and I have about the same problem as the questioner above me.
        I have data in the script variable ( var teamsData = JSON.parse(‘\x7B\x2……………), but if I try to use .text, it won’t return anything. Has anything changed or am I just missing something?
        I’ve tried it on another, simpler, page and it works there, but here it looks like the .text (or get_text ()) function has stopped working here.
        Don’t you know where the problem might be?
        Thanks a lot


        1. Hey 🙂

          You have to use .text on the pile of data that is in the scripts tag, while looping through each tag. If you already extracted that text, your data is in the “string” type, so you have to deal with it as regular string.

          Also, if it doesn’t return anything maybe you didn’t catch any data… I just ran my notebook in Kaggle and it gets all the numbers.

          Try to debug your code: print content of any variable you introduce or change, even if you are sure about the output. Print all the data from scripts and manually check if there is a string ‘teamsData’ and check the type of that data, then print only ‘teamsData’ and its type and so on. Pretty sure you will find what’s wrong.

          Hope that helps 🙂 if not – find me on social or shoot me an email and we will discuss it more in depth.

          Cheers and have a great day 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *