ScraperFC modules

Capology

class ScraperFC.Capology.Capology

Bases: object

close(): Closes and quits the Selenium WebDriver instance.

scrape_payrolls(year, league, currency)

Scrapes team payrolls for the given league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
currency (str) – The currency for the returned salaries. Options are “eur” for Euro, “gbp” for British Pount, and “USD” for US Dollar

Returns:

The payrolls of all teams in the given league season

Return type:

Pandas DataFrame

scrape_salaries(year, league, currency)

Scrapes player salaries for the given league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
currency (str) – The currency for the returned salaries. Options are “eur” for Euro, “gbp” for British Pount, and “USD” for US Dollar

Returns:

The salaries of all players in the given league season

Return type:

Pandas DataFrame

ClubElo

class ScraperFC.ClubElo.ClubElo

Bases: object

scrape_team_on_date(team, date)

Scrapes a team’s ELO score on a given date.

Parameters:

team (str) – To get the appropriate team name, go to clubelo.com and find the team you’re looking for. Copy and past the team’s name as it appears in the URL.
date (str) – Must be formatted as YYYY-MM-DD

Returns:

elo (int) – ELO score of the given team on the given date
-1 (int) – -1 if the team has no score on the given date

FBRef

class ScraperFC.FBRef.FBRef

Bases: object

ScraperFC module for FBRef

close(): Closes and quits the Selenium WebDriver instance.

complete_report_from_player_link(player_link)

Scrapes the FBRef scouting reports for a player.

Parameters:

player_link (str) – URL to an FBRef player page

Returns:

cleaned_complete_report (Pandas DataFrame) – Complete report with a MultiIndex of stats categories and statistics. Columns for per90 and percentile values.
player_name (str)
player_pos (str)
minutes (int)

get(url)

Custom get function just for the FBRef module.

Calls .get() from the Selenium WebDriver and then waits in order to avoid a Too Many Requests HTTPError from FBRef.

Parameters:: url (str) – The URL to get
Return type:: None

get_match_links(year, league)

Gets all match links for the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

FBRef links to all matches for the chosen league season

Return type:

list

get_season_link(year, league)

Returns the URL for the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

URL to the FBRef page of the chosen league season

Return type:

str

requests_get(url)

Custom requests.get function for the FBRef module

Calls requests.get() until the status code is 200.

Parameters:: url (Str) – The URL to get
Returns:: The response
Return type:: requests.Response

scrape_all_stats(year, league, normalize=False)

Scrapes all stat categories

Runs scrape_stats() for each stats category on dumps the returned tuple of dataframes into a dict.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
normalize (bool) – OPTIONAL, default is False. If True, will normalize all stats to Per90.

Returns:

Keys are stat category names, values are tuples of 3 dataframes, (squad_stats, opponent_stats, player_stats)

Return type:

dict

scrape_complete_scouting_reports(year, league, goalkeepers=False)

Scrapes the FBRef scouting reports for all players in the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
goalkeepers (bool) – OPTIONAL, default is False. If True, will scrape reports for only goalkeepers. If False, will scrape reports for only outfield players.

Returns:

per90 (Pandas DataFrame) – DataFrame of reports with Per90 stats.
percentiles (Pandas DataFrame) – DataFrame of reports with stats percentiles (versus other players in the top 5 leagues)

scrape_league_table(year, league)

Scrapes the league table of the chosen league season

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

Pandas DataFrame – DataFrame may be empty if the league has no tables. Otherwise, the league table.
tuple – If the league has multiple tables (e.g. Champions League, Liga MX, MLS) then a tuple of DataFrames will be returned.

scrape_match(link)

Scrapes an FBRef match page.

Parameters:: link (str) – URL to the FBRef match page
Returns:: DataFrame containing most parts of the match page if they’re available (e.g. formations, lineups, scores, player stats, etc.). The fields that are available vary by competition and year.
Return type:: Pandas DataFrame

scrape_matches(year, league, save=False)

Scrapes the FBRef standard stats page of the chosen league season.

Works by gathering all of the match URL’s from the homepage of the chosen league season on FBRef and then calling scrape_match() on each one.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
save (bool) – OPTIONAL, default is False. If True, will save the returned DataFrame to a CSV file.

Returns:

Pandas DataFrame – If save is False, will return the Pandas DataFrame with the the stats.
filename (str) – If save is True, will return the filename the CSV was saved to.

scrape_stats(year, league, stat_category, normalize=False)

Scrapes a single stats category

Adds team and player ID columns to the stats tables

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
stat_cateogry (str) – The stat category to scrape.
normalize (bool) – OPTIONAL, default is False. If True, will normalize all stats to Per90.

Returns:

tuple of 3 Pandas DataFrames, (squad_stats, opponent_stats, player_stats).

Return type:

tuple

FiveThirtyEight

class ScraperFC.FiveThirtyEight.FiveThirtyEight

Bases: object

close(): Closes and quits the Selenium WebDriver instance.

scrape_matches(year, league, save=False)

Scrapes matches for the given league season

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
save (bool) – OPTIONAL, default is False. If True, output will be saved to a CSV file.

Returns:

Pandas DataFrame – If save=False, FiveThirtyEight stats for all matches of the given league season
filename (str) – If save=True, filename of the CSV that the stats were saved to

up_season(string)

Increments a string of the season year

Parameters:: string (str) – String of a calendar year (e.g. “2022”)
Returns:: Incremented calendar year
Return type:: str

SofaScore

Transfermarkt

class ScraperFC.Transfermarkt.Transfermarkt

Bases: object

close(): Closes and quits the Selenium WebDriver instance.

get_club_links(year, league)

Gathers all Transfermarkt club URL’s for the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

List of the club URL’s

Return type:

list

get_player_links(year, league)

Gathers all Transfermarkt player URL’s for the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

List of the player URL’s

Return type:

list

get_players(year, league)

Gathers all player info for the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

Each row is a player and contains some of the information from their Transfermarkt player profile.

Return type:

Pandas DataFrame

class ScraperFC.Transfermarkt.TransfermarktPlayer(url)

Bases: object

Class to represent Transfermarkt player profiles.

Initialize with the URL to a player’s Transfermarkt profile page.

Understat

class ScraperFC.Understat.Understat

Bases: object

close(): Closes and quits the Selenium WebDriver instance.

get_match_links(year, league)

Gets all of the match links for the chosen league season

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

List of match links of the chosen league season

Return type:

list

get_season_link(year, league)

Gets URL of the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

URL to the Understat page of the chosen league season.

Return type:

str

get_team_links(year, league)

Gets all of the team links for the chosen league season

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

List of team URL’s from the chosen season.

Return type:

list

remove_diff(string)

Removes the plus/minus from some stats like xG.

Parameters:: string (str) – The string to remove the difference from
Returns:: String passed in as arg with the difference removed
Return type:: str

scrape_attack_speeds(year, league)

Scrapes the attack speeds for each team in the year and league

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

DataFrame containing the attack speeds of each team

Return type:

Pandas DataFrame

scrape_formations(year, league)

Scrapes the stats for each team in the year and league, broken down by formation used by the team.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

Keys are each team. Values are more dicts with keys for each formation and values are stats for each formation.

Return type:

dict

scrape_game_states(year, league)

Scrapes the game states for each team in the year and league

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

DataFrame containing the game states

Return type:

Pandas DataFrame

scrape_home_away_tables(year, league, normalize=False)

Scrapes the home and away league tables for the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
normalize (bool) – OPTIONAL, default False. If True, normalizes stats to per90

Returns:

home (Pandas DataFrame) – Home league table
away (Pandas DataFrame) – Away league table

scrape_league_table(year, league, normalize=False)

Scrapes the league table for the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
normalize (bool) – OPTIONAL, default False. If True, normalizes stats to per90

Returns:

The league table of the chosen league season.

Return type:

Pandas DataFrame

scrape_match(link)

Scrapes a single match from Understat.

Parameters:: link (str) – URL to the match
Returns:: match – The match stats
Return type:: Pandas DataFrame

scrape_matches(year, league, save=False)

Scrapes all of the matches from the chosen league season.

Gathers all match links from the chosen league season and then call scrape_match() on each one.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
save (bool) – OPTIONAL, default False. If True, saves the DataFrame of match stats to a CSV.

Returns:

matches (Pandas DataFrame) – If save=False
filename (str) – If save=True, the filename the DataFrame was saved to

scrape_shot_results(year, league)

Scrapes the shot results for each team in the year and league

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

DataFrame containing the shot results data

Return type:

Pandas DataFrame

scrape_shot_xy(year, league, save=False, format='json')

Scrapes the info for every shot in the league and year.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
save (bool) – OPTIONAL, default if False. If True, shot XY’s will be saved to a JSON file.
format (str) – OPTIONAL, format of the output. Options are “json” and “dataframe”

Returns:

Dict if save=False and format=json Dataframe if save=False and format=json Str if save=True. Filetype is determined by format argument

Return type:

dict, Padnas DataFrame, or str

scrape_shot_zones(year, league)

Scrapes the shot zones for each team in the year and league

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

DataFrame containing the shot zones data

Return type:

Pandas DataFrame

scrape_situations(year, league)

Scrapes the situations leading to shots for each team in the chosen league season.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

DataFrame containing the situations

Return type:

Pandas DataFrame

scrape_timing(year, league)

Scrapes the timing of goals for each team in the year and league

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.

Returns:

DataFrame containing the timing stats

Return type:

Pandas DataFrame

unhide_stats(columns)

Understat doesn’t display all stats by default.

This functions uses the stats currently shown in the table columns to unhide stats that aren’t being displayed.

Parameters:: columns (Pandas DataFrame.columns) – The columns currently shown in the table being scraped
Return type:: None

shared_functions

exception ScraperFC.shared_functions.InvalidCurrencyException

Bases: Exception

Raised when an invalid currency is used with the Capology module.

exception ScraperFC.shared_functions.InvalidLeagueException(league, source, source_comp_info)

Bases: Exception

Raised when an invalid league is found

exception ScraperFC.shared_functions.InvalidSourceException(source, source_comp_info)

Bases: Exception

Raised when an invalid source is found

exception ScraperFC.shared_functions.InvalidYearException(year, league, source, source_comp_info)

Bases: Exception

Raised when an invalid year is found

exception ScraperFC.shared_functions.NoMatchLinksException(fixtures_url, year, league)

Bases: Exception

Raised when no match links are found

exception ScraperFC.shared_functions.UnavailableSeasonException(year, league, source)

Bases: Exception

Raised when a given year and league is unavailable from a source.

ScraperFC.shared_functions.get_proxy()

Gets a proxy address.

Can be used to initialize a Selenium WebDriver to change the address of the browser. Adapted from https://stackoverflow.com/questions/59409418/how-to-rotate-selenium-webrowser-ip-address. Randomly chooses one proxy.

Returns:: proxy – In the form <IP address>:<port>
Return type:: str

ScraperFC.shared_functions.get_source_comp_info(year, league, source)

Checks to make sure that the given league season is a valid season for the scraper and returns the source_comp_info dict for use in the modules.

Parameters:

year (int) – Calendar year that the season ends in (e.g. 2023 for the 2022/23 season)
league (str) – League. Look in shared_functions.py for the available leagues for each module.
source (str) – The scraper to be checked (e.g. “FBRef”, “Transfermarkt, etc.). These are the ScraperFC modules.

Returns:

souce_comp_info – Dict containing all of the competition info for all of the sources. Used for different things in every module

Return type:

dict

ScraperFC.shared_functions.xpath_soup(element)

Generate xpath from BeautifulSoup4 element.

I shamelessly stole this from https://gist.github.com/ergoithz/6cf043e3fdedd1b94fcf.

Parameters:

element (bs4.element.Tag or bs4.element.NavigableString) – BeautifulSoup4 element.

Returns:

str – xpath as string
Usage
—–
>>> import bs4
>>> html = (
… “<html><head><title>title</title></head>”
… “<body>p 1p 2</body></html>”
… )
>>> soup = bs4.BeautifulSoup(html, “html.parser”)
>>> xpath_soup(soup.html.body.p.i)
”/html/body/p[1]/i”
>>> import bs4
>>> xml = “<doc><elm/><elm/></doc>”
>>> soup = bs4.BeautifulSoup(xml, “lxml-xml”)
>>> xpath_soup(soup.doc.elm.next_sibling)
”/doc/elm[2]”