Predicting The FIFA World Cup 2022 With a Simple Model using Python

Read Time:6 Minute, 43 Second

And the winner is…

Picture through Shutterstock below license to Frank Andrade (edited with Canva)

Many individuals (together with me) name soccer “the unpredictable recreation” as a result of a soccer match has various factors that may change the ultimate score.

That’s true … to some extent.

It’s exhausting to foretell the ultimate score or the winner of a match, however that’s not the case relating to predicting the winner of a contest. Over the previous 5 years, Bayern Munich has received all Bundesligas, whereas Manchester Metropolis has received 4 Premiere Leagues.

Coincidence? I don’t assume so.

In reality, in the midst of the season 20–21, I created a model to predict the winner of the Premier League, La Liga, Serie A, and Bundesliga, and it efficiently predicted the winner of all of them.

That prediction wasn’t so exhausting to make since 19 matches have been already performed at that time. Now I’m working the identical mannequin to foretell the World Cup 2022.

Right here’s how I predicted the World Cup utilizing Python (for extra particulars concerning the code check my 1-hour video tutorial)

How are we going to foretell the matches?

There are other ways to make predictions. I might construct a elaborate machine studying mannequin and feed it a number of variables, however after studying some papers I made a decision to present an opportunity to the Poisson distribution.

Why? Nicely, let’s take a look on the definition of the Poisson distribution.

The Poisson distribution is a discrete chance distribution that describes the variety of events occurring in a set time interval or area of alternative.

If we consider a goal as an event that may occur within the 90 minutes of a soccer match, we might calculate the chance of the variety of goals that could possibly be scored in a match by Team A and Team B.

However that’s not sufficient. We nonetheless want to satisfy the assumptions of the Poisson distribution.

The variety of events might be counted (a match can have 1, 2, 3 or extra goals)
The incidence of events is impartial (the incidence of 1 goal mustn’t have an effect on the chance of one other goal)
The speed at which events happen is fixed (the chance of a goal occurring in a sure time interval must be precisely the identical for each different time interval of the identical size)
Two events can not happen at precisely the identical on the spot in time (two goals can’t happen on the identical time)

Definitely assumptions 1 and 4 are met, however 2 and three are partly true. That stated, let’s assume that assumptions 2 and three are all the time true.

Once I predicted the winners of the highest European leagues, I plotted the histogram of the variety of goals in each match over the previous 5 years for the highest 4 leagues.

Histogram of the variety of goals within the 4 leagues

When you’ve got a have a look at the match curve of any league, it appears to be like just like the Poisson distribution.

Now we will say that it’s potential to make use of the Poisson distribution to calculate the chance of the variety of goals that could possibly be scored in a match.

Right here’s the system of the Poisson distribution.

To make the predictions I thought-about:

lambda: median of goals in 90 minutes (Team A and Team B)
x: variety of goals in a match that could possibly be scored by Team A and Team B

To calculate lambda, we’d like the typical goals scored/conceded by every nationwide team. This leads us to the subsequent level.

Goals scored/conceded by each nationwide team

After collecting data from all the World Cup matches played from 1930 to 2018, I might calculate the typical goal scored and conceded by every nationwide team.

Within the prediction I made for the highest 4 European leagues, I thought-about the house/away issue, however since within the World Cup virtually all teams play in a impartial stadium, I didn’t think about that issue for this evaluation.

As soon as I had the goals scored/conceded by each nationwide team, I created a operate that predicted the variety of factors every team would get within the group stage.

Under is the code I used to foretell the variety of factors every nationwide team would get within the group stage. It appears to be like intimidating, however it solely has many issues I discussed till this level translated into code.

def predict_points(house, away):
if house in df_team_strength.index and away in df_team_strength.index:
lamb_home = df_team_strength.at[home,'GoalsScored'] * df_team_strength.at[away,'GoalsConceded']
lamb_away = df_team_strength.at[away,'GoalsScored'] * df_team_strength.at[home,'GoalsConceded']
prob_home, prob_away, prob_draw = 0, 0, 0
for x in vary(0,11): #variety of goals house team
for y in vary(0, 11): #variety of goals away team
p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)
if x == y:
prob_draw += p
elif x > y:
prob_home += p
else:
prob_away += ppoints_home = 3 * prob_home + prob_draw
points_away = 3 * prob_away + prob_draw
return (points_home, points_away)
else:
return (0, 0)

In plain English, predict_points calculates what number of factors the house and away teams would get. To take action, I calculated lambda for every team utilizing the system average_goals_scored * average_goals_conceded .

Then I simulated all of the potential scores of a match from 0–0 to 10–10 (that final score is simply the restrict of my vary of goals). As soon as I’ve lambda and x, I take advantage of the system of the Poisson distribution to calculate p.

The prob_home, prob_draw, and prob_away accumulates the worth of p if, say, the match ends in 1–0 (house wins), 1–1 (draw), or 0–1 (away wins) respectively. Lastly, the factors are calculated with the system under.

points_home = 3 * prob_home + prob_draw
points_away = 3 * prob_away + prob_draw

If we use predict_points to foretell the match England vs United States, we’ll get this.

>>> predict_points('England', 'United States')
(2.2356147635326007, 0.5922397535606193)

Which means that England would get 2.23 factors, whereas the USA would get 0.59. I get decimals as a result of I’m utilizing possibilities.

If we apply this predict_points operate to all of the matches within the group stage, we’ll get the first and 2nd place of every group, thus the next matches within the knockouts.

For the knockouts, I don’t have to predict the factors, however the winner of every bracket. Because of this I created a brand new get_winner operate based mostly on the earlier predict_points operate.

def get_winner(df_fixture_updated):
for index, row in df_fixture_updated.iterrows():
house, away = row['home'], row['away']
points_home, points_away = predict_points(house, away)
if points_home > points_away:
winner = house
else:
winner = away
df_fixture_updated.loc[index, 'winner'] = winner
return df_fixture_updated

To place it merely, if the points_home is bigger than the points_away the winner is the house team, in any other case, the winner is the away team.

Due to the get_winner operate, I can get the outcomes of the earlier brackets.

If I take advantage of the get_winner once more I can predict the winner of the World Cup. Right here’s the ultimate consequence!!

By working the operate another time, I get that the winner is …

Brazil!

That’s it! That’s how I predicted the World Cup 2022 utilizing Python and the Poisson distribution. To see the entire code, verify my GitHub. It’s also possible to verify my Medium list, to see all of the articles associated to this Python mission.