Predicting the FIFA 2018 World Cup finals with Data Analytics

2018 FIFA World Cup Group B march IRN MAR 24


Using statistics to predict the outcomes of sporting matches has always been popular, from fans to betting punters. Using machine learning lets us examine large sets of data and identify patterns.

With the FIFA 2018 World Cup now into the Round of 16, with some surprises in the pool games, we could try to predict the outcome of the finals and see who we might think will win.

Football Data

Sports analytics has been historically most active in baseball, we can see the shear number of metrics collected (just look at how many are listed here!). Baseball is simpler to analyse and record given the nature of the game. That is, striking/fielding (isolated pitch+bat events). Football however is classed as an invasion game, which makes it a lot more difficult to analyse. The nature of invasion games are much more complex, due to the nature chaotic nature of attack vs defence. The amount of data published for football has been fairly limited, typically including things like team line-ups, goals scored and penalties given.

Data Sets

Fortunately, data sets have been made available:

Team Ratings

There are two primary sources of team ratings, FIFA's official rating and the Elo rating measure. The Elo rating system was created by Arpad Elo, a Hungarian-American professor, to rank Chess players (and other single player ratings for zero sum games). This rating system has been modified, incorporating additional data such as goal margin and match importance, to apply for Football teams as well. FIFA use a modified version of the Elo rating system for ranking women’s international teams. The men’s international teams use a different rating system, however FIFA announced on the 10th of June, that they are switching to an Elo based system. These ratings are more of a ranking device than a predictor (they rank according to relative strength to each team).

Exploring the data

We load some of these into Power BI and explore some of the data. Click the full screen arrows to maximize, and navigate the data.

We can then try to load some of this into a model, to see what we think the results might be of the upcoming finals.

Predicting the Finals

A really good article appeared in the Economist a few weeks ago, which highlighted the difficulty in predicting football outcomes.  Most models created before the World Cup started, predicted that Germany would be the overall winner. Unfortunately, chaos ultimately decided Germany would be knocked out in the pool rounds (the first time in 80 years!). Likewise, no model could take into account things like the Spanish team's manager being fired early on, or the match fitness levels of Egypt's star player Mo Salah.

With the pool games over though, we know who will make it through to the Round of 16 games.

Teams making it to Round of 16
Team Place
Uruguay Group A winner
Russia Group A runner-up
Spain Group B winner
Portugal Group B runner-up
France Group C winner
Denmark Group C runner-up
Croatia Group D winner
Argentina Group D runner-up
Brazil Group E winner
Switzerland Group E runner-up
Sweden Group F winner
Mexico Group F runner-up
Belgium Group G winner
England Group G runner-up
Colombia Group H winner
Japan Group H runner-up


Using open source Python libraries, we can train a simple logistic regression model based on the prior data, encompassing past performance, FIFA rating data as well as performance to date at this year's world cup. This gives us a probability expectation of each team winning in the Round of 16 matches, with which we can simulate our final bracket, and determine who might be the winner. We can then present this data using Microsoft Power BI:

 As we can see, Brazil is set to win.

Further Information

This is just an example of using data to explore real world problems. Feel free to contact us should you want to discuss how we can help you explore your data.

Further reading on football prediction

If it is of interest, the following is some good reading on football prediction: