"If football were a formal language, it would be the most popular in the world..."
Within this work, I combine two of my biggest passions - football and data science - and try to model the language we love and use every day: the language of football. Leveraging state-of-the-art algorithms and StatsBomb data, I use the model to describe actions and players, and understand their place in the football semantic space. It is the first step in research yet to be continued to model this world. This whole process was pure fun, and I hope you’ll enjoy reading it.
It starts with football events data
openly available by StatsBomb
Each event is a row within a DataFrame
> 3M Events
~ 900 Matches
Events are interpreted
& encoded into words
Action type
Location
Action attributes
4
< Pass >
(3, 4)
digonal | on-ground | medium length
3
Words are encoded & fed into models
Powerful Nature Language Processing Models
Action2Vec
Player2Vec
powered by
Pass to back
Cross to box
Shot saved
Leo Messi
Erling Haaland
Virgil Van Dijk
Player-matches
aggregation
Transforming each action to a 32-dim vector, based on football semantics
Transforming each player to a 32-dim vector, based on his actions during games
Evaluate, interpret, explain
Understand players with players similarities & equations
Selected player: Andres Iniesta
Most similar player: Arthur Melo
Equations:
-
Iniesta + outbox scoring
+ inbox scoring
~ Kevin De Bruyne
-
Iniesta + inbox scoring
~ Eden Hazard
-
Iniesta + outbox scoring
- dribbling
~ Toni Kroos
Use players variations to better understand representation’s vector dimension
Actions analogies as a tool to investigate model semantics
Illustrative analogy plot for learning pass direction. B1/2/3 are the best actions to fit the analogy equation: A - A’ + B’ =?. Solid lines represent A or B, while dashed lines represent A’ or B’. Green colors are for A, A’, reds for B, B’. The pass distance (short/med/long) is represented by the arrow length. Here, A’ is the same pass as A, but with the opposite direction (left). B’ is the same as A’ from one position behind. B1/2 are mirrored passes to B with variations of height and length. B3 is exactly the mirrored pass.
Combine players with actions to generate endless local variations of players
Explore the Player2Vec embeddings space
Present &
Use out-of-the-box
Stunning UI
powered by
Streamlit & Plotly
Interact
Interactive Charts
A Gensim Word2Vec model which allows embedding the semantics of the football language in a 32-dimensional space.
Action2Vec
UMAP projections of the complete all Action2Vec vocabulary.
UMAP projections of all players within all matches, according to StatsBomb open dataset.
A Gensim Doc2Vec model that produces players embedding within a single match in a 32-dimensional space, based on the actions performed by the player.
PlayerMatch2Vec
Player2Vec is the core of this project. It is, in fact, the averaged representation of PlayerMatch2Vec representations.
Player2Vec
Plotly interactive UMAP projection of Player2vec where all player’s matches are averaged to a single vector. Players are colored by position.