Game for Goalr? The Dynamics of Predicting the World Cup
In the wake of the 2014 FIFA World Cup which was kick started last week, billions of fans spread across the world are turning their attention towards the host country Brazil to root for their favorite teams.
Certainly goes by without saying that Soccer (or, if you prefer, football) fans are loud; you only need to remember the last World Cup’s infamous vuvuzelas for a demonstration.
Then again fans aren’t only loud in stadiums. They also ensure that their voices are heard across the social media. In addition, although you may assume that these fans are just blowing their vuvuzelas into the social abyss, if you listen closely, you’ll find out a treasure trove of data — including possibly an answer to the most vital query of it all: As to Who will win?
By way of being soccer fans and Yahoo Labs’ experts braced with access to Tumblr data, the idea was to find out if one could take advantage of the exclusive insight at the offing to be able to rake through a flurry of posts in order to guess a World Cup winner.
In a case entailing what may be called as – Go through with a fine-tooth comb in the midst of 188.9 million Tumblr blogs encompassing 83.1 billion posts to search World Cup-related content was not easy. To begin with, two main parameters were made use of; to be able to define which content was relevant: posts with hashtags referencing #WorldCup, #World Cup, #Copa do mundo (or other variants outlined in our technical report), and posts with hashtags referencing #soccer, #football, #futbol, and so on and so forth.
Nevertheless, making use of these parameters only, proved excessively expansive. Hence once #WorldCup-related posts were isolated, the bodies of the posts for allusions of country names were checked upon. Then the same procedure followed suit for #soccer-related posts
As for instance Team USA, one tallied only mentions in that of #soccer posts to circumvent confusion with American football. On behalf of Team Brazil, percentage of posts were discounted owing to the country holding the event and thus receiving additional mentions — this remained a percentage calculated, founded on an editorial evaluation upon a sample of posts.
To avail even more representative results, the bodies of posts in both hashtag categories for mentions of any national team player as stated by FIFA’s official list of players for each nation were checked even.
Upon completion of the filtering mode, left with 27.3 million relevant posts from February through May. The fun (read: science-y) part arrived next.
In order to figure out how each country will stack up against each other, the perquisite surfaced to be able to assign a forte to each team.
These intrinsic values were then considered as stated by each matchup and delivered a representative game score.
Precisely, when two teams are positioned to play against each other, it was projected that the number of goals scored by each team by making use of a Poisson distribution with four differently-weighted parameters learned using the Maximum Likelihood algorithm on prior games (qualifications, friendlies, etc.). The four parameters encompassed the following, viz:
1. Team mentions in #WorldCup-related posts, 2. Team mentions in #soccer-related posts, 3. The average number of player mentions per team, and 4. The standard deviation of player mentions per team.
To conclude, being left with that of a statistical model guessing the outcome of each successive matchup based on the calculations at the offing, the 27.3 million relevant posts, had a complete bracket and a winner: which is Team Brazil.