Foxes and Hedgehogs in Sports Prediction
Posted 3rd September 2014
Writing at Scoreboard Journalism last year, Simon Gleave, head of Analysis at Infostrada Sports compared the performance of a number of sports journalists to a number of computer models in predicting the outcome of the 2013/14 English Premier League. Last month Simon revisited those predictions to see how they had performed and posted the illuminating results on twitter. Whilst both sports journalists and computer models were consistently better (on average) than either random guessing or simply copying the finishing positions from the previous season, the model predictions were superior to those of the journalists. The question this article attempts to address is why?
For the original research, computer models came with points predictions where as sports journalists were only asked to predict rankings. Strikingly, it is evident that the majority of the models make fairly conservative estimates for the finishing points totals, with the average for all models for top and bottom being 79 and 33 points respectively, compared to 83 and 29 points for the actual average top and bottom points totals across the last 18 seasons of the Premier League (the 20 club era). Indeed, one model ranged from only 38 to 71 points. Might this offer a clue as to the way computer models arrive at their predictions?
Since the sports journalists were not asked to predict points totals, I have chosen to focus on the comparison of ranking predictions between models and journalists only. As Simon has done in his tweet, a useful way of analysing how each prediction has performed is to compare them to the actual finishing positions of the teams and then calculate the amount of error within each sample of 20 predictions. Those familiar with statistics will recognise this as simply the standard error or standard deviation, in other words the amount of variance or variability. This is simply calculated by subtracting the predicted rank from the actual finishing rank for each team, and then calculating the standard deviation of those differences. If every prediction made was correct, the standard error would be zero. Conversely, if every forecast made simply predicted a mid table finish (10th) for all teams, the standard error would be about 6. So how did the computer models and journalists fare?
As is clear from Simon's graphic, both groups did OK when compared to either midtable guessing or previous season copying. However, considerable variation exists amongst both groups. The best models achieved standard errors of just 3.5 whereas the poorest ones could only manage about 4.9. This compares to the best and poorest journalists' errors of 3.5 and 5.3. These figures, however, don't tell the whole story. Whilst the best journalist (Joe PrinceWright of NBC) was about as good as the best model, the average model performance (4.08) was considerably better than that of the average journalist (4.37). With only 13 forecasts from models and 14 from journalists one might not expect that difference to show any statistical significance, and yet using a 1tailed 2sample ttest, the difference between this averages is very nearly statistically significant at the 95% level (pvalue = 0.057). To put it another way, such a difference could be expected to occur purely by chance in just over 5% of occasions for the same sample size of forecasters.
Of course statistics, if used irresponsibly, can be used to support most arguments. Buts let's suppose for the time being that the average computer model is capable of outperforming the average sports journalist, the interesting question is why. To get an idea of what at least part of the answer might involved, it's worth taking another look at the spread of the errors for models and journalists alike. A casual glance at Simon's graphic makes it fairly clear that a wider spread of error exists for the journalists compared to the models. In fact the standard deviation of individual standard errors is 0.43 for the models, whilst it is 0.50 for the journalists. Furthermore, much of the variance in the models' errors arises because of the 3 poorest performers; the rest are fairly closely grouped. Indeed, remove those 3 models and the standard deviation across models' errors drops to 0.20. Do the same for the journalists and it's still as high as 0.40. So why the bigger spread in forecasting accuracy for journalists compared to computer models. Perhaps it boils down to the way each thinks.
The ancient Greek poet Archilochus once wrote that: "the fox knows many things, but the hedgehog knows one big thing". Pinnaclesports have conveniently summarised some of the typical traits that Archilochus probably had in mind when making such a distinction, and which Philip Tetlock identified in his masterpiece Expert Political Judgement? How good is it? How can we know?. Specifically, when it comes to forecasting uncertain futures foxes are multidisciplinary, adaptable, selfcritical, tolerant of complexity (and uncertainty) and empirical. Hedgehogs, by comparison are specialised, stubborn, (over)confident and ideological. There are many examples of hedgehogs in the political sphere. One only has to switch on a news channel to hear "experts" debate solutions to issues like IsraelPalestine and SyriaIraq, with a typical theme of "if we just do this simple thing that that will ensure that that happens".
Of course, a hedgehog mentality is probably the default human behaviour, and it exists for evolutionary reasons. The ancient fable the Fox and the Cat, is essentially an analogous version of this foxhedgehog dichotomy, and addresses the difference between resourceful expediency and a master stratagem. In the basic story a cat and a fox discuss how many tricks and dodges they have. The fox boasts that he has many; the cat confesses to having only one. When hunters arrive with their dogs, the cat quickly climbs a tree, but the fox is caught by the hounds. The moral: spend too much time overanalysing situations and you're liable to end up as lunch. In the days when our ancestors foraged a living on the steppes of East Africa, having the ability to make bold and rapid decisions when faced with environmental threats clearly had evolutionary payoffs. Those humans that could, survived; those less able, died out. Even if, as a hedgehog, you got it wrong, so long as you lived to fight another day, that was the main thing.
Today, of course, human existence is much more benign, but essentially we are still hardwired to exhibit many hedgehoglike traits, including overconfidence and stubbornness in our decision making. Arguably, however, a hedgehog mentality is less suited to the terrain of forecasting uncertainty that exists in sports prediction (and many other arenas for that matter). Whilst foxes might consider many possible points of view, and make use of many sources of information in deliberating their opinions, hedgehogs will be more likely to make bold and singular statements about what they think will happen, and be more entrenched in those positions. One might be inclined to say that whilst foxes sit on the fence (recall the smaller spread in points forecasts compared to what actually happens), hedgehogs go out on a limb. Perhaps that is true in part, but such an explanation on its own cannot account for the generally superior forecasting ability that foxes appear to have over hedgehogs as documented by Philip Tetlock.
So what about our Premiership forecasters? Here I wish to make the case the computer models, by their very nature and design, and possibly also by the nature of their designers, are more likely to operate like foxes, considering a wide variety of quantitative data and be more inclined to draw conclusions unbiased by singular and dominant points of view. Sports journalists, by contrast, are more likely to act like hedgehogs. They are, after all human beings (well, most of them anyway), and they like to make bold predictions about what they think will happen, presumably to garner a reputation. When they get it right, they will look spectacularly profound (although how can we tell whether they just got lucky?), but when they get it wrong they look spectacularly stupid (again, can we be sure they have not just been unlucky?).
To put some meat on these bones, consider for example, David James' (BT Sport) prediction that Everton would finish the 2013/14 season in 16th place, one place above relegation. Clearly he must have had his reasons for believing this to be likely. It was a very bold prediction indeed; all the other journalists ranged from between 4th and 9th. In the event it turned out to be wrong (Everton finished 5th) , and in a large part accounted for David James overall proving to be the poorest forecaster of all, only marginally better than simply copying the finishing positions from the previous season. Perhaps the reason, then, that journalists, on average showed a wider variation of success at forecasting the English Premier League rankings is because they are, on average, thinking more like hedgehogs than foxes: making bold predictions which when wrong lead to poorer overall scores.
This is not to say that all computer models are thinking like foxes and all sports journalists are thinking like hedgehogs. Despite the differences highlighted, there are other general commonalities. Models and Journalists alike were all pretty successful in predicting that Manchester United would not win the league (indeed not a single journalist predicted it would happen compared to 2 models). Similarly, they all spectacularly failed to predict Crystal Palace's 11th place finish with the majority having forecast last place and none better than 18th. And despite the variations, models and journalists, on average, didn't disagree by more than 2 ranking places for every team.
Overall, however, the reality, as Philip Tetlock found in his 20 years of research into political forecasting, is that both foxes and hedgehogs alike are not very good, on average, at forecasting uncertain futures, better than random perhaps, but probably not enough to justify the large sums of money spent on hiring these journalists to offer opinions about sporting outcome in the first place, or indeed modelers to spend time predicting the evolution of betting markets more generally, at least not if those predictions are meant to offer something better than the market with a view to finding profitable trading opportunities. Whilst I haven't been able to investigate whether any of these predictions could have been used to make a profit, is it interesting to observe that the PinnacleSports market with its built in profit margin was only bettered by 7 of the other 26 models, and only significanly so by 3 of them. In truth, whilst some forecasters will be more foxlike, and some more hedgehoglike, the majority, on average, will be foxhogs, a mixture of the two, and few will be capable of being consistently right in the long term. Uncertainty is an uncertain business and if everyone could make money out of it, it wouldn't be uncertain any more.
