Predicting the future is hard, but it's also incredibly important.
Let's say someone starts making predictions about important events. How much should you believe them when they say the world will end tomorrow? What about when they say there's a 70% chance the world will end in 50 years?
Wait, what does "70%" even mean in this situation? How can you have 70% of an apocalypse?
In this situation the predictor is making a prediction with a certain confidence. Rather than just saying "it's likely", they've chosen a number to represent how confident they are in that statement.
People make predictions every day, but most don't choose a specific number to assign to their confidence. This would be wildly impractical for most things! If you're driving and a car in front of you slows down, you could make a prediction about what it's going to do. If they turn on their turn signal, you could make a pretty confident prediction about what it's going to do. You usually don't need to state what potential outcomes you're anticipating, which you think is most likely, or what amount of confidence you'd place on each, but you are already doing it!
Explicit predictions are most useful when trying to communicate about important, uncertain events. When you hear the morning news say there's a 70% chance of rain today, they've given you a useful data point! You can use that information to make decisions: Should I take an umbrella? Should I wear a jacket? Probably!
Why should I care about a specific confidence number? Just say "probably" like everyone else!
Predictions, quantified or not, are ultimately only useful as tools that you can use to make decisions. If a prediction is not particularly relevant to a decision you're making, or it won't affect you much either way, then "probably" is fine! If someone tells you they will "probably" be home in twenty minutes, that's usually enough information for any decision you need to make.
On the other hand, predictions that would affect something significant in your life or require you to make a bigger decision should probably be taken more seriously.
These are the sorts of questions where it's helpful to have quantified predictions.
If these predictions are so important, how do we know who to trust? Just because someone is confident in themselves doesn't mean I should be confident in them.
The best way to measure how good a person is at predicting is to look at how often they were right in the past. If our Nostradamus was wrong about every prediction they've made so far, we should probably ignore them. If they have been right every time, we should probably take them seriously.
To grade simple predictions, we can put all of the YES predictions in one bucket, and all of the NO predictions in another. We'll count how many times those predictions came true - ideally everything in the NO bucket will resolve NO, and everything in the YES bucket will resolve YES.
Prediction | Resolved No | Resolved Yes | Average Resolution |
---|---|---|---|
NO Bucket | 15 | 3 | 3 / 18 = 16.7% |
YES Bucket | 7 | 10 | 10 / 17 = 58.8% |
Well it looks like our Nostradamus was decently accurate whenever he predicted NO - those only happened 17% of the time. But his YES predictions weren't so good - they happened about as often as chance! It seems like this predictor isn't very well-calibrated.
Anyways, we're more interested in forecasters that don't just say yes or no. We're looking at people who assign some sort of probability to their statement. In the example at the top of the page, our doomsayer was claiming a 70% chance that the world would end by a specific timeframe. How would we judge that after the fact? (Assuming the world did not end, that is.)
Instead of two buckets (YES and NO), let's break their predictions up into eleven buckets - 0%, 10%, 20%, and so on to 100%. If our Nostradamus said there's a 0% chance that the sky will fall and a 70% chance there will be a snowy Christmas this year, then we can sort those into the right buckets and then evaluate each one.
Prediction | Resolved No | Resolved Yes | Average Resolution |
---|---|---|---|
0% Bucket | 10 | 1 | 9.1% |
10% Bucket | 15 | 2 | 11.7% |
20% Bucket | 18 | 7 | 28.0% |
30% Bucket | 15 | 7 | 31.8% |
40% Bucket | 20 | 14 | 41.2% |
50% Bucket | 18 | 19 | 51.4% |
60% Bucket | 14 | 21 | 60.0% |
70% Bucket | 7 | 14 | 66.7% |
80% Bucket | 7 | 17 | 70.8% |
90% Bucket | 3 | 13 | 81.3% |
100% Bucket | 0 | 9 | 100.0% |
This looks a lot better! Now that we have more granularity, we can differentiate between things like "unlikely", "probably not", and "definitely not". When this predictor said something has a 10% chance to occur, it actually happened only 11.7% of the time. And when they gave something a 60% chance, it actually happened 60% of the time! It seems like this predictor has a much better calibration.
If a predictor is calibrated it means that, on average, predictions they make with X% confidence occur X% of the time.
Let's plot these on a chart for convenience. Across the bottom we'll have a list of all our buckets - 0 to 100%. Along the side we'll have a percentage - how often those predicted events came true. If our predictor is well-calibrated, these points should line up in a row from the bottom-left to the top-right. We'll call this a calibration plot, but it's also known as a reliability diagram.
This is very good! Now we can see visually where our predictor is calibrated or where they're over- or under-confident. If our forecaster keeps making predictions like this, we could expect them to be well-calibrated in most cases - especially when they make predictions between 30% and 70%.
Those charts are nice and all, but it still doesn't tell me how seriously I should take this person.
Good point! Calibration plots can tell you plenty, but they're hard to compare and they don't give you a single numeric score. For that, let's look into accuracy scoring. Accuracy is an intuitive measure but it has some important caveats.
A predictor is more accurate when their predictions are closer to the resolved outcome.
We have a few ways to calculate accuracy, but let's focus on the most popular one: Brier scores.
For each prediction, we take the "distance" it was from the outcome: if we predict 10% but it resolved NO, the distance is 0.1 — but if we predict 10% and the answer is YES, the distance would be 0.9. We always want this number to be low! Once we have these distances, we square each one. This has the effect of "forgiving" small errors while punishing larger ones.
After we have done this for all predictions, we take the average of these scores. This gives us the Brier score for the prediction set.
Prediction | Resolution | "Distance" | Score |
---|---|---|---|
10% | NO (0) | 0.10 | 0.0100 |
35% | NO (0) | 0.35 | 0.1225 |
42% | YES (1) | 0.68 | 0.3364 |
60% | NO (0) | 0.60 | 0.3600 |
75% | YES (1) | 0.25 | 0.0625 |
95% | YES (1) | 0.05 | 0.0025 |
Average Brier Score | 0.1490 |
The most important thing to note here is the fact that smaller is better! This score is actually measuring the amount of error in our predictions, so we want it to be as low as possible. In fact, an ideal score in this system is 0 while the worst possible score is 1.
If you were to guess "50%" on every question, your Brier score would be 0.25. Superforecasters tend to fall around 0.15 while aggregated prediction markets generally fall between 0.10 and 0.20.
So how is accuracy different than calibration here?
Calibration is about how good you are at quantifying your own confidence, not always about how close you are to the truth. If you make a lot of predictions that are incorrect, but properly document your confidence in those predictions, you can be more well-calibrated than someone who makes accurate but over- or under-confident predictions.
If a forecaster gives you their calibration and their accuracy, you should look at both but weigh their accuracy more than their calibration. Calibration is good, but it doesn't mean you know the future.
It seems like these statistics are pretty easy to game. What's stopping you from predicting 100% on a bunch of certain things, like "will the sun come up tomorrow"?
Ultimately, nothing is preventing that! It's very important to check what sorts of predictions someone is making to ensure that they're relevant to you. It's especially important when looking at user-generated content on prediction market sites, where extremely easy questions can be added for profit or calibration manipulation.
This is especially relevant when comparing between different predictors or platforms. Just because someone has a lower Brier score does not mean that they are inherently better! The only way you can directly compare is if the corpus of questions is the same for all participants.
What are these prediction markets? How can they be so accurate?
Prediction markets are based on a simple concept: If you're confident about something, you can place a bet on it. If someone else disagrees with you, declare terms with them and whoever wins takes the money. By aggregating the odds of these trades, you can gain an insight into the "wisdom of the crowds".
Imagine a stock exchange, but instead of trading shares, you trade on the likelihood of future events. Each prediction market offers contracts tied to specific events, like elections, economic indicators, or scientific breakthroughs. You can buy or sell these contracts based on your belief about the outcome - if you are very confident about something, or you have specialized information, you can make a lot of money from a market.
Markets give participants a financial incentive to be correct, encouraging researchers and skilled forecasters to spend time investigating events. Individuals with insider information or niche skills can profit by trading, which also updates the market's probability. Prediction markets have out-performed polls and revealed insider information, making them a useful tool for information gathering or profit.
Everyone that participates in a prediction market increases its accuracy in some way:
An expert in the field, who understands the situation and historical context, bets NO when the probability exceeds the base rates. They make money off of the optimists if they're correct and move the probability towards the proper base rates.
When pundits or specialists make public claims, their followers place bets. Savvy bettors will follow multiple specialists to avoid bias and bet even more. The market probability will move towards the specialist's consensus, distilling discourse down into a single number.
Someone who has specific information about the subject places a large bet in order to get a huge payout based on their insider knowledge. Other traders may join them or bet against them, but everyone gains information and is alerted to a potential upset.
Someone thinks that a market platform has a severe bias against a specific political party. They go and bet for their preferred party across many markets on the platform, which wins them money if they're correct. Betting in multiple markets both reduces their risk and reduces the market bias at the same time.
A prediction market has high liquidity but traders haven't found a consensus. Someone decides to conduct original research through polls, experimentation, or some other means then place a large bet in the direction that their research indicates. They then reveal their research, letting everyone make updates based on this new information, and sell their shares at a profit.
Someone who has a lot of money and likes to gamble puts down large bets at random across a platform. This shifts the probability away from the expert consensus, but increases the liquidity of each market. Smart users notice this and arbitrage their positions into profit, rewarding quick responses and correcting the price at the same time.
A news journalist links to specific markets as proof or evidence to back up their claims, or cites them as public opinion. If their readers agree, they can subscribe to the market to be informed first of any changes. If they disagree, they can log in and bet against it.
There are many prediction market platforms where you can go to either place bets or just gather information. The platforms that we track are:
While prediction markets have existed in various capacities for decades, their use in the U.S. is currently limited by the CFTC. Modern platforms either submit questions for approval to the CFTC, use reputation or "play-money" currencies, restrict usage to non-U.S. residents, or utilize cryptocurrencies. Additionally, sites will often focus on a particular niche or community in order to increase trading volume and activity on individual questions.