Brier.fyi launched in 2025 to help people compare and evaluate prediction markets fairly. Our mission is to help the public make informed decisions about which markets to trust and how to interpret forecasting data. We believe in transparency, honest comparisons, and providing meaningful context.
The idea for this site started in July 2023 when we set out to create a calibration plot for Manifold, as they weren't publishing their own accuracy statistics at the time. What began as "Calibration City" eventually grew to include other platforms like Kalshi, Metaculus, and Polymarket. Thanks to grants from the Manifold Community Fund and EA Community Choice programs, we were able to expand our work. We added more markets, created new filters and charts, tracked accuracy metrics, and built guides for newcomers. However, we weren't comfortable sharing hard numerical accuracy scores or directly comparing platforms because they were fundamentally different from each other.
While our data was solid, it wasn't conveying the insights people expected. Calibration is very useful, but it can't ever tell the whole story. In January 2025, we initiated a complete overhaul. We started fresh, linking similar markets across platforms and finding directly comparable questions. We built a new website, less focused on experimentation and more focused on showing valuable results. We wanted to have something that actually answered questions like “How accurate are prediction markets?” and “Which platform is more accurate on the topics I care about?”
We're still growing - adding new platforms, curating interesting questions, and building new features. You can find all our source code under the on GitHub project Themis, complete with our open issues and roadmap. All our data is available through our PostgREST API. We welcome collaboration and encourage others to use our data, with the hope that you'll share your findings with us.
Broadly, we support all binary and multiple-choice markets on all supported platforms. There are a few asterisks around this, however.
For platforms like Kalshi and Polymarket where all markets are binary, the process is straightforward. Simple yes/no questions are extracted as-is, with the probability based on the price of the YES side. On these platforms, question groups are actually constructed out of binary markets (usually in the form of "Will team X win game Y?" or "Will metric X be greater than Y at time Z?") which makes extraction straightforward.
Manifold and Metaculus have a number of different market types, which they use for question groups and continuous spreads.
We used to assume that the implied probability of a market before the first trade was 50%, since that is how the probaility is often shown on each platform frontend. However, this caused problems if a significant amount of time elapsed between market creation and the first trade or if there were no trades whatsoever. Now, we consider the market to have no probability until the first trade. We also ignore any market that has zero trades for the same reason.
For traditional market sites, the implied probability is equivalent to the price of one YES share (where payout would be $1 if it resolves YES). For sites that aggregate predictions in other ways, we follow the probability that they display most prominently. For Metaculus this is the community prediction, which is exposed as recency_weighted in the API.
We do not evaluate non-market items from any platform, such as bounties, posts, or non-forecasting polls. Our downloader runs nightly, and notifies us of any new or unrecognized market types so we can implement them as quickly as possible.
When we first started working on the Calibration City site, we realized how different each prediction market platform was. The apples-to-oranges problem has reared its head many times and we were certainly not the first to realize this.
Our goal with the matching process is to find equivalent markets across platforms, usually by targeting one of the following situations.
After downloading items from each platform's API, we use a couple techniques to try to find these markets. We generate embeddings to find similar markets, then refine matches using tags, keywords, duration overlap, and other heuristics.
The final decision as to whether two markets are "equivalent" can be surprisingly difficult. For instance, there can be two markets that resolve on December 31st vs January 1st, or use two different news sources, or have any other number of slight variations that make them not 100% equivalent. In order to stay in the spirit of the concept, we allow grouping markets as long as these differences wouldn't have more than a 1% chance of changing the resolution.
All matches are picked and approved by real people. We do our best, but there may be some mistakes. Contact us if you think there's an issue with a market link, or if you have a suggestion for additional market links.
Currently we have two main types of scores: absolute and relative scores.
All absolute scores are calculated based on a criterion probability and scored using a scoring rule. The criterion probability is what we reference as the market's "prediction", and can be at a specific point in time like the midpoint or 30 days before resolution, or it could be an aggregation such as the time-weighted average probability.
The Brier score is fairly intuitive, with better scores closer to zero and worse scores closer to one. Random guesses tend towards a score of 0.25 with superforecasters around 0.10. With any market's criterion probability and the market's resolution , we can calculate the Brier score with:
The logarithmic score is another strictly proper scoring rule, but with better scores closer to zero and worse scores closer to negative infinity. Predictions far from the correct resolution are punished extremely hard with this scoring rule. With any market's criterion probability and resolution , we can calculate the logarithmic score with:
The spherical score is a third strictly proper scoring rule. Better scores tend towards one but the worst possible score is only zero, but the vast majority of predictions will fall very high on this scale (between 0.99 and 1.0) so differentiation is difficult. With any market's criterion probability and resolution , we can calculate the spherical score with:
Relative scores are calculated based on the performance of the market relative to other markets. They provide a measure of how well the market has performed compared to its peers. These are only present for markets that are linked in a question, since they are scored against the other markets in that question.
We calculate a relative score with each scoring rule, which you can find on the individual question pages. The overall process is the exact same between each, with the only difference being the scoring rule used. Scoring rules where a lower score is better will be evaluated the same way for a relative score, so a lower relative Brier score is better while a higher relative logarithmic score is better.
The process to calculate relative scores for a group of markets starts by determining the scoring period. We choose to score groups for the duration where at least two markets are open. In some situations, we also override the start or end dates so that the scoring period does not include days where the outcome was already known.
For each day in the scoring period, we calculate each market's score using a scoring rule (Brier, log, etc.) and, from those, calculate the median score. We then subtract the median from each market's daily score and save it as the daily relative score.
Finally, we sum all of the daily relative scores for each market and divide that by the total number of days in the scoring period. Note that this is not a simple average! For markets that were not open for the entire scoring period, their sum is being divided by more days than they had values for. This means that a market that otherwise performed the same as another but was open for less time will have a relative score closer to zero. Also note that relative scores can be both positive and negative, since this is the difference from a median score.
One way to represent this score for each market would be:
Where is the market's score on day , is the median score on day , and is the number of days we're scoring.
You can learn more about relative scores at the following sources:
The letter grades are intended to be an easy-to-read, intuitive representation of how well the market has performed on a specific axis. Each score (e.g. Brier score at market midpoint, spherical score one month before close, etc.) has a corresponding letter grade, which is determined by comparing the score to a set of predefined thresholds.
The thresholds for absolute scores are:
Grade | Brier Score | Logarithmic Score | Spherical Score | Probability Margin |
---|---|---|---|---|
S | 0.0000 to 0.0001 | 0.0000 to -0.0101 | 1.0000 to 0.9999 | 0.0000 to 0.0100 |
A+ | 0.0001 to 0.0009 | -0.0101 to -0.0305 | 0.9999 to 0.9995 | 0.0100 to 0.0300 |
A | 0.0009 to 0.0018 | -0.0305 to -0.0434 | 0.9995 to 0.9990 | 0.0300 to 0.0424 |
A- | 0.0018 to 0.0022 | -0.0434 to -0.0480 | 0.9990 to 0.9988 | 0.0424 to 0.0469 |
B+ | 0.0022 to 0.0030 | -0.0480 to -0.0563 | 0.9988 to 0.9983 | 0.0469 to 0.0548 |
B | 0.0030 to 0.0045 | -0.0563 to -0.0694 | 0.9983 to 0.9974 | 0.0548 to 0.0671 |
B- | 0.0045 to 0.0055 | -0.0694 to -0.0771 | 0.9974 to 0.9968 | 0.0671 to 0.0742 |
C+ | 0.0055 to 0.0075 | -0.0771 to -0.0906 | 0.9968 to 0.9955 | 0.0742 to 0.0866 |
C | 0.0075 to 0.0150 | -0.0906 to -0.1306 | 0.9955 to 0.9904 | 0.0866 to 0.1225 |
C- | 0.0150 to 0.0250 | -0.1306 to -0.1721 | 0.9904 to 0.9828 | 0.1225 to 0.1581 |
D+ | 0.0250 to 0.0500 | -0.1721 to -0.2531 | 0.9828 to 0.9609 | 0.1581 to 0.2236 |
D | 0.0500 to 0.1100 | -0.2531 to -0.4030 | 0.9609 to 0.8958 | 0.2236 to 0.3317 |
D- | 0.1100 to 0.2500 | -0.4030 to -0.6931 | 0.8958 to 0.7071 | 0.3317 to 0.5000 |
F | 0.2500 to 1.0000 | -0.6931 to -3.4028235e+38 | 0.7071 to 0.0000 | 0.5000 to 1.0000 |
The thresholds for relative scores are a little different. The relative scoring algorithm we use results in a lot of scores very close to zero with a sharp dropoff and roughly-symmetrical curve on either side. We calculate our grade cutoffs so that C+ is centered at zero, with widths based on the deviations of the scores.
The thresholds for relative scores can be found on GitHub for now while we continue to add questions and tweak the grades in response.
wasabipesto, lead developer
Hi, I'm wasabipesto. You can find me at wasabipesto.com, or on GitHub at github.com/wasabipesto. I don't have a twitter, don't look for me there. If you want to contact me directly, you can email me at contact@wasabipesto.com.
If you find this site useful, please share it with others! We believe prediction markets are valuable, but only if their accuracy is verified and understood. We believe prediction markets should be evaluated rigorously and publicly, showing both their strengths and weaknesses, in order to earn credibility.
Please also share your feedback about the site with us. Currently our focus is on improving the site, making it more intuitive to use while also adding new features that give valuable insights. We're specifically interested in:
You can use the form below to submit this or any other feedback.
If you want to support the site financially, you can donate via GitHub Sponsors. For $5 per month you can have your name listed on the site as a supporter.
Having an issue with the site? See a bug or a typo? Found a set of markets that aren't covered here? Do you need help accessing the data for a research project? Here's how you can contact us:
And finally, here's a handy form for anything else:
Astral Codex Ten: Prediction Market FAQ
Scott Alexander gives a summary of what prediction markets are, their fundamental qualities, and common objections. It's excellent and super easy to read - if you read anything from this list, it should be this one.
The obligatory Wikipedia page on the topic. It has a good overview and timeline of the recent history, but not much else.
Prediction Markets are not Polls
A common refrain to prediction markets is that they're "just polls of random people on the internet". Isaac King puts together a great rebuttal with examples as to why this is not the case.
Prediction Markets: When Do They Work?
In an older post (from 2018), Zvi discusses some situations where prediction markets thrive, and some where they don't. There are many more markets today, but I believe the basis of this post still holds.
First Sigma: What can we learn from scoring different election forecasts?
Jack compares head-to-head performance between Metaculus, 538, Manifold, Polymarket, EBO, and PredictIt on the US 2022 midterm elections. Metaculus and 538 took the lead but with a small sample size.
EA Forum: Predictive Performance on Metaculus vs. Manifold Markets
A direct comparison of 64 binary markets mirrored between Manifold and Metaculus. Metaculus had a better score on 75% of the questions.
JHK Forecasts: Forecast Database
Jack Kersting (a different Jack) assembled an impressive list of US election forecasts across dozens of predictors from 2016 to 2024. While this isn't comparing prediction markets, it's still a good example of prediction metrics.
Many platforms will score themselves and publish their results, or users will create dashboards similar to this site with the API or blockchain data. We took a lot of inspiration from these sites when creating our standardized scoring format and inspiration for our charts.
The Wikipedia page on scoring rules. It was a great starting point for our research due to covering many different score types.
Cultivate Labs: Relative Brier Scores
This post from Cultivate Labs describing their relative scoring system is really the basis for this site. It describes the method of creating relative scores based on the daily mean score, with the added twist of penalizing forecasters that start predicting later.
Eigil Fjeldgren Rischel: Against calibration
A good summation of calibration versus accuracy, mainly that calibration can be applied more broadly but is less meaningful. This is exactly why we switched focus on this site away from calibration!
Prediction market accuracy in the long run
A comparison of the performance of Iowa Electronic Markets (a small academic platform) versus contemporary polls. Between 1988 to 2004, the markets outperformed polls 74% of the time.
How manipulable are prediction markets?
A team attempts to manipulate 817 random markets on Manifold in early 2024 and finds that the manipulations were somewhat reversed after 7 days, moreso after 30 days.
Metaforecast was created by Nuño Sempere (and now maintained by QURI) to be a search engine for prediction markets from over a dozen platforms. One search bar to find open predictions from basically anywhere.
Saul Munn: Prediction Market Map
If you're interested in exploring everything there is to know about prediction markets, Saul keeps a categorized list of resources on Notion.