Forecast Evaluation

Forecast evaluation measures how well probabilistic forecasts match outcomes using calibration diagnostics and proper scoring rules like Brier score and log loss.
background

Forecast evaluation is the process of measuring how good a set of probabilistic forecasts is once outcomes are known. In prediction markets, it’s commonly used to test whether market-implied probabilities were trustworthy and informative over time.

Rather than asking only “was the market right?”, forecast evaluation asks:

  • Were probabilities accurate on average?
  • Were they calibrated (did 70% events happen ~70% of the time)?
  • Did the market become more informative as resolution approached?

Forecast evaluation turns raw probability data into evidence about forecasting quality. Teams use it to:

  • compare forecasting performance across topics, venues, or time periods
  • detect overconfidence and systematic bias
  • validate whether probabilities are usable for decision-making, research, or risk management

Forecast evaluation typically combines:

  1. Proper scoring rules (single-number accuracy measures)
    • Brier score for binary outcomes: ( (p - y)^2 ) where (y\in{0,1})
    • Log loss (cross-entropy): (-[y\ln(p) + (1-y)\ln(1-p)])
  2. Calibration diagnostics (do stated probabilities match frequencies?)
    • calibration / reliability plots
    • bucketed observed-vs-predicted comparisons (e.g., 0.1-wide probability bins)

Scoring rules summarize “how good” the probabilities were; calibration tools show where they were strong or weak.

Because probabilities evolve, it’s common to evaluate forecasts at consistent timestamps such as:

  • a fixed horizon (e.g., T-30d, T-7d, T-24h)
  • a standardized “close” snapshot (e.g., last price before resolution)

This helps separate early signal from late consensus and highlights whether a market converged smoothly or only moved at the end.

  • Outcome leakage: accidentally using prices after the outcome became known.
  • Survivorship bias: evaluating only high-volume markets or only cleanly resolved events.
  • Class imbalance: simple hit-rate can look good on rare events; scoring rules are usually more informative.
  • Timestamp mismatch: using a probability snapshot that doesn’t reflect what was knowable at that time.

A research team evaluates 500 resolved binary markets. They compute Brier score and log loss at T-7d and at the final pre-resolution probability. The results show strong late accuracy but weaker early calibration—suggesting the market is most useful close to resolution and needs better early information aggregation.

If you’re evaluating prediction-market forecasts programmatically, FinFeedAPI’s Prediction Market API can provide time-stamped probability histories and resolution outcomes—key inputs for computing scoring rules, building calibration curves, and comparing performance across market cohorts.

Get your free API key now and start building in seconds!