How We Build Win Probability Models

Why independent probabilities matter

Most sports analytics services derive "fair value" by devigging sportsbook lines — averaging Pinnacle, FanDuel, and DraftKings odds. This gives you a consensus probability that tracks the market by construction. It's useful for sports betting (finding +EV against soft books), but it cannot find prediction market mispricings — because the output already agrees with the market.

ZenHodl models are trained on game state only: score differential, seconds remaining, period, Elo ratings, and sport-specific features. No odds, no lines, no market prices are used as inputs. This makes our output genuinely independent from the market — when our fair probability diverges from the Polymarket ask price, that divergence is a real signal, not noise.

The tradeoff: our models have a worse Brier score than market-derived models (they're less accurate in an absolute sense). But they produce +10.9c/trade net profit across 1,552 backtested trades — because independence is what creates trading value.

	Market-Derived	ZenHodl (Independent)
Inputs	Sportsbook odds/lines	Score, time, Elo only
Output	Tracks market by construction	Genuinely independent
Brier Score	Better	Worse
Trading Value	Zero (agrees with market)	+10.9c/trade net

Data pipeline

We scrape ESPN's play-by-play API for every game across 7 sports. Each game produces ~400 snapshots — one per score change or significant event.

60,702

Games

25.6M

Rows

7

Sports

2020–26

Seasons

Sports: NBA, NCAAMB, NCAAWB, CFB, NFL, NHL, MLB

Data is stored as Apache Parquet files. One row = one game state (score, period, clock, ESPN WP, outcome label).

Module 1 of our course teaches you to build this exact scraper.

Feature engineering

Each game state is featurized with 13–16 variables. We deliberately keep the feature set small — overfitting to noise destroys trading value.

Feature	Sports	Description
score_diff	All	home_score − away_score
seconds_remaining	All	Total game seconds left
period	All	Current period/half/inning
time_fraction	All	Fraction of game elapsed (0→1)
elo_diff	All	Home Elo − Away Elo
pregame_wp	All	ESPN pre-game win probability (fixed prior)
score_diff_x_tf	All	Lead × time elapsed (interaction)
score_diff_sq	All	Lead² (quadratic, captures blowouts)
is_home_batting	MLB	1 if home team is batting
down, distance	CFB/NFL	Football situation
yard_line	CFB/NFL	Field position
possession_home	CFB/NFL	1 if home has the ball
pace features	NBA	total_score, ortg_diff, drtg_diff

Model architecture

We use sport-specific models — no one-size-fits-all approach.

Basketball & Football (NBA, NCAAMB, NCAAWB, CFB, NFL)

Logistic regression with isotonic spline calibration. We tested XGBoost, random forests, and neural nets. The key finding: simpler models produce better trading value because they don't overfit to noise in the training data. XGBoost achieves a better Brier score but worse c/trade. We select models by trading value, not accuracy.

For NCAAMB specifically, XGBoost+Isotonic outperforms LR+Spline — the extra complexity is justified by the larger training set (854 trades vs 30 for NFL).

Soccer

Poisson model: predicts goal rates for home and away teams, adjusted by Elo. Generates P(home win), P(draw), P(away win) from the Poisson distribution. Calibrated on 211 matches across EPL, La Liga, Ligue 1.

Esports (CS2, LoL)

Binomial model: Elo-based game win probability fed into a binomial series model (BO1/BO3/BO5). 258 CS2 teams (HLTV rankings), 255 LoL teams (lolesports API).

Spread/Total

Regression + CDF: predicts remaining margin/total, then P(cover) = Φ((predicted − line) / σ). Time-bucketed residual standard deviation: wide σ early (conservative), narrow σ late (confident).

Backtesting methodology

Our backtests are designed to avoid the common mistakes that inflate results.

Time-split validation: Train on season N−1, test on season N. No future data leaks.
Real market prices: NBA and NCAAMB tested against actual Polymarket ask prices from live snapshots. CFB/NFL use ESPN WP as a market proxy + 0.5c half-spread.
Deduplication: One entry per (game_id, side, score, period). No duplicate signals from the same game state.
Subsampling: Every 5th row per game simulates realistic 25-second polling intervals.
Hold to settlement: All trades held to game end. No exit timing or stop-loss assumptions.
Net of fees: Results include Polymarket taker fee (2c) and estimated slippage (1c).
Score diff filter: Minimum 3-point lead required. Tight games have too much noise.
Period filter: Enter in periods 2–3 only. First-period data is too noisy; late-game has diminishing returns.

What we tried that doesn't work

We believe in showing failures alongside successes.

Sell-the-top / fade spikes: Negative EV. Information-driven price moves don't mean-revert.
Compression sniping: Negative on historical data, adverse selection live.
Martingale / double-down on dips: Catastrophic — doubles position into adverse moves.
NBA taker (pre-v3): NBA market was too efficient at −1.3c/trade. Now +4.9c with pace features, but still our weakest sport.
Spread/Total mean-reversion exits: Spreads reprice permanently from score changes. Fix: hold to settlement instead.
ESPN live WP as model input: Tested and rejected — creates circular dependency with market. ESPN pre-game WP works as a fixed prior.
Conformal prediction intervals: Tested and rejected — added complexity without improving trading value.

Our Methodology