99% of Your Backtested Edge Doesn't Exist: Execution Quality on Prediction Markets

Your backtest is lying to you.

Not about the model. Not about the features. About something far more fundamental: whether the trade you think you're making can actually be executed.

We ran a controlled experiment on live Kalshi sports markets — same strategy, same signals, two different assumptions about execution. The results changed how we build everything.

The Experiment That Changed Everything

We identified 237 mean-reversion opportunities on SPREAD and TOTAL markets over a 3-day live observation window (Dec 27-29, 2025). Same signals. Two backtests:

Naive Backtest (assumes constant liquidity): - 237 opportunities detected - 2,772 cents total profit - 32% win rate

Realistic Backtest (actual execution constraints): - 10 executable trades - 31 cents total profit - 90% win rate

99% of the theoretical edge vanished. Not because the model was wrong — but because 227 of those 237 "opportunities" couldn't actually be traded.

Zero of 81 parameter combinations we tested produced positive risk-adjusted returns once execution costs were included. The math was brutal:

Component	Impact
Theoretical PnL	+2,772c
Spread costs	-3,335c
Fees	-474c
Net result	-1,680c

Execution costs exceeded theoretical edge by 160%.

The Three Things That Kill Your Trades

When we dissected the 227 failed opportunities, three failure modes dominated:

1. Spread Too Wide (82% of failures)

The orderbook looks tight in your historical data because you're sampling snapshots. Between those snapshots, spreads widen — sometimes dramatically. A trade that appears to have a 5-cent edge actually has a 12-cent spread cost to enter and exit.

2. Depth Too Low (14% of failures)

The price is right, but there's nobody on the other side. You see a bid at 52 cents, but there are only 8 contracts there. Your 100-contract order moves the market 6 cents against you before it fills.

3. Entry Not Sustained (4% of failures)

The signal was real, the spread was tight, the depth was there — and then a score change repriced the entire market before your exit. The opportunity existed for 3 seconds. Your polling interval is 5 seconds.

Quote Freezes: The Hidden Information Blackout

This was the finding that surprised us most.

When market makers stop updating their quotes, information doesn't stop flowing. It accumulates. And when quotes finally update, they gap.

From 7,889 observed quote freeze events:

Median freeze duration: 58 seconds
Tail freeze duration (P95): 229 seconds
Median post-freeze price gap: 5.5 cents
Tail post-freeze gap (P95): 31 cents
Maximum observed gap: 93 cents

Your bot sees a stale orderbook and thinks it's safe. It isn't. A 93-cent gap means a contract priced at 50 cents suddenly jumps to 143 (capped at 100) — complete repricing.

70.5% of freezes occur within 30 seconds of a score change. This isn't random — it's a systematic information blackout during the exact moments when markets move most.

Score Changes Break Markets

When a score changes in a live game, market structure collapses. We measured this precisely:

Metric	Value
Score events that cause a "tradability cliff"	54%
NBA cliff rate	56%
NFL cliff rate	34%
Median recovery time	29 seconds
P95 recovery time	85 seconds

A "tradability cliff" means the spread widens beyond 3 cents OR depth drops below 100 contracts — the market becomes untradable.

The asymmetry between sports matters enormously. NFL moneyline markets recover instantly (0 seconds post-freeze). NBA spread markets need 177 seconds before the cost environment stabilizes. Trading the wrong sport at the wrong time is paying a tax you can't see in backtests.

95% of Game Time Is Untradable

This is the single most important number in this entire analysis.

When we applied strict tradability filters — spread under 2 cents, depth over 5,000 contracts per side, quotes less than 45 seconds old, market not halted — only 5.2% of live game time qualified.

Your backtest assumes 100% tradability. Reality delivers 5%.

This explains the entire funnel: most "opportunities" have zero execution probability. They exist in your historical data because the snapshot captured a momentary price, not a tradable market.

What Actually Works

After all this bad news, here's the good news: there IS a tradable universe. It's just much smaller and more specific than most people assume.

NFL SPREAD in the 10-20 cent spread band emerged as the most reliable regime: - 334 observed attempts - 79% fill rate - Positive PnL after stress-testing with +1 cent fee buffer - Best fill rate of any regime tested

NFL SPREAD in the 20-50 cent spread band had the best risk/reward: - 113 attempts - 71% fill rate - Highest per-trade PnL of any regime

What didn't work: - Moneyline markets: Only 12 events detected — too sparse to build a strategy around - NBA markets: Negative in 12+ months of backtesting. 0% of snapshots met minimum tradability conditions - Anything with spreads over 50 cents: Fill rates collapse

The 65% Rule

The most counterintuitive finding: you need to skip 60-65% of apparent opportunities.

When we combined strict entry filters (right sport, right spread band, sufficient depth) with execution gating (no active freezes, no recent score changes, sufficient quote freshness), the pass rate dropped to 35-40% of all detected signals.

But the trades that survived both filters had: - 70-85% fill rates - Positive expected value after all costs - 20-30x better performance than naive "buy every dip" approaches

The competitive advantage isn't finding more signals. It's having the discipline — and the data — to reject the bad ones.

What This Means For Your Strategy

If you're building prediction market bots, audit these assumptions:

Are you modeling quote freezes? If not, you're ignoring 58-second information blackouts that occur during 70% of score events.

Are you assuming continuous liquidity? Only 5% of game time is actually tradable. Your backtest assumes 100%.

Did you include spread costs? Average entry/exit spread cost was 3.3 cents — before fees. Many "edges" are smaller than this.

Did you test on live-only data? Pregame regimes are roughly 4x better than live for execution quality. If your backtest mixes both, it's optimistic.

If you answered "no" to any of these, your backtest is likely overstating returns by 50-200%.

The Real Competitive Advantage

The prediction market gold rush doesn't reward the best model. It rewards the best execution.

Every quant can build a model that works on historical data. The ones who actually make money are the ones who:

Measure execution quality — trade what you can actually fill, not what your backtest imagines
Respect quote freezes — they're information blackouts, not safe windows
Pick the right regime — NFL SPREAD at 10-50 cent spreads is real; NBA moneyline is a microstructure trap
Say no most of the time — the opportunity cost of waiting beats the slippage cost of rushing

The edge isn't in the signal. It's in knowing which signals you can actually capture.

This analysis is based on 13,568 market events observed on live Kalshi markets with ~30-second polling resolution. Full methodology details are available on our methodology page.

Want to learn how to build execution-aware bots from scratch? Our course covers model building through deployment — including the execution pitfalls most tutorials skip.