Every profitable-looking backtest has the same problem: it probably is not real. The gap between backtested returns and live returns in sports betting is enormous, and it almost always goes in one direction — your backtest is too optimistic.
This guide shows you how to build a backtest that does not lie to you, and how to spot the biases that inflate results.
The Minimum Viable Backtest
A sports betting backtest needs four components:
import pandas as pd
import numpy as np
def backtest_strategy(
predictions: pd.DataFrame,
edge_threshold: float = 0.08,
kelly_fraction: float = 0.25,
bankroll: float = 1000.0,
) -> pd.DataFrame:
"""
Backtest a betting strategy.
predictions DataFrame must have columns:
- model_prob: your model's fair probability
- market_price: what the market is asking
- outcome: 1 if the bet would have won, 0 otherwise
- timestamp: when the signal occurred
"""
trades = []
current_bankroll = bankroll
for _, row in predictions.iterrows():
edge = row["model_prob"] - row["market_price"]
if edge < edge_threshold:
continue
# Kelly criterion for position sizing
kelly = (edge * row["model_prob"] - (1 - row["model_prob"])) / edge
bet_size = current_bankroll * kelly_fraction * max(kelly, 0)
bet_size = min(bet_size, current_bankroll * 0.05) # Max 5% per trade
if bet_size < 1.0:
continue
# Resolve the trade
if row["outcome"] == 1:
pnl = bet_size * (1 - row["market_price"]) / row["market_price"]
else:
pnl = -bet_size
current_bankroll += pnl
trades.append({
"timestamp": row["timestamp"],
"model_prob": row["model_prob"],
"market_price": row["market_price"],
"edge": edge,
"bet_size": bet_size,
"pnl": pnl,
"bankroll": current_bankroll,
})
return pd.DataFrame(trades)
This is the skeleton. By itself, it tells you nothing useful — because the inputs determine everything. The model probabilities, the market prices, and how you constructed the dataset are where the biases hide.
Bias #1: Look-Ahead Bias
Look-ahead bias is using information from the future to make decisions in the past. It is the most common and most destructive bias in sports backtesting.
How it happens:
- Training your model on the full dataset, then backtesting on the same dataset
- Using closing line prices instead of the price available when your signal fired
- Computing Elo ratings with future games included
- Using injury reports that were released after your hypothetical trade time
How to prevent it:
def walk_forward_backtest(data, train_window_days=365):
"""
Walk-forward validation: train only on past data, test on next day.
Eliminates look-ahead bias by construction.
"""
results = []
dates = sorted(data["date"].unique())
for i, test_date in enumerate(dates):
# Training data: only games BEFORE the test date
cutoff = test_date - pd.Timedelta(days=1)
train_start = cutoff - pd.Timedelta(days=train_window_days)
train_data = data[(data["date"] >= train_start) & (data["date"] <= cutoff)]
if len(train_data) < 100:
continue
# Train model on past data only
model = train_model(train_data)
# Predict on today's games
test_data = data[data["date"] == test_date]
test_data["model_prob"] = model.predict_proba(test_data[features])[:, 1]
results.append(test_data)
return pd.concat(results)
Walk-forward validation is non-negotiable. If your backtest does not strictly separate past and future, the results are meaningless.
Bias #2: Survivorship Bias
Survivorship bias means your dataset only includes games or markets that still exist — not the ones that were canceled, delisted, or resolved ambiguously.
How it happens in sports:
- Excluding postponed or canceled games (which you would have bet on had they not been canceled)
- Dropping games with missing data (the missingness may be correlated with unusual situations)
- Only including sports or leagues where your model performs well
How to prevent it:
def check_survivorship_bias(raw_data, backtest_data):
"""Check how many games from raw data made it into the backtest."""
raw_count = len(raw_data)
backtest_count = len(backtest_data)
drop_rate = 1 - backtest_count / raw_count
print(f"Raw games: {raw_count}")
print(f"Backtest games: {backtest_count}")
print(f"Drop rate: {drop_rate:.1%}")
if drop_rate > 0.05:
print("WARNING: >5% of games dropped. Investigate why.")
# Check if dropped games are random or systematic
dropped = raw_data[~raw_data.index.isin(backtest_data.index)]
print(f"Dropped game status distribution:")
print(dropped["status"].value_counts())
A drop rate above 5% is a red flag. Investigate every dropped game. If they are disproportionately unusual games (overtime, weather delays, controversial outcomes), your backtest is biased toward "normal" games where your model works best.
Bias #3: Execution Assumptions
This is the bias that kills the most strategies. Your backtest assumes you can execute at the market price you observe. In reality, you cannot.
The execution gap includes:
- Spread cost: The difference between bid and ask. You buy at the ask, sell at the bid. A 3-cent spread costs you 3 cents per round trip.
- Slippage: Your order moves the price. Large orders fill at worse prices than the quoted best ask.
- Latency: The price you see is seconds old. By the time your order arrives, the price may have changed.
- Fill rate: Not every order fills. If you only count filled orders, you are ignoring the opportunity cost of missed trades.
def add_execution_costs(
trades: pd.DataFrame,
spread_cost: float = 0.03,
fee_rate: float = 0.02,
fill_probability: float = 0.70,
) -> pd.DataFrame:
"""Adjust backtest results for realistic execution costs."""
# Simulate partial fills
trades["filled"] = np.random.random(len(trades)) < fill_probability
filled_trades = trades[trades["filled"]].copy()
# Adjust entry price for spread
filled_trades["effective_price"] = (
filled_trades["market_price"] + spread_cost / 2
)
# Recalculate PnL with worse entry and fees
for i, row in filled_trades.iterrows():
edge_after_costs = row["model_prob"] - row["effective_price"] - fee_rate
if row["outcome"] == 1:
gross = row["bet_size"] * (1 - row["effective_price"]) / row["effective_price"]
filled_trades.loc[i, "pnl"] = gross * (1 - fee_rate)
else:
filled_trades.loc[i, "pnl"] = -row["bet_size"]
return filled_trades
Our execution quality research found that 99% of theoretical edge vanished when real execution constraints were applied. If your backtest does not model execution costs, multiply your expected returns by 0.1 for a rough reality check.
Bias #4: Overfitting the Edge Threshold
You tried edge thresholds of 5, 6, 7, 8, 9, and 10 cents, and found that 8 cents works best. But by trying 6 values, you have implicitly optimized on your test set. The "best" threshold is partially luck.
How to prevent it:
def robust_threshold_test(predictions, thresholds, n_bootstrap=500):
"""Bootstrap test to check if threshold selection is robust."""
results = {}
for threshold in thresholds:
bootstrap_pnls = []
for _ in range(n_bootstrap):
# Resample with replacement
sample = predictions.sample(frac=1.0, replace=True)
trades = backtest_strategy(sample, edge_threshold=threshold)
if len(trades) > 0:
bootstrap_pnls.append(trades["pnl"].sum())
else:
bootstrap_pnls.append(0)
results[threshold] = {
"mean_pnl": np.mean(bootstrap_pnls),
"std_pnl": np.std(bootstrap_pnls),
"pct_positive": np.mean(np.array(bootstrap_pnls) > 0),
}
print(
f"Threshold {threshold:.2f}: "
f"mean={results[threshold]['mean_pnl']:.1f} "
f"std={results[threshold]['std_pnl']:.1f} "
f"P(profit)={results[threshold]['pct_positive']:.0%}"
)
return results
If the best threshold has a P(profit) below 85% in the bootstrap, it is not robust. A strategy that only works at exactly one threshold setting is likely overfit.
Bias #5: Ignoring Regime Changes
A model trained on 2022-2024 NBA data may not work on 2025-2026 data. Markets evolve. Market makers get smarter. Liquidity changes. Your edge decays.
def check_regime_stability(trades: pd.DataFrame, window_days: int = 90):
"""Check if strategy performance is stable across time."""
trades["date"] = pd.to_datetime(trades["timestamp"]).dt.date
trades["quarter"] = pd.to_datetime(trades["timestamp"]).dt.to_period("Q")
quarterly = trades.groupby("quarter").agg(
n_trades=("pnl", "count"),
total_pnl=("pnl", "sum"),
win_rate=("pnl", lambda x: (x > 0).mean()),
avg_edge=("edge", "mean"),
)
print(quarterly.to_string())
# Flag quarters where performance diverges significantly
mean_pnl = quarterly["total_pnl"].mean()
for q, row in quarterly.iterrows():
if row["total_pnl"] < mean_pnl * 0.3:
print(f"WARNING: {q} significantly underperformed — possible regime change")
If more than one quarter is deeply negative, the strategy may not be robust. Look at what changed: new market makers, different sports seasons, liquidity shifts.
Putting It All Together
A honest backtest pipeline looks like this:
- Walk-forward split: Train only on past data. No exceptions.
- Generate predictions: Model outputs probability for each game.
- Apply edge filter: Only trade when model disagrees with market by enough.
- Add execution costs: Spread, fees, slippage, partial fills.
- Bootstrap the results: Check if profitability is robust to resampling.
- Check regime stability: Verify performance is consistent across time.
- Compare to baseline: Is your strategy better than random betting at the same edge threshold?
# Full pipeline
walk_forward_preds = walk_forward_backtest(data)
raw_trades = backtest_strategy(walk_forward_preds, edge_threshold=0.08)
realistic_trades = add_execution_costs(raw_trades)
print(f"Trades: {len(raw_trades)} raw -> {len(realistic_trades)} filled")
print(f"Naive PnL: ${raw_trades['pnl'].sum():.2f}")
print(f"Realistic PnL: ${realistic_trades['pnl'].sum():.2f}")
print(f"Shrinkage: {1 - realistic_trades['pnl'].sum() / raw_trades['pnl'].sum():.0%}")
If the shrinkage from naive to realistic is above 80%, your edge is probably not real. If it is 50-80%, there may be something there but execution is eating most of it. Below 50% shrinkage is a genuinely promising strategy.
The Honest Truth
Most sports betting backtests are worthless — not because people are dishonest, but because the biases are subtle and always point in the same direction: making your strategy look better than it is.
The backtest that survives all five bias checks above is rare. But when you find one, you have real evidence — not hope — that a strategy works.
Want to see a backtest that survived? Our results page shows real trades from a system built with these principles. The ZenHodl course teaches you to build and validate your own.