How to Backtest a Sports Betting Strategy in Python (Without Fooling Yourself)

Every profitable-looking backtest has the same problem: it probably is not real. The gap between backtested returns and live returns in sports betting is enormous, and it almost always goes in one direction — your backtest is too optimistic.

This guide shows you how to build a backtest that does not lie to you, and how to spot the biases that inflate results.

The Minimum Viable Backtest

A sports betting backtest needs four components:

import pandas as pd
import numpy as np

def backtest_strategy(
    predictions: pd.DataFrame,
    edge_threshold: float = 0.08,
    kelly_fraction: float = 0.25,
    bankroll: float = 1000.0,
) -> pd.DataFrame:
    """
    Backtest a betting strategy.

    predictions DataFrame must have columns:
        - model_prob: your model's fair probability
        - market_price: what the market is asking
        - outcome: 1 if the bet would have won, 0 otherwise
        - timestamp: when the signal occurred
    """
    trades = []
    current_bankroll = bankroll

    for _, row in predictions.iterrows():
        edge = row["model_prob"] - row["market_price"]

        if edge < edge_threshold:
            continue

        # Kelly criterion for position sizing
        kelly = (edge * row["model_prob"] - (1 - row["model_prob"])) / edge
        bet_size = current_bankroll * kelly_fraction * max(kelly, 0)
        bet_size = min(bet_size, current_bankroll * 0.05)  # Max 5% per trade

        if bet_size < 1.0:
            continue

        # Resolve the trade
        if row["outcome"] == 1:
            pnl = bet_size * (1 - row["market_price"]) / row["market_price"]
        else:
            pnl = -bet_size

        current_bankroll += pnl

        trades.append({
            "timestamp": row["timestamp"],
            "model_prob": row["model_prob"],
            "market_price": row["market_price"],
            "edge": edge,
            "bet_size": bet_size,
            "pnl": pnl,
            "bankroll": current_bankroll,
        })

    return pd.DataFrame(trades)

This is the skeleton. By itself, it tells you nothing useful — because the inputs determine everything. The model probabilities, the market prices, and how you constructed the dataset are where the biases hide.

Bias #1: Look-Ahead Bias

Look-ahead bias is using information from the future to make decisions in the past. It is the most common and most destructive bias in sports backtesting.

How it happens:

Training your model on the full dataset, then backtesting on the same dataset
Using closing line prices instead of the price available when your signal fired
Computing Elo ratings with future games included
Using injury reports that were released after your hypothetical trade time

How to prevent it:

def walk_forward_backtest(data, train_window_days=365):
    """
    Walk-forward validation: train only on past data, test on next day.
    Eliminates look-ahead bias by construction.
    """
    results = []
    dates = sorted(data["date"].unique())

    for i, test_date in enumerate(dates):
        # Training data: only games BEFORE the test date
        cutoff = test_date - pd.Timedelta(days=1)
        train_start = cutoff - pd.Timedelta(days=train_window_days)
        train_data = data[(data["date"] >= train_start) & (data["date"] <= cutoff)]

        if len(train_data) < 100:
            continue

        # Train model on past data only
        model = train_model(train_data)

        # Predict on today's games
        test_data = data[data["date"] == test_date]
        test_data["model_prob"] = model.predict_proba(test_data[features])[:, 1]

        results.append(test_data)

    return pd.concat(results)

Walk-forward validation is non-negotiable. If your backtest does not strictly separate past and future, the results are meaningless.

Bias #2: Survivorship Bias

Survivorship bias means your dataset only includes games or markets that still exist — not the ones that were canceled, delisted, or resolved ambiguously.

How it happens in sports:

Excluding postponed or canceled games (which you would have bet on had they not been canceled)
Dropping games with missing data (the missingness may be correlated with unusual situations)
Only including sports or leagues where your model performs well

How to prevent it:

def check_survivorship_bias(raw_data, backtest_data):
    """Check how many games from raw data made it into the backtest."""
    raw_count = len(raw_data)
    backtest_count = len(backtest_data)
    drop_rate = 1 - backtest_count / raw_count

    print(f"Raw games:      {raw_count}")
    print(f"Backtest games: {backtest_count}")
    print(f"Drop rate:      {drop_rate:.1%}")

    if drop_rate > 0.05:
        print("WARNING: >5% of games dropped. Investigate why.")

    # Check if dropped games are random or systematic
    dropped = raw_data[~raw_data.index.isin(backtest_data.index)]
    print(f"Dropped game status distribution:")
    print(dropped["status"].value_counts())

A drop rate above 5% is a red flag. Investigate every dropped game. If they are disproportionately unusual games (overtime, weather delays, controversial outcomes), your backtest is biased toward "normal" games where your model works best.

Bias #3: Execution Assumptions

This is the bias that kills the most strategies. Your backtest assumes you can execute at the market price you observe. In reality, you cannot.

The execution gap includes:

Spread cost: The difference between bid and ask. You buy at the ask, sell at the bid. A 3-cent spread costs you 3 cents per round trip.
Slippage: Your order moves the price. Large orders fill at worse prices than the quoted best ask.
Latency: The price you see is seconds old. By the time your order arrives, the price may have changed.
Fill rate: Not every order fills. If you only count filled orders, you are ignoring the opportunity cost of missed trades.

def add_execution_costs(
    trades: pd.DataFrame,
    spread_cost: float = 0.03,
    fee_rate: float = 0.02,
    fill_probability: float = 0.70,
) -> pd.DataFrame:
    """Adjust backtest results for realistic execution costs."""
    # Simulate partial fills
    trades["filled"] = np.random.random(len(trades)) < fill_probability
    filled_trades = trades[trades["filled"]].copy()

    # Adjust entry price for spread
    filled_trades["effective_price"] = (
        filled_trades["market_price"] + spread_cost / 2
    )

    # Recalculate PnL with worse entry and fees
    for i, row in filled_trades.iterrows():
        edge_after_costs = row["model_prob"] - row["effective_price"] - fee_rate
        if row["outcome"] == 1:
            gross = row["bet_size"] * (1 - row["effective_price"]) / row["effective_price"]
            filled_trades.loc[i, "pnl"] = gross * (1 - fee_rate)
        else:
            filled_trades.loc[i, "pnl"] = -row["bet_size"]

    return filled_trades

Our execution quality research found that 99% of theoretical edge vanished when real execution constraints were applied. If your backtest does not model execution costs, multiply your expected returns by 0.1 for a rough reality check.

Bias #4: Overfitting the Edge Threshold

You tried edge thresholds of 5, 6, 7, 8, 9, and 10 cents, and found that 8 cents works best. But by trying 6 values, you have implicitly optimized on your test set. The "best" threshold is partially luck.

How to prevent it:

def robust_threshold_test(predictions, thresholds, n_bootstrap=500):
    """Bootstrap test to check if threshold selection is robust."""
    results = {}

    for threshold in thresholds:
        bootstrap_pnls = []
        for _ in range(n_bootstrap):
            # Resample with replacement
            sample = predictions.sample(frac=1.0, replace=True)
            trades = backtest_strategy(sample, edge_threshold=threshold)
            if len(trades) > 0:
                bootstrap_pnls.append(trades["pnl"].sum())
            else:
                bootstrap_pnls.append(0)

        results[threshold] = {
            "mean_pnl": np.mean(bootstrap_pnls),
            "std_pnl": np.std(bootstrap_pnls),
            "pct_positive": np.mean(np.array(bootstrap_pnls) > 0),
        }
        print(
            f"Threshold {threshold:.2f}: "
            f"mean={results[threshold]['mean_pnl']:.1f}  "
            f"std={results[threshold]['std_pnl']:.1f}  "
            f"P(profit)={results[threshold]['pct_positive']:.0%}"
        )

    return results

If the best threshold has a P(profit) below 85% in the bootstrap, it is not robust. A strategy that only works at exactly one threshold setting is likely overfit.

Bias #5: Ignoring Regime Changes

A model trained on 2022-2024 NBA data may not work on 2025-2026 data. Markets evolve. Market makers get smarter. Liquidity changes. Your edge decays.

def check_regime_stability(trades: pd.DataFrame, window_days: int = 90):
    """Check if strategy performance is stable across time."""
    trades["date"] = pd.to_datetime(trades["timestamp"]).dt.date
    trades["quarter"] = pd.to_datetime(trades["timestamp"]).dt.to_period("Q")

    quarterly = trades.groupby("quarter").agg(
        n_trades=("pnl", "count"),
        total_pnl=("pnl", "sum"),
        win_rate=("pnl", lambda x: (x > 0).mean()),
        avg_edge=("edge", "mean"),
    )

    print(quarterly.to_string())

    # Flag quarters where performance diverges significantly
    mean_pnl = quarterly["total_pnl"].mean()
    for q, row in quarterly.iterrows():
        if row["total_pnl"] < mean_pnl * 0.3:
            print(f"WARNING: {q} significantly underperformed — possible regime change")

If more than one quarter is deeply negative, the strategy may not be robust. Look at what changed: new market makers, different sports seasons, liquidity shifts.

Putting It All Together

A honest backtest pipeline looks like this:

Walk-forward split: Train only on past data. No exceptions.
Generate predictions: Model outputs probability for each game.
Apply edge filter: Only trade when model disagrees with market by enough.
Add execution costs: Spread, fees, slippage, partial fills.
Bootstrap the results: Check if profitability is robust to resampling.
Check regime stability: Verify performance is consistent across time.
Compare to baseline: Is your strategy better than random betting at the same edge threshold?

# Full pipeline
walk_forward_preds = walk_forward_backtest(data)
raw_trades = backtest_strategy(walk_forward_preds, edge_threshold=0.08)
realistic_trades = add_execution_costs(raw_trades)

print(f"Trades: {len(raw_trades)} raw -> {len(realistic_trades)} filled")
print(f"Naive PnL:     ${raw_trades['pnl'].sum():.2f}")
print(f"Realistic PnL: ${realistic_trades['pnl'].sum():.2f}")
print(f"Shrinkage:     {1 - realistic_trades['pnl'].sum() / raw_trades['pnl'].sum():.0%}")

If the shrinkage from naive to realistic is above 80%, your edge is probably not real. If it is 50-80%, there may be something there but execution is eating most of it. Below 50% shrinkage is a genuinely promising strategy.

The Honest Truth

Most sports betting backtests are worthless — not because people are dishonest, but because the biases are subtle and always point in the same direction: making your strategy look better than it is.

The backtest that survives all five bias checks above is rare. But when you find one, you have real evidence — not hope — that a strategy works.

Want to see a backtest that survived? Our results page shows real trades from a system built with these principles. The ZenHodl course teaches you to build and validate your own.

How to Backtest a Sports Betting Strategy in Python (Without Fooling Yourself)

The Minimum Viable Backtest

Bias #1: Look-Ahead Bias

Bias #2: Survivorship Bias

Bias #3: Execution Assumptions

Bias #4: Overfitting the Edge Threshold

Bias #5: Ignoring Regime Changes

Putting It All Together

The Honest Truth

Get free trading insights

Want to build this yourself?