Your model predicts 70% win probability. The team wins 58% of the time when you say 70%. That gap is a calibration problem, and it will silently destroy your profitability.
Accuracy measures how often you are right. Calibration measures whether your probabilities mean what they say. For sports betting and prediction markets, calibration is the one that determines whether you make money.
Why Calibration Beats Accuracy
Imagine two models evaluating a game where the market asks 65 cents:
- Model A says 78%. Well-calibrated — when it says 78%, the team wins ~78% of the time.
- Model B says 78%. Poorly calibrated — when it says 78%, the team actually wins ~66% of the time.
Both models "agree" the contract is underpriced. Both trigger a buy signal. But Model A has a real 13-cent edge. Model B has a 1-cent edge that doesn't even cover fees.
The danger is that Model B looks great in standard ML evaluation. It might even have a better AUC than Model A. But its probabilities are inflated, so every edge it reports is a mirage.
Measuring Calibration: Brier Score
The Brier score measures how close your predicted probabilities are to actual outcomes:
import numpy as np
def brier_score(y_true, y_prob):
"""Lower is better. 0 = perfect, 0.25 = random coin flip."""
return np.mean((y_prob - y_true) ** 2)
# Example: model predictions vs actual outcomes
predictions = np.array([0.78, 0.55, 0.90, 0.30, 0.62])
outcomes = np.array([1, 0, 1, 0, 1 ])
print(f"Brier score: {brier_score(outcomes, predictions):.4f}")
Brier score captures both discrimination (can you separate winners from losers?) and calibration (do your probabilities match reality?). For betting, a model with Brier score of 0.20 is solid. Below 0.18 is excellent.
Measuring Calibration: ECE
Expected Calibration Error (ECE) directly measures the calibration gap. It bins your predictions and compares predicted probability to observed frequency in each bin:
def expected_calibration_error(y_true, y_prob, n_bins=10):
"""ECE: average gap between predicted probability and actual frequency."""
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
for i in range(n_bins):
mask = (y_prob >= bin_edges[i]) & (y_prob < bin_edges[i + 1])
if mask.sum() == 0:
continue
bin_confidence = y_prob[mask].mean()
bin_accuracy = y_true[mask].mean()
ece += mask.sum() * abs(bin_accuracy - bin_confidence)
return ece / len(y_true)
ece = expected_calibration_error(outcomes, predictions)
print(f"ECE: {ece:.4f}")
An ECE below 0.03 means your probabilities are within 3 percentage points of reality on average. That's good enough for profitable trading. Above 0.05, you are systematically over- or under-confident, and your edge calculations will be wrong.
Visualizing Calibration
A reliability diagram plots predicted probability vs observed frequency. A perfectly calibrated model falls on the diagonal:
import matplotlib.pyplot as plt
def plot_calibration(y_true, y_prob, n_bins=10):
bin_edges = np.linspace(0, 1, n_bins + 1)
bin_centers = []
bin_freqs = []
for i in range(n_bins):
mask = (y_prob >= bin_edges[i]) & (y_prob < bin_edges[i + 1])
if mask.sum() < 5:
continue
bin_centers.append(y_prob[mask].mean())
bin_freqs.append(y_true[mask].mean())
plt.figure(figsize=(6, 6))
plt.plot([0, 1], [0, 1], "k--", label="Perfect calibration")
plt.plot(bin_centers, bin_freqs, "o-", label="Model")
plt.xlabel("Predicted probability")
plt.ylabel("Observed frequency")
plt.title("Calibration Plot")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
If your curve bows above the diagonal, you are under-confident — your 60% predictions actually win 70% of the time. If it bows below, you are over-confident. Both are fixable.
Fixing Calibration: Isotonic Regression
Isotonic regression is a non-parametric method that learns a monotonic mapping from raw model output to calibrated probabilities. It's the go-to method for post-hoc calibration:
from sklearn.isotonic import IsotonicRegression
# Split data: train calibrator on validation set, NOT training set
raw_probs_val = model.predict_proba(X_val)[:, 1]
y_val = y_validation
calibrator = IsotonicRegression(out_of_bounds="clip")
calibrator.fit(raw_probs_val, y_val)
# Apply to new predictions
raw_probs_test = model.predict_proba(X_test)[:, 1]
calibrated_probs = calibrator.predict(raw_probs_test)
Critical detail: fit the calibrator on a held-out validation set, never on the training data. If you calibrate on training data, you are overfitting the calibration curve and the test-time probabilities will still be wrong.
Platt Scaling: The Parametric Alternative
For smaller datasets, Platt scaling fits a logistic curve instead of a free-form mapping:
from sklearn.linear_model import LogisticRegression
platt = LogisticRegression()
platt.fit(raw_probs_val.reshape(-1, 1), y_val)
calibrated_probs = platt.predict_proba(raw_probs_test.reshape(-1, 1))[:, 1]
Platt scaling assumes the miscalibration follows a sigmoid shape. This works well for SVMs and neural networks but can be too rigid for tree-based models. For sports betting, isotonic regression is usually the better choice because the miscalibration pattern is often non-sigmoid.
Calibration in Practice: Our Pipeline
At ZenHodl, calibration is a first-class step in model development:
- Train the WP model on training data (seasons 2020-2024)
- Generate raw probabilities on a validation set (early 2025)
- Fit an isotonic calibrator on the validation set
- Apply the calibrator to live predictions
- Monitor ECE weekly — if it drifts above 0.04, retrain
The calibrator adds ~0.5 cents per trade of edge on average. That sounds small, but across thousands of trades, it compounds into meaningful profit.
Common Mistakes
Calibrating on training data. This is the most common error. The model is already fit to the training data, so the calibration curve learns nothing useful. Always use a separate validation set.
Ignoring domain shift. A calibrator trained on 2023 NBA data may not transfer to 2026 March Madness. Different sports, different seasons, and different market conditions all require recalibration.
Over-binning ECE. With too many bins, each bin has too few samples and the ECE estimate is noisy. Ten bins is a good default for datasets under 50,000 predictions.
Confusing calibration with discrimination. A calibrated model is not necessarily a good model. A model that predicts 50% for every game is perfectly calibrated — and completely useless. You need both discrimination (separate winners from losers) and calibration (probabilities match reality).
The Bottom Line
In sports betting, your probabilities are your prices. If you say 78% and the market says 65%, you are buying at 65 cents something you value at 78 cents. If your 78% is actually worth 66%, you just overpaid.
Calibration is what makes your probabilities trustworthy. Without it, every edge you compute is built on a lie.
Part of the ZenHodl blog. We write about sports analytics, prediction markets, and building trading bots with Python.