Build an Elo Rating System from Scratch in Python

Elo ratings are the most underrated feature in sports prediction. One number per team, updated after every game, capturing relative strength without box scores or advanced stats. This tutorial builds a complete Elo system from scratch in Python.

By the end, you will have working code that tracks team ratings across full seasons and produces pre-game win probabilities that rival sportsbook opening lines.

The Math

Elo is built on one equation. Given two teams with ratings $R_A$ and $R_B$, the expected win probability for team A is:

$$E_A = \frac{1}{1 + 10^{(R_B - R_A) / 400}}$$

After the game, ratings update based on the surprise of the result:

$$R_A' = R_A + K \times (S_A - E_A)$$

Where $S_A$ is 1 for a win, 0 for a loss, and $K$ controls how much a single game moves the rating.

Step 1: Core Functions

def expected_win_prob(rating_a: float, rating_b: float) -> float:
    """Probability that team A beats team B based on Elo ratings."""
    return 1.0 / (1.0 + 10 ** ((rating_b - rating_a) / 400))

def update_ratings(
    rating_a: float,
    rating_b: float,
    a_won: bool,
    k: float = 20.0,
) -> tuple[float, float]:
    """Return updated (rating_a, rating_b) after a game."""
    exp_a = expected_win_prob(rating_a, rating_b)
    score_a = 1.0 if a_won else 0.0
    new_a = rating_a + k * (score_a - exp_a)
    new_b = rating_b + k * ((1 - score_a) - (1 - exp_a))
    return new_a, new_b

The system is zero-sum: rating points gained by the winner are lost by the loser. Total Elo across all teams stays constant (anchored at 1500 per team).

Step 2: Initialize Ratings

Every team starts at 1500. Use a dictionary for fast lookup:

from collections import defaultdict

ratings = defaultdict(lambda: 1500.0)

With defaultdict, any new team automatically gets 1500. This is useful for college sports where new teams appear mid-dataset.

Step 3: Process a Season of Games

Assume you have a list of game results as tuples: (home_team, away_team, home_won).

def process_season(games, ratings, k=20.0):
    """Update ratings for a full season of games. Modifies ratings in place."""
    for home, away, home_won in games:
        new_home, new_away = update_ratings(ratings[home], ratings[away], home_won, k)
        ratings[home] = new_home
        ratings[away] = new_away
    return ratings

Order matters. Games must be sorted chronologically because each game depends on the current ratings, which depend on all previous games.

Step 4: Add Home Court Advantage

Home teams win roughly 58% of NBA games. To reflect this, add a bonus to the home team's rating when computing expected score — but do not permanently alter their stored rating:

def update_ratings_with_hca(
    rating_home: float,
    rating_away: float,
    home_won: bool,
    k: float = 20.0,
    home_advantage: float = 70.0,
) -> tuple[float, float]:
    """Elo update with home court advantage."""
    exp_home = expected_win_prob(rating_home + home_advantage, rating_away)
    score_home = 1.0 if home_won else 0.0
    new_home = rating_home + k * (score_home - exp_home)
    new_away = rating_away + k * ((1 - score_home) - (1 - exp_home))
    return new_home, new_away

The home_advantage value of 70 Elo points gives equally rated teams a ~58% home win probability. Tune this per sport: NBA ~70, college basketball ~100, NFL ~50.

Step 5: Season Resets

Rosters change between seasons. A team that went 70-12 last year may have lost its best player. Without a reset, stale ratings poison future predictions:

def apply_season_reset(ratings, regression_factor=0.75):
    """Regress all ratings toward 1500 between seasons."""
    for team in ratings:
        ratings[team] = 1500 + regression_factor * (ratings[team] - 1500)
    return ratings

A regression factor of 0.75 means a 1700-rated team drops to 1650 after the offseason. This balances two needs: teams that were good tend to stay good, but not as good as their peak rating suggests after roster turnover.

Step 6: Putting It Together

Here is the full pipeline processing multiple seasons:

def build_elo_ratings(seasons_data, k=20.0, home_adv=70.0, regression=0.75):
    """
    Build Elo ratings across multiple seasons.

    seasons_data: dict mapping season_id -> list of (home, away, home_won)
    Returns final ratings dict and a log of every game's pre-game prediction.
    """
    ratings = defaultdict(lambda: 1500.0)
    prediction_log = []

    for season_id in sorted(seasons_data.keys()):
        games = seasons_data[season_id]

        for home, away, home_won in games:
            # Record pre-game prediction
            pre_game_prob = expected_win_prob(
                ratings[home] + home_adv, ratings[away]
            )
            prediction_log.append({
                "season": season_id,
                "home": home,
                "away": away,
                "home_win_prob": pre_game_prob,
                "home_won": home_won,
            })

            # Update ratings
            new_home, new_away = update_ratings_with_hca(
                ratings[home], ratings[away], home_won, k, home_adv
            )
            ratings[home] = new_home
            ratings[away] = new_away

        # Reset between seasons
        apply_season_reset(ratings, regression)

    return dict(ratings), prediction_log

The prediction_log is essential for evaluation. It records what the model predicted before seeing the result, so you can measure calibration and accuracy on out-of-sample data.

Step 7: Evaluate the System

Use the prediction log to compute Brier score and plot calibration:

import numpy as np

def evaluate_predictions(prediction_log):
    probs = np.array([g["home_win_prob"] for g in prediction_log])
    outcomes = np.array([float(g["home_won"]) for g in prediction_log])

    brier = np.mean((probs - outcomes) ** 2)
    accuracy = np.mean((probs > 0.5) == outcomes)

    print(f"Games evaluated: {len(probs)}")
    print(f"Brier score:     {brier:.4f}")
    print(f"Accuracy:        {accuracy:.3%}")
    return brier, accuracy

On NBA data, a well-tuned Elo system should produce a Brier score around 0.22-0.24 and accuracy around 65-67%. That is competitive with much more complex models for pre-game predictions.

Step 8: Tune K-Factor

K is the single most important hyperparameter. Too high and ratings oscillate wildly. Too low and they take forever to adjust to real changes in team quality:

def tune_k_factor(seasons_data, k_values, home_adv=70.0, regression=0.75):
    """Grid search over K values, return Brier scores."""
    results = {}
    for k in k_values:
        _, log = build_elo_ratings(seasons_data, k, home_adv, regression)
        probs = np.array([g["home_win_prob"] for g in log])
        outcomes = np.array([float(g["home_won"]) for g in log])
        results[k] = np.mean((probs - outcomes) ** 2)
        print(f"K={k:5.1f}  Brier={results[k]:.4f}")
    return results

# Example: test K from 8 to 40
k_results = tune_k_factor(seasons_data, np.arange(8, 42, 2))
best_k = min(k_results, key=k_results.get)
print(f"Best K: {best_k}")

Typical optimal values: NBA K=20, NCAAMB K=28-32, NFL K=24-28. Sports with fewer games per season need higher K values so ratings can adjust in time.

Using Elo for Live Betting

Pre-game Elo gives you a baseline probability. For live in-game prediction, Elo difference becomes a feature in a larger model alongside score differential, time remaining, and period:

features = {
    "elo_diff": ratings[home] - ratings[away],  # Pre-game strength gap
    "score_diff": home_score - away_score,        # Current game state
    "time_remaining": seconds_left,               # How much can change
    "period": current_period,                     # Game phase
}

In our backtests, adding elo_diff as a feature to the in-game WP model improved trading profit by 2-3 cents per trade. Elo captures information — strength of schedule, recent form, league standing — that the score alone cannot.

Common Mistakes

Not sorting games chronologically. If games are out of order, ratings update on future information. This leaks data and inflates your evaluation metrics.

Forgetting season resets. Without regression, a team's rating from three seasons ago still affects predictions today. Rosters turn over. Regress toward the mean.

Using the same K for all sports. An NFL season is 17 games. An NBA season is 82. The same K value cannot work for both. Always tune per sport.

Evaluating on training data. The prediction log naturally avoids this because each prediction is made before the game. But if you tune K on the same dataset you evaluate on, you are overfitting the hyperparameter.

This is the foundation of Module 2 in the ZenHodl course. The full module adds margin-of-victory adjustments, conference-aware resets for college sports, and integration with the ESPN data pipeline from Module 1. Module 1 is free — start there.