Building an Options Backtesting Platform from Scratch

A deep dive into the architecture, pricing engine, and strategy framework behind a custom-built options backtesting platform — from Black-Scholes pricing and SABR-inspired volatility surfaces to multi-variable signal aggregation and Kelly Criterion position sizing.

34 min read
By Altus Labs
Technology ResearchTechnology ResearchOptionsBacktestingBlack-ScholesVolatility SurfaceGreeksDerivativesSoftware Architecture
Building an Options Backtesting Platform from Scratch

This article walks through the architecture and key design decisions behind the Options Backtester — a platform we built to test derivatives strategies against historical data with realistic pricing, Greeks, and risk management. The goal was to solve a specific problem: existing tools either cost too much or only let you backtest predefined structures. We needed something that decoupled signal generation from instrument selection entirely.

Key ideas in brief

  • Existing options backtesting tools force you into predefined structures (buy a condor, sell a butterfly). This platform separates when to trade from what to trade, letting you pair any technical strategy with any options instrument.
  • The pricing engine uses Black-Scholes with a SABR-inspired volatility surface — not flat vol — so backtest results reflect how options are actually priced across strikes and expiries.
  • Position sizing uses a constrained Kelly Criterion with half-Kelly support and minimum quantity preservation to handle the practical realities of contract-level sizing.
  • The multi-variable backtesting engine enables cross-asset signal generation: generate entry signals from VIX or sector ETFs, execute options trades on a completely different underlying.

Why Build This

The options backtesting space has a specific gap. Tools like OptionStack, ORATS, and TastyTrade's backtester let you test predefined options structures — iron condors at 16 delta, 45 DTE strangles, mechanical credit spreads. They're structure-first: you pick the spread, set the deltas and expiry, and the tool tells you how it performed historically.

That's useful if you're running a premium-selling operation. But it doesn't answer the question we were actually asking: given a technical signal we've designed — a momentum inflection, a volatility squeeze, a multi-timeframe trend confirmation — what is the optimal options structure to express that view?

The same bullish signal might warrant a 40-delta call spread in low IV, a bull put spread in high IV, or a deep ITM call when you want pure delta exposure. The signal is one decision. The instrument is a separate decision. Every existing tool we evaluated conflated the two.

So we built a platform where strategies and instruments are independently authored Python modules. A strategy generates entry signals. An instrument constructs the option legs. The backtesting engine pairs them, prices everything through a proper volatility surface, sizes via Kelly, and tracks P&L with realistic commissions and slippage. Change the strategy, keep the instrument. Change the instrument, keep the strategy. Test every permutation.


System Architecture

High-Level Component Architecture

Backtest Execution Flow

This is the path a single backtest takes from user input to results. The critical design decision is the separation between signal generation (step 3) and position construction (step 5) — the strategy runs first over the full dataset, producing a list of entry dates, and only then does the engine construct options positions at those dates.


The Pricing Engine

The pricing layer is where domain knowledge matters most. Getting this wrong means your backtest prices contracts that either can't exist on an exchange or are systematically mispriced relative to how the market actually works.

Black-Scholes with Greeks

The platform uses Black-Scholes European pricing via py_vollib, calculating the full Greek suite — delta, gamma, theta, vega, rho — for each contract leg. Greeks are then aggregated across all legs of a multi-leg position to give portfolio-level risk exposure.

This aggregation is important for spreads. A bull call spread's net delta is the difference between the long and short leg deltas — substantially lower than a naked call. The backtest needs to reflect this because it affects how the position responds to underlying moves, and therefore when stop losses and take profits trigger.

def price_option_strategy(self, options: List[OptionContract],
                          current_date: str, spot_price: float) -> Dict:
    """Price a multi-leg options strategy with aggregated Greeks"""
    total_price = 0.0
    total_greeks = {'delta': 0, 'gamma': 0, 'theta': 0, 'vega': 0, 'rho': 0}

    for option in options:
        # Price each leg through Black-Scholes
        T = self.vol_engine.time_to_expiry(current_date, option.expiry_date)
        sigma = self.vol_engine.adjust_volatility_for_skew(
            base_vol, option.strike, spot_price, option.option_type
        )
        price = self.vol_engine.black_scholes_price(
            spot_price, option.strike, T, risk_free_rate, sigma, option.option_type
        )
        greeks = self.vol_engine.calculate_greeks(
            spot_price, option.strike, T, risk_free_rate, sigma, option.option_type
        )

        # Aggregate: quantity handles long (+1) vs short (-1)
        total_price += price * option.quantity
        for greek in total_greeks:
            total_greeks[greek] += greeks[greek] * option.quantity

    return {'total_price': total_price, 'total_greeks': total_greeks}

Strike Rounding to Exchange Conventions

Options exchanges don't allow arbitrary strike prices. US equity options follow a tiered increment schedule based on price level. If your backtester prices a $152.37 strike, you're testing a contract that doesn't exist.

def _round_strike(self, strike: float) -> float:
    """Round strike to nearest valid exchange increment"""
    if strike < 25:
        return round(strike * 2) / 2      # $0.50 increments
    elif strike < 200:
        return round(strike)               # $1.00 increments
    elif strike < 500:
        return round(strike / 2.5) * 2.5   # $2.50 increments
    else:
        return round(strike / 5) * 5       # $5.00 increments

This matters more than it seems. A strategy backtesting 25-delta put spreads on a $180 stock will select different strikes than one that ignores rounding. The protection profile changes, the cost changes, the max loss changes. Getting this detail wrong contaminates every downstream metric.

Expiry Date Alignment

US equity options expire on Fridays. The platform's find_friday_expiry() function takes a target DTE and finds the nearest Friday within a configurable tolerance window, ensuring every backtested contract aligns to a real expiration cycle.

def find_friday_expiry(target_date, target_days_to_expiry, tolerance_days=3):
    """Find the Friday expiry closest to the target DTE"""
    target_expiry = target_date + timedelta(days=target_days_to_expiry)

    # Search within tolerance window for nearest Friday
    for offset in range(tolerance_days + 1):
        for delta in [offset, -offset]:
            candidate = target_expiry + timedelta(days=delta)
            if candidate.weekday() == 4:  # Friday
                return candidate.strftime('%Y-%m-%d')

    # Fallback: find next Friday
    days_until_friday = (4 - target_expiry.weekday()) % 7
    return (target_expiry + timedelta(days=days_until_friday)).strftime('%Y-%m-%d')

Volatility Surface

Flat volatility is the single biggest source of error in naive options backtesting. In reality, out-of-the-money puts trade at significantly higher implied volatility than at-the-money options (the volatility skew), and near-term options trade at different vol levels than longer-dated ones (the term structure). Ignoring this misprices protective puts by 15-30% and makes credit spreads look systematically more attractive than they actually are.

SABR-Inspired Parametrisation

The platform constructs a volatility surface across 7 standard expiries (7d to 365d) and 25 strike points spanning 70%-130% of spot price. Each expiry slice is parametrised with four values inspired by the SABR stochastic volatility model:

ParameterRoleDefaultBounds
AlphaBase volatility level30d realised vol[0.05, 1.0]
BetaSkew strength0.5[0.0, 2.0]
RhoSpot-vol correlation-0.1[-0.5, 0.5]
NuVolatility of volatility0.5[0.1, 1.0]

The negative rho reflects the empirical observation that equity volatility rises as prices fall (the leverage effect). The term structure is modelled via per-expiry multipliers — shorter expiries carry higher base volatility, while longer expiries exhibit stronger skew:

# Term structure: shorter expiries = higher vol, longer expiries = stronger skew
time_factor = expiry / 365.0
base_multiplier = 1.0 + 0.1 * (1.0 - time_factor)   # Near-term vol premium
skew_strength  = 1.0 + 0.2 * time_factor              # Longer-dated skew steepens

Implied Volatility Calculation

For a given strike, expiry, and option type, the surface produces an implied volatility by:

  1. Finding the closest standard expiry
  2. Calculating moneyness (strike / spot)
  3. Applying term-structure-adjusted base volatility
  4. Adding skew based on moneyness and option type (calls and puts have asymmetric skew treatment)
  5. Incorporating the correlation effect
  6. Clamping to [5%, 100%] bounds
def calculate_implied_volatility(self, strike, expiry_days, option_type='c'):
    moneyness = strike / self.current_price
    skew_params = self.skew_parameters[closest_expiry]
    term_params = self.term_structure[closest_expiry]

    base_vol = skew_params['alpha'] * term_params['base_multiplier']

    # Asymmetric skew: OTM puts get more vol than OTM calls
    if option_type == 'c':
        if moneyness < 1.0:  # ITM call
            vol_adj = skew_strength * beta * (1.0 - moneyness)
        else:                # OTM call
            vol_adj = skew_strength * beta * (moneyness - 1.0) * 0.5
    else:  # put
        if moneyness < 1.0:  # OTM put (more expensive — skew)
            vol_adj = skew_strength * beta * (1.0 - moneyness) * 0.5
        else:                # ITM put
            vol_adj = skew_strength * beta * (moneyness - 1.0)

    correlation_effect = rho * vol_adj * 0.1
    implied_vol = max(0.05, min(1.0, base_vol + vol_adj + correlation_effect))
    return implied_vol

Fitting to Market Data

When real options market data is available, the surface parameters are fitted via L-BFGS-B optimisation, minimising the sum of squared errors between model and market implied volatilities across all observed strikes and expiries:

minimise: sum( (model_IV - market_IV)^2 )
subject to: alpha in [0.05, 1.0], beta in [0.0, 2.0], rho in [-0.5, 0.5], nu in [0.1, 1.0]

This ensures the backtest uses a volatility surface that reflects actual market conditions when available, rather than relying solely on synthetic parameters.


Strategy Framework

The Interface Contract

Every strategy is a standalone Python module implementing a single function:

def run_strategy(data: pd.DataFrame) -> List[Dict]:
    """
    Receives: DataFrame with OHLCV columns and a 200-day lookback buffer
              for indicator warm-up.

    Returns:  List of trade signal dicts, each containing:
              - 'entry_date' (or 'buy_date')
              - 'exit_date' (or 'sell_date')
              - 'entry_price', 'exit_price'
              - 'return'
              - Any additional indicator values at entry/exit
    """

This interface is deliberately minimal. The strategy knows nothing about options, instruments, or position sizing. It operates purely on underlying price data and produces dates when conditions are met. The backtesting engine handles everything else — constructing the options position, pricing it, sizing it, and managing exits via stop loss, take profit, and max duration parameters.

This separation means a strategy can be tested with any instrument. The same momentum signal can drive a call spread, a put spread, a straddle, or a naked option — without modifying a single line of strategy code.

Example: Momentum Entry with Volatility Exit

This strategy enters when MACD histogram is negative but turning positive (momentum inflection), RSI is neutral (40-60), and the 30-day EMA is above the 120-day SMA (confirmed uptrend). It exits on the first of: 30 days elapsed, 5% profit, or 5% drawdown from peak.

def run_strategy(data):
    # Trend filter
    data['EMA_30']  = data['Close'].ewm(span=30).mean()
    data['SMA_120'] = data['Close'].rolling(window=120).mean()

    # Momentum: MACD(12, 26, 9)
    data['MACD_line']      = data['Close'].ewm(span=12).mean() - data['Close'].ewm(span=26).mean()
    data['MACD_signal']    = data['MACD_line'].ewm(span=9).mean()
    data['MACD_histogram'] = data['MACD_line'] - data['MACD_signal']
    data['MACD_hist_trend'] = data['MACD_histogram'].diff()

    # Mean reversion filter: RSI(14)
    delta = data['Close'].diff()
    gain  = delta.where(delta > 0, 0).rolling(14).mean()
    loss  = (-delta.where(delta < 0, 0)).rolling(14).mean()
    data['RSI'] = 100 - (100 / (1 + gain / loss))

    trades = []
    position = None

    for i in range(120, len(data)):
        row = data.iloc[i]

        if position is None:
            macd_inflection = row['MACD_histogram'] < 0 and row['MACD_hist_trend'] > 0
            rsi_neutral     = 40 <= row['RSI'] <= 60
            trend_confirmed = row['EMA_30'] > row['SMA_120']

            if macd_inflection and rsi_neutral and trend_confirmed:
                position = 'long'
                # ... record entry

        elif position == 'long':
            # Exit: 30 days | 5% profit | 5% drawdown from peak
            # ... check conditions, record trade

    return trades

The entry logic combines three independent filters: a momentum inflection (MACD turning), a mean-reversion guard (RSI not extended), and a trend filter (EMA above SMA). This layered approach reduces false signals while remaining explainable — each condition has a clear market rationale.

Example: Multi-Timeframe Volatility Squeeze (DoubleDip)

A more sophisticated strategy operating across daily and weekly timeframes simultaneously. The Keltner Channel / Bollinger Band width ratio is a measure of volatility compression — when the ratio exceeds 0.8 on both timeframes, the market is coiling for a directional move:

def run_strategy(data):
    # === Daily KC/BB Ratio ===
    data['EMA_20']   = data['Close'].ewm(span=20).mean()
    data['ATR_20']   = true_range(data).rolling(20).mean()
    data['KC_width'] = 2 * 1.5 * data['ATR_20']                    # Keltner: 1.5 ATR
    data['BB_width'] = 2 * 2 * data['Close'].rolling(20).std()     # Bollinger: 2 sigma
    data['KCBB_daily'] = data['KC_width'] / data['BB_width']

    # === Weekly KC/BB Ratio ===
    weekly = data.resample('W-FRI').agg(
        {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
    )
    # ... same KC/BB calculation on weekly bars
    data['KCBB_weekly'] = weekly_ratio.reindex(data.index, method='ffill')

    # Entry: both daily AND weekly squeeze ratios > 0.8
    if row['KCBB_daily'] > 0.8 and row['KCBB_weekly'] > 0.8:
        # ... enter position

The multi-timeframe confirmation significantly reduces false signals. A daily squeeze alone triggers frequently. Requiring the weekly timeframe to confirm means the underlying is compressed on both scales — a stronger structural setup.

Strategy Library

The platform includes ~20 strategy modules spanning several categories:

CategoryExamplesSignal Logic
Trend crossoverEMA/SMA crossovers, Price vs. moving averageEntry on cross above, exit on cross below
Momentum inflectionMACD histogram turning, RSI range filtersEnter at momentum reversal points
Volatility-awareRV-based stop losses, IV/RV exit triggersIncorporate realised vol into exit logic
Multi-timeframeDoubleDip KC/BB ratio (daily + weekly)Require confirmation across timeframes
BenchmarkAlways-on (constant exposure)Generate entry signals every day

Each strategy is a standalone .py file. Adding a new strategy means writing one function and dropping it in the /strategies/ folder — no registration, no configuration, no framework boilerplate. The engine discovers strategies by scanning the directory at runtime.


Instrument Construction

The OptionContract Model

Every options position is built from OptionContract instances — the atomic unit of the system:

class OptionContract:
    def __init__(self, option_type: str, strike: float,
                 expiry_date: str, quantity: int = 1):
        self.option_type = option_type   # 'c' for call, 'p' for put
        self.strike = self._round_strike(strike)  # Snapped to exchange increment
        self.expiry_date = expiry_date   # Always a Friday
        self.quantity = quantity          # +1 long, -1 short

Quantity sign encodes direction. A bull put spread is [OptionContract('p', lower_strike, expiry, +1), OptionContract('p', higher_strike, expiry, -1)] — long the cheaper put, short the more expensive one. This convention means the pricing engine doesn't need special-case logic for credit vs. debit spreads; the sign propagates naturally through pricing and P&L.

Instrument Templates

Like strategies, instruments are standalone Python modules implementing create_instrument():

def create_instrument(current_price, current_date, delta_spread=10, days_to_expiry=60):
    """Create a 40-delta bull put spread"""
    expiry = find_friday_expiry(current_date, days_to_expiry, tolerance_days=5)

    # Strike spacing driven by delta_spread parameter
    spacing = current_price * (delta_spread / 100.0) * 0.5
    spacing = max(spacing, max(current_price * 0.01, 0.50))  # Minimum width

    sell_strike = current_price       # Short put at ~40 delta (ATM)
    buy_strike  = current_price - spacing  # Long put below (protection)

    return [
        OptionContract('p', buy_strike,  expiry, +1),   # Long put (protection)
        OptionContract('p', sell_strike, expiry, -1),   # Short put (premium)
    ]

The delta_spread parameter controls strike spacing — a wider spread means more directional exposure, more premium collected, but higher max loss. This is the parameter the optimisation engine sweeps to find the best risk/reward for a given strategy.

Available Instruments

InstrumentStructureLegsDTE
30-delta call singleNaked long call+1C at ~30 delta60d
90-delta call singleDeep ITM long call+1C at ~90 delta90d
40-delta call spreadBull call vertical+1C / -1C14d / 60d
40-delta put spreadBull put vertical+1P / -1P14d / 60d
Bear call spreadCredit call vertical-1C / +1C60d
Long ATM straddleLong vol+1C + 1P at ATM30d
Short ATM straddleShort vol-1C + -1P at ATM30d
10-delta strangleWide short vol-1C + -1P at 10 delta30d
Short straddle butterflyCapped short vol-1C -1P +1C(wing) +1P(wing)60d

The key insight is combinatorial: 20 strategies x 15 instruments = 300 potential pairings. The optimisation engine tests each combination across parameter grids of delta spread, stop loss, take profit, and max duration — finding the optimal strategy-instrument pairing for a given underlying and time period.


Position Sizing: Kelly Criterion

The Core Formula

The Kelly Criterion determines the optimal fraction of capital to risk per trade:

f* = (b * p - q) / b

where:
  b = win odds (take_profit / stop_loss)
  p = win probability
  q = loss probability (1 - p)

Practical Constraints

Raw Kelly is famously aggressive — it optimises for maximum geometric growth rate but tolerates enormous drawdowns along the way. The platform applies three constraints that reflect how Kelly is actually used on trading desks:

def get_position_size(self, capital, win_prob, tp_pct, sl_pct,
                      use_half_kelly=False, max_fraction=0.25, min_fraction=0.01):

    # 1. Calculate raw Kelly
    win_odds = tp_pct / sl_pct
    raw_kelly = (win_odds * win_prob - (1 - win_prob)) / win_odds

    # 2. Half-Kelly option (standard practice — reduces variance by ~50%)
    if use_half_kelly:
        raw_kelly *= 0.5

    # 3. Hard constraints: never risk more than 25% or less than 1%
    constrained = max(min_fraction, min(max_fraction, raw_kelly))

    return capital * constrained

Why these constraints exist:

  • 25% maximum: Full Kelly assumes an infinite time horizon and perfect edge estimation. In practice, your win probability estimate is noisy. Capping at 25% prevents a single bad estimate from oversizing catastrophically.
  • Half-Kelly: Reduces the variance of returns by approximately 50% while only sacrificing about 25% of the geometric growth rate. Most professional desks use fractional Kelly for this reason.
  • 1% minimum: Prevents the engine from producing zero-size positions when Kelly suggests minimal allocation, ensuring every signal is actually tested.

Zero-Quantity Prevention

When Kelly sizing is combined with contract-level rounding, a subtle problem emerges. If Kelly says "risk 0.3% of capital" and a single contract costs more than that amount, the quantity rounds to zero. The engine preserves direction:

scaled_qty = round(original_qty * multiplier)
if original_qty != 0 and scaled_qty == 0:
    scaled_qty = 1 if original_qty > 0 else -1  # Preserve direction

This matters for multi-leg positions. If a butterfly's wing rounds to zero contracts but the body doesn't, the position profile is completely different from what was intended. The minimum-quantity rule ensures every leg maintains its structural role.


Multi-Variable Backtesting

This is the most architecturally distinctive feature. Standard backtesting generates signals and executes trades on the same asset. Multi-variable backtesting decouples signal generation from trade execution entirely.

Signal-Execution Decoupling

The system supports up to three signal strategies, each running on a different asset. Signals are combined with AND logic — all strategies must agree on the same date for an entry to trigger. This enables setups like:

  • Signal from VIX (volatility regime) + Signal from SPY (trend) → Execute options on QQQ
  • Signal from sector ETF (rotation) + Signal from underlying (momentum) → Execute options on individual stock

Each signal strategy uses the same run_strategy() interface. The multi-variable engine runs each strategy against its respective signal asset, extracts the entry dates, computes the intersection, and passes those dates to the standard position execution pipeline.

Architecture

class SignalStrategy:
    """Wraps a strategy function with its signal asset"""
    def __init__(self, strategy_func, signal_asset_data):
        self.signals = strategy_func(signal_asset_data)
        self.entry_dates = [s['entry_date'] for s in self.signals]

class SignalAggregator:
    """Combines signals from multiple strategies using AND logic"""
    def aggregate(self, strategies: List[SignalStrategy]) -> List[str]:
        if not strategies:
            return []
        # Intersection of all entry date sets
        common_dates = set(strategies[0].entry_dates)
        for strategy in strategies[1:]:
            common_dates &= set(strategy.entry_dates)
        return sorted(common_dates)

Exit logic remains position-based (stop loss, take profit, max duration) rather than signal-based — once you're in a position, exit is governed by P&L and time, not by the original signal assets. This reflects how most systematic options strategies are actually managed.


Cascading Optimisation

Most backtesting tools offer a flat grid search — test every combination, rank results, done. The problem is that a flat search either covers too little of the parameter space (too coarse) or takes too long (too fine). This platform uses a three-phase cascading approach: broad search, then targeted refinement, then temporal validation.

Phase 1: Broad Parameter Sweep

Phase 1 runs a full backtest for every combination in a parameter grid:

  • Take profit: [50%, 70%, 100%]
  • Stop loss: [20%, 50%, 100%]
  • Max duration: [10, 20, 30 days]
  • Delta spread: [10, 20 delta between legs]

That's 54 combinations per strategy-instrument-symbol triplet. Each combination runs a complete backtest and produces a Composite Factor (CF) score:

CF = options_win_rate × total_reward_to_risk

where:
  options_win_rate = winning_trades / total_trades
  total_reward_to_risk = sum(all winning %) / abs(sum(all losing %))

CF is deliberately multiplicative rather than additive. A strategy with 80% win rate but a 0.5 reward-to-risk scores lower than one with 60% win rate and 2.0 reward-to-risk. This penalises the common trap of high win rate / low payoff strategies — the ones that look good until a single tail event wipes out months of premium collection.

Results are visualised as heatmaps across four dimensions: CF by TP vs SL, CF by Duration vs Delta, Win Rate by TP vs SL, and Reward-to-Risk by TP vs SL. The heatmaps reveal parameter interaction effects — for instance, that a 20% stop loss only works with short durations, or that wider delta spreads require higher take profits to compensate for the additional risk.

Phase 2: Cascading Refinement

Phase 2 automatically extracts the top 20% of Phase 1 results and narrows the parameter ranges around those winners:

top_results = phase1_results.head(int(len(results) * 0.2))

# Extract winning parameter ranges and expand slightly
tp_min, tp_max = top_results['take_profit'].min(), top_results['take_profit'].max()
refined_tp = [tp_min - 0.2, tp_min, midpoint, tp_max, tp_max + 0.2]

# Same for SL, duration, delta — adding midpoints where gaps > threshold

This creates a finer-grained grid centred on the promising region of parameter space. If Phase 1 found that TP between 50-70% and SL between 20-50% dominated, Phase 2 tests [30%, 50%, 60%, 70%, 90%] for TP and [20%, 35%, 50%] for SL — zooming in without wasting compute on regions that already proved unpromising.

The cascading approach finds near-optimal parameters in a fraction of the time a brute-force fine grid would take, while still exploring the neighbourhood of the best results for local optima.

Phase 3: Temporal Robustness (Walk-Forward)

A strategy optimised on 2014-2024 data might just be overfitted to that specific decade. Phase 3 tests whether the parameters generalise across time.

Standard robustness test: Split the data 70/30 into train and test periods. Optimise on training data, then evaluate on unseen test data. Calculate performance degradation:

degradation = (train_CF - test_CF) / train_CF

A degradation under 20% suggests the parameters are genuinely capturing a structural edge. Over 50% suggests overfitting.

Walk-forward analysis: Roll a 3-year optimisation window forward in 1-year steps, testing on the subsequent year each time:

Window 1: Optimise 2014-2016, Test 2017
Window 2: Optimise 2015-2017, Test 2018
Window 3: Optimise 2016-2018, Test 2019
Window 4: Optimise 2017-2019, Test 2020
Window 5: Optimise 2018-2020, Test 2021

This produces a sequence of out-of-sample results. If performance is consistent across windows, the strategy has a durable edge. If it spikes in one window and collapses in others, it's regime-dependent.


Alpha Validation

This is the platform's most statistically rigorous feature. The question it answers: does this strategy have genuine alpha, or did it just get lucky with entry timing?

Alpha validation combines three independent tests into a single robustness score.

Randomised Entry Testing

The most revealing test. For each trade in a strategy's history, the entry date is randomly shifted within a configurable window (e.g., ±20 trading days). The option spread is recalibrated to the new entry price — preserving the percentage distance from spot that the original spread had — and the trade is re-simulated from the new date.

# Shift entry date randomly within range
random_days = np.random.randint(-random_days_range, random_days_range + 1)
adjusted_date = find_nearest_trading_day(original_date + random_days)

# Recalibrate spread to new spot price (preserve % structure)
new_spread = recalibrate_bull_put(adjusted_date, new_spot, original_trade)

# Re-simulate trade from new entry
result = simulate_trade(new_spread, adjusted_date)

This is run for 20 iterations, each with a different random seed. If the strategy's win rate and average return are similar across randomised entries, the edge comes from the spread structure and market regime, not from entry timing. If performance collapses when entries are shifted by a few days, the strategy is fragile — it depends on hitting exact inflection points, which is not reliably repeatable in live trading.

Benchmark Comparison

Each trade's return is compared against two benchmarks over the same holding period:

  • SPY buy-and-hold — did the options strategy outperform the market?
  • Symbol buy-and-hold — did the options structure add value beyond just owning the underlying?

If a bull put spread on AAPL underperforms AAPL buy-and-hold over the same period, the spread structure is destroying value, not creating it.

Signal Attribution

Signal attribution decomposes where the strategy's returns come from:

# For each trade:
underlying_moved_as_expected = (exit_spot - entry_spot) > 0  # for bullish strategies
options_profitable = trade_pnl > 0

# Aggregated:
strategy_accuracy = correct_direction_trades / total_trades
win_rate_when_correct = options_wins_when_direction_right / direction_right_trades
win_rate_when_wrong = options_wins_when_direction_wrong / direction_wrong_trades

# How much does signal quality drive outcomes?
strategy_contribution = strategy_accuracy × win_rate_when_correct / combined_win_rate

This answers a subtle question: does the strategy make money because it picks direction correctly (signal quality), or because the spread structure profits even when direction is wrong (structural edge from theta/IV)? A bull put spread can profit when the underlying goes sideways or slightly down — so a strategy with poor directional accuracy might still work if the spread is wide enough. Attribution separates these effects.

Robustness Score

The three tests are combined into a single robustness score on a 0-1 scale:

win_rate_diff = abs(original_win_rate - mean(randomised_win_rates))
return_diff   = abs(original_avg_return - mean(randomised_avg_returns))

robustness_score = (max(0, 1 - win_rate_diff * 2) + max(0, 1 - return_diff * 0.1)) / 2

The interpretation scale:

  • 0.8+: Highly robust — edge survives entry timing randomisation
  • 0.6-0.8: Robust — minor degradation with randomised entries
  • 0.4-0.6: Moderately robust — performance is partially timing-dependent
  • 0.2-0.4: Weakly robust — significant dependence on entry timing
  • Below 0.2: Not robust — strategy is likely overfitted to specific entry points

Strategy Comparison

The comparison engine tests multiple strategies against multiple symbols simultaneously, producing a performance matrix:

# Test 1-3 strategies × 1-15 symbols in a single run
for strategy in strategies:
    for symbol in symbols:
        result = run_backtest(strategy, symbol, instrument, params)
        matrix[f"{symbol}_{strategy}"] = aggregate_metrics(result)

Each cell in the matrix contains the full metric suite (CF, win rate, Sharpe, Sortino, max drawdown, reward-to-risk). This reveals which strategies work on which assets — a momentum strategy might excel on tech stocks but fail on utilities, while a mean-reversion strategy shows the opposite pattern.

Portfolio-level aggregation converts trade-to-trade returns into daily equivalents for comparable risk metrics:

daily_equivalent = (1 + trade_return) ^ (1 / days_between_trades) - 1
sharpe = (annualised_return - risk_free) / annualised_volatility

The comparison engine is what makes the combinatorial strategy-instrument architecture practically useful. Without it, testing 20 strategies × 15 instruments × 15 symbols manually would take weeks. The matrix produces ranked results across all combinations in a single run.


Performance Metrics

The platform computes standard quantitative metrics for each backtest:

MetricCalculationPurpose
Total Return(final - initial) / initialAbsolute performance
Win Ratewinners / total tradesSignal accuracy
Profit Factorgross profit / gross lossEdge quality
Sharpe Ratio(mean return - Rf) / std(returns)Risk-adjusted return
Sortino Ratio(mean return - Rf) / downside stdPenalises downside only
Calmar Ratioannual return / max drawdownReturn per unit of tail risk
Max Drawdownpeak-to-trough declineWorst-case loss
Average Winner/LoserMean return of W/L tradesPayoff asymmetry

Drawdown tracking handles long and short positions differently:

  • Long positions: Peak = highest value, drawdown = (peak - current) / peak
  • Short positions: Peak = lowest value, drawdown = (current - peak) / |peak|

Strategy Deep Dive: Live Trade Reconstruction

The backtester answers "what would have happened?" The deep dive tool answers the harder question: "what did happen, and does the backtest agree?"

Upload a CSV or Excel file of actual trades executed in the market. The engine parses spread notation in multiple formats — 150/145P 7Nov25, 320.0/315.0P 2025-11-14, 639/629P 31/10/2025 — and reconstructs each trade:

  1. Parse spread notation into individual legs (strikes, option type, expiry)
  2. Download historical data for the underlying at the actual trade dates
  3. Calculate technical indicators at entry — RSI, Stochastic, MACD, Bollinger Bands, Keltner Channels, ATR — to understand what the market looked like when the trade was entered
  4. Simulate option pricing day by day from entry to exit using the volatility surface
  5. Track daily P&L and compare against the actual P&L from the broker

The divergence between simulated and actual P&L isolates model error from execution quality. If the simulation says +3% but the broker shows +1.5%, the difference is attributable to some combination of slippage, fill quality, and IV model inaccuracy. Tracking this over dozens of trades reveals whether the backtester's assumptions are systematically optimistic or conservative — and by how much.

This tool feeds directly into the alpha validation pipeline: the live trades become the "original" results that are then subjected to randomised entry testing, benchmark comparison, and signal attribution.


Limitations and Future Work

Naming limitations is not hedging — it's acknowledging the assumptions behind the model and being explicit about where they matter.

Pricing model: Black-Scholes assumes European-style exercise. US equity options are American-style, meaning early exercise is theoretically possible. In practice, early exercise of equity options is rare enough (except around ex-dividend dates) that BS remains the standard pricing model for backtesting. A binomial tree or finite-difference pricer would handle early exercise but at significantly higher computational cost.

Volatility surface: The SABR-inspired surface is synthetic. When not fitted to market data, it produces reasonable skew and term structure but won't capture intraday regime shifts, earnings-driven vol spikes, or event-specific skew deformations. Production systems would consume a live IV feed from an options data vendor.

Dividends: Not modelled. For high-dividend stocks or long-dated options, this introduces pricing error. The impact is minimal for index options and short-dated trades.

Slippage model: Percentage-based rather than order-book-based. Real slippage depends on bid-ask width, order size relative to displayed liquidity, and time of day. The percentage model is conservative but not structurally accurate.

Execution: Single-threaded with a global progress dictionary. Adequate for the research use case but would need concurrency work for multi-user deployment.


Tech Stack

LayerTechnology
LanguagePython 3.x
Web frameworkFlask (HTTP API + template rendering)
Options pricingpy_vollib (Black-Scholes + Greeks)
OptimisationSciPy (L-BFGS-B for volatility surface fitting)
Data processingNumPy, Pandas
Market datayfinance, Polygon.io
VisualisationChart.js (interactive equity curves, P&L charts)
StorageLocal CSV files (pricing data, results), SQLite (test results DB)
ExportsCSV, Excel (via openpyxl)

Reproducing This

The platform is not open-sourced, but the architecture is reproducible with the information above. Someone with basic options knowledge and a coding agent could reconstruct the core pipeline:

  1. Data layer: yfinance for OHLCV data, local CSV cache
  2. Volatility surface: Implement the SABR-inspired parametrisation described above
  3. Pricing engine: py_vollib for Black-Scholes + Greeks, with strike rounding
  4. Strategy framework: The run_strategy() interface contract, one strategy file per module
  5. Instrument framework: The create_instrument() interface, one instrument file per module
  6. Backtesting engine: Iterate through days, price positions at each bar, check exit conditions
  7. Kelly sizing: The constrained formula with half-Kelly and zero-quantity prevention

The novel contribution isn't any single component — it's the decoupling of strategy from instrument, the combinatorial testing this enables, and the multi-variable signal aggregation across assets. These architectural decisions are what make the platform useful for derivatives research rather than just another equity backtester with options bolted on.


Disclaimer: Altus Labs is not authorised or regulated by the Financial Conduct Authority (FCA). Altus Labs is a research publication and this content is provided for informational and educational purposes only. It does not constitute investment advice, a financial promotion, or an invitation to engage in investment activity. See our full disclaimer for more information.