Building a Research-Grade Equity Backtesting Platform

A deep dive into the architecture and validation framework behind a custom equity backtesting platform — from hit-rate analysis and Kaplan-Meier survival curves to elastic-net macro factor modeling and walk-forward cross-validation.

25 min read
By Altus Labs
Technology ResearchTechnology ResearchBacktestingStatistical ValidationMacro FactorsSurvival AnalysisWalk-ForwardSoftware Architecture
Building a Research-Grade Equity Backtesting Platform

This article walks through the architecture and key design decisions behind our equity backtesting platform — a system we built to test trading strategies against historical data and, more importantly, to statistically validate whether the results represent genuine skill or noise. The platform doesn't just tell you how a strategy performed. It tells you whether you should believe the results.

Key ideas in brief

  • Most backtesting tools produce an equity curve and a Sharpe ratio. This platform adds four independent validation layers: hit-rate analysis, skill testing (AUC/ROC + Brier scores), randomised entry testing (1000 iterations), and Kaplan-Meier survival analysis.
  • A macro factor modeling module uses elastic-net logistic regression with walk-forward time-series cross-validation to test whether macro variables (VIX, yields, DXY) have predictive power for stock returns — with ablation studies to isolate which factors actually matter.
  • The strategy interface is deliberately minimal: one function, one DataFrame in, one list of trades out. Everything else — equity simulation, metrics, statistical tests — is handled by the platform.

Why Build This

Every equity backtesting tool produces the same output: an equity curve, a Sharpe ratio, a win rate. The problem isn't generating these numbers — it's knowing whether to trust them.

A strategy with a 2.0 Sharpe over 10 years of data could be genuinely capturing a market inefficiency. Or it could be overfitted to the specific sequence of events in that period. Or its edge could come entirely from three outlier trades that happened to catch pandemic-era moves. Or its entry timing might be irrelevant — random entries with the same holding period might produce similar results.

Existing tools don't answer these questions. They show you the curve and leave the interpretation to you. We needed a platform that systematically tests whether a strategy's results are distinguishable from luck across multiple independent methodologies:

  • Hit-rate analysis: Do entries actually reach profit targets more often than loss targets within a fixed lookforward window?
  • Skill testing: Does the strategy's entry condition scoring predict trade outcomes better than random (measured by AUC and Brier score)?
  • Randomised entry testing: If we shift every entry date randomly within ±10 days and re-run 1000 times, does the strategy still outperform?
  • Survival analysis: How quickly do trades reach profit vs. loss targets? Is the time-to-profit significantly shorter than time-to-loss?

These four tests, applied together, make it substantially harder for a spurious result to survive validation. A strategy that passes all four has a meaningfully higher probability of representing genuine edge.


System Architecture

Backtest Execution Flow


Strategy Framework

The Interface Contract

Every strategy is a Python function implementing a single entry point:

def run_strategy(data: pd.DataFrame) -> List[Dict]:
    """
    Receives: DataFrame with Date, Open, High, Low, Close, Volume columns
              and a 365-day lookback buffer for indicator warm-up.

    Returns:  List of trade dicts, each containing:
              - 'buy_date', 'buy_price'
              - 'sell_date', 'sell_price'
              - 'return' (decimal, e.g. 0.05 for 5%)
              - 'days_held'
              - 'exit_reason' (optional)
              - Entry condition values for skill testing (optional):
                'entry_rsi', 'entry_macd_hist', 'entry_kc_bb_ratio'
    """

The optional entry condition fields are what make the deep testing module powerful. When a strategy records why it entered — the RSI value, the MACD histogram level, the KC/BB ratio — the skill testing module can evaluate whether higher-confidence entries produce better outcomes. Strategies that don't include these fields still work for basic backtesting but skip the entry quality analysis.

Equity Curve Simulation

The backtesting engine simulates a $100,000 portfolio through multiplicative compounding:

equity = 100_000
for trade in trades:
    equity *= (trade['sell_price'] / trade['buy_price'])

# Daily equity tracking (while holding):
daily_equity = equity * (current_close / entry_price)

Each backtest also runs two benchmark comparisons automatically:

  • Buy & Hold: What if you just held the underlying for the entire period?
  • SPY: What if you held the market index instead?

These benchmarks contextualise the strategy's performance. A 15% annual return sounds good until you learn SPY did 20% over the same period.


Deep Testing: Statistical Validation

This is the core differentiator. Deep testing takes a strategy's trade list and subjects it to four independent statistical tests. The goal is to answer one question: is this strategy distinguishable from a random entry strategy?

Hit-Rate Analysis

For every trade entry, the engine looks forward N days (default 30) and asks two questions:

  1. Did the price reach the profit target (default +5%) at any point during the window?
  2. Did the price reach the loss target (default -5%) at any point during the window?
profit_target_price = entry_price * (1 + profit_target_pct / 100)
loss_target_price   = entry_price * (1 + loss_target_pct / 100)  # negative pct

hit_profit = max(highs[entry : entry + 30]) >= profit_target_price
hit_loss   = min(lows[entry : entry + 30])  <= loss_target_price

This is fundamentally different from just looking at whether the trade was profitable. A trade might have been closed at +2% by the strategy's exit logic, but the hit-rate analysis reveals that the underlying actually reached +5% within the lookforward window — suggesting the strategy is leaving money on the table with premature exits.

The engine also calculates price deltas at fixed intervals — 5-day, 10-day, and 20-day changes from each entry — and the realised volatility over each trade's holding period. These distributions reveal whether the strategy's entries tend to precede moves of unusual magnitude.

Skill Tests

Skill testing evaluates whether the strategy's entry signals contain genuine predictive information. Two complementary metrics are used:

AUC/ROC (Area Under the Curve): Measures how well the strategy's entry condition score discriminates between profitable and unprofitable trades. An AUC of 0.5 means the entry score is no better than random; 0.7+ suggests meaningful signal quality; 0.9+ is exceptional.

Brier Score: Measures probability calibration — how well the predicted confidence of a trade matches its actual outcome frequency. A Brier score of 0.25 means the predictions are no better than always predicting 50/50. Lower is better.

The entry condition scoring uses strategy-specific ranking functions:

# Built-in ranking factors for entry quality scoring:
ranking_factors = {
    'macd_rsi':       MACD_histogram + MACD_trend + RSI,
    'kcbb_ratio':     KC_width / BB_width,
    'ema_sma_ratio':  EMA_30 / SMA_120,
    'rsi_value':      100 - RSI,           # prefer oversold
    'rv_ratio':       RV_30d_SMA / RV_current,  # prefer vol compression
    'ema_slope_kcbb': abs(EMA_slope) + KCBB_ratio,
}

These factors let the platform rank trades by entry quality and test whether higher-ranked entries produce better returns. If they do, the strategy is capturing real signal. If returns are uncorrelated with entry quality, the strategy may be profiting from market drift rather than signal.

Bootstrap analysis with configurable iterations provides confidence intervals on all skill metrics.

Randomised Entry Testing

The most brutal test. Every entry date is randomly shifted within a configurable window (default ±10 days), and the strategy is re-run. This is repeated 1000 times, producing a distribution of outcomes under randomised entry timing.

for iteration in range(1000):
    randomised_trades = []
    for trade in original_trades:
        shift = random.randint(-window, +window)
        new_entry = find_nearest_trading_day(trade['buy_date'] + shift)
        new_trade = simulate_trade_from(new_entry, trade['holding_period'])
        randomised_trades.append(new_trade)

    randomised_returns.append(mean_return(randomised_trades))

# p-value: fraction of randomised iterations that beat original
p_value = sum(r >= original_return for r in randomised_returns) / 1000

The output is a p-value. If p < 0.05, the strategy's entry timing produces significantly better results than random entries with the same holding period — the timing contains real information. If p > 0.05, the strategy's returns are statistically indistinguishable from random entry timing, which means the edge (if any) comes from the holding period and market drift, not from the entry signal.

Survival Analysis

Kaplan-Meier survival curves visualise how quickly trades reach their profit vs. loss targets. This answers a question that aggregate statistics miss: even if the hit rates for profit and loss are identical, is the median time-to-profit shorter than the median time-to-loss?

A strategy where trades reach +5% in a median of 8 days but take 18 days to reach -5% has a meaningful time-asymmetry — it captures gains faster than it accumulates losses. This asymmetry isn't visible in win rate or Sharpe ratio but is a real edge in live trading because it reduces capital lockup and increases trade frequency.

The platform uses the lifelines library for Kaplan-Meier estimation, producing cumulative probability curves for both profit and loss target events.

Test Parameters

All deep testing parameters are user-configurable:

ParameterDefaultPurpose
Profit target+5%Price level considered a "hit" for profit
Loss target-5%Price level considered a "hit" for loss
Lookforward window30 daysHow far ahead to check for target hits
Randomisation window±10 daysRange for entry date shifting
Randomisation iterations1000Number of random entry simulations
Confidence level95%Statistical significance threshold
Minimum trades10Minimum sample size for valid testing

Macro Factor Modeling

The macro testing module answers a different question from deep testing. Deep testing asks is this strategy's edge real? Macro testing asks can macro variables predict whether a stock will reach its profit target?

Feature Engineering

The module takes macro factor time series — VIX, VVIX, TNX (10-year yield), DXY (dollar index), MOVE (bond volatility), FX rates — and expands each into four features:

for factor in macro_factors:
    features[f'{factor}_level']     = factor_data[factor]
    features[f'{factor}_delta_1d']  = factor_data[factor].diff(1)
    features[f'{factor}_delta_5d']  = factor_data[factor].diff(5)
    features[f'{factor}_zscore_20d'] = (
        (factor_data[factor] - factor_data[factor].rolling(20).mean())
        / factor_data[factor].rolling(20).std()
    )

# All features shifted +1 day to prevent look-ahead bias
features = features.shift(1)

The target variable is binary: 1 if the stock reaches +5% within 30 days, 0 otherwise. This transforms a continuous return prediction problem into a classification problem — which is better suited to the question traders actually care about (will this trade hit my target?) rather than the question academics typically ask (what will the return be?).

The +1 day shift is critical. Without it, today's VIX level would be used to predict today's forward return — but in live trading you'd be making the decision at close, and the return starts from tomorrow. The shift ensures all features are genuinely available at decision time.

Elastic-Net Logistic Regression

The model uses elastic-net regularisation — a combination of L1 (lasso) and L2 (ridge) penalties:

Loss = -y·log(p) - (1-y)·log(1-p) + C·(0.5·|w| + 0.25·|w|²)

where:
  p = 1 / (1 + exp(-w·x))
  l1_ratio = 0.5 (balanced L1/L2)

L1 drives irrelevant factor coefficients to exactly zero (feature selection). L2 prevents overfitting when factors are correlated (which macro factors always are — VIX and MOVE are highly correlated, for example). The combination yields a sparse, interpretable model: the surviving non-zero coefficients are the factors that genuinely matter, and their signs indicate direction (positive coefficient = factor increase predicts higher probability of hitting +5%).

Walk-Forward Cross-Validation

Standard k-fold cross-validation is invalid for time series because it allows future data to inform past predictions. The platform uses strict walk-forward validation:

# Initial training window: 3 years (756 trading days)
# Test window: 1 quarter (63 trading days)
# Step: 1 quarter forward

Fold 1: Train 2014-01 to 2016-12, Test 2017-Q1
Fold 2: Train 2014-01 to 2017-03, Test 2017-Q2
Fold 3: Train 2014-01 to 2017-06, Test 2017-Q3
...

Each fold produces AUC, Brier score, precision, recall, and F1. The pattern across folds reveals whether the model's predictive power is stable or decaying — a common sign of regime change.

Ablation Study

After the walk-forward validation, the platform automatically runs an ablation study: systematically removing each factor and measuring the performance drop.

for factor in macro_factors:
    # Remove this factor's features (level, delta_1d, delta_5d, zscore_20d)
    reduced_features = all_features.drop(columns=factor_columns)

    # Re-run walk-forward validation without this factor
    reduced_metrics = walk_forward_cv(reduced_features, target)

    # Measure degradation
    auc_drop = full_model_auc - reduced_metrics['auc']
    brier_increase = reduced_metrics['brier'] - full_model_brier

    importance[factor] = {'auc_drop': auc_drop, 'brier_increase': brier_increase}

A large AUC drop when a factor is removed means that factor is load-bearing — the model genuinely relies on it. A near-zero drop means the factor is redundant (its information is already captured by other factors) or irrelevant. This prevents the common mistake of including VIX, VVIX, and MOVE in the same model and concluding all three matter — the ablation study reveals which one is actually doing the work.

Regime Analysis

The module detects market regimes (bull/bear based on momentum, high/low vol based on VIX quintiles) and evaluates whether the model's predictive power varies across regimes. A model that works in low-vol bull markets but fails during stress periods is useful to know about before deploying capital.


Multi-Asset Testing

1-to-N Backtesting

Run a single strategy across multiple symbols in one request (up to 10 symbols per run). The platform supports theme-based symbol selection — grouping by sector, market cap, or custom watchlists — and produces a comparative metrics table across all symbols.

This reveals whether a strategy generalises across assets or is specific to one stock's idiosyncratic dynamics. A momentum strategy that works on AAPL, MSFT, NVDA, and GOOGL but fails on XOM, JNJ, and PG is capturing tech-sector momentum, not a universal pattern.

Relative Performance

Ratio-based backtesting between two symbols. The engine constructs a price ratio series (Stock A / Stock B), applies the strategy to the ratio, and evaluates whether the strategy can time the relative outperformance:

ratio_series = stock_a['Close'] / stock_b['Close']
trades = run_strategy(ratio_series)  # Entry when ratio is undervalued

This is a pairs trading framework: long the underperformer, short the outperformer, and profit from mean reversion in the ratio. The backtest tracks whether the strategy's entries coincide with ratio inflection points.

Strategy Comparison

Run up to 3 strategies simultaneously on the same symbol and date range. The platform produces overlaid equity curves, comparative metrics tables, and identifies which strategy performs best under different market conditions. This is the simplest way to answer "is my new strategy actually better than my old one, or did I just test it on a better period?"


Visual Analytics

The platform generates a suite of charts designed to expose variable-level impact on trade outcomes. The philosophy is that aggregate metrics (Sharpe, win rate) tell you what happened but not why. The visual analytics layer answers why by decomposing results across every factor the strategy uses.

Entry Factor Scatter Plots

The most revealing visualisation in the platform. For each entry condition the strategy records, the engine produces a scatter plot with trade return on the X-axis and factor value on the Y-axis. Each point is a single trade, colour-coded by outcome — darker for winners, brighter for losers. When multiple symbols are tested via 1-to-N backtesting, each stock gets its own colour, producing a multi-symbol overlay that reveals whether factor behaviour is consistent across assets.

Seven factor dimensions are plotted simultaneously:

FactorWhat it measuresWhat a useful pattern looks like
KC/BB RatioKeltner Channel width / Bollinger Band width — volatility compressionWinners cluster at high ratios (above 0.8), losers spread across all values
RV exit - RV entryChange in realised volatility during the tradeWinners cluster where vol decreased (entered during high vol, exited during low vol)
RV entry30-day annualised realised volatility at entryVisible regime bands — strategies may only work in specific vol environments
MACD HistogramMomentum strength at entryWinners cluster at moderately negative values (momentum about to turn)
MACD TrendDay-over-day change in MACD histogramWinners cluster at positive trend (histogram improving) — momentum inflection
RSI Value14-period Relative Strength Index at entryWinners concentrated in 40-60 range (not extended), losers at extremes
RSI TrendDay-over-day change in RSIWinners at positive RSI trend (strength improving from neutral zone)

The value of these plots is in the absence of pattern as much as its presence. If winners and losers are uniformly mixed across a factor's range, that factor isn't discriminating — the strategy's entry condition for that variable isn't adding value. This is faster and more intuitive than running regressions, because a human eye immediately spots clustering, banding, and outlier-driven results that statistical summaries can mask.

ROC Curves

Each deep test produces a Receiver Operating Characteristic curve plotting the true positive rate against the false positive rate across all possible entry score thresholds. The diagonal represents a random classifier (AUC = 0.5). The strategy's curve bows toward the top-left corner in proportion to its discriminative ability.

The AUC value is displayed in the chart title. An AUC of 0.65 means the strategy's entry scoring correctly ranks a random winner above a random loser 65% of the time — modest but non-trivial. Below 0.55 suggests the entry scoring is essentially noise.

Equity and Drawdown Curves

Standard but well-executed. Multiple strategies are overlaid on a single time-axis chart for direct comparison. Drawdown charts use filled area (red, 10% opacity) beneath the drawdown line to make the severity of drawdown periods visually proportional to their depth and duration.

Both chart types support time-axis zoom, legend toggling per strategy, and hover tooltips showing exact values at each date.

Macro Factor Visualisation

The ablation study renders as a colour-coded table where rows with large AUC drops (important factors) are highlighted in red, and rows with near-zero drops (irrelevant factors) are green. This immediately surfaces which factors are load-bearing in the model.

A ranked badge display shows the top 3 most important factors in red, orange, and purple — a quick reference for which macro variables deserve monitoring attention.

Coefficient analysis splits features into positive and negative contributors, showing both the direction and magnitude of each factor's influence on the prediction. This answers the question: "if VIX rises by 1 z-score, does that make a +5% hit more or less likely, and by how much?"

Volatility Decomposition Charts

Three time-series charts for index volatility analysis:

  1. Index realised volatility — 30-day rolling vol with filled area, showing volatility regime changes over time
  2. Realised dispersion — constituent variance minus index variance, identifying periods where stock-picking has structural advantage
  3. Constituent volatility overlay — top 10 constituents plotted simultaneously, revealing which names drive index vol and which are idiosyncratically quiet

Performance Metrics

MetricFormulaWhat it reveals
CAGR(final / initial)^(1/years) - 1Annualised compounding rate
Sharpe Ratio(mean daily return / std) * sqrt(252)Return per unit of total risk
Sortino Ratio(mean daily return / downside std) * sqrt(252)Return per unit of downside risk only
Calmar RatioCAGR / abs(max drawdown)Return per unit of worst-case loss
Max Drawdownmin((equity - peak) / peak)Worst peak-to-trough decline
VaR (95%)5th percentile of daily returnsWorst expected daily loss at 95% confidence
CVaR (95%)mean(returns where return < VaR)Expected loss in the worst 5% of days
SkewnessThird moment of returnsPositive = right tail heavier (desirable)
KurtosisFourth moment of returnsHigher = fatter tails (more extreme events)

VaR and CVaR are worth highlighting. Most retail backtesting tools stop at Sharpe and max drawdown. VaR tells you the daily loss you should expect 1 in 20 days. CVaR (Conditional Value at Risk, also called Expected Shortfall) tells you the average loss on those worst days — it captures tail risk that VaR alone misses. A strategy with good Sharpe but poor CVaR is one bad week away from a drawdown the Sharpe ratio didn't warn you about.


Volatility Testing

The volatility module analyses index constituent behaviour for sector indices (SMH, XLE, XLK). For each index, it calculates:

  • Constituent-level realised volatility: 30-day rolling vol for each stock in the index
  • Realised dispersion: Average constituent variance minus index-level variance — this measures how much the constituents are moving independently vs. together
  • Regime detection: Classifying volatility environments for conditional analysis

High dispersion means individual stocks are diverging from the index — a favourable environment for stock-picking. Low dispersion means stocks are moving in lockstep — a difficult environment for any strategy that relies on idiosyncratic moves.


Limitations

Strategy execution: Strategies run via exec() with restricted globals. This is adequate for a research tool but wouldn't be appropriate for a multi-tenant production system where untrusted code execution is a security concern.

Data storage: CSV files on the local filesystem. No database, no concurrent access control. Works well for single-user research but would need a proper data layer for team use.

Position sizing: The equity curve simulation is a simple multiplicative model (equity × sell/buy ratio). There's no explicit contract sizing, margin modeling, or partial fills. The focus is on signal validation rather than execution simulation.

Macro factor model: Elastic-net logistic regression is interpretable but linear. Non-linear interactions between factors (e.g., the combination of high VIX and rising yields being predictive when neither alone is) would require tree-based models or neural networks at the cost of interpretability.

Survival analysis: Kaplan-Meier is non-parametric — it makes no assumptions about the distribution of time-to-event. This is conservative but means we can't extrapolate beyond the observed data range.


Tech Stack

LayerTechnology
FrontendReact 19, Chart.js 4.5, chroma-js
BackendFastAPI (ASGI), Python 3.x
Data processingPandas, NumPy
ML / Statisticsscikit-learn (Elastic-Net, metrics), SciPy, lifelines (Kaplan-Meier)
Market datayfinance (optional download scripts)
StorageLocal CSV (OHLCV data), JSON (test results), Python files (strategies)
DeploymentReact dev server (port 3000) + FastAPI (port 8000), CORS-enabled

Reproducing This

The platform is not open-sourced, but the architecture is reproducible:

  1. Backtesting engine: run_strategy() interface → equity curve via multiplicative compounding → standard metrics (Sharpe, Sortino, Calmar, VaR, CVaR)
  2. Hit-rate analysis: For each entry, check if highs/lows within N-day window breach profit/loss targets
  3. Skill testing: Use scikit-learn's roc_auc_score and brier_score_loss on strategy entry condition scores vs. trade outcomes
  4. Randomised entry testing: Shift entries ±N days, re-simulate 1000 times, compute p-value
  5. Survival analysis: lifelines.KaplanMeierFitter on time-to-target-hit data
  6. Macro factor modeling: Elastic-net logistic regression with walk-forward CV, +1 day feature shift, automatic ablation study
  7. Volatility testing: 30-day rolling constituent vol, realised dispersion = avg constituent variance - index variance

The novel contribution is the four-layer validation framework. Any single test can be fooled. Hit rates can look good by chance. Skill tests can overfit. Randomised entries might miss timing-dependent edges. Survival curves can be skewed by outliers. But a strategy that passes all four has been stress-tested from independent angles — making it substantially more likely to represent a genuine, deployable edge.


Disclaimer: Altus Labs is not authorised or regulated by the Financial Conduct Authority (FCA). Altus Labs is a research publication and this content is provided for informational and educational purposes only. It does not constitute investment advice, a financial promotion, or an invitation to engage in investment activity. See our full disclaimer for more information.