Guide

Out-of-sample testing explained: the only backtest number that matters

Backtesting fundamentals · ~5 min read

If you only ever check one number before trading a strategy, make it this one. Out-of-sample testing is the closest thing backtesting has to an honesty check — it's the only standard technique that actually simulates not knowing the future.

Quick answer

Out-of-sample testing means building and tuning your strategy on one part of your historical data, then testing it, unchanged, on a separate part it never saw during development. If the strategy performs well on data it wasn't fitted to, that's real evidence of an edge. If it doesn't, the in-sample result was likely overfitting.

Why in-sample results tell you almost nothing on their own

When you build a strategy, you naturally tune it against the data in front of you — adjusting parameters, adding filters, trying variations until the equity curve looks good. That process, called in-sample development, will always eventually produce something that looks profitable on that specific data. The question out-of-sample testing answers is simple but essential: does the thing you built actually work, or does it just fit the past?

How to set up an out-of-sample split

Divide your historical data into two (or three) chronological pieces:

In-sample: the earlier portion of your history. Build, tune, and optimise your strategy here — as much as you like.
Out-of-sample: the later portion, held back and never touched during development. Run your finished strategy on it exactly once, unchanged.

A common split reserves the most recent 20-30% of history as out-of-sample, but the exact ratio matters less than the discipline of never tuning on it. The moment you adjust a parameter because the out-of-sample result looked disappointing, that data stops being out-of-sample — you've just folded it into development.

What a good out-of-sample result looks like

You're not necessarily looking for the out-of-sample result to match the in-sample result exactly — some degradation is normal and expected, because in-sample performance is always somewhat inflated by the tuning process. What you want to see is that the edge is still positive and still believable on unseen data: a reasonable Sharpe ratio, a sensible number of trades, and no collapse into losses.

A large gap — strong in-sample, weak or negative out-of-sample — is the clearest single sign of overfitting there is.

Common ways people quietly cheat their own out-of-sample test

Peeking, then "just checking"

Looking at the out-of-sample result, tweaking a parameter, and testing again turns your out-of-sample data into in-sample data by the second attempt. The whole test only works if the held-back data is touched exactly once.

Choosing the split after seeing the results

Picking where the out-of-sample period starts based on which split makes the strategy look best is the same problem in a different disguise. Fix the split before you look at results from either side.

Walk-forward without discipline

Rolling the in-sample/out-of-sample window forward through time (walk-forward testing) is a legitimate and often better approach — but only if each out-of-sample window is still genuinely untouched before it's tested.

Out-of-sample testing doesn't replace the other checks

A strategy can pass an out-of-sample test and still fail elsewhere — ignoring trading costs, being the best of many parameter searches, or resting on too few trades can all still be present. Treat out-of-sample performance as necessary, not sufficient: the single most important checkpoint, among several that all matter.

Or have it enforced automatically, every run

The Honest Backtest Engine headlines out-of-sample results by default, with the split fixed at 70/30 by data length — not editable, so it can't be gamed after the fact. You see the honest number every time, without having to trust your own discipline to hold the line.

See how it works

Frequently asked questions

What is out-of-sample testing in trading?

It means testing a strategy on historical data it was never tuned or optimised against, to check whether its edge holds up on data it hasn't seen. It's the closest backtesting comes to simulating genuine future performance.

How much data should be out-of-sample?

A common approach reserves the most recent 20-30% of history as out-of-sample. The exact ratio matters less than strictly never tuning the strategy on that portion before testing it.

What if my strategy performs worse out-of-sample?

Some degradation from in-sample to out-of-sample is normal. A large gap, or a result that turns negative, is the clearest sign the in-sample performance was overfitting rather than a real edge.

Is walk-forward testing better than a single out-of-sample split?

It can be, since it tests the strategy across multiple time periods rather than one. It only works correctly if each out-of-sample window remains genuinely untouched before being tested, the same discipline a single split requires.