What Is Point-in-Time Correctness? Why No-Look-Ahead Makes or Breaks a Backtest
By TickDistill — order-flow microstructure signals. Educational content, not financial advice.
The short answer
Point-in-time correctness is the guarantee that every computation at time t uses only data from strictly before t. Violating this constraint — accidentally or structurally — is called look-ahead bias, and it is the most common way order-flow research produces backtest results that cannot be reproduced in live trading. TickDistill treats point-in-time correctness as a hard engineering invariant: every baseline, every normalization, every mask is causal by construction, and every backtest result can be independently reproduced from the same inputs.
What does "point-in-time correct" mean, exactly?
Point-in-time correctness means that the value of any signal emitted at timestamp t is a deterministic function of data with timestamps t' < t only. No observation from t' ≥ t enters the computation — not the current bucket, not a future bucket, not in the normalization denominator, not in the exclusion mask calibration.
The strict inequality matters. Including the current observation (t' ≤ t instead of t' < t) is still a form of contamination: the baseline that normalizes a measurement must not include that same measurement, or the z-score becomes self-referential.
Why look-ahead bias is so easy to introduce by accident
Look-ahead bias does not require deliberate cheating. It emerges from common implementation shortcuts:
| Source | How it happens |
|---|---|
| In-sample normalization | Computing the mean/std over the entire history, then using it to normalize each historical point |
| Rolling window off-by-one | A pandas.rolling().mean() default that includes the current row in the window |
| Global volatility estimate | Using the full-period σ as the denominator for a z-score computed at each past point |
| Classifier training | Training a trade-side classifier on the same period you backtest the signal |
| Mask calibration | Identifying "noisy" windows after the fact and masking them retroactively |
Each of these makes a past computation depend on future information. The backtest looks cleaner than it is; live performance does not benefit from knowledge of the future.
The causal baseline: t' < t strictly
A causal baseline is a rolling statistic — mean, standard deviation, or exponentially weighted equivalent — computed at each point using only the observations that were available at that point in history.
The public z-score formula is:
z_t = ( x_t − μ_t ) / σ_t
where μ_t and σ_t are estimated from { x_{t'} : t' < t } exclusively. This is standard practice for normalizing order-flow quantities against a causal baseline (Easley, López de Prado, O'Hara 2012).
Two choices of baseline are common in practice:
| Baseline type | Formula | Property |
|---|---|---|
| Rolling window of N observations | μ_t = (1/N) Σ_{i=t-N}^{t-1} x_i |
Equal weight, sharp cutoff |
| Exponentially weighted (EWM) | μ_t = (1−λ) Σ_{k=0}^{∞} λ^k x_{t-1-k} |
Smooth decay, infinite memory |
The decay parameter λ corresponds to a half-life h via λ = exp(−ln2/h). A longer half-life makes the baseline more stable across regime changes; a shorter half-life makes it more adaptive. The calibration of this parameter is proprietary — what matters for correctness is that whichever estimator is used, it uses t' < t only. TickDistill uses a causal EWM baseline, and the current observation never enters the estimate that normalizes it.
Mechanical windows: why some events must be excluded from the baseline
Even a perfectly causal baseline can be distorted by recurring mechanical events — moments when volume or imbalance is large for structural reasons rather than informational ones.
A clear example is the perpetual futures funding settlement at 00:00, 08:00, and 16:00 UTC (public exchange schedule, Binance and most major venues). At these moments, a funding payment causes predictable positioning activity that is unrelated to informed order flow. Including funding spikes in the baseline causes the baseline σ to inflate, which then suppresses the z-score of genuine order-flow events in surrounding windows.
The solution is an exclusion mask: data within a mechanical window is excluded from updating the baseline. The mask is applied causally — it defines which observations are allowed to enter the rolling statistic. Observations inside the mask are not deleted; the signal may still be computed over them, but the baseline parameters are not updated from them.
μ_t = EWM over { x_{t'} : t' < t AND t' ∉ mask }
σ_t = EWM-std over the same filtered set
Which windows to mask, and at what granularity, is a calibration decision that depends on the instrument, the signal, and the empirical effect of the mechanical event on the signal's distribution. The general principle — exclude mechanical events from the normalization baseline — is textbook practice; the specific calendar is proprietary.
Warm-up periods: when a causal baseline is not yet reliable
A rolling or exponentially weighted estimator requires a minimum number of observations before its estimates are stable. Emitting z-scores before the warm-up completes produces values with high estimation error, which corrupt any downstream comparison.
TickDistill enforces two distinct warm-up criteria before emitting any signal value:
- Signal window warm-up. A signal that is itself a rolling statistic (e.g., VPIN, a moving imbalance) requires its own window to be filled before it produces a meaningful value.
-
Baseline warm-up. The causal baseline
(μ_t, σ_t)requires a sufficient number of non-masked observations before its estimates stabilize. For an EWM baseline with half-lifeh, stability is reached after approximately5hobservations — the point at which the weight of the initialization drops below roughly 3%.
No signal point is emitted until both criteria are satisfied. A missing warm-up is equivalent to a form of look-ahead: the estimator behaves as if it has more historical information than it does.
Anti-look-ahead: the test that verifies the guarantee (Test 5)
The claim of point-in-time correctness is verifiable. The test is direct: compute signal values over a stream, then modify trades at timestamps t' > t, and confirm that the signal value at t is identical.
Formally, for any t and any perturbation of { x_{t'} : t' > t }:
signal(t | history up to t) = signal(t | history up to t, perturbed future)
If this equality fails, the computation has a look-ahead dependency. This test is mandatory in TickDistill's test suite and covers every path: the signal window, the baseline estimator, the mask exclusion, and the BVC price-change estimator σ_dP (which uses its own causal window over past price differences between sub-bars, never the full sample).
Reproducibility: why point-in-time correctness enables version-pinned backtests
Point-in-time correctness is a prerequisite for reproducibility. A backtest result from a point-in-time-correct pipeline is a deterministic function of four inputs: (signal, params, range, version) — because each signal is itself a pure parametric function f(primitive, params). Given the same four inputs, the same output must emerge, regardless of when the query runs.
This enables two capabilities:
- Permalink/content-hash. Every backtest result can be identified by a hash of its inputs. The result is shareable and reproducible indefinitely.
- Version pinning. When a signal formula is updated (v1 → v2), backtest queries pinned to v1 continue to reproduce the v1 result exactly. Code and data definitions are frozen together.
Neither capability is possible if the computation is contaminated by look-ahead, because future data would make the output depend on when the query runs, not only on the declared inputs. See also What makes a backtest reproducible? Permalinks and version pinning.
How this connects to sigma-normalization and signal quality
Sigma-normalization — expressing a signal in units of standard deviations from its own rolling baseline — is only honest if the baseline is causal. An in-sample standard deviation is not a yardstick; it is a measurement taken with a ruler that was calibrated using the answer.
The practical consequence is that live signals and historically backfilled signals use the same code path: the causal baseline estimator, the same mask, the same warm-up logic. There is no separate "backtest mode" that uses full-sample statistics. The backtest is the same computation run over historical data. See Why Order-Flow Signals Should Be Measured in Standard Deviations.
How the pipeline enforces these guarantees
Three architectural properties enforce point-in-time correctness end-to-end:
-
Single-pass streaming. Each day of data is processed in order, one observation at a time. The state at time
tis built from the stream up tot; no random access to future records is possible. See Single-pass streaming ETL and discard. - Immutable daily partitions. Processed outputs are stored as immutable Parquet partitions. Reprocessing a day overwrites its slice cleanly and produces the identical result (idempotence). This is verified by the QA gate. See How We Validate Market Data Before It Becomes a Signal.
-
Causal baseline module. The baseline estimator is a shared module used by every signal processor. It carries its own state forward per stream and enforces the
t' < tconstraint at the interface level, so individual signal implementations cannot accidentally access current or future baseline values.
FAQ
What is look-ahead bias, in one sentence?
Look-ahead bias is the use of information from time t' ≥ t when computing a signal value for time t, causing backtest results to reflect knowledge the strategy could not have had.
Why does an in-sample standard deviation cause look-ahead bias?
An in-sample standard deviation is computed over the entire historical period. Using it to normalize a point in the middle of that period means the denominator includes observations that occurred after that point — information the model would not have had in real time.
What is an exclusion mask and why does it not create look-ahead bias?
An exclusion mask is a set of timestamp intervals whose observations are not allowed to update the rolling baseline. The mask must itself be defined causally — based on a public, fixed event schedule (like exchange funding times), not identified from the data after the fact. A mask derived by examining data is a form of look-ahead; a mask derived from a published schedule is not.
Does warm-up affect live trading or only backtests?
Both. In a live deployment, a signal cannot emit values until its baseline has accumulated the required number of non-masked observations. In a historical backtest, the same warm-up logic applies: signal points are absent from the first segment of history until both the signal window and the baseline window are filled.
How can I verify that a signal is point-in-time correct?
The direct test: compute signal values on a stream, modify observations after a target time t, and confirm the signal value at t is unchanged. Any dependency on future data will cause the value to change. This test must cover the signal formula, the baseline estimator, the mask, and any classifier sub-component that has its own rolling estimate.
TickDistill sells clean, computed order-flow inputs — not trading advice or guaranteed alpha. Backtests are illustrative and not a promise of future results.









