Everything NOVOSKY Is Still Getting Wrong ⋇ NOVOSKY

article.md

The Bot Works. Here's Everything That's Still Broken.

NOVOSKY is live. It places real trades on a real account at RoboForex, using real money, 24/7.

And we still have a list of known problems we haven't solved yet.

That's normal for any serious trading system. The dangerous systems are the ones where the builders think everything is working perfectly — because they haven't looked closely enough.

This article is our honest accounting. Every open problem, every ongoing struggle, every "we're working on it" — documented in public, because that's how we stay honest with ourselves and with anyone watching.

The Current Status

Open

Market Regime Detection

In Progress

Overfitting Management

Open

Execution Reality Gap

In Progress

Broker Data Delays

In Progress

Circuit Breaker Calibration

Open

Single-Instrument Risk

Open

Feature Stability Over Time

Open

Static OOS Split

Handled

Cent Account Precision

Handled

Timezone OOS Bug

Let's go through each one.

Problem #1: Market Regime Blindness

This is probably the most fundamental open problem.

Bitcoin moves in distinct regimes — trending up, trending down, and ranging (choppy sideways). A model that works beautifully in a trending regime might be completely useless in a ranging one. And vice versa.

Market Regime — Model Performance vs Price Behavior

Here's the core issue: NOVOSKY doesn't explicitly know which regime it's in.

The ADX regime filter provides a rough check — if ADX is too low, the market is probably ranging and trades get skipped. But ADX is a lagging indicator. It takes time to register a regime change after it's already happened.

The model was trained on data that included all three regimes. But it has no explicit "regime classifier" telling it which mode to be in right now.

What we're building toward: a dedicated LightGBM regime classifier that runs as a separate pre-filter. Before any signal fires, the regime classifier labels the current state — and the signal model only activates in regimes where it historically performed well.

This is non-trivial. Regime boundaries are blurry and contested even for experienced human traders. But it's the most important structural improvement on the roadmap.

Problem #2: Overfitting Is Never Truly Solved

Every time we retrain, overfitting is a risk. And it never fully goes away.

Our 224-day OOS window provides strong protection — but it's not a guarantee. A model can pass OOS validation and still be overfit to the specific conditions of that OOS period. If the market changes significantly after that period, performance can degrade.

This is called model drift — and it's real.

We monitor it by tracking rolling live performance against OOS predictions. When they diverge significantly, it's a retraining signal. But there's always a lag between "the model has started drifting" and "we noticed it in the data."

Formally, drift magnitude is quantified as a z-score against the OOS baseline:

$z_{drift} = \frac{r ˉ _{live} - μ _{OOS}}{σ _{OOS} / n}$

where $\overset{r}{ˉ}_{live}$ is the mean return over the past 30 live trades, $μ_{OOS}$ and $σ_{OOS}$ are the mean and standard deviation from the OOS window, and $n$ is the rolling sample size. When $∣ z_{drift} ∣ > 2$ , it triggers a retraining review.

The uncomfortable truth

Every model we deploy is a bet that the near future will resemble the recent past. That bet is usually right for weeks or months. It's occasionally catastrophically wrong when markets undergo structural regime shifts — like the 2022 crypto crash, or a sudden macro correlation event.

What we're doing: retraining cycles happen regularly, not just when performance drops. Proactive retraining on recent data is the best defense against drift.

Problem #3: The Execution Reality Gap

Backtests live in a perfect world. Live trading does not.

In a backtest, the model signals BUY and the price is instantly filled at exactly the close of that candle. In real life:

Spread widens during high-volatility candles — sometimes 3–5× the typical spread
Slippage occurs on fast-moving markets — your order fills at a worse price than you requested
Re-quotes happen — the broker's requote system rejects your price and offers a worse one
Latency matters — even 200ms of API round-trip means the market moved before your order hit

Each of these individually is small. Combined, over hundreds of trades, they create a consistent performance drag that makes live results worse than backtest results.

The effective fill price at execution is:

$P_{fill} = P_{signal} + δ_{spread} + δ_{slip} + δ_{latency}$

where $P_{signal}$ is the model's target entry price and each $δ$ term adds cost in the direction of the trade. Over $N$ trades with average drag $\overset{ˉ}{δ}$ , the total erosion is $N \cdot \overset{ˉ}{δ}$ — small per trade, significant at scale.

Backtest vs Live — The Performance Gap

We've partially addressed this by:

Using ATR-based dynamic stop-losses that account for current market volatility
Applying a spread filter — if the spread exceeds a threshold, no entry fires
Measuring broker latency at startup (broker_lat_ms) and logging it each heartbeat

But we haven't solved execution drag. We've managed it. There's a difference.

Problem #4: Broker Data Delays

RoboForex can delay closing deal history by hours after a trade closes.

This matters because NOVOSKY logs every trade to Supabase immediately after a position closes. When the MT5 deal history isn't available yet, the bot estimates the outcome based on available data and stores that estimate.

The reconciliation system then runs in the background, retrying every 60 seconds for up to 4 hours, until it finds the real closing deal data and overwrites the estimate.

How Reconciliation Works

After any estimated trade record, a background thread fires. It polls the MT5 deal history API every 60 seconds. When the real deal appears, it overwrites the Supabase record with actual broker data — correct profit, correct timestamps, correct fill price. The website dashboard always shows the most recent available data, whether that's an estimate or the reconciled truth.

The delay is most common when trades are closed via the MT5 mobile app (manual close) rather than by the bot's own logic. Manual closes bypass the bot's normal exit flow, so the bot reconstructs the outcome from the deal history retroactively.

This works. But it creates a window where the dashboard shows estimated data. If you close the trade manually at 2pm and check the dashboard at 2:05pm, the profit shown might be an estimate, not the broker's confirmed figure.

What we want: a webhook from MT5 that fires when deals are confirmed. RoboForex doesn't expose this. We're stuck with polling until the API improves.

Problem #5: Circuit Breaker Calibration

The circuit breaker halts trading after 7 consecutive losses in the same local trading day (WIB timezone).

Seven. Why seven? Honest answer: because it felt right during testing.

That's not a satisfying answer. But calibrating circuit breakers is genuinely difficult:

Too tight (e.g., 3 consecutive losses): The system halts constantly during normal drawdown sequences that would have recovered. You miss good trades.
Too loose (e.g., 15 consecutive losses): By the time the breaker fires, significant damage has been done. The protection is too late.

The "right" number depends on the model's expected winning streak distribution, which changes as the model evolves. A number calibrated for a 58% win rate model isn't necessarily right for a 55% win rate model.

At win rate $p$ , the probability of exactly $k$ consecutive losses is:

$P (k) = (1 - p)^{k}$

At $p = 0.58$ , seven straight losses has probability $(0.42)^{7} \approx 0.27%$ — rare, but across hundreds of trades it will happen. The open question is whether it signals genuine model failure or is simply an unlucky variance event.

What we're building: a statistical analysis of the model's historical consecutive loss distribution, plus an Optuna sweep specifically targeting circuit breaker parameters against the OOS window. The goal is to find the threshold where the breaker fires on genuine model failure — not normal variance.

Problem #6: The Single-Instrument Problem

NOVOSKY currently only trades BTCUSD.

One instrument. One market. One source of edge — and one source of risk.

In a bad BTC week — macro shock, exchange news, sudden correlation event — the system has no diversification. There's nowhere to hide.

Multi-instrument support (XAUUSD, EURUSD, other crypto) has been architecturally planned but not built. The model is trained specifically on BTC's microstructure. A XAUUSD model would need its own training pipeline, features tuned for gold's behavior (more news-driven, different session dynamics), and a separate validation window.

This is not a small task. It's an entire parallel development track.

Current status: architecture supports it (the broker abstraction layer exists), but the training pipeline and model files don't exist for additional instruments yet.

Problem #7: Feature Stability Over Time

The 59 features that mattered in 2022–2024 might not matter the same way in 2026–2028.

Market microstructure evolves. Correlations shift. The relationship between RSI and next-candle direction in 2022 might look completely different in 2026 because the participant base changed — more institutional traders, different ETF flows, different leverage dynamics.

This is feature drift, and it's related to but distinct from model drift.

We currently track SHAP values over time as a proxy — if a feature's average contribution collapses, something changed. But we don't have a systematic feature stability monitor that flags drift before it affects performance.

What we're building: rolling SHAP analysis over 30-day windows, comparing against the baseline distribution at training time. Features that drift significantly outside the training-time distribution get flagged for investigation before they drag down model performance.

Problem #8: Static OOS Split

The 224-day OOS window is a fixed historical period.

That's strong protection against in-sample overfitting. But it has a structural weakness: the OOS period ends at a fixed date. As time passes, that OOS period gets older — and older OOS data is less representative of current market conditions.

Ideally, you'd use walk-forward validation — a rolling series of train/test splits where each test window is adjacent to its corresponding training window. This tests whether the model generalizes across time, not just across a single split.

Walk-Forward vs Static OOS

Static OOS: train on 2022–2025, test on the last 224 days. One evaluation.

Walk-Forward: train on 2022–2023, test on 2023 Q1. Train on 2022–2023 Q1, test on 2023 Q2. Repeat across multiple windows. Produces a distribution of performance metrics instead of a single number — much more robust.

The reason we don't have walk-forward yet: it requires significant compute and pipeline restructuring. Each window needs a full training run (Azure ML training job, ~40 minutes). For 20 windows, that's 800 minutes of training per cycle.

It's on the roadmap. It's just not cheap or easy.

What's Actually Been Fixed

Not everything is an open problem. A few things that were broken and are now handled:

The cent account precision bug. All MT5 values (balance, profit, deal amounts) are in USC (cent account units), not USD. Early versions of the codebase mixed these up, producing incorrect risk calculations and wrong Supabase records. Now there's a strict rule: raw USC is always stored, with helpers for display conversion. No more mixing.

The timezone OOS crash. A bug where plain date strings passed as OOS window boundaries caused a timezone-naive vs UTC comparison crash in the backtester. Fixed by detecting tzinfo and calling tz_localize("UTC") if missing. Doesn't sound exciting, but it silently produced wrong results before being caught.

Reconciliation data integrity. The bot previously stored estimated trade data permanently when broker history wasn't available. Now the _reconcile_trade_bg() background thread retries for up to 4 hours and overwrites with real broker data when it arrives.

The Meta-Problem: We Can't Know What We Don't Know

Every problem listed above is one we've found.

The scarier category is the problems we haven't found yet — unknown failure modes that only appear under specific market conditions we haven't seen since going live.

Every live trading system has these. The answer isn't to delay launch until everything is perfect (you'd never launch). The answer is to stay small while you learn, maintain conservative position sizing, log everything, and treat every unexpected behavior as a research opportunity rather than an emergency.

NOVOSKY runs on a cent account for exactly this reason. The goal is proving the edge is real before scaling the capital. Every problem found at small size costs less to learn from.

The Actual Goal

A system that fails gracefully and informatively is better than a system that works mysteriously. NOVOSKY's extensive logging, circuit breakers, and reconciliation system aren't just safety nets — they're how we learn what's actually happening in live markets vs what the models predict.

The problems listed here aren't signs the system is broken. They're evidence that we're paying attention.