← back to mock pineapple
Mock Pineapple · Step 4 · Maintenance Loop
The System Maintains Itself — Drift-Triggered Re-Tuning
Re-tuning on a calendar is wasteful. Re-tuning every time the model drifts is over-eager. The right answer is per-pair, per-model, drift-triggered — and the rules are different for every pair. This is the unsexy MLOps work that's actually load-bearing.
−71%
compute vs periodic re-tune
2 weeks
consecutive trigger required
+62%
SGD/SARIMA improvement
−150%
JPY/SARIMA hurt by re-tune
Walk-forward MAPE — the source of truth
Drift detection requires a stable measurement. Single-window backtests are too noisy — a model that randomly hit a calm month looks great; a model that hit a rate-decision week looks terrible. Walk-forward across 10+ cutoff dates averages out the noise. This is the only number the drift detector trusts.
# run_weekly_retrain.py — Step 1
def walk_forward_mape(pair, model, n_cutoffs=10):
cutoffs = sample_cutoffs(start='2023-01-01', end='today', n=n_cutoffs)
mapes = []
for cutoff in cutoffs:
train_df = data[data.date < cutoff]
test_df = data[(data.date >= cutoff) & (data.date < cutoff + 30d)]
model.fit(train_df)
forecast = model.predict(horizon=30)
mape = mean(abs(forecast - test_df.actual) / test_df.actual)
mapes.append(mape)
return median(mapes) # median, not mean — robust to single bad cutoff
Why median, not mean: one bad cutoff (say, a regime shift week) shouldn't make the system re-tune a model that's been stable for 9 other cutoffs. Median is robust; mean is trigger-happy.
Drift detection — 1.5× threshold, 2 consecutive weeks
A single-week MAPE spike isn't drift, it's noise. Drift is persistent. The detector compares this week's walk-forward MAPE to the baseline tuned MAPE; if the ratio exceeds 1.5× for two consecutive weeks, it triggers an Optuna re-tune. Baselines live in tuned_hyperparameters.json; thresholds in config.json.
# source.monitoring.drift_retune
def check_drift(pair, model):
current = walk_forward_mape(pair, model)
baseline = tuned_hyperparameters[(model, pair)]['mape']
ratio = current / baseline
history.append(ratio >= 1.5) # rolling 2-week window
if len(history) >= 2 and all(history[-2:]):
# Persistent drift — re-tune
log_warning(f"DRIFT DETECTED: {model}/{pair} — "
f"MAPE {current:.4f} vs baseline {baseline:.4f} "
f"({ratio:.1f}x) for 2 consecutive weeks → RE-TUNE TRIGGERED")
return True
return False
def retune(pair, model):
# Optuna 30 trials — narrower than initial 100
best = optuna.optimize(model, pair, n_trials=30, walk_forward=True)
tuned_hyperparameters[(model, pair)] = best
save(tuned_hyperparameters) # atomic JSON write
Per-pair, per-model rules — what 6 years of simulation taught
Before deploying drift-triggered re-tuning, I simulated 6 years of weekly tunes against the no-tune baseline, per (model, pair). The result was strikingly pair-specific — the same drift detector that helps SGD/SARIMA actively hurts JPY/SARIMA. The production rules below are encoded directly in drift_retune.py.
SARIMA / JPY
RULE — NEVER RE-TUNE
Re-tuning SARIMA on JPY hurts MAPE by 150% in the 6-year simulation. JPY's autoregressive structure is unusually stable — once tuned, it stays good. Each re-tune introduces noise from finite-sample Optuna trials.
Encoded as: SKIP_RETUNE = {('sarima', 'JPY')} hardcoded skip set.
SARIMA / SGD
RULE — DRIFT-TRIGGERED
SGD/SARIMA +62% MAPE improvement with drift triggers vs no-retune baseline. Drift triggers fired 3 times in 6 years; periodic re-tuning would have fired 7 times — same outcome with less compute.
SGD's monetary policy regime shifts more than JPY's, so the model legitimately needs occasional updates.
LightGBM / JPY
RULE — DRIFT-TRIGGERED
LightGBM/JPY +76% improvement with only 2 re-tunes in 6 years. Tree models are more sensitive to feature distribution shifts than autoregressive models — the macro feature set evolves with the world (VIX regime, yield-curve shape).
Drift trigger catches it without retuning every Monday.
−71% compute vs periodic re-tuning
The weekly retrain script trains all 24 model-pair combinations (3 models × 8 pairs) every Monday — that part is non-negotiable, it's how we measure drift. But re-tuning hyperparameters via Optuna is the expensive step: 30 trials × ~80 seconds each = ~40 minutes of compute per re-tune. Doing this for every pair every week is wasteful.
| Strategy | Re-tunes per year | Compute | Best vs no-retune |
| Periodic (every Monday) | 52 × 24 = 1,248 | ~830 hrs/year | marginal |
| Drift-triggered (1.5× × 2 weeks) | ~10/year (in production) | ~6.6 hrs/year (−71%) | +62% / +76% on the right pairs |
| Per-pair rules (no JPY/SARIMA) | ~7/year | ~4.7 hrs/year (−80%) | strictly better — never the regression cases |
The principle: "stationary" isn't a property of FX markets — it's a property of this pair × this model × this feature set. The drift detector and per-pair rules together encode that, so I don't have to remember it.
Mar 30 incident — drift caught SGD before I did
The most recent successful retrain logged a real drift event. This is what the system was built to do.
2026-03-30 · 10:28 CDT · WEEKLY RETRAIN W13
SGD models drifted 7.9× and 1.8× — system auto-recovered overnight
DRIFT DETECTED: sarima/SGD — MAPE 0.0220 vs baseline 0.0028 (7.9x) for 2 consecutive weeks → RE-TUNE TRIGGERED
DRIFT DETECTED: lightgbm/SGD — MAPE 0.0041 vs baseline 0.0023 (1.8x) for 2 consecutive weeks → RE-TUNE TRIGGERED
Both pairs re-tuned in the same retrain cycle (~80 seconds each, 30 Optuna trials, walk-forward validated). New baselines written to tuned_hyperparameters.json. Dashboards regenerated. Total human time involved: zero. I noticed it the next morning when I looked at the regen log.
This is what "MLOps" should mean: the system flags its own failures, fixes what it can, and only escalates when the rules don't apply.
Full weekly retrain — five sequential steps
# Mondays 9:30 AM via LaunchAgent — com.mockpineapple.weekly-retrain
bash projects/mock_pineapple/run_pipeline.sh --retrain
STEP 1 · walk-forward MAPE per (model × pair) over 10 cutoffs
24 fits × ~1-2s each ≈ 30s
STEP 2 · drift_retune.py
for each (model, pair):
check_drift() → returns True/False
if True and (model, pair) ∉ SKIP_RETUNE: queue_retune()
STEP 3 · Optuna re-tune queued items (sequential, ~40-60s each)
Mar 30 example: 3 items queued, 2 actually re-tuned (SARIMA/SGD, LightGBM/SGD)
save tuned_hyperparameters.json (atomic write)
STEP 4 · refit production models with new hyperparameters
write data/trained_models/YYYY-WXX/manifest.json
sarima.pkl, prophet.pkl, lgbm.pkl per (pair, model)
STEP 5 · regenerate dashboards
dashboard/generate_analyst_dashboard.py
→ mape_dashboard.html (model quality, ensemble weights)
→ analyst_report.html (7-finding narrative)
→ bucketed_mape.html (per-horizon bucket detail)
Total wall time: ~25-40 min (depends on how many re-tunes triggered)
What this loop taught me
PROCESS LESSONS
Three rules that came out of this work
1. Re-tune is not a default, it's a triggered action. Default re-tuning destroys stable models. Treating "Monday morning" as a re-tune signal is a bug, not a feature.
2. Per-pair rules beat global rules. The same drift threshold (1.5×) is correct for SGD and wrong for JPY/SARIMA. Encoding pair-specific behavior in SKIP_RETUNE took 30 minutes and prevented a class of regressions I'd otherwise have to babysit.
3. The dashboard regenerate is part of the loop. If the system re-tunes overnight but the dashboard still shows last week's MAPE, the human (me) doesn't trust the numbers. Regenerating analyst_report.html every Monday closes the loop.
What the maintenance loop delivers
71% less compute than periodic re-tuning. Per-pair rules prevent the regression cases (JPY/SARIMA gets worse with re-tuning, so the system never tries). Dashboards regenerate automatically so the numbers Diane sees are always current. The Mar 30 incident shows the loop working in production: SGD drift detected, two models re-tuned, no human in the loop. This is what "MLOps on a laptop" looks like.