When Fusion Helps and When It Breaks: View-Aligned Robustness in Same-Source Financial Imaging
Rui Ma
TL;DR
This work investigates same-source, two-view financial imaging for next-day direction prediction on $SGE$ gold spot data, constructing aligned OHLCV and indicator-image views and evaluating under leakage-resistant time-block splits with MCC as the primary metric. It reveals a non-monotonic data-noise trade-off driven by a post-hoc minimum-movement filter on $|r_{t+1}|$, which governs when predictive signal emerges and how robust the models are. The study shows that late fusion with dual encoders provides the dominant clean-performance gains in stabilized label regimes, while early fusion can incur negative transfer under high label noise; cross-view consistency regularization yields secondary, backbone-dependent effects. Adversarial robustness tests with $\ell_\infty$ perturbations demonstrate severe vulnerability at small budgets, with robustness strongly view-dependent and view-constrained attacks benefiting from late fusion, though joint perturbations remain challenging. These findings underscore the importance of explicit evaluation design, view-aligned threat modeling, and diagnostics to reliably assess fusion benefits and robustness in financial-imaging pipelines.
Abstract
We study same-source multi-view learning and adversarial robustness for next-day direction prediction with financial image representations. On Shanghai Gold Exchange (SGE) spot gold data (2005-2025), we construct two window-aligned views from each rolling window: an OHLCV-rendered price/volume chart and a technical-indicator matrix. To ensure reliable evaluation, we adopt leakage-resistant time-block splits with embargo and use Matthews correlation coefficient (MCC). We find that results depend strongly on the label-noise regime: we apply an ex-post minimum-movement filter that discards samples with realized next-day absolute return below tau to define evaluation subsets with reduced near-zero label ambiguity. This induces a non-monotonic data-noise trade-off that can reveal predictive signal but eventually increases variance as sample size shrinks; the filter is used for offline benchmark construction rather than an inference-time decision rule. In the stabilized subsets, fusion is regime dependent: early fusion by channel stacking can exhibit negative transfer, whereas late fusion with dual encoders and a fusion head provides the dominant clean-performance gains; cross-view consistency regularization has secondary, backbone-dependent effects. We further evaluate test-time L-infinity perturbations using FGSM and PGD under two threat scenarios: view-constrained attacks that perturb one view and joint attacks that perturb both. We observe severe vulnerability at tiny budgets with strong view asymmetry. Late fusion consistently improves robustness under view-constrained attacks, but joint attacks remain challenging and can still cause substantial worst-case degradation.
