Table of Contents
Fetching ...

OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting

Sisuo Lyu, Siru Zhong, Weilin Ruan, Qingxiang Liu, Qingsong Wen, Hui Xiong, Yuxuan Liang

TL;DR

This paper tackles time series forecasting by addressing the overparameterization of large vision models when applied to TSF. It introduces OccamVTS, a cross-modal knowledge distillation framework that distills only the essential $1\%$ of predictive information from off-the-shelf vision backbones into lightweight temporal models using a pyramid-style feature alignment and two distillation pathways. The approach leverages a texture-focused visual augmentation pipeline to convert time series into 2D texture-like representations and uses an asymmetric teacher–student setup with a frozen teacher and a trainable small student. Across extensive long-, short-, few-shot, and zero-shot experiments on multiple benchmarks, OccamVTS achieves state-of-the-art results with up to a $99\%$ reduction in vision-encoder parameters, while offering substantial gains in data-scarce regimes. This work highlights the value of selective cross-modal transfer and establishes a practical path for efficient, robust TSF with limited data and resources.

Abstract

Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.

OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting

TL;DR

This paper tackles time series forecasting by addressing the overparameterization of large vision models when applied to TSF. It introduces OccamVTS, a cross-modal knowledge distillation framework that distills only the essential of predictive information from off-the-shelf vision backbones into lightweight temporal models using a pyramid-style feature alignment and two distillation pathways. The approach leverages a texture-focused visual augmentation pipeline to convert time series into 2D texture-like representations and uses an asymmetric teacher–student setup with a frozen teacher and a trainable small student. Across extensive long-, short-, few-shot, and zero-shot experiments on multiple benchmarks, OccamVTS achieves state-of-the-art results with up to a reduction in vision-encoder parameters, while offering substantial gains in data-scarce regimes. This work highlights the value of selective cross-modal transfer and establishes a practical path for efficient, robust TSF with limited data and resources.

Abstract

Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.

Paper Structure

This paper contains 39 sections, 44 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Modality visualization of images (ImageNet) and time series (ECL, Weather, Electricity, ETT) via the MAE encoder. (a)-(d): Original image samples extracted from the corresponding boxes in the t-SNE plot.
  • Figure 2: Overview of the OccamVTS framework.
  • Figure 3: Model Efficiency Comparison, MAE vs Inference Time vs Parameters.
  • Figure 4: Ablation Experiment on Four Datasets.
  • Figure 5: Effect of Different Training Data on Four Datasets.
  • ...and 7 more figures