Table of Contents
Fetching ...

Learning Time in Static Classifiers

Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao

TL;DR

The paper tackles the limitation of static classifiers that ignore temporal dynamics by introducing Support-Exemplar-Query (SEQ) learning, a loss-driven framework that imparts temporal reasoning without architectural changes. By constructing temporally coherent feature trajectories from smooth augmentations and aligning predictions to class-specific temporal prototypes via differentiable Soft-DTW, a lightweight fully connected classifier on frozen features learns time-evolving semantics. The method combines three objectives—temporal prototype alignment, semantic supervision, and temporal smoothness—into a single trainable loss, yielding strong gains on fine-grained image recognition and precise, frame-level anomaly detection in video. This approach offers a modular, data-efficient bridge between static and temporal learning, reducing architectural complexity while delivering robust temporal generalization.

Abstract

Real-world visual data rarely presents as isolated, static instances. Instead, it often evolves gradually over time through variations in pose, lighting, object state, or scene context. However, conventional classifiers are typically trained under the assumption of temporal independence, limiting their ability to capture such dynamics. We propose a simple yet effective framework that equips standard feedforward classifiers with temporal reasoning, all without modifying model architectures or introducing recurrent modules. At the heart of our approach is a novel Support-Exemplar-Query (SEQ) learning paradigm, which structures training data into temporally coherent trajectories. These trajectories enable the model to learn class-specific temporal prototypes and align prediction sequences via a differentiable soft-DTW loss. A multi-term objective further promotes semantic consistency and temporal smoothness. By interpreting input sequences as evolving feature trajectories, our method introduces a strong temporal inductive bias through loss design alone. This proves highly effective in both static and temporal tasks: it enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection. Despite its simplicity, our approach bridges static and temporal learning in a modular and data-efficient manner, requiring only a simple classifier on top of pre-extracted features.

Learning Time in Static Classifiers

TL;DR

The paper tackles the limitation of static classifiers that ignore temporal dynamics by introducing Support-Exemplar-Query (SEQ) learning, a loss-driven framework that imparts temporal reasoning without architectural changes. By constructing temporally coherent feature trajectories from smooth augmentations and aligning predictions to class-specific temporal prototypes via differentiable Soft-DTW, a lightweight fully connected classifier on frozen features learns time-evolving semantics. The method combines three objectives—temporal prototype alignment, semantic supervision, and temporal smoothness—into a single trainable loss, yielding strong gains on fine-grained image recognition and precise, frame-level anomaly detection in video. This approach offers a modular, data-efficient bridge between static and temporal learning, reducing architectural complexity while delivering robust temporal generalization.

Abstract

Real-world visual data rarely presents as isolated, static instances. Instead, it often evolves gradually over time through variations in pose, lighting, object state, or scene context. However, conventional classifiers are typically trained under the assumption of temporal independence, limiting their ability to capture such dynamics. We propose a simple yet effective framework that equips standard feedforward classifiers with temporal reasoning, all without modifying model architectures or introducing recurrent modules. At the heart of our approach is a novel Support-Exemplar-Query (SEQ) learning paradigm, which structures training data into temporally coherent trajectories. These trajectories enable the model to learn class-specific temporal prototypes and align prediction sequences via a differentiable soft-DTW loss. A multi-term objective further promotes semantic consistency and temporal smoothness. By interpreting input sequences as evolving feature trajectories, our method introduces a strong temporal inductive bias through loss design alone. This proves highly effective in both static and temporal tasks: it enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection. Despite its simplicity, our approach bridges static and temporal learning in a modular and data-efficient manner, requiring only a simple classifier on top of pre-extracted features.

Paper Structure

This paper contains 20 sections, 23 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of our framework. (a) Temporally smooth sequences are generated via time-indexed transformations $\mathcal{A}_t$ (or sourced from natural videos) and processed by a frozen, image-pretrained vision transformer to extract frame-wise features. A lightweight temporal classifier is then trained to produce feature trajectories. (b) These trajectories are optimized using a multi-term objective with the Support-Exemplar-Query (SEQ) learning framework (see Fig. \ref{['fig:seq']}) to (i) align with class-specific prototype trajectories that capture typical temporal patterns (violet block), (ii) achieve accurate classification through semantic supervision (vivid green block), and (iii) ensure smooth and consistent temporal evolution (gray brown block).
  • Figure 2: Examples from Flowers-102, SoyAging, Stanford Dogs, and Cars show how augmentations create temporal variations from one image. The first column shows originals (green); others apply augmentations by color: flip (red), zoom (blue), rotation (purple), color jitter (orange), shear (brown), translation (pink), blur (gray), and cutout (cyan), enriching the feature space with varied appearances.
  • Figure 3: Support-Exemplar-Query (SEQ) models class-consistent temporal dynamics by constructing a support set of sequences to form a class-specific exemplar that captures typical prediction trajectories over time. A query sequence is then aligned against this exemplar to enforce temporal consistency and reveal deviations from expected class behavior.
  • Figure 4: Evaluation of key hyperparameters.
  • Figure 5: Visualization of selected FC weight regions shows a clear comparison between the baseline (left) and our temporal modeling (right). Temporal modeling yields stronger, more distinct patterns, enhancing feature discrimination. Even on the ultra-fine-grained SoyAging, our approach produces clearer, more structured weights, demonstrating the advantages of temporal supervision in feature learning.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3