Table of Contents
Fetching ...

Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

Thinesh Thiyakesan Ponbagavathi, Constantin Seibold, Alina Roitberg

TL;DR

Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them, providing encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer.

Abstract

Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at https://github.com/th-nesh/Frame2Freq.

Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

TL;DR

Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them, providing encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer.

Abstract

Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at https://github.com/th-nesh/Frame2Freq.
Paper Structure (23 sections, 6 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Examples from Diving48 illustrating how different somersault counts induce distinct temporal frequency signatures. Spectral magnitude maps on the right are obtained by applying a Fast Fourier Transform temporally to the frame embeddings of a frozen VFM. Slow dives produce low-frequency spectra, while faster periodic and rapid motions shift energy toward mid and high frequencies. This motivates Frame2Freq -- our image-to-video PEFT approach, leveraging Fast Fourier Transform (FFT) along time to improve fine-grained video understanding.
  • Figure 2: Overview of Frame2Freq. Left: Frozen VFMs are adapted to video by inserting lightweight Frame2Freq adapters between transformer blocks, enriching spatial embeddings with frequency-aware temporal cues. Right: Unlike the ST-Adapter, our spectral variants use FFT-based branches: Frame2Freq-ST for single scale spectral clues and Frame2Freq-MS for multi-scale motion patterns.
  • Figure 3: Class-wise mean spectra from Diving48 diving48 reveal that dive complexity boosts frequency energy, while posture (tuck vs. pike) shapes directional components. Fine-grained distinctions emerge more cleanly in the frequency domain, motivating frequency-aware temporal modeling.
  • Figure 4: Normalized frequency discriminability curves $D(f)$ for our main baseline ST-Adapter St_adaptor (blue) and our Frame2Freq-adapter (orange) on four datasets, computed with the Frequency Discriminability Analysis described in Sec. \ref{['sec:spectral_analysis']}. Each curve shows how much class-separating power is carried by each temporal frequency band (0-8). Standard temporal adapters concentrate discriminability in low or very high frequencies and tend to underuse the mid bands, whereas Frame2Freq shifts discriminability toward the most informative bands for each dataset, which is especially useful for recognition of fine-grained actions.
  • Figure 5: SSv2 symmetric actions. Spectral maps (right) reveal clear directional frequency differences between moving something down and moving something up, despite nearly identical RGB frames.
  • ...and 2 more figures