Table of Contents
Fetching ...

Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations

Yi Zhang, Chun-Wun Cheng, Junyi He, Zhihai He, Carola-Bibiane Schönlieb, Yuyan Chen, Angelica I Aviles-Rivero

TL;DR

SONO introduces a cross-modal few-shot learning framework that leverages Second-Order Neural ODEs to refine visual features while a text-initialized cross-modal classifier maintains efficient learning. A text-based augmentation strategy, Text-as-Image Augmentation, enriches training by exploiting CLIP's image-text alignment, and the classifier is initialized with text embeddings from class prompts to avoid repeated text-encoder passes. Empirical results across 11 datasets and domain-shift settings demonstrate state-of-the-art performance with strong robustness and competitive efficiency, including notable gains on challenging datasets like ImageNet-A. The work shows that higher-order dynamical systems can enhance feature expressiveness for vision-language tasks, offering a practical and scalable approach for few-shot and domain-generalized cross-modal learning.

Abstract

We introduce SONO, a novel method leveraging Second-Order Neural Ordinary Differential Equations (Second-Order NODEs) to enhance cross-modal few-shot learning. By employing a simple yet effective architecture consisting of a Second-Order NODEs model paired with a cross-modal classifier, SONO addresses the significant challenge of overfitting, which is common in few-shot scenarios due to limited training examples. Our second-order approach can approximate a broader class of functions, enhancing the model's expressive power and feature generalization capabilities. We initialize our cross-modal classifier with text embeddings derived from class-relevant prompts, streamlining training efficiency by avoiding the need for frequent text encoder processing. Additionally, we utilize text-based image augmentation, exploiting CLIP's robust image-text correlation to enrich training data significantly. Extensive experiments across multiple datasets demonstrate that SONO outperforms existing state-of-the-art methods in few-shot learning performance.

Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations

TL;DR

SONO introduces a cross-modal few-shot learning framework that leverages Second-Order Neural ODEs to refine visual features while a text-initialized cross-modal classifier maintains efficient learning. A text-based augmentation strategy, Text-as-Image Augmentation, enriches training by exploiting CLIP's image-text alignment, and the classifier is initialized with text embeddings from class prompts to avoid repeated text-encoder passes. Empirical results across 11 datasets and domain-shift settings demonstrate state-of-the-art performance with strong robustness and competitive efficiency, including notable gains on challenging datasets like ImageNet-A. The work shows that higher-order dynamical systems can enhance feature expressiveness for vision-language tasks, offering a practical and scalable approach for few-shot and domain-generalized cross-modal learning.

Abstract

We introduce SONO, a novel method leveraging Second-Order Neural Ordinary Differential Equations (Second-Order NODEs) to enhance cross-modal few-shot learning. By employing a simple yet effective architecture consisting of a Second-Order NODEs model paired with a cross-modal classifier, SONO addresses the significant challenge of overfitting, which is common in few-shot scenarios due to limited training examples. Our second-order approach can approximate a broader class of functions, enhancing the model's expressive power and feature generalization capabilities. We initialize our cross-modal classifier with text embeddings derived from class-relevant prompts, streamlining training efficiency by avoiding the need for frequent text encoder processing. Additionally, we utilize text-based image augmentation, exploiting CLIP's robust image-text correlation to enrich training data significantly. Extensive experiments across multiple datasets demonstrate that SONO outperforms existing state-of-the-art methods in few-shot learning performance.

Paper Structure

This paper contains 24 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison between (a) Zero-shot CLIP radford2021learning, (b) CoOp zhou2022learning, (c) Tip-Adapter-F zhang2022tip, and (d) our proposed SONO, where $S^{(a)}_\theta$ represents the Second-Order NODE model.
  • Figure 2: An overview of our method for $K$-class $N$-shot classification. Subfigure (a) illustrates the text-as-image data augmentation process. Subfigure (b) presents the overall architecture of our proposed SONO, consisting of a Second-Order NODEs model $S^{(a)}_{\theta}$ and a cross-modal classifier, which is initialized with text embeddings derived from prompts containing class labels.
  • Figure 3: Classification Performance Comparison on Few-shot Learning, i.e., 1-/2-/4-/8-/16-shot, on 11 benchmark datasets. The top-left is the averaged accuracy over the 11 datasets.
  • Figure 4: Ablation results on various ODE solvers: Fourth-Order Runge-Kutta (RK4), Euler, Explicit Adams-Bashforth (AB), and Implicit Adams-Bashforth-Moulton (ABM) methods.