Table of Contents
Fetching ...

NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning

Yi Zhang, Chun-Wun Cheng, Ke Yu, Zhihai He, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

TL;DR

NODE-Adapter introduces Neural Ordinary Differential Equations to refine cross-modal prototypes for vision-language reasoning in few-shot settings. It first constructs a cross-modal prototype by adaptively blending CLIP-based textual and visual prototypes, then optimizes this prototype as a continuous-time gradient flow via a Neural ODE, guided by a gradient-flow estimation module. The approach achieves state-of-the-art results in few-shot classification, domain generalization, and HOI visual reasoning while remaining parameter-efficient and computationally lean due to adjoint-based backpropagation and an efficient ODE solver. This continuous-depth refinement enables more accurate prototypes under data scarcity, improving downstream decision boundaries without heavy retraining of the entire VLM. The work underscores the potential of continuous-time dynamics for robust cross-modal adaptation and paves the way for extending Neural ODEs to broader vision-language tasks.

Abstract

In this paper, we consider the problem of prototype-based vision-language reasoning problem. We observe that existing methods encounter three major challenges: 1) escalating resource demands and prolonging training times, 2) contending with excessive learnable parameters, and 3) fine-tuning based only on a single modality. These challenges will hinder their capability to adapt Vision-Language Models (VLMs) to downstream tasks. Motivated by this critical observation, we propose a novel method called NODE-Adapter, which utilizes Neural Ordinary Differential Equations for better vision-language reasoning. To fully leverage both visual and textual modalities and estimate class prototypes more effectively and accurately, we divide our method into two stages: cross-modal prototype construction and cross-modal prototype optimization using neural ordinary differential equations. Specifically, we exploit VLM to encode hand-crafted prompts into textual features and few-shot support images into visual features. Then, we estimate the textual prototype and visual prototype by averaging the textual features and visual features, respectively, and adaptively combine the textual prototype and visual prototype to construct the cross-modal prototype. To alleviate the prototype bias, we then model the prototype optimization process as an initial value problem with Neural ODEs to estimate the continuous gradient flow. Our extensive experimental results, which cover few-shot classification, domain generalization, and visual reasoning on human-object interaction, demonstrate that the proposed method significantly outperforms existing state-of-the-art approaches.

NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning

TL;DR

NODE-Adapter introduces Neural Ordinary Differential Equations to refine cross-modal prototypes for vision-language reasoning in few-shot settings. It first constructs a cross-modal prototype by adaptively blending CLIP-based textual and visual prototypes, then optimizes this prototype as a continuous-time gradient flow via a Neural ODE, guided by a gradient-flow estimation module. The approach achieves state-of-the-art results in few-shot classification, domain generalization, and HOI visual reasoning while remaining parameter-efficient and computationally lean due to adjoint-based backpropagation and an efficient ODE solver. This continuous-depth refinement enables more accurate prototypes under data scarcity, improving downstream decision boundaries without heavy retraining of the entire VLM. The work underscores the potential of continuous-time dynamics for robust cross-modal adaptation and paves the way for extending Neural ODEs to broader vision-language tasks.

Abstract

In this paper, we consider the problem of prototype-based vision-language reasoning problem. We observe that existing methods encounter three major challenges: 1) escalating resource demands and prolonging training times, 2) contending with excessive learnable parameters, and 3) fine-tuning based only on a single modality. These challenges will hinder their capability to adapt Vision-Language Models (VLMs) to downstream tasks. Motivated by this critical observation, we propose a novel method called NODE-Adapter, which utilizes Neural Ordinary Differential Equations for better vision-language reasoning. To fully leverage both visual and textual modalities and estimate class prototypes more effectively and accurately, we divide our method into two stages: cross-modal prototype construction and cross-modal prototype optimization using neural ordinary differential equations. Specifically, we exploit VLM to encode hand-crafted prompts into textual features and few-shot support images into visual features. Then, we estimate the textual prototype and visual prototype by averaging the textual features and visual features, respectively, and adaptively combine the textual prototype and visual prototype to construct the cross-modal prototype. To alleviate the prototype bias, we then model the prototype optimization process as an initial value problem with Neural ODEs to estimate the continuous gradient flow. Our extensive experimental results, which cover few-shot classification, domain generalization, and visual reasoning on human-object interaction, demonstrate that the proposed method significantly outperforms existing state-of-the-art approaches.
Paper Structure (35 sections, 13 equations, 7 figures, 7 tables)

This paper contains 35 sections, 13 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Classes exhibit distinct visual and textual feature spaces respectively. Images from different classes may share similar visual features but differ in textual features, and conversely, images from the same class may showcase diverse visual features. Our goal is to leverage both modalities to enhance performance in few-shot classification scenarios.
  • Figure 2: Illustration of prototype rectification. $\mathbf{q}_i$ and $\mathbf{q}_k$ are query samples of class $i, k$, respectively. For class i, j and k, $\{\mathbf{p}_{v,i},\mathbf{p}_{v,j},\mathbf{p}_{v,k}\}$ are the visual prototypes, $\{\mathbf{p}_{t,i},\mathbf{p}_{t,j},\mathbf{p}_{t,k}\}$ are textual prototypes and $\{\mathbf{p}_i,\mathbf{p}_j,\mathbf{p}_k\}$ are the cross-modal prototypes. (a) Initially the two query samples are misclassified. (b) Cross-modal prototypes corrects the classification of $\mathbf{q}_i$. (c) The cross-modal prototypes are further rectified by Neural ODE. Hence, in (d), both of the query samples are corrected at time $T$.
  • Figure 3: An overview of our NODE-Adapter. We first leverage the powerful aligning capability of CLIP to obtain the primitive textual and visual class prototypes. To exploit both modalities, we utilize a learnable vector $\mathbf{u}$ to conditionally combine the prototypes as the initial value for the ordinary differential equation. Then, we apply Neural ODEs to obtain the gradient and solve the initial value problem with an ODE solver as the optimal prototype to formulate the final prediction.
  • Figure 4: Structure of our Neural ODEs . With a gradient estimator and a weight generator, $f_\theta$ could adaptively capture the prototype dynamic to perform accurate rectifications.
  • Figure 5: Classification performance comparison on few-shot learning (on ResNet-50), i.e., 1-/2-/4-/8-/16-shot, on 11 benchmark datasets. The top-left is the averaged accuracy over the 11 datasets.
  • ...and 2 more figures