Table of Contents
Fetching ...

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

TL;DR

VT-FSL tackles semantic grounding in $N$-way $K$-shot learning by generating visually grounded textual descriptions with Cross-modal Iterative Prompting (CIP) and synthesizing consistent images to augment data, then aligning all modalities with Cross-modal Geometric Alignment (CGA) in a kernelized RKHS space. CIP produces precise class definitions in a single structured pass, conditioned on class names and support images, while CGA minimizes the volume $\mathrm{Vol}_{\mathcal{H}}$ of the kernel Gram matrix to jointly align textual, support, and synthetic visual embeddings. The approach achieves state-of-the-art results across ten benchmarks spanning standard, cross-domain, and fine-grained FSL, with ablations confirming the complementary benefits of textual/visual prompts and nonlinear kernelized alignment. This work demonstrates the practical value of integrating LLM-driven semantics and geometry-aware multimodal fusion to improve data-efficient generalization in vision tasks.

Abstract

Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

TL;DR

VT-FSL tackles semantic grounding in -way -shot learning by generating visually grounded textual descriptions with Cross-modal Iterative Prompting (CIP) and synthesizing consistent images to augment data, then aligning all modalities with Cross-modal Geometric Alignment (CGA) in a kernelized RKHS space. CIP produces precise class definitions in a single structured pass, conditioned on class names and support images, while CGA minimizes the volume of the kernel Gram matrix to jointly align textual, support, and synthetic visual embeddings. The approach achieves state-of-the-art results across ten benchmarks spanning standard, cross-domain, and fine-grained FSL, with ablations confirming the complementary benefits of textual/visual prompts and nonlinear kernelized alignment. This work demonstrates the practical value of integrating LLM-driven semantics and geometry-aware multimodal fusion to improve data-efficient generalization in vision tasks.

Abstract

Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

Paper Structure

This paper contains 40 sections, 29 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Illustration of the VT-FSL intuition. Left: The generated text and synthetic images provide high-level class semantics and low-level sample diversity. Right: By minimizing the volume of the $3$-dimensional parallelotope spanned by all embeddings, they lie closer, indicating better alignment.
  • Figure 2: Overview of the proposed VT-FSL framework. First, given both class names and support images, CIP guides an LLM to generate precise descriptions via four structured stages. Synthetic images with semantic consistency are then generated based on these descriptions to expand the limited data. They are extracted to obtain features $Z_v$ by a feature extractor consisting of multiple Transformer blocks with shared weights. Next, the textual features $Z_t$ encoded by CLIP are injected into the support features $z_s$ via a two-layer MLP, enhancing the support embeddings $Z_s$. Finally, $Z_s$, $Z_t$, and $Z_v$ are jointly aligned through CGA, enabling global and nonlinear cross-modal interactions.
  • Figure 3: Illustration of Cross-modal Iterative Prompting (CIP). Given the class name and 5-shot support samples, CIP exploits the LLM through four structured reasoning stages: Strategy, Perception, Refinement, and Conclusion, to generate class-specific, precise class descriptions.
  • Figure 4: Illustration of visual synthetic images in the 1-shot task.
  • Figure 5: Comparison with textual semantics from name, definition of SemFew zhang2024simple, and ours.
  • ...and 5 more figures