Table of Contents
Fetching ...

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

TL;DR

FGVC with Vision-Language Model distillation is hampered by fixed prompts and global alignment. PAND introduces a two-stage approach that first learns task-adaptive semantic anchors via Stage-PSC and then enforces neighborhood-aware structural transfer through Stage-NSD, enabling lightweight models to inherit fine-grained discrimination. The framework extends neighborhood-based distillation to the vision-language setting and yields state-of-the-art results across four FGVC benchmarks, notably improving ResNet-18 on CUB-200 by 3.4% over VL2Lite. This work advances practical deployment of VLM capabilities on resource-constrained devices by decoupling semantic calibration from structural transfer.

Abstract

Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

TL;DR

FGVC with Vision-Language Model distillation is hampered by fixed prompts and global alignment. PAND introduces a two-stage approach that first learns task-adaptive semantic anchors via Stage-PSC and then enforces neighborhood-aware structural transfer through Stage-NSD, enabling lightweight models to inherit fine-grained discrimination. The framework extends neighborhood-based distillation to the vision-language setting and yields state-of-the-art results across four FGVC benchmarks, notably improving ResNet-18 on CUB-200 by 3.4% over VL2Lite. This work advances practical deployment of VLM capabilities on resource-constrained devices by decoupling semantic calibration from structural transfer.

Abstract

Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.
Paper Structure (19 sections, 9 equations, 3 figures, 2 tables)

This paper contains 19 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overall framework of PAND. The training is decoupled into two stages. Stage-PSC: We learn task-specific context tokens to generate calibrated text features (semantic anchors) while keeping the VLM encoders frozen. Stage-NSD: Using the learned text features as a fixed classifier for the teacher, we train the lightweight student. The student is supervised by the VL2Lite base loss jang2025 and our proposed Neighborhood-Aware Structural Distillation, which aligns the local decision structures of the student with the teacher.
  • Figure 2: Sensitivity analysis of the NSD weight $\lambda_{NSD}$ on CUB-200 with ResNet-18.
  • Figure 3: t-SNE visualization of feature distributions. (a) MobileNet-V2 on FGVC-Aircraft. (b) ResNet-18 on CUB-200. Each subplot compares w/o KD, VL2Lite, and our method. Different colors indicate different categories.