Table of Contents
Fetching ...

Self-Empowering VLMs: Achieving Hierarchical Consistency via Self-Elicited Knowledge Distillation

Wei Yang, Yiran Zhu, Zilin Li, Xunjia Zhang, Hongtao Wang

TL;DR

This work identifies that vision-language models struggle with hierarchical VQA primarily due to a lack of cross-level state rather than insufficient taxonomic knowledge. It introduces Self-Elicited Knowledge Distillation (SEKD), where a VLM alternates between a conditioned-step teacher and a single-pass student, distilling hard labels, soft distributions, and decoder states to instill dependency-aware, multi-step reasoning. SEKD achieves substantial gains in hierarchical consistency (HCA) and related metrics, generalizes to unseen taxonomies (e.g., Food-101), and transfers to non-hierarchical reasoning benchmarks, all while preserving efficiency and avoiding external supervision or catastrophic forgetting. Practically, SEKD enables compact VLMs to internalize structured reasoning, offering a scalable path to robust path-consistent VQA across diverse taxonomies and datasets.

Abstract

Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.

Self-Empowering VLMs: Achieving Hierarchical Consistency via Self-Elicited Knowledge Distillation

TL;DR

This work identifies that vision-language models struggle with hierarchical VQA primarily due to a lack of cross-level state rather than insufficient taxonomic knowledge. It introduces Self-Elicited Knowledge Distillation (SEKD), where a VLM alternates between a conditioned-step teacher and a single-pass student, distilling hard labels, soft distributions, and decoder states to instill dependency-aware, multi-step reasoning. SEKD achieves substantial gains in hierarchical consistency (HCA) and related metrics, generalizes to unseen taxonomies (e.g., Food-101), and transfers to non-hierarchical reasoning benchmarks, all while preserving efficiency and avoiding external supervision or catastrophic forgetting. Practically, SEKD enables compact VLMs to internalize structured reasoning, offering a scalable path to robust path-consistent VQA across diverse taxonomies and datasets.

Abstract

Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.

Paper Structure

This paper contains 46 sections, 15 equations, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Overview of our hierarchical VQA formulation and Self-Elicited Knowledge Distillation (SEKD) framework.
  • Figure 2: Depth-wise per-level accuracy and conditional accuracy.
  • Figure 3: SEKD training architecture: a stepwise teacher VLM distills its intermediate hierarchical states into a single-pass student.
  • Figure 4: Illustration of the conditioned-step (teacher) and joint (student) prompts used in our hierarchical VQA tasks.
  • Figure 5: Failure--repair case study across three paradigms on iNat-Animal. Each row shows the ground-truth taxonomy and the paths from Joint Invocation, Independent Levels, and Conditioned Steps; green checks and red crosses mark correct and incorrect nodes.
  • ...and 1 more figures