Self-Empowering VLMs: Achieving Hierarchical Consistency via Self-Elicited Knowledge Distillation
Wei Yang, Yiran Zhu, Zilin Li, Xunjia Zhang, Hongtao Wang
TL;DR
This work identifies that vision-language models struggle with hierarchical VQA primarily due to a lack of cross-level state rather than insufficient taxonomic knowledge. It introduces Self-Elicited Knowledge Distillation (SEKD), where a VLM alternates between a conditioned-step teacher and a single-pass student, distilling hard labels, soft distributions, and decoder states to instill dependency-aware, multi-step reasoning. SEKD achieves substantial gains in hierarchical consistency (HCA) and related metrics, generalizes to unseen taxonomies (e.g., Food-101), and transfers to non-hierarchical reasoning benchmarks, all while preserving efficiency and avoiding external supervision or catastrophic forgetting. Practically, SEKD enables compact VLMs to internalize structured reasoning, offering a scalable path to robust path-consistent VQA across diverse taxonomies and datasets.
Abstract
Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.
