Table of Contents
Fetching ...

Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models

Ying Peng, Hongsen Ye, Changxin Huang, Xiping Hu, Jian Chen, Runhao Zeng

TL;DR

This work tackles the challenge of cross-architecture knowledge distillation for video action recognition, where ViTs offer strong global modeling at high cost and lightweight CNNs lag in accuracy. It introduces a Dual-Teacher KD framework that jointly leverages a heterogeneous ViT teacher and a homogeneous CNN teacher to guide a lightweight CNN student, employing Discrepancy-Aware Teacher Weighting (DATW) and Structure Discrepancy-Aware Distillation (SDD) with a Structure Discrepancy Branch (SDB) and Relational KD (RKD). The target logits are adaptively fused as $\mathbf{z}_{\text{target}}(x) = \omega_{\text{CNN}}(x) \mathbf{z}_{T_{CNN}}(x) + \omega_{\text{ViT}}(x) \mathbf{z}_{T_{ViT}}(x)$ and the student learns the residual between ViT and CNN teacher features via $f_{\text{res}} = \phi_{\text{ViT}}(f_{T_{ViT}}) - \phi_{\text{CNN}}(f_{T_{CNN}})$ with a training-only SDB; RKD further preserves relational structure. Across HMDB51, EPIC-KITCHENS-100, and Kinetics-400, the method consistently surpasses state-of-the-art distillation methods, achieving notable gains and, in some cases, exceeding the performance of both teachers, while incurring no extra inference cost. This demonstrates that incorporating both teacher types and focusing on architecture-relevant residuals yields robust, practical improvements for efficient video recognition.

Abstract

Vision Transformers (ViTs) have achieved strong performance in video action recognition, but their high computational cost limits their practicality. Lightweight CNNs are more efficient but suffer from accuracy gaps. Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs, yet existing methods often struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers. To tackle these challenges, we propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher to collaboratively guide a lightweight CNN student. We introduce two key components: (1) Discrepancy-Aware Teacher Weighting, which dynamically fuses the predictions from ViT and CNN teachers by assigning adaptive weights based on teacher confidence and prediction discrepancy with the student, enabling more informative and effective supervision; and (2) a Structure Discrepancy-Aware Distillation strategy, where the student learns the residual features between ViT and CNN teachers via a lightweight auxiliary branch, focusing on transferable architectural differences without mimicking all of ViT's high-dimensional patterns. Extensive experiments on benchmarks including HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate that our method consistently outperforms state-of-the-art distillation approaches, achieving notable performance improvements with a maximum accuracy gain of 5.95% on HMDB51.

Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models

TL;DR

This work tackles the challenge of cross-architecture knowledge distillation for video action recognition, where ViTs offer strong global modeling at high cost and lightweight CNNs lag in accuracy. It introduces a Dual-Teacher KD framework that jointly leverages a heterogeneous ViT teacher and a homogeneous CNN teacher to guide a lightweight CNN student, employing Discrepancy-Aware Teacher Weighting (DATW) and Structure Discrepancy-Aware Distillation (SDD) with a Structure Discrepancy Branch (SDB) and Relational KD (RKD). The target logits are adaptively fused as and the student learns the residual between ViT and CNN teacher features via with a training-only SDB; RKD further preserves relational structure. Across HMDB51, EPIC-KITCHENS-100, and Kinetics-400, the method consistently surpasses state-of-the-art distillation methods, achieving notable gains and, in some cases, exceeding the performance of both teachers, while incurring no extra inference cost. This demonstrates that incorporating both teacher types and focusing on architecture-relevant residuals yields robust, practical improvements for efficient video recognition.

Abstract

Vision Transformers (ViTs) have achieved strong performance in video action recognition, but their high computational cost limits their practicality. Lightweight CNNs are more efficient but suffer from accuracy gaps. Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs, yet existing methods often struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers. To tackle these challenges, we propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher to collaboratively guide a lightweight CNN student. We introduce two key components: (1) Discrepancy-Aware Teacher Weighting, which dynamically fuses the predictions from ViT and CNN teachers by assigning adaptive weights based on teacher confidence and prediction discrepancy with the student, enabling more informative and effective supervision; and (2) a Structure Discrepancy-Aware Distillation strategy, where the student learns the residual features between ViT and CNN teachers via a lightweight auxiliary branch, focusing on transferable architectural differences without mimicking all of ViT's high-dimensional patterns. Extensive experiments on benchmarks including HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate that our method consistently outperforms state-of-the-art distillation approaches, achieving notable performance improvements with a maximum accuracy gain of 5.95% on HMDB51.

Paper Structure

This paper contains 32 sections, 14 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Motivation for our dual-teacher framework. A preliminary study shows that distillation from a weaker but structurally aligned CNN teacher can sometimes lead to better student performance than from a stronger, heterogeneous ViT teacher. This finding motivates the integration of both teacher types and highlights the critical role of architectural compatibility in cross-architecture knowledge transfer.
  • Figure 2: An overview of our dual-teacher distillation framework. Given a lightweight CNN student $S_{\text{CNN}}$, a ViT teacher $T_\text{ViT}$, and a CNN teacher $T_\text{CNN}$, we distill complementary knowledge via three components: 1) Discrepancy-Aware Teacher Weighting mechanism measures teacher confidence $\mathcal{C}_k(x)$ and prediction discrepancy $\mathcal{D}_k(x)$ for each sample, which are integrated to generate adaptive weights $\omega_k(x)$ for combining teacher logits. This enables the student to prioritize informative and reliable supervision on a per-sample basis. 2) Structure Discrepancy Branch predicts the feature residual $f_{T_\text{ViT}} - f_{T_\text{CNN}}$ using a Non-local module, enabling the student to capture ViT-specific global context cues. At inference, only $S_{\text{CNN}}$ is used, introducing no extra computational overhead. 3) Relational Knowledge Distillation transfers architecture-agnostic structural knowledge.