Table of Contents
Fetching ...

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

Renjie Liang, Yiming Yang, Hui Lu, Li Li

TL;DR

This paper tackles the efficiency bottleneck in Temporal Sentence Grounding in Videos by introducing EMTM, a multi-teacher knowledge distillation framework. EMTM unifies heterogeneous teacher outputs into a single 1D span-based distribution and uses a Knowledge Aggregation Unit to adaptively fuse multiple teachers, while a Shared Encoder Strategy enables shallow layers to benefit from teacher knowledge. The approach achieves superior efficiency with dramatically reduced FLOPs and parameters while maintaining or improving accuracy across Charades-STA, ActivityNet, and TACoS, and is complemented by extensive ablations and qualitative analyses. The work suggests a practical path toward end-to-end, fast TSGV in real-world scenarios, with potential extensions to joint feature extraction and end-to-end training.

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

TL;DR

This paper tackles the efficiency bottleneck in Temporal Sentence Grounding in Videos by introducing EMTM, a multi-teacher knowledge distillation framework. EMTM unifies heterogeneous teacher outputs into a single 1D span-based distribution and uses a Knowledge Aggregation Unit to adaptively fuse multiple teachers, while a Shared Encoder Strategy enables shallow layers to benefit from teacher knowledge. The approach achieves superior efficiency with dramatically reduced FLOPs and parameters while maintaining or improving accuracy across Charades-STA, ActivityNet, and TACoS, and is complemented by extensive ablations and qualitative analyses. The work suggests a practical path toward end-to-end, fast TSGV in real-world scenarios, with potential extensions to joint feature extraction and end-to-end training.

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.
Paper Structure (27 sections, 15 equations, 6 figures, 6 tables)

This paper contains 27 sections, 15 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (a) The previous FVMR and CCA methods expedite the Fusion Layer, and our EMTM accelerates the Encoder, Fusion Layer, and Predictor. (b) FLOPs and accuracy plot of state-of-the-art TSGV approaches on Chradas-STA and ActivityNet. We report R1@0.7 for the two datasets. Our proposed EMTM achieves the best accuracy-speed balance among all the competitors.
  • Figure 2: An overview of the proposed framework. EMTM mainly consists of three components: the student model, the shared encoder, and the KAU. The shared encoder is utilized to align their hidden states. The label teacher model outputs are unified into 1D probability distribution as shown on the right. Then it is adopted to adaptively determine the importance weights of different teachers with respect to a specific instance based on both the teacher and student representation. During the inference stage, only the student model is adopted for fast TSGV.
  • Figure 3: Illustration of Knowledge Aggregation Unit, which exploits the multi-scale information from various teachers to generate higher-quality knowledge. The final ensemble probability distribution $\widetilde{P}$ is obtained by the weighted sum from all individual branches.
  • Figure 4: Effect of the Number of Teacher Models on Charades-STA. In detail, we adopt EAMAT, EAMAT & BAN-APR, EAMAT & BAN-APR & SeqPAN, which correspond to one teacher, two teachers and three teachers respectively.
  • Figure 5: Effect of different degrees of lightweight by adjusting the hidden dimension $d$.
  • ...and 1 more figures