Table of Contents
Fetching ...

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

Adrienne Deganutti, Simon Hadfield, Andrew Gilbert

TL;DR

DANTE-AD tackles the challenge of long-form audio description by introducing a dual-vision Transformer that fuses frame-level and scene-level visual embeddings through sequential cross-attention, enabling coherent narratives over extended videos. The frame branch leverages EVA-CLIP with Q-Former for fine-grained content while the scene branch uses Side4Video to capture global temporal dynamics; a frozen LLaMA2-7B decodes the fused representations. Training employs HowTo-AD pretraining with a 2-epoch CMD-AD fine-tuning setup and offline embedding precomputation to enable efficient single-GPU training. Empirical results on CMD-AD show improvements in traditional NLP metrics (CIDEr) and LLM-based evaluations (LLM-AD-Eval), underscoring gains in narrative depth and contextual grounding for automated audio description. This work advances accessible video storytelling by bridging frame-level detail and long-term narrative structure, with potential extensions to additional modalities and adaptive attention strategies.

Abstract

Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

TL;DR

DANTE-AD tackles the challenge of long-form audio description by introducing a dual-vision Transformer that fuses frame-level and scene-level visual embeddings through sequential cross-attention, enabling coherent narratives over extended videos. The frame branch leverages EVA-CLIP with Q-Former for fine-grained content while the scene branch uses Side4Video to capture global temporal dynamics; a frozen LLaMA2-7B decodes the fused representations. Training employs HowTo-AD pretraining with a 2-epoch CMD-AD fine-tuning setup and offline embedding precomputation to enable efficient single-GPU training. Empirical results on CMD-AD show improvements in traditional NLP metrics (CIDEr) and LLM-based evaluations (LLM-AD-Eval), underscoring gains in narrative depth and contextual grounding for automated audio description. This work advances accessible video storytelling by bridging frame-level detail and long-term narrative structure, with potential extensions to additional modalities and adaptive attention strategies.

Abstract

Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.

Paper Structure

This paper contains 19 sections, 9 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Our DANTE-AD method extracts frame- and scene-level visual information fused via a sequential cross-attention module for both frame and scene context-aware AD over extended video sequences.
  • Figure 2: Overview of our audio description generation pipeline. The system features two primary branches: a frame-level visual branch (blue) and a scene-level visual branch (red). Ground-truth references are embedded and processed auto-regressively using a causal attention mask. Sequential fusion integrates the visual embeddings within the Dual-Vision Attention Network (purple). The fused representation is fed to our LLaMA language model and decoded into a natural language AD prediction.
  • Figure 3: We propose a sequential fusion method within the Dual-Vision Attention Network to integrate frame- and scene-level embeddings. Ground-truth word embeddings are processed using a causal self-attention mask.
  • Figure 4: Text Caption length distribution of our generated descriptions (blue) compared to AutoAD-III han2024autoad3 (grey).
  • Figure 5: Illustration of the ordering of frame-level ($F_{t-1}$) and scene-level ($S_{t-1}$) visual embeddings during sequential cross-attention within our dual-vision attention network.
  • ...and 5 more figures