Table of Contents
Fetching ...

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal

TL;DR

MART introduces a memory-augmented recurrent transformer for video paragraph captioning to address cross-sentence coherence and redundancy. By maintaining a highly summarized memory state updated across video segments, MART enables the decoder to generate context-aware sentences within a unified encoder-decoder framework, surpassing prior Transformer and LSTM-based approaches. Experimental results on ActivityNet Captions and YouCookII show MART achieves better coherence and lower repetition while maintaining relevance, with human evaluators favoring its paragraph quality. The work demonstrates the practical value of memory-based recurrence in multimodal sequence generation and provides a reusable framework for memory-augmented Transformers.

Abstract

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: https://github.com/jayleicn/recurrent-transformer

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

TL;DR

MART introduces a memory-augmented recurrent transformer for video paragraph captioning to address cross-sentence coherence and redundancy. By maintaining a highly summarized memory state updated across video segments, MART enables the decoder to generate context-aware sentences within a unified encoder-decoder framework, surpassing prior Transformer and LSTM-based approaches. Experimental results on ActivityNet Captions and YouCookII show MART achieves better coherence and lower repetition while maintaining relevance, with human evaluators favoring its paragraph quality. The work demonstrates the practical value of memory-based recurrence in multimodal sequence generation and provides a reusable framework for memory-augmented Transformers.

Abstract

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: https://github.com/jayleicn/recurrent-transformer

Paper Structure

This paper contains 30 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Vanilla transformer video captioning model zhou2018end. PE denotes Positional Encoding, TE denotes token Type Embedding.
  • Figure 2: Left: Our proposed Memory-Augmented Recurrent Transformer (MART) for video paragraph captioning. Right: Transformer-XL dai2019transformer model for video paragraph captioning. Relative PE denotes Relative Positional Encoding dai2019transformer. SG($\cdot$) denotes stop-gradient, $\odot$ denotes Hadamard product.
  • Figure 3: Qualitative examples. Red/bold indicates pronoun errors (inappropriate use of pronouns), blue/italic indicates repetitive patterns, underline indicates content errors. Compared to baselines, our model generates more coherent, less repeated paragraphs while maintaining relevance.
  • Figure 4: Nearest neighbors retrieved using memory states. Top row shows the query, the 3 rows below it are the top-3 nearest neighbors.
  • Figure 5: Additional qualitative examples. Red/bold indicates pronoun errors (inappropriate use of pronouns or person mentions), blue/italic indicates repetitive patterns, underline indicates content errors. Compared to baselines, our model generates more coherent, less repeated paragraphs while maintaining relevance.