Describing Multimedia Content using Attention-based Encoder--Decoder Networks
Kyunghyun Cho, Aaron Courville, Yoshua Bengio
TL;DR
The paper tackles the challenge of structured output learning with rich input–output relationships by proposing an attention-based encoder–decoder framework. It combines recurrent and convolutional building blocks with soft attention to create a dynamically focused context c^t for each output token, enabling end-to-end training across tasks such as neural machine translation, image captioning, video description, and speech recognition. It also explores hard attention with variational training, location-aware variants for sequential data, and extensions to parsing, pointer networks, and memory networks, demonstrating improved performance and interpretability via attention maps. The approach offers a unifying, scalable methodology for cross-modal description and related structured prediction tasks, with broad implications for both practical systems and future research into aligned multimodal representations.
Abstract
Whereas deep neural networks were first mostly used for classification tasks, they are rapidly expanding in the realm of structured output problems, where the observed target is composed of multiple random variables that have a rich joint distribution, given the input. We focus in this paper on the case where the input also has a rich structure and the input and output structures are somehow related. We describe systems that learn to attend to different places in the input, for each element of the output, for a variety of tasks: machine translation, image caption generation, video clip description and speech recognition. All these systems are based on a shared set of building blocks: gated recurrent neural networks and convolutional neural networks, along with trained attention mechanisms. We report on experimental results with these systems, showing impressively good performance and the advantage of the attention mechanism.
