Implicit and Explicit Commonsense for Multi-sentence Video Captioning

Shih-Han Chou; James J. Little; Leonid Sigal

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

Shih-Han Chou, James J. Little, Leonid Sigal

TL;DR

This work tackles the lack of world commonsense in multi-sentence video captioning by introducing a framework that fuses explicit knowledge from knowledge bases via COMET ATOMIC/ConceptNet and implicit knowledge from large language models (GPT-Neo) and CLIP-based signals. It presents a Transformer-based captioning model with a snippet-memory mechanism and a parallel instruction-generation task derived from the ALFRED dataset, along with an LLM-based decoding pathway (BLIP-2 Q-former bridging to a frozen LLM). The paper demonstrates substantial improvements on the ALFRED instruction-generation task and state-of-the-art performance on ActivityNet Captions, supported by extensive ablations and a user study that confirms the value of incorporating commonsense. Overall, the results indicate that explicit and implicit commonsense priors substantially enhance the quality, coherence, and discourse structure of multi-sentence video descriptions and instructional paraphrases.

Abstract

Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using the ALFRED dataset [54] generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

TL;DR

Abstract

Paper Structure (13 sections, 8 equations, 5 figures, 2 tables)

This paper contains 13 sections, 8 equations, 5 figures, 2 tables.

Introduction
Related Works
Approach
Commonsense Knowledge Generation
Video Captioning Transformer-based Model
Captioning Model
Loss Function
Instruction Generation
Experiments
Results on Instruction Generation
Results on Dense Video Captioning
User Study
Conclusion

Figures (5)

Figure 1: Commonsense-enhanced Multi-sentence Video Captioning. Illustration of the proposed multi-sentence video captioning that leverages implicit (sentence completion) and explicit (knowledge-base) forms of commonsense knowledge. For each video segment (snippet), we first predict the duration as well as the actions and objects. The captioning model takes these predictions along with inferred commonsense priors and hidden features, obtained from the previous snippet, to generate the caption for the current video segment.
Figure 2: Model Overview of the Proposed Video Captioning Transformer-based Model. The proposed model takes the video features, commonsense knowledge features, and hidden features as inputs for the snippet selector to predict the duration of the snippet. The video features are then multiplied with the predicted snippet to form the snippet features. The Action-Object predictor uses the snippet features, commonsense knowledge features, and hidden features to predict the actions and objects that appear in the snippet. In the captioning model, the concatenation of the snippet features, commonsense knowledge features, hidden features, and predicted actions-objects are fed into the Transformer encoder-decoder to generate the sentence which then passes through the implicit and explicit knowledge module to predict the commonsense knowledge for the next snippet.
Figure 3: Details of the Transformer-based Captioning Model. The Transformer-based captioning model takes the snippet features, commonsense knowledge features, hidden features, and actions-objects predictions as the inputs of the Transformer encoder. The Snippet Memory leverages the "add" and "erase" operations to dynamically update the visual attention for the Transformer decoder at each decoding step. In the end, a Linear layer is used to output the generated sentence.
Figure 4: Details of the LLM-based captioning model. The LLM-based captioning model comprises the Transformer encoder, a Q-former, and a frozen LLM decoder. The transformer encoder takes the visual feature, action feature, commonsense knowledge features, and hidden feature as inputs, and a fully connected layer is applied to linearly project the features to the same dimension as the Q-former. The Q-former is a trainable module to bridge visual and language parts. In the end, the instructions are generated by the frozen LLM decoder.
Figure 5: Qualitative Results. We show two qualitative examples from each dataset. Illustrated are generated captions obtained by our model; for comparison we also show results from a baseline variant without (w/o) commonsense knowledge. The generated explicit and implicit knowledge are also provided for one intermediate sentence. As shown in the figure, explicit knowledge provides relations like AtLocation, and xNeed that the model can and does leverage. Implicit knowledge includes grab, and for more soup that correspond to the following actions, pick up, and rinse it off, in our generated and ground truth captions. The red color indicates the wrong predictions. One can also see that commonsense in our model leads to less repetition across sentences, higher accuracy and, often, richer descriptions.

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

TL;DR

Abstract

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)