Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Shih-Han Chou, James J. Little, Leonid Sigal
TL;DR
This work tackles the lack of world commonsense in multi-sentence video captioning by introducing a framework that fuses explicit knowledge from knowledge bases via COMET ATOMIC/ConceptNet and implicit knowledge from large language models (GPT-Neo) and CLIP-based signals. It presents a Transformer-based captioning model with a snippet-memory mechanism and a parallel instruction-generation task derived from the ALFRED dataset, along with an LLM-based decoding pathway (BLIP-2 Q-former bridging to a frozen LLM). The paper demonstrates substantial improvements on the ALFRED instruction-generation task and state-of-the-art performance on ActivityNet Captions, supported by extensive ablations and a user study that confirms the value of incorporating commonsense. Overall, the results indicate that explicit and implicit commonsense priors substantially enhance the quality, coherence, and discourse structure of multi-sentence video descriptions and instructional paraphrases.
Abstract
Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using the ALFRED dataset [54] generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].
