Table of Contents
Fetching ...

Look, Remember and Reason: Grounded reasoning in videos with language models

Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic

TL;DR

The paper tackles grounded reasoning in videos by enabling language models to leverage fine-grained low-level visual cues. It proposes Look, Remember, Reason (LRR), a framework with a two-stream video encoder and cross-attention LM backbone trained with surrogate tasks (object recognition, re-identification, tracking) to ground reasoning in motion and interactions. The model demonstrates state-of-the-art performance across ACRE, Something-Else, CATER, and STAR datasets, showing strong gains over task-specific baselines. This grounding-enabled approach enables more reliable, multimodal reasoning in dynamic scenes, with potential for generalist video understanding.

Abstract

Multi-modal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual information is extracted using low-level visual skills step-by-step and then integrated to arrive at a final answer. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across these tasks by a large margin.

Look, Remember and Reason: Grounded reasoning in videos with language models

TL;DR

The paper tackles grounded reasoning in videos by enabling language models to leverage fine-grained low-level visual cues. It proposes Look, Remember, Reason (LRR), a framework with a two-stream video encoder and cross-attention LM backbone trained with surrogate tasks (object recognition, re-identification, tracking) to ground reasoning in motion and interactions. The model demonstrates state-of-the-art performance across ACRE, Something-Else, CATER, and STAR datasets, showing strong gains over task-specific baselines. This grounding-enabled approach enables more reliable, multimodal reasoning in dynamic scenes, with potential for generalist video understanding.

Abstract

Multi-modal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual information is extracted using low-level visual skills step-by-step and then integrated to arrive at a final answer. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across these tasks by a large margin.
Paper Structure (19 sections, 6 equations, 8 figures, 7 tables)

This paper contains 19 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Our Look, Remember, Reason (LRR) model 'looks' at the video frames to extract relevant low-level information, e.g., object motion and interactions, supervised with surrogate tasks like object tracking only during training. It 'remembers' the information from intermediate steps and 'reasons' using the aggregated information.
  • Figure 2: The architecture of our LRR model, highlighting the use of interleaved top-down cross-attention layers in between self-attention layers higher up in the hierarchy.
  • Figure 3: Example solutions to surrogate tasks generated by our LRR model on ACRE. Re-identified objects across context trials are underlined in the same color.
  • Figure 4: Example solutions to surrogate task tracking generated by our LRR model on Something-Else. Bounding boxes belonging to the same track are highlighted using the same color.
  • Figure 5: Example answers to the tracking surrogate task generated by our LRR model on CATER. Our LRR model is prompted with the "<track>" special token to solve the tracking surrogate task at randomly selected time-steps during training. Object tracks are over the $6 \times 6$ grid on the surface and are highlighted in color.
  • ...and 3 more figures