Table of Contents
Fetching ...

While recognizing actions, LMMs struggle to detect core interaction events

Daniel Harari, Michael Sidorov, Liel David, Chen Shterental, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

TL;DR

This work interrogates whether large multi-modal models truly ground their understanding in visual input when identifying core interaction events in video. The authors introduce the Contact-Release Interaction Dataset, extending SSv2 with over 20K annotated events across more than 10K videos and providing frame- and coordinate-level ground truth. Through prompting experiments across zero-, one-, and two-shot in-context learning regimes, with Grounding and Reasoning manipulations, the study finds that state-of-the-art LMMs can name objects and classify actions yet fail to accurately locate the exact frames or spatial regions where contact and release events occur. The results highlight a gap between high-level action recognition and low-level perceptual grounding, suggesting that current LMMs lack robust visual grounding for dynamic interactions and motivating future work on integrative perceptual-language grounding. The dataset and findings offer a benchmark to drive improvements in visual dynamic understanding for foundational models.

Abstract

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

While recognizing actions, LMMs struggle to detect core interaction events

TL;DR

This work interrogates whether large multi-modal models truly ground their understanding in visual input when identifying core interaction events in video. The authors introduce the Contact-Release Interaction Dataset, extending SSv2 with over 20K annotated events across more than 10K videos and providing frame- and coordinate-level ground truth. Through prompting experiments across zero-, one-, and two-shot in-context learning regimes, with Grounding and Reasoning manipulations, the study finds that state-of-the-art LMMs can name objects and classify actions yet fail to accurately locate the exact frames or spatial regions where contact and release events occur. The results highlight a gap between high-level action recognition and low-level perceptual grounding, suggesting that current LMMs lack robust visual grounding for dynamic interactions and motivating future work on integrative perceptual-language grounding. The dataset and findings offer a benchmark to drive improvements in visual dynamic understanding for foundational models.

Abstract

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

Paper Structure

This paper contains 28 sections, 3 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Collecting human annotations for interactions using Amazon Mechanical Turk platform. Human subjects were asked to annotate core interaction events in videos from SSv2 dataset goyal2017something. Shown here are example annotations for 'contact' and 'release' events, where the target object (white candle) comes in contact with a hand (left) and a surface (middle), or is detached from the hand (right). The annotations include the event type, the kind of agent-object pair and the spatiotemporal location of the event (frame and image coordinates).
  • Figure 2: A schematic flow chart of the experiments under the different In-Context-Learning (ICL) regimes (i.e., ZS, OS, TS) and modulating conditions. The blocks represent different components of intermediate procedures. Each row represents an experiment using a particular ICL regime and condition (the experiment flow is directed left to right). The CNTX block indicates an introductory prompt about the agent. The EXMP block represents a prompt of an example, including the task instruction, an input video and the correct response for this example. The RSN block indicates a prompt instructing the model to include in the response a step-by-step description of the reasoning behind the predicted answer. The GRND block represents a prompt instructing the model to describe the content of the input video and the instructing prompt. In this block, the model provides an intermediate response, prior to the main task. The TST block indicates the prompt of the main test task, including the instruction and test video (see \ref{['sec:experiments']} for more details).
  • Figure 3: Mean accuracy vs. detection error tolerance. A correct detection of the models represents a predicted frame within the allowed error tolerance, where an error tolerance of zero means the exact true frame was predicted. Results of Qwen-2.5VL-72B (a) and GPT-4o (b) are shown for the difference ICL regimes under the "with reasoning" condition. Note, that the length of all videos in the experimental dataset is 10 frames.
  • Figure 4: Example predictions of the model Qwen-2.5VL-72B. The model provides the presented chain-of-thought under the "WITH" Reasoning condition. (a) A false prediction. (b) A Correct prediction. The examples show that the reasoning seems logical and realistic, but the relation to the actual video frames is often very loose. Orange and green boxes mark the true frame. Red box marks a false prediction.
  • Figure 5: Example false predictions of the model Qwen-2.5VL-72B. The model provides the presented chain-of-thought under the "WITH" Reasoning condition. The examples show that the reasoning text seems logical and realistic, but the relation to the actual video frames is often very loose. A red box marks a false prediction, while the orange box marks the true frame.
  • ...and 2 more figures