Table of Contents
Fetching ...

G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models

Riya Arora, Niveditha Narendranath, Aman Tambi, Sandeep S. Zachariah, Souvik Chakraborty, Rohan Paul

TL;DR

The paper tackles robot instruction following when commands reference past human–object interactions, a setting requiring temporal grounding and current-grounding of objects. It introduces G$^{2}$TR, a three-stage, factorized pipeline that leverages pre-trained video-language and grounding models to locate the relevant interaction interval, ground the object within that interval, and propagate the grounded object's location to the present scene via semantic tracking. Key contributions include the Temporal-Parser, Event Localizer, Target Detector, and semantic-tracking cascade, a real-world dataset of 155 video–instruction pairs, and demonstrated zero-shot open-set grounding with $70.10\%$ accuracy, outperforming baselines by about $26.62\%$. This approach advances robust, open-set grounded temporal reasoning for robot instruction following, enabling more reliable manipulation in dynamically changing, cluttered environments. The work also discusses practical limitations, such as processing length constraints and single-object grounding, framing clear directions for extending to longer videos and multi-object scenarios.

Abstract

Consider the scenario where a human cleans a table and a robot observing the scene is instructed with the task "Remove the cloth using which I wiped the table". Instruction following with temporal reasoning requires the robot to identify the relevant past object interaction, ground the object of interest in the present scene, and execute the task according to the human's instruction. Directly grounding utterances referencing past interactions to grounded objects is challenging due to the multi-hop nature of references to past interactions and large space of object groundings in a video stream observing the robot's workspace. Our key insight is to factor the temporal reasoning task as (i) estimating the video interval associated with event reference, (ii) performing spatial reasoning over the interaction frames to infer the intended object (iii) semantically track the object's location till the current scene to enable future robot interactions. Our approach leverages existing large pre-trained models (which possess inherent generalization capabilities) and combines them appropriately for temporal grounding tasks. Evaluation on a video-language corpus acquired with a robot manipulator displaying rich temporal interactions in spatially-complex scenes displays an average accuracy of 70.10%. The dataset, code, and videos are available at https://reail-iitdelhi.github.io/temporalreasoning.github.io/ .

G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models

TL;DR

The paper tackles robot instruction following when commands reference past human–object interactions, a setting requiring temporal grounding and current-grounding of objects. It introduces GTR, a three-stage, factorized pipeline that leverages pre-trained video-language and grounding models to locate the relevant interaction interval, ground the object within that interval, and propagate the grounded object's location to the present scene via semantic tracking. Key contributions include the Temporal-Parser, Event Localizer, Target Detector, and semantic-tracking cascade, a real-world dataset of 155 video–instruction pairs, and demonstrated zero-shot open-set grounding with accuracy, outperforming baselines by about . This approach advances robust, open-set grounded temporal reasoning for robot instruction following, enabling more reliable manipulation in dynamically changing, cluttered environments. The work also discusses practical limitations, such as processing length constraints and single-object grounding, framing clear directions for extending to longer videos and multi-object scenarios.

Abstract

Consider the scenario where a human cleans a table and a robot observing the scene is instructed with the task "Remove the cloth using which I wiped the table". Instruction following with temporal reasoning requires the robot to identify the relevant past object interaction, ground the object of interest in the present scene, and execute the task according to the human's instruction. Directly grounding utterances referencing past interactions to grounded objects is challenging due to the multi-hop nature of references to past interactions and large space of object groundings in a video stream observing the robot's workspace. Our key insight is to factor the temporal reasoning task as (i) estimating the video interval associated with event reference, (ii) performing spatial reasoning over the interaction frames to infer the intended object (iii) semantically track the object's location till the current scene to enable future robot interactions. Our approach leverages existing large pre-trained models (which possess inherent generalization capabilities) and combines them appropriately for temporal grounding tasks. Evaluation on a video-language corpus acquired with a robot manipulator displaying rich temporal interactions in spatially-complex scenes displays an average accuracy of 70.10%. The dataset, code, and videos are available at https://reail-iitdelhi.github.io/temporalreasoning.github.io/ .

Paper Structure

This paper contains 14 sections, 8 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Method Overview. We propose a method for following instructions involving reasoning over past interactions to determine future robot actions. We combine large pre-trained models (possessing inherent generalization capacity) for (i) temporally localising events (via video-LLM), (ii) spatially grounding intended object in interaction frames (via visual-QA aided with visual prompts) and (iii) propagating knowledge of its location (via semantic segmentation) for future robot interaction.
  • Figure 2: Pipeline Overview. Event Localization processes the input video based on a temporal query. Using the output timestamp, a frame interval is generated. The Target Detector identifies the precise object through visual prompting, and the Tracker follows it using bounding boxes. The robot then executes the action.
  • Figure 3: Grounding Propagation. Semantic tracking via a vision foundation model is used to propagate the location of the grounded object from the end of the language-referred interaction to the the current world state. Tracking failure (when the object goes out of view) triggers language re-prompting to estimate where the object is and to track the containing object till the current world state.
  • Figure 4: Evaluation Corpus Bifurcations. The evaluation corpus is bifurcated for analysis as per the complexity of temporal, linguistic and spatial reasoning required for grounding. Figure shows representative examples.
  • Figure 5: Alternate Approach - Direct Temporal Visual Grounding
  • ...and 4 more figures