G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora, Niveditha Narendranath, Aman Tambi, Sandeep S. Zachariah, Souvik Chakraborty, Rohan Paul
TL;DR
The paper tackles robot instruction following when commands reference past human–object interactions, a setting requiring temporal grounding and current-grounding of objects. It introduces G$^{2}$TR, a three-stage, factorized pipeline that leverages pre-trained video-language and grounding models to locate the relevant interaction interval, ground the object within that interval, and propagate the grounded object's location to the present scene via semantic tracking. Key contributions include the Temporal-Parser, Event Localizer, Target Detector, and semantic-tracking cascade, a real-world dataset of 155 video–instruction pairs, and demonstrated zero-shot open-set grounding with $70.10\%$ accuracy, outperforming baselines by about $26.62\%$. This approach advances robust, open-set grounded temporal reasoning for robot instruction following, enabling more reliable manipulation in dynamically changing, cluttered environments. The work also discusses practical limitations, such as processing length constraints and single-object grounding, framing clear directions for extending to longer videos and multi-object scenarios.
Abstract
Consider the scenario where a human cleans a table and a robot observing the scene is instructed with the task "Remove the cloth using which I wiped the table". Instruction following with temporal reasoning requires the robot to identify the relevant past object interaction, ground the object of interest in the present scene, and execute the task according to the human's instruction. Directly grounding utterances referencing past interactions to grounded objects is challenging due to the multi-hop nature of references to past interactions and large space of object groundings in a video stream observing the robot's workspace. Our key insight is to factor the temporal reasoning task as (i) estimating the video interval associated with event reference, (ii) performing spatial reasoning over the interaction frames to infer the intended object (iii) semantically track the object's location till the current scene to enable future robot interactions. Our approach leverages existing large pre-trained models (which possess inherent generalization capabilities) and combines them appropriately for temporal grounding tasks. Evaluation on a video-language corpus acquired with a robot manipulator displaying rich temporal interactions in spatially-complex scenes displays an average accuracy of 70.10%. The dataset, code, and videos are available at https://reail-iitdelhi.github.io/temporalreasoning.github.io/ .
