Towards Unified Interactive Visual Grounding in The Wild

Jie Xu; Hanbo Zhang; Qingyi Si; Yifeng Li; Xuguang Lan; Tao Kong

Towards Unified Interactive Visual Grounding in The Wild

Jie Xu, Hanbo Zhang, Qingyi Si, Yifeng Li, Xuguang Lan, Tao Kong

TL;DR

This work presents TiO, a unified end-to-end transformer for interactive visual grounding in open-world human-robot interaction. By unifying the roles of Questioner, Oracle, and Guesser and training on a broad multi-task dataset, TiO achieves state-of-the-art results on InViG and GuessWhat?! and demonstrates strong generalization in diverse HRI scenarios. The approach is validated through comprehensive experiments, including 150 challenging human-robot interaction cases and real-robot desktop and mobile platforms, illustrating robust disambiguation and grounding with natural language. The work highlights the practical impact of integrated visual-language reasoning for robust, flexible interactive grounding in real-world robotic applications.

Abstract

Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO.

Towards Unified Interactive Visual Grounding in The Wild

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 8 figures, 6 tables)

This paper contains 28 sections, 1 equation, 8 figures, 6 tables.

Introduction
Related Works
Interactive Visual Grounding
End-to-End and Unified HRI
Preliminaries
Method
TiO Network
Backbone
Vision Embedding
Text Input/Output
Training
Datasets
Unified Multi-Task Formulation
Interactive Grasping System
Experiments
...and 13 more sections

Figures (8)

Figure 1: TiO in the wild (top row) and the realistic interactive robot manipulation tasks (bottom row). Each image along with the corresponding round of in the dialog shows that TiO can ask informative questions based on previous dialog history and complex observations, while maintaining explainable internal states to evaluate the grounded candidates (green box) of the target.
Figure 2: Overview of TiO. Left: TiO network, which is a visual-language interactive disambiguation model that can interact with humans through natural language for disambiguation. It unifies the Questioner, Oracle, and Guesser in a single transformer with different instructions. Right: TiO deployed on interactive manipulation robots. In our interactive manipulation system, TiO provides the target object's bounding box based on the disambiguation by interaction with the human user then converts it into a segmentation map using Segment Anything kirillov2023segment. Contact GraspNet sundermeyer2021contact finally generates the best grasp based on the projected point clouds.
Figure 3: Qualitative results of different interactive visual grounding methods on our 3 HRI evaluation sets. Left: scene understanding. Middle: human understanding. Right: language understanding. The green box denotes the target object by the human user, and the red box denotes the prediction after interaction.
Figure 4: Examples of our evaluation benchmark for HRI experiments. Top row: Scene Understanding. Middle row: Human Understanding. Bottom row: Language Understanding.
Figure 5: Interactive visual grounding success rate of HRI on 3 evaluation sets. Our approach achieves the highest performance on the more challenging interactive scenarios.
...and 3 more figures

Towards Unified Interactive Visual Grounding in The Wild

TL;DR

Abstract

Towards Unified Interactive Visual Grounding in The Wild

Authors

TL;DR

Abstract

Table of Contents

Figures (8)