Table of Contents
Fetching ...

Language-guided Active Sensing of Confined, Cluttered Environments via Object Rearrangement Planning

Weihan Chen, Hanwen Ren, Ahmed H. Qureshi

TL;DR

This work addresses dense perception in confined, cluttered environments by allowing a robot to follow natural language instructions to observe a region of interest (ROI) through object rearrangement. It introduces a unified pipeline that links language grounding to ROI generation, predicts ROI information gain with ROI-ScoreNet, plans informative viewpoints using a Gaussian Mixture Model–based MPC (GMM-MPC), and relocates view-blocking objects to maximize ROI coverage measured by $\phi(S_o, S_{ROI})$. The approach demonstrates superior performance in simulation against baselines and validates sim2real transfer in real cabinet-like settings, using prompts like 'Show me to the left of' or 'behind' certain objects. This advances practical dense sensing in tight spaces and lays groundwork for end-to-end learning to jointly optimize viewpoint selection and object manipulation for rapid, language-guided perception.

Abstract

Language-guided active sensing is a robotics subtask where a robot with an onboard sensor interacts efficiently with the environment via object manipulation to maximize perceptual information, following given language instructions. These tasks appear in various practical robotics applications, such as household service, search and rescue, and environment monitoring. Despite many applications, the existing works do not account for language instructions and have mainly focused on surface sensing, i.e., perceiving the environment from the outside without rearranging it for dense sensing. Therefore, in this paper, we introduce the first language-guided active sensing approach that allows users to observe specific parts of the environment via object manipulation. Our method spatially associates the environment with language instructions, determines the best camera viewpoints for perception, and then iteratively selects and relocates the best view-blocking objects to provide the dense perception of the region of interest. We evaluate our method against different baseline algorithms in simulation and also demonstrate it in real-world confined cabinet-like settings with multiple unknown objects. Our results show that the proposed method exhibits better performance across different metrics and successfully generalizes to real-world complex scenarios.

Language-guided Active Sensing of Confined, Cluttered Environments via Object Rearrangement Planning

TL;DR

This work addresses dense perception in confined, cluttered environments by allowing a robot to follow natural language instructions to observe a region of interest (ROI) through object rearrangement. It introduces a unified pipeline that links language grounding to ROI generation, predicts ROI information gain with ROI-ScoreNet, plans informative viewpoints using a Gaussian Mixture Model–based MPC (GMM-MPC), and relocates view-blocking objects to maximize ROI coverage measured by . The approach demonstrates superior performance in simulation against baselines and validates sim2real transfer in real cabinet-like settings, using prompts like 'Show me to the left of' or 'behind' certain objects. This advances practical dense sensing in tight spaces and lays groundwork for end-to-end learning to jointly optimize viewpoint selection and object manipulation for rapid, language-guided perception.

Abstract

Language-guided active sensing is a robotics subtask where a robot with an onboard sensor interacts efficiently with the environment via object manipulation to maximize perceptual information, following given language instructions. These tasks appear in various practical robotics applications, such as household service, search and rescue, and environment monitoring. Despite many applications, the existing works do not account for language instructions and have mainly focused on surface sensing, i.e., perceiving the environment from the outside without rearranging it for dense sensing. Therefore, in this paper, we introduce the first language-guided active sensing approach that allows users to observe specific parts of the environment via object manipulation. Our method spatially associates the environment with language instructions, determines the best camera viewpoints for perception, and then iteratively selects and relocates the best view-blocking objects to provide the dense perception of the region of interest. We evaluate our method against different baseline algorithms in simulation and also demonstrate it in real-world confined cabinet-like settings with multiple unknown objects. Our results show that the proposed method exhibits better performance across different metrics and successfully generalizes to real-world complex scenarios.
Paper Structure (14 sections, 4 equations, 3 figures, 1 table, 2 algorithms)

This paper contains 14 sections, 4 equations, 3 figures, 1 table, 2 algorithms.

Figures (3)

  • Figure 1: This figure shows the language-guided active sensing pipeline with an example user prompt. The user prompt and initial scene are processed by the language linking module for the region of interest selection. With the region of interest, the MPC-GMM viewpoint selection algorithm uses ScoreNet and Gaussian mixture to select the optimal viewpoint for the environment setup. After moving to the optimal viewpoint, the view-blocking objects are removed sequentially using the blocking score formulation. Finally, the modified scene view is shown to the user. The system will move on to the next viewpoint if the desired coverage is not reached.
  • Figure 2: This figure shows the real-world demonstration of our pipeline. The prompt given to the system is "Show me behind the purple cylinder". From the top left figure, the robot starts by choosing the viewpoint that's roughly in front of the object of interest. The view-blocking object selection algorithm identified two objects to remove. By removing the two objects, the desired coverage is reached and the scene is restored.
  • Figure 3: This figure shows the performance of our method against various baselines under different region densities. Our method outperforms other baselines in all densities in terms of success rate and efficiency.