Language-guided Active Sensing of Confined, Cluttered Environments via Object Rearrangement Planning
Weihan Chen, Hanwen Ren, Ahmed H. Qureshi
TL;DR
This work addresses dense perception in confined, cluttered environments by allowing a robot to follow natural language instructions to observe a region of interest (ROI) through object rearrangement. It introduces a unified pipeline that links language grounding to ROI generation, predicts ROI information gain with ROI-ScoreNet, plans informative viewpoints using a Gaussian Mixture Model–based MPC (GMM-MPC), and relocates view-blocking objects to maximize ROI coverage measured by $\phi(S_o, S_{ROI})$. The approach demonstrates superior performance in simulation against baselines and validates sim2real transfer in real cabinet-like settings, using prompts like 'Show me to the left of' or 'behind' certain objects. This advances practical dense sensing in tight spaces and lays groundwork for end-to-end learning to jointly optimize viewpoint selection and object manipulation for rapid, language-guided perception.
Abstract
Language-guided active sensing is a robotics subtask where a robot with an onboard sensor interacts efficiently with the environment via object manipulation to maximize perceptual information, following given language instructions. These tasks appear in various practical robotics applications, such as household service, search and rescue, and environment monitoring. Despite many applications, the existing works do not account for language instructions and have mainly focused on surface sensing, i.e., perceiving the environment from the outside without rearranging it for dense sensing. Therefore, in this paper, we introduce the first language-guided active sensing approach that allows users to observe specific parts of the environment via object manipulation. Our method spatially associates the environment with language instructions, determines the best camera viewpoints for perception, and then iteratively selects and relocates the best view-blocking objects to provide the dense perception of the region of interest. We evaluate our method against different baseline algorithms in simulation and also demonstrate it in real-world confined cabinet-like settings with multiple unknown objects. Our results show that the proposed method exhibits better performance across different metrics and successfully generalizes to real-world complex scenarios.
