Table of Contents
Fetching ...

MSGField: A Unified Scene Representation Integrating Motion, Semantics, and Geometry for Robotic Manipulation

Yu Sheng, Runfeng Lin, Lidian Wang, Quecheng Qiu, YanYong Zhang, Yu Zhang, Bei Hua, Jianmin Ji

Abstract

Combining accurate geometry with rich semantics has been proven to be highly effective for language-guided robotic manipulation. Existing methods for dynamic scenes either fail to update in real-time or rely on additional depth sensors for simple scene editing, limiting their applicability in real-world. In this paper, we introduce MSGField, a representation that uses a collection of 2D Gaussians for high-quality reconstruction, further enhanced with attributes to encode semantic and motion information. Specially, we represent the motion field compactly by decomposing each primitive's motion into a combination of a limited set of motion bases. Leveraging the differentiable real-time rendering of Gaussian splatting, we can quickly optimize object motion, even for complex non-rigid motions, with image supervision from only two camera views. Additionally, we designed a pipeline that utilizes object priors to efficiently obtain well-defined semantics. In our challenging dataset, which includes flexible and extremely small objects, our method achieve a success rate of 79.2% in static and 63.3% in dynamic environments for language-guided manipulation. For specified object grasping, we achieve a success rate of 90%, on par with point cloud-based methods. Code and dataset will be released at:https://shengyu724.github.io/MSGField.github.io.

MSGField: A Unified Scene Representation Integrating Motion, Semantics, and Geometry for Robotic Manipulation

Abstract

Combining accurate geometry with rich semantics has been proven to be highly effective for language-guided robotic manipulation. Existing methods for dynamic scenes either fail to update in real-time or rely on additional depth sensors for simple scene editing, limiting their applicability in real-world. In this paper, we introduce MSGField, a representation that uses a collection of 2D Gaussians for high-quality reconstruction, further enhanced with attributes to encode semantic and motion information. Specially, we represent the motion field compactly by decomposing each primitive's motion into a combination of a limited set of motion bases. Leveraging the differentiable real-time rendering of Gaussian splatting, we can quickly optimize object motion, even for complex non-rigid motions, with image supervision from only two camera views. Additionally, we designed a pipeline that utilizes object priors to efficiently obtain well-defined semantics. In our challenging dataset, which includes flexible and extremely small objects, our method achieve a success rate of 79.2% in static and 63.3% in dynamic environments for language-guided manipulation. For specified object grasping, we achieve a success rate of 90%, on par with point cloud-based methods. Code and dataset will be released at:https://shengyu724.github.io/MSGField.github.io.

Paper Structure

This paper contains 20 sections, 9 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: a). We propose MSGField, a unified representation that enables robots to manipulate objects in real-world environments based on human instructions. b) The device we use for robotic manipulation in the real world.
  • Figure 2: The framework of MSGField. Geometry field captured by surface reconstruction from 2D Gaussian Splatting. In the semantic field, each primitive is assigned a label, which links to an object feature extract from CLIP. For the motion field, we represent scene motion with motion bases, where each primitive's motion is a combination of these bases. With human instructions, objects are segmented via text queries, and their motion is tracked. A grasp detector then identifies executable manipulations.
  • Figure 3: a). Projection of the 2D Gaussian on the image. b). Determining whether a Gaussian primitive falls within the mask. The gray primitive is inside. c). The gray primitives are the ones occluding the object.
  • Figure 4: a) MSGField produces accurate meshes and can generate robust grasp poses, even for a 5mm cable tie. b) MSGField effectively optimizes non-rigid motion, recovering a toy's transition from sitting to standing.
  • Figure 5: Visualization of certain scenes from the dataset. We strongly recommend watching the supplementary video for more detailed information.
  • ...and 1 more figures