Table of Contents
Fetching ...

Bridging Simulation and Reality: Cross-Domain Transfer with Semantic 2D Gaussian Splatting

Jian Tang, Pu Pang, Haowen Sun, Chengzhong Ma, Xingyu Chen, Hua Huang, Xuguang Lan

TL;DR

This work tackles the persistent sim-to-real gap in robotic manipulation by introducing Semantic 2D Gaussian Splatting (S2GS), a representation that yields object-centric, domain-invariant spatial features from multi-view data. S2GS combines 3D/2D Gaussian splatting with hierarchical semantic extraction, semantic feature rendering, and dynamic scene updating to provide clean inputs for a diffusion-based policy, significantly improving transfer from ManiSkill simulation to real UR5 robots. The approach demonstrates robust cross-domain performance, achieving high success rates in real-world manipulation tasks and outperforming RGB-based baselines and 3D Gaussian methods in both appearance fidelity and transfer reliability. The work offers a practical, real-time, and editable representation that reduces engineering effort for sim-to-real transfer and highlights future work to incorporate additional domain-invariant cues such as surface normals.

Abstract

Cross-domain transfer in robotic manipulation remains a longstanding challenge due to the significant domain gap between simulated and real-world environments. Existing methods such as domain randomization, adaptation, and sim-real calibration often require extensive tuning or fail to generalize to unseen scenarios. To address this issue, we observe that if domain-invariant features are utilized during policy training in simulation, and the same features can be extracted and provided as the input to policy during real-world deployment, the domain gap can be effectively bridged, leading to significantly improved policy generalization. Accordingly, we propose Semantic 2D Gaussian Splatting (S2GS), a novel representation method that extracts object-centric, domain-invariant spatial features. S2GS constructs multi-view 2D semantic fields and projects them into a unified 3D space via feature-level Gaussian splatting. A semantic filtering mechanism removes irrelevant background content, ensuring clean and consistent inputs for policy learning. To evaluate the effectiveness of S2GS, we adopt Diffusion Policy as the downstream learning algorithm and conduct experiments in the ManiSkill simulation environment, followed by real-world deployment. Results demonstrate that S2GS significantly improves sim-to-real transferability, maintaining high and stable task performance in real-world scenarios.

Bridging Simulation and Reality: Cross-Domain Transfer with Semantic 2D Gaussian Splatting

TL;DR

This work tackles the persistent sim-to-real gap in robotic manipulation by introducing Semantic 2D Gaussian Splatting (S2GS), a representation that yields object-centric, domain-invariant spatial features from multi-view data. S2GS combines 3D/2D Gaussian splatting with hierarchical semantic extraction, semantic feature rendering, and dynamic scene updating to provide clean inputs for a diffusion-based policy, significantly improving transfer from ManiSkill simulation to real UR5 robots. The approach demonstrates robust cross-domain performance, achieving high success rates in real-world manipulation tasks and outperforming RGB-based baselines and 3D Gaussian methods in both appearance fidelity and transfer reliability. The work offers a practical, real-time, and editable representation that reduces engineering effort for sim-to-real transfer and highlights future work to incorporate additional domain-invariant cues such as surface normals.

Abstract

Cross-domain transfer in robotic manipulation remains a longstanding challenge due to the significant domain gap between simulated and real-world environments. Existing methods such as domain randomization, adaptation, and sim-real calibration often require extensive tuning or fail to generalize to unseen scenarios. To address this issue, we observe that if domain-invariant features are utilized during policy training in simulation, and the same features can be extracted and provided as the input to policy during real-world deployment, the domain gap can be effectively bridged, leading to significantly improved policy generalization. Accordingly, we propose Semantic 2D Gaussian Splatting (S2GS), a novel representation method that extracts object-centric, domain-invariant spatial features. S2GS constructs multi-view 2D semantic fields and projects them into a unified 3D space via feature-level Gaussian splatting. A semantic filtering mechanism removes irrelevant background content, ensuring clean and consistent inputs for policy learning. To evaluate the effectiveness of S2GS, we adopt Diffusion Policy as the downstream learning algorithm and conduct experiments in the ManiSkill simulation environment, followed by real-world deployment. Results demonstrate that S2GS significantly improves sim-to-real transferability, maintaining high and stable task performance in real-world scenarios.

Paper Structure

This paper contains 21 sections, 17 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Sim-to-real domain transfer with domain-invariant features. Although there exist significant domain gap between the simulation and the real world, both share a common feature space of domain-invariant features. Leveraging these domain-invariant features to train the policy in simulation, and directly extracting the same type of features from real-world observations as the policy input, can mitigate the domain gap and enhance sim-to-real generalization.
  • Figure 2: S2GS Overview. S2GS aims to extract domain-invariant spatial features to support robust cross-domain policy transfer. In the initial stage, S2GS extracts hierarchical semantic features of multi-view images and optimizes the semantic 2D Gaussian Splatting field and the decoder. While executing, a semantic retrieval module queries and filters task-relevant objects while removing background distractions. The resulting domain-invariant spatial features serve as compact and clean inputs for downstream diffusion policy learning. After manipulation, S2GS supports dynamic scene updating to maintain accurate scene representation in real-time, satisfying the requirements of online robotic control.
  • Figure 3: Dynamic scene updating process. Our method tracks object motion during manipulation tasks by optimizing SE(3) transformations to maintain accurate scene representation in real-time.
  • Figure 4: Real-world results. Our method achieves high success rates in real-world manipulation tasks, demonstrating the effectiveness of our S2GS representation.
  • Figure 5: Three tasks in simulation: PickCube, PushCube, and StackCube.
  • ...and 5 more figures