Table of Contents
Fetching ...

Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li, Shihang Weng, Zhaoxin Zhang, Tao Lin, Xiangyu Chen, Wenjie Zhang, Jiaqi Mao, Weisheng Xu, Bin Yang, Jiaming Liang, Junhao Cai, Renjing Xu

Abstract

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.

Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Abstract

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.

Paper Structure

This paper contains 39 sections, 7 equations, 21 figures, 14 tables.

Figures (21)

  • Figure 1: Experimental setup overview. Camera layout, example multi-view observations, and the Clean/Random environment modes.
  • Figure 2: Comparison between single-view and multi-view training under clean and random settings. Multi-view training improves performance across tasks, with stronger gains under the random setting.
  • Figure 3: Scaling behavior under increasing viewpoint diversity. Both VA and VLA exhibit performance gains under moderate multi-view augmentation, while excessive view expansion does not yield monotonic improvement.
  • Figure 4: Scaling behavior of multi-view demonstrations under fixed 0$^\circ$ evaluation. Each subplot shows success rate versus demonstrations per added view ($10\!\rightarrow\!50$) for one task. The black dashed line indicates the single-view baseline trained on origin view.
  • Figure 5: Scaling laws of single-view vs. multi-view training on two manipulation tasks. Multi-view data continues to improve performance when single-view training saturates, achieving comparable or higher performance with substantially less scene diversity.
  • ...and 16 more figures