Table of Contents
Fetching ...

AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models

Hyeongjun Heo, Seungyeon Woo, Sang Min Kim, Junho Kim, Junho Lee, Yonghyeon Lee, Young Min Kim

TL;DR

This paper proposes a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification, and uses a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters.

Abstract

Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments. These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments. In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification. Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time. For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters. This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy. Through extensive experiments on the LIBERO benchmark, our method consistently outperforms baselines that use data augmentation for policy fine-tuning or additional 3D-aware features for visual input. We further validate that our approach constantly enhances viewpoint robustness in real-world robotic manipulation scenarios, including settings with varying camera extrinsics, intrinsics, and freely moving handheld cameras.

AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models

TL;DR

This paper proposes a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification, and uses a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters.

Abstract

Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments. These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments. In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification. Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time. For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters. This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy. Through extensive experiments on the LIBERO benchmark, our method consistently outperforms baselines that use data augmentation for policy fine-tuning or additional 3D-aware features for visual input. We further validate that our approach constantly enhances viewpoint robustness in real-world robotic manipulation scenarios, including settings with varying camera extrinsics, intrinsics, and freely moving handheld cameras.
Paper Structure (19 sections, 3 equations, 6 figures, 3 tables)

This paper contains 19 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We present a zero-shot camera adaptation framework for VLAs that operates in real-time under various camera configuration changes from those used during training. At each control step, images from a test-time camera are synthesized into the training camera viewpoint at 30 Hz, while the frozen VLA policy runs at 10 Hz. Without any policy fine-tuning, our framework enables robust task execution across diverse setups—including different extrinsics, intrinsics, and freely moving handheld cameras such as an iPhone, ZED, and RealSense.
  • Figure 2: (a) Three perturbation levels on the agent and wrist camera. For the agent camera, we set the intersection point of the workspace surface and the camera $z$-axis as the center of spherical coordinates and apply perturbations to $(r, \theta, \phi)$ on camera poses. For the wrist camera, we perturb the $x$, $y$ coordinates and the pitch of the camera poses on the wrist camera frame. (b) Visualization of agent and wrist cameras in the viewpoint augmented LIBERO dataset.
  • Figure 3: Success rate on LIBERO-Long with agent view variations by fine-tuning steps: (Left) Average task success rate on all three unseen views (Right) Average task success rate on original view.
  • Figure 4: Qualitative results of LVSM on simulation and real images. Fine-tuning reduces the domain gap for input views on simulation.
  • Figure 5: Experimental setup for real-world experiments.
  • ...and 1 more figures