Table of Contents
Fetching ...

OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model

Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Animesh Garg, Valts Blukis

TL;DR

OG-VLA addresses the challenge of mapping natural language and RGB-D observations to precise robot actions by unprojecting scenes into a unified 3D point cloud, rendering canonical orthographic views, and synthesizing actions through a vision backbone, an LLM, and a diffusion-based image generator. By combining 3D-aware reasoning with language and vision priors, it achieves strong generalization to unseen objects, scenes, and instructions while maintaining accurate end-effector poses. It demonstrates state-of-the-art generalization on Arnold and robust performance under Colosseum perturbations, with real-world adaptation feasible in as few as 3–5 demonstrations. The work highlights the value of canonical-view representations and multimodal prompting for robust, data-efficient robotic manipulation, and outlines clear directions for scaling, long-horizon tasks, and faster inference.

Abstract

We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/

OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model

TL;DR

OG-VLA addresses the challenge of mapping natural language and RGB-D observations to precise robot actions by unprojecting scenes into a unified 3D point cloud, rendering canonical orthographic views, and synthesizing actions through a vision backbone, an LLM, and a diffusion-based image generator. By combining 3D-aware reasoning with language and vision priors, it achieves strong generalization to unseen objects, scenes, and instructions while maintaining accurate end-effector poses. It demonstrates state-of-the-art generalization on Arnold and robust performance under Colosseum perturbations, with real-world adaptation feasible in as few as 3–5 demonstrations. The work highlights the value of canonical-view representations and multimodal prompting for robust, data-efficient robotic manipulation, and outlines clear directions for scaling, long-horizon tasks, and faster inference.

Abstract

We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/

Paper Structure

This paper contains 33 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: OG-VLA illustration. OG-VLA represents robot end-effector keyframes with easy-to-decode annotations on orthographic images in a set of canonical views. This output encoding enables action prediction via image generation, and using canonical views achieves invariance to input camera poses. The red hotspot is the predicted end-effector position in each image. In this example, the hotspot is indicating the 3D point for approaching the drawer handle to open it. The yellow, blue and green hotspots work in tandem with the red hotspot to encode the three axes of end-effector orientation. The color of the hotspot on the top-left encodes the gripper open/close state. The system is robust to distractors and changing lighting conditions.
  • Figure 2: Model Overview. The input to our system is a task instruction and multiple RGB-D views of the scene. We build a point cloud from the input views and re-project it to orthographic projections from orthonormal views. The orthonormal views are fed into a Visual Encoder to derive a set of CLS and patch embeddings. CLS embeddings are projected into the LLM latent space and concatenated with a tokenized prompt that queries the next end-effector state and specifies the output format. The LLM outputs image token embeddings to condition the ImageGenerator, which are projected to the ImageGenerator's input latent space, and then concatenated with skip-connected visual features. The ImageGenerator generates heatmaps-one per orthographic view-indicating the next end-effector pose. We decode the heatmaps by interpreting them as probabilities, and inferring the most likely 3D position across all views and one rotation angle per view.
  • Figure 3: Evaluation on the COLOSSEUM benchmark 2024colosseum. Task-averaged success rate shows that OG-VLA outperforms all baselines on the hardest generalization test set (all perturbation).
  • Figure 4: Qualitative example. Showing generalization of OG-VLA to unseen scenarios.
  • Figure 5: Example gripper position and rotation outputs from OG-VLA for eight different tasks. The rows are the different views: front, top, left, and right. For each task, the two columns are two timesteps required to solve the task. The red Gaussian is the predicted position. The yellow, blue, and green Gaussians are predicted rotation angles along x, z, and y-axis respectively. The blue dot is our model's output gripper position, back-projected to each view. The dots on rotation Gaussians are showing the extracted pixel for computing the rotation angle in reference to the horizontal right axis in each view.
  • ...and 1 more figures