Table of Contents
Fetching ...

NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, Soujanya Poria

TL;DR

NORA-1.5 advances vision-language-action modeling by coupling a strong VLA backbone (NORA) with a flow-matching action expert and augmenting it with reward-driven post-training. Lightweight world-model-based and action-heuristic rewards are used to generate pairwise preferences, trained via Direct Preference Optimization to improve trajectory planning and robustness across simulation and real-world embodiments. The approach yields state-of-the-art results on SimplerEnv and LIBERO benchmarks and demonstrates reliable transfer to a real robot (Galaxea A1), with DPO-driven refinements particularly helping unseen objects and distractors. Overall, reward-guided post-training provides a scalable, data-efficient path to more dependable embodied agents suitable for real-world deployment.

Abstract

Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.

NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

TL;DR

NORA-1.5 advances vision-language-action modeling by coupling a strong VLA backbone (NORA) with a flow-matching action expert and augmenting it with reward-driven post-training. Lightweight world-model-based and action-heuristic rewards are used to generate pairwise preferences, trained via Direct Preference Optimization to improve trajectory planning and robustness across simulation and real-world embodiments. The approach yields state-of-the-art results on SimplerEnv and LIBERO benchmarks and demonstrates reliable transfer to a real robot (Galaxea A1), with DPO-driven refinements particularly helping unseen objects and distractors. Overall, reward-guided post-training provides a scalable, data-efficient path to more dependable embodied agents suitable for real-world deployment.

Abstract

Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.

Paper Structure

This paper contains 39 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Training pipeline of NORA-1.5 where firstly a VLA model is pre-trained through imitation learning and subsequently a preference dataset of the actions is created for preference optimization. WM stands for WM-guided goal-based reward (\ref{['eq:goal-re']}) and GTA stands for the reward based on ground-truth action (\ref{['eq:act-re']}).
  • Figure 2: Comparing FAST+ with flow-matching.
  • Figure 3: Effect of DPO post-training on real-robot gripper trajectories for the Galaxea A1 arm. Compared to the non-DPO baseline (a), the DPO-trained NORA-1.5 (b) executes smoother trajectories with fewer strokes, aligning with the reduced number of action chunks and improved grasp success reported in \ref{['tab:real_robot_results2']}.
  • Figure 4: Examples of NORA-1.5 executing evaluation tasks in SimplerEnv: (a) pickup coke and move object near another object and (b) open and close drawer.
  • Figure 5: Examples of NORA-1.5 executing evaluation tasks with Galaxea A1 robotic arm in the real world.