Table of Contents
Fetching ...

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov

TL;DR

FLOWER tackles the computational and memory barriers of generalist Vision-Language-Action policies by introducing intermediate-modality fusion and action-specific Global-AdaLN conditioning, enabling a compact 950M-parameter VLA trained in ~200 GPU-hours. The flore architecture leverages a Flow Transformer with Rectified Flow for efficient, multimodal action generation, achieving state-of-the-art or competitive results across 190 tasks in 10 benchmarks and demonstrating strong real-world generalization. Key contributions include a principled fusion strategy, parameter-efficient conditioning, and an open-source, low-resource pretraining pipeline that broadens access to generalist robotic policies. The work significantly lowers barriers to deployment, enabling robust, cross-embodiment manipulation across diverse tasks and settings.

Abstract

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to $50\%$ of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by $20\%$ through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across $190$ tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

TL;DR

FLOWER tackles the computational and memory barriers of generalist Vision-Language-Action policies by introducing intermediate-modality fusion and action-specific Global-AdaLN conditioning, enabling a compact 950M-parameter VLA trained in ~200 GPU-hours. The flore architecture leverages a Flow Transformer with Rectified Flow for efficient, multimodal action generation, achieving state-of-the-art or competitive results across 190 tasks in 10 benchmarks and demonstrating strong real-world generalization. Key contributions include a principled fusion strategy, parameter-efficient conditioning, and an open-source, low-resource pretraining pipeline that broadens access to generalist robotic policies. The work significantly lowers barriers to deployment, enabling robust, cross-embodiment manipulation across diverse tasks and settings.

Abstract

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.

Paper Structure

This paper contains 29 sections, 4 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Intermediate fusion for efficient VLA policies. Our fusion strategy (top-right) strategically prunes VLM layers while enhancing Flow Transformer capacity in parameter-constrained settings. This approach informs FLOWER, a novel, $950M$ VLA that achieves competitive performance across 10 benchmarks using only $1\%$ of the pretraining compute of models like OpenVLA kim2024openvla, while maintaining a small memory footprint across diverse embodiments and action spaces (bottom).
  • Figure 2: flore architecture. A fine-tuned VLM processes multimodal inputs and integrates intermediate features into a Flow Transformer via cross-attention. The model predicts velocity fields using action-space Global AdaLN-Zero conditioning with embodiment and temporal metadata.
  • Figure 3: Comparison of standard DiT blocks and our proposed Global AdaLN with layer-specific Lora adapters.
  • Figure 4: Simulation Environments used to test flore. From left to right: CALVINmees2022calvin, LIBEROliu2024libero, SIMPLERli24simpler with the Bridge and Google Robot variants and Aloha Simulation Benchmarkzhao2023learning. Real world multi-task kitchen setup and generalization experiments with cluttered scenes, different lightning and novel objects.
  • Figure 5: Simulation Results for FLOWER We report average results for various benchmarks against relevant baselines. For brevity we reduce the shown baselines to most relevant ones but provide detailed results for each benchmark (see \ref{['sec:app-benchmarks']}. C refers to CALVIN and L refers to LIBERO. SGOL refers to average results for LIBERO Object, Goal, Spatial and Long.
  • ...and 4 more figures