FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Moritz Reuss; Hongyi Zhou; Marcel Rühle; Ömer Erdinç Yağmurlu; Fabian Otto; Rudolf Lioutikov

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov

TL;DR

FLOWER tackles the computational and memory barriers of generalist Vision-Language-Action policies by introducing intermediate-modality fusion and action-specific Global-AdaLN conditioning, enabling a compact 950M-parameter VLA trained in ~200 GPU-hours. The flore architecture leverages a Flow Transformer with Rectified Flow for efficient, multimodal action generation, achieving state-of-the-art or competitive results across 190 tasks in 10 benchmarks and demonstrating strong real-world generalization. Key contributions include a principled fusion strategy, parameter-efficient conditioning, and an open-source, low-resource pretraining pipeline that broadens access to generalist robotic policies. The work significantly lowers barriers to deployment, enabling robust, cross-embodiment manipulation across diverse tasks and settings.

Abstract

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to $50\%$ of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by $20\%$ through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across $190$ tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

TL;DR

Abstract

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)