Table of Contents
Fetching ...

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang

TL;DR

By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies.

Abstract

Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

TL;DR

By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies.

Abstract

Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.
Paper Structure (27 sections, 8 equations, 6 figures, 6 tables)

This paper contains 27 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We demonstrate that FRAPPE significantly outperforms the state-of-the-art models in both simulated and real-world complex scenarios, and it can effectively leverage data from different levels of the training data pyramid.
  • Figure 2: Overview of training and inference. During the training phase, the model progressively learns to align with the representation spaces of multiple visual foundation models simultaneously. The model is trained through a two-stage training process to extends to parallel processing of multiple input streams while aligning diverse visual representations. Similarly, parallel inference is implemented during the inference stage.
  • Figure 3: Experiments on different parameter scales on 4 tasks on the RoboTwin 2.0. The 130M backbone model fine-tuned using either LoRA or full-parameter fine-tuning under the FRAPPE consistently outperforms the naively fine-tuned RDT-130M across all tasks and remains competitive when compared with the naively fine-tuned RDT-1B. Especially in the Stack Bowls Two-Hard task, the improvement is significant.
  • Figure 4: Real-world experiment results in seen and unseen scenarios. We evaluate our FRAPPE and prior SOTA VLAs on 4 representative tasks, each with different axes of generalization. Seen means the settings were included in the training data, while Unseen refers to new task settings that the model did not encounter during training.
  • Figure 5: Long-horizon Performance. Each data point represents the success rate of completing up to and including that corresponding subtask.
  • ...and 1 more figures