Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Milan Ganai; Katie Luo; Jonas Frey; Clark Barrett; Marco Pavone

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

TL;DR

R&B-EnCoRe addresses the challenge of learning action-predictive embodied reasoning by treating reasoning as a latent variable $Z$ that links context $C$ to action $A$ and grounding internet-scale priors through self-supervised, importance-weighted inference (IWAE bound $\mathcal{L}_K$ and SIR). The framework uses warmstarting with diverse reasoning primitives via Reasoning Dropout, jointly trains a prior $p(Z,A|C)$ and a posterior $q(Z|C,A)$, and refines traces by sampling-importance-resampling to emphasize strategies that maximize information benefit $\Delta \mathcal{I}_R$. Across manipulation, legged navigation, and autonomous driving benchmarks, it yields concise, action-predictive traces and substantial gains while reducing test-time latency. This approach eliminates external rewards or verifiers and enables scalable grounding of multimodal priors in physical execution.

Abstract

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

TL;DR

R&B-EnCoRe addresses the challenge of learning action-predictive embodied reasoning by treating reasoning as a latent variable

that links context

to action

and grounding internet-scale priors through self-supervised, importance-weighted inference (IWAE bound

and SIR). The framework uses warmstarting with diverse reasoning primitives via Reasoning Dropout, jointly trains a prior

and a posterior

, and refines traces by sampling-importance-resampling to emphasize strategies that maximize information benefit

. Across manipulation, legged navigation, and autonomous driving benchmarks, it yields concise, action-predictive traces and substantial gains while reducing test-time latency. This approach eliminates external rewards or verifiers and enables scalable grounding of multimodal priors in physical execution.

Abstract

Paper Structure (55 sections, 3 theorems, 35 equations, 26 figures, 7 tables, 2 algorithms)

This paper contains 55 sections, 3 theorems, 35 equations, 26 figures, 7 tables, 2 algorithms.

Introduction
Related Works
Preliminaries on Variational Inference
Latent Variable Models and the Variational Autoencoders
Importance Weighted Autoencoders (IWAE)
R&B-EnCoRe Framework
Warmstarting Strategy Hypotheses via Reasoning Dropout
Jointly Training Prior and Posterior
Refining and Bootstrapping via Importance Sampling
Experiments
LIBERO-90 Franka Panda Manipulation
Bridge WidowX Hardware
Legged Robots Navigation
Autonomous Vehicles
Discussion and Conclusion
...and 40 more sections

Key Result

Proposition 1

Under the training (Alg. alg:rbencore_train) and sampling (Alg. alg:rbencore_refine) procedures, the expected log-ratio of importance weights equals the information benefit: Proof in Appendix sec:prop.

Figures (26)

Figure 1: We generate diverse embodied reasoning primitives and refine them based on action-prediction information benefit. We bootstrap policy performance by retraining on these self-refined, high-quality reasoning traces, discovering embodiment-specific reasoning distributions that reveal effective strategies, significantly improving VLA task success while producing more efficient CoT traces.
Figure 2: Top: Probabilistic Graphical Model relating the Task Context ($C$), Reasoning ($Z$), and Action ($A$). The latent reasoning $Z$ is induced from a set of primitives $\mathcal{R}$ (e.g., subtask reasoning, move reasoning). Bottom: An example reasoning trace on the Bridge setup.
Figure 3: Overview of R&B-EnCoRe. (a) We generate diverse reasoning primitives (e.g., Plan, Visible Objects) and combine them via dropout to warmstart model capturing prior and posterior distributions. (b) We sample candidates from posterior and apply importance weighting to filter for reasoning that maximizes action prediction power. These refined, high-quality reasoning traces are used to bootstrap the final VLA.
Figure 4: This plot shows the reasoning primitives distributions that are generated from R&B-EnCoRe refining warmstarting diverse reasoning strategy data. In a) the distribution for manipulation shows differences between reasoning for Franka Panda in simulation versus WidowX hardware in real-world data, notably for Visible Object, Move Explain, and Subtask Explain reasoning primitives. In b) we observe that the four-legged locomotion embodiments we investigate benefit in similar frequencies across reasoning types, with structural affordances being critical. For autonomous vehicles, we find in c) that reasoning focuses on goals and constraints.
Figure 5: Visible Objects generated in LIBERO-90 by R&B-EnCoRe's model and a model producing a full list. The latter model attends to task-irrelevant objects like plate and bowl, while our model emits reasoning focused on task-critical objects.
...and 21 more figures

Theorems & Definitions (4)

Proposition : Importance Weight Ratios Estimate Information Benefit
Proposition : Importance Weights Capture Information Gain
proof
Proposition : Categorical Resampling Bound Adapted from cremer2017reinterpreting

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

TL;DR

Abstract

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (4)