Table of Contents
Fetching ...

BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, Aviral Kumar

TL;DR

This work tackles the challenge of long-horizon, history-dependent imitation learning in robotics by identifying history coverage as the core bottleneck. It introduces Big Picture Policies (BPP), which compress histories into a small set of semantically meaningful keyframes detected by vision-language models, thereby reducing distribution shift between training and deployment. Across four real-world manipulation tasks and three simulations, BPP outperforms memoryless and prior history-conditioned baselines by up to 70% in real-world success, demonstrating improved data efficiency and robust long-horizon tracking. Limitations include dependence on VLM latency and detection accuracy, suggesting future directions toward automatic keyframe generation and event-based learning extensions.

Abstract

Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations. Videos are available at https://bigpicturepolicies.github.io/

BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

TL;DR

This work tackles the challenge of long-horizon, history-dependent imitation learning in robotics by identifying history coverage as the core bottleneck. It introduces Big Picture Policies (BPP), which compress histories into a small set of semantically meaningful keyframes detected by vision-language models, thereby reducing distribution shift between training and deployment. Across four real-world manipulation tasks and three simulations, BPP outperforms memoryless and prior history-conditioned baselines by up to 70% in real-world success, demonstrating improved data efficiency and robust long-horizon tracking. Limitations include dependence on VLM latency and detection accuracy, suggesting future directions toward automatic keyframe generation and event-based learning extensions.

Abstract

Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations. Videos are available at https://bigpicturepolicies.github.io/
Paper Structure (32 sections, 2 equations, 14 figures, 5 tables)

This paper contains 32 sections, 2 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 2: Benchmark tasks. We evaluate on 4 real-world tasks (A--D) and 3 simulation tasks (E--G). All tasks require history conditioning for success.
  • Figure 3: Predicting longer action chunks significantly decreases spurious correlations in naïve history-conditioned policies.Left: We compare how well the features learned by the policy predict the history state (in $\mathrm{Fixed\text{-}Password}$, this is the number of buttons pressed so far). Both policies trained with shorter (10) and longer (50) action chunks learn features that predict history state well on in-distribution expert trajectories. However, policies trained with shorter action chunks exhibit a much larger generalization gap to out-of-distribution policy rollouts (7.2$\times$ vs. 2.9$\times$ decrease), indicating greater reliance on spurious correlations. Right: These representational differences translate into significant performance gaps. Shorter action chunks drastically hurt naïve history-conditioned policy performance, while having little effect on oracle policies. Notably, training with action chunk 50 but executing with chunk 10 still significantly outperforms training with chunk 10, confirming that longer chunk prediction improves feature learning for history conditioning.
  • Figure 4: History state prediction regularization hurts history understanding. We evaluate whether regularizing the history encoder to predict ground-truth state information (number of buttons pressed so far in $\mathrm{Fixed\text{-}Password}$) improves generalization. While this auxiliary task improves accuracy on unseen expert trajectories, it leads to worse performance on out-of-distribution rollouts, indicating it increases reliance on spurious correlations. This regularization also degrades success rate: $55.5\% \pm 3.3\%$ to $19.0\% \pm 3.3\%$.
  • Figure 5: BPP is more robust to the distribution shift between histories appearing in the training data and policy rollouts. We evaluate accuracy of state prediction given the observation history, averaged across the fixed-password-entering and ingredient-insertion tasks. Both Naïve History Conditioning and PTP suffer big degradations in performance when testing on policy rollouts. Our approach, BPP, which we discuss next, does not suffer from this drop.
  • Figure 6: BPP system architecture. We condition a standard diffusion transformer policy architecture on a small set of history keyframes. These keyframes are defined by simple, task-specific criteria and are detected using a VLM. We also mask recent histories to account for detection latency.
  • ...and 9 more figures