Table of Contents
Fetching ...

Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse

Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, Chris Xiaoxuan Lu

TL;DR

This work tackles the latency bottleneck of Embodied Chain-of-Thought (ECoT) in vision-language-action policies by introducing Fast ECoT, which caches high-level reasoning across timesteps, enables parallel generation of modular reasoning steps, and employs an asynchronous scheduler to decouple reasoning from action decoding. The method is model-agnostic and requires no training or architectural changes, integrating into existing VLA pipelines. Empirical results across LIBERO simulations and real-world robot tasks show latency reductions up to substantial factors while maintaining or improving task success and reasoning fidelity, with asynchronous variants offering the best speed-accuracy trade-offs. Overall, Fast ECoT makes ECoT-driven policies more viable for real-time deployment by balancing interpretability, efficiency, and performance.

Abstract

Embodied Chain-of-Thought (ECoT) reasoning enhances vision-language-action (VLA) models by improving performance and interpretability through intermediate reasoning steps. However, its sequential autoregressive token generation introduces significant inference latency, limiting real-time deployment. We propose Fast ECoT, an inference-time acceleration method that exploits the structured and repetitive nature of ECoT to (1) cache and reuse high-level reasoning across timesteps and (2) parallelise the generation of modular reasoning steps. Additionally, we introduce an asynchronous scheduler that decouples reasoning from action decoding, further boosting responsiveness. Fast ECoT requires no model changes or additional training and integrates easily into existing VLA pipelines. Experiments in both simulation (LIBERO) and real-world robot tasks show up to a 7.5% reduction in latency with comparable or improved task success rate and reasoning faithfulness, bringing ECoT policies closer to practical real-time deployment.

Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse

TL;DR

This work tackles the latency bottleneck of Embodied Chain-of-Thought (ECoT) in vision-language-action policies by introducing Fast ECoT, which caches high-level reasoning across timesteps, enables parallel generation of modular reasoning steps, and employs an asynchronous scheduler to decouple reasoning from action decoding. The method is model-agnostic and requires no training or architectural changes, integrating into existing VLA pipelines. Empirical results across LIBERO simulations and real-world robot tasks show latency reductions up to substantial factors while maintaining or improving task success and reasoning fidelity, with asynchronous variants offering the best speed-accuracy trade-offs. Overall, Fast ECoT makes ECoT-driven policies more viable for real-time deployment by balancing interpretability, efficiency, and performance.

Abstract

Embodied Chain-of-Thought (ECoT) reasoning enhances vision-language-action (VLA) models by improving performance and interpretability through intermediate reasoning steps. However, its sequential autoregressive token generation introduces significant inference latency, limiting real-time deployment. We propose Fast ECoT, an inference-time acceleration method that exploits the structured and repetitive nature of ECoT to (1) cache and reuse high-level reasoning across timesteps and (2) parallelise the generation of modular reasoning steps. Additionally, we introduce an asynchronous scheduler that decouples reasoning from action decoding, further boosting responsiveness. Fast ECoT requires no model changes or additional training and integrates easily into existing VLA pipelines. Experiments in both simulation (LIBERO) and real-world robot tasks show up to a 7.5% reduction in latency with comparable or improved task success rate and reasoning faithfulness, bringing ECoT policies closer to practical real-time deployment.

Paper Structure

This paper contains 16 sections, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: ECoT zawalski2024robotic reasoning autoregressively generates high-level (green) and low-level (purple) reasoning steps to enhance VLA performance.
  • Figure 2: Statistics illustrating the pattern of ECoT reasoning steps under Libero Goal liu2023libero and Bridge V2 walke2023bridgedata.
  • Figure 3: Comparison between ECoT (left) and our proposed Fast ECoT(right). Both decompose reasoning into fixed stages (e.g., task, plan, object grounding), but ECoT generates these sequentially at every step, while Fast ECoT enables parallel generation and reuses cached higher-level reasoning from previous timesteps as context. The dotted lines coloured in green/magenta represent token copying.
  • Figure 4: Illustratively comparing static vs. continuous batching for reasoning generation. Left: Static batching pads to the longest sequence, processing 4×11=44 tokens. Right: Continuous batching processes only actual tokens (3, 6, 8, 9), adding up to 26 tokens, which reduces padding and improves efficiency.
  • Figure 5: Generated robot rollouts at successive time steps (top row) alongside its reasoning (bottom row). Between frames, a large part of the reasoning remains unchanged (Green). At each timestep (t=1, 10, 24, 31, 33, 42), the Subtask updates intermittently, and the low-level Move command adapts continuously as it picks up the banana and places it on the plate.
  • ...and 4 more figures