Table of Contents
Fetching ...

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

Kaisen Yang, Lixuan He, Rushi Shah, Kaicheng Yang, Qinwei Ma, Dianbo Liu, Alex Lamb

TL;DR

This paper introduces the Explore–Execute Chain (E2C), a structured reasoning framework that decouples high-level exploration (planning) from deterministic execution (calculation) to enhance efficiency and interpretability in LLMs. It employs a two-stage training pipeline (SFT followed by RL) and a causal data-construction method to enforce faithful plan adherence, complemented by Exploration-Focused SFT (EF-SFT) for data-efficient domain adaptation. The approach achieves substantial efficiency gains at test time (e.g., $58.1\%$ accuracy on AIME'2024 with under $10\%$ of decoding tokens) and strong cross-domain performance in medical benchmarks (up to $14.5\%$ improvement with only $3.5\%$ tokens), while maintaining robustness through well-designed rewards and plan-execution discipline. Collectively, E2C demonstrates improved reasoning efficiency, generalization, and transparency, enabling scalable, interpretable AI-assisted problem solving; code and models are publicly available.

Abstract

Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain ($E^2C$), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of execution. This decomposition enables an efficient test-time scaling strategy: on AIME'2024, $E^2C$ Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution. The code and pre-trained models for the project are available at: https://github.com/yks23/Explore-Execute-Chain.git

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

TL;DR

This paper introduces the Explore–Execute Chain (E2C), a structured reasoning framework that decouples high-level exploration (planning) from deterministic execution (calculation) to enhance efficiency and interpretability in LLMs. It employs a two-stage training pipeline (SFT followed by RL) and a causal data-construction method to enforce faithful plan adherence, complemented by Exploration-Focused SFT (EF-SFT) for data-efficient domain adaptation. The approach achieves substantial efficiency gains at test time (e.g., accuracy on AIME'2024 with under of decoding tokens) and strong cross-domain performance in medical benchmarks (up to improvement with only tokens), while maintaining robustness through well-designed rewards and plan-execution discipline. Collectively, E2C demonstrates improved reasoning efficiency, generalization, and transparency, enabling scalable, interpretable AI-assisted problem solving; code and models are publicly available.

Abstract

Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain (), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of execution. This decomposition enables an efficient test-time scaling strategy: on AIME'2024, Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution. The code and pre-trained models for the project are available at: https://github.com/yks23/Explore-Execute-Chain.git

Paper Structure

This paper contains 49 sections, 15 equations, 3 figures, 7 tables, 3 algorithms.

Figures (3)

  • Figure 1: Our proposed Explore-Execute Chain (E2C) method decomposes reasoning chains into a short, high-level exploratory plan followed by a long, detailed execution (left). After optimizing these special reasoning chains using RL, it is possible to synthesize a large number of plans, use the model to pick the best plan, and then execute this plan (middle). This unlocks dramatically improved overall token efficiency on the challenging AIME'2024 benchmark (right).
  • Figure 2: Overview of E2C method. The approach begins with E2C-SFT to achieve a paradigm shift, followed by a two-stage E2C-RL process that leverages the decomposition advantage of the new paradigm to boost performance. The resulting E2C-LLM can be efficiently adapted to new domains via EF-SFT. The exploration stage's high informativeness enables effective test-time scaling, implementable through semantic clustering or LLM selection.
  • Figure 3: A comparison of training dynamics on the AIME'24 benchmark. The application of our token-weighting coefficient $\lambda_{i,t}$ (b) facilitates faster entropy reduction and superior performance improvement compared to the baseline without it (a).