Table of Contents
Fetching ...

MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning

Haojin Wang, Yike Wang, Shangbin Feng, Hannaneh Hajishirzi, Yulia Tsvetkov

TL;DR

MentorCollab tackles the cost and inefficiency of large reasoning models by enabling inference-time collaboration where a small language model is guided only at selective points by a larger mentor. The framework uses a three-stage pipeline—Decision, Consultation, and Verification—with either a prompt-based or a lightweight MLP verifier to choose short mentor-provided lookahead segments, keeping the generator in control. Empirically, it yields 3–8 percentage-point improvements across 12 of 15 generator–mentor pairs in math, general knowledge, and commonsense domains, while injecting only about 18% of mentor tokens. The approach shows that concise, targeted, and verifiable mentor guidance can recover much of large-model reasoning with substantially reduced inference overhead, enabling practical efficient reasoning systems.

Abstract

Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM--LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.

MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning

TL;DR

MentorCollab tackles the cost and inefficiency of large reasoning models by enabling inference-time collaboration where a small language model is guided only at selective points by a larger mentor. The framework uses a three-stage pipeline—Decision, Consultation, and Verification—with either a prompt-based or a lightweight MLP verifier to choose short mentor-provided lookahead segments, keeping the generator in control. Empirically, it yields 3–8 percentage-point improvements across 12 of 15 generator–mentor pairs in math, general knowledge, and commonsense domains, while injecting only about 18% of mentor tokens. The approach shows that concise, targeted, and verifiable mentor guidance can recover much of large-model reasoning with substantially reduced inference overhead, enabling practical efficient reasoning systems.

Abstract

Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM--LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.
Paper Structure (35 sections, 6 equations, 9 figures, 5 tables)

This paper contains 35 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example of MentorCollab on a query from the MATH hendrycksmath2021 dataset. MentorCollab proposes mentoring LRM reasoning segments and verifies them before adoption, enabling the small model to correct its initial mistake while avoiding the redundant reasoning seen in prior collaboration methods. Tokens generated from the mentoring LRM is highlighted in red in the figure.
  • Figure 2: Overview of MentorCollab. At randomly sampled token positions, if the generator token and the mentor token disagree, we prompt two models to produce a short future segment respectively. A verifier then selects which segment to follow, and generation proceeds from the selected continuation.
  • Figure 3: Incorporated mentor token count and performance comparison across methods on Qwen3-8B-Base with different mentor LRMs. While incorporating a comparable or even smaller number of mentor-generated tokens in the final output compared to nudging fei-etal-2025-nudging and CoSD wang2025speculate, MentorCollab consistently achieves higher performance, effectively guiding the generator toward more accurate generations.
  • Figure 4: MentorCollab performance with different decision proportions. We report the performance from multiple generator models with multiple mentor LRMs.
  • Figure 5: Effect of the verifier on MentorCollab. We report the accuracy gains of MentorCollab over directly injecting reasoning segments from Qwen3-32B into various generator models on MATH.
  • ...and 4 more figures