Table of Contents
Fetching ...

Slm-mux: Orchestrating small language models for reasoning

Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie, Tushar Krishna, Vijay Janapa Reddi, Yilun Du

TL;DR

The paper tackles the challenge of leveraging multiple small language models (SLMs) to achieve higher reasoning accuracy than any single SLM by proposing SLM-MUX, a static, confidence-based orchestration approach that avoids inter-model dialogue. It demonstrates that prior discussion-based orchestration methods, effective on frontier LLMs, can degrade SLM performance due to groupthink; SLM-MUX instead selects outputs based on per-model consistency and, when needed, validation accuracy as a tie-breaker. To maximize effectiveness, the authors introduce a model selection search to identify complementary model subsets and compute-scaling strategies to trade off accuracy and compute, achieving up to $13.4\%$ gains on MATH, $8.8\%$ on GPQA, and $7.0\%$ on GSM8K, with two SLMs sometimes surpassing a $72$B-parameter frontier model on certain benchmarks. The work provides both theoretical analysis and empirical validation, showing that intelligently orchestrating smaller, cheaper models can approach or exceed the performance of larger models while offering practical efficiency benefits. The results suggest a promising paradigm for scalable AI systems built from ensembles of SLMs, with clear directions for adaptive selection and richer confidence metrics in future work.

Abstract

With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.

Slm-mux: Orchestrating small language models for reasoning

TL;DR

The paper tackles the challenge of leveraging multiple small language models (SLMs) to achieve higher reasoning accuracy than any single SLM by proposing SLM-MUX, a static, confidence-based orchestration approach that avoids inter-model dialogue. It demonstrates that prior discussion-based orchestration methods, effective on frontier LLMs, can degrade SLM performance due to groupthink; SLM-MUX instead selects outputs based on per-model consistency and, when needed, validation accuracy as a tie-breaker. To maximize effectiveness, the authors introduce a model selection search to identify complementary model subsets and compute-scaling strategies to trade off accuracy and compute, achieving up to gains on MATH, on GPQA, and on GSM8K, with two SLMs sometimes surpassing a B-parameter frontier model on certain benchmarks. The work provides both theoretical analysis and empirical validation, showing that intelligently orchestrating smaller, cheaper models can approach or exceed the performance of larger models while offering practical efficiency benefits. The results suggest a promising paradigm for scalable AI systems built from ensembles of SLMs, with clear directions for adaptive selection and richer confidence metrics in future work.

Abstract

With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.

Paper Structure

This paper contains 28 sections, 4 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Head-to-Head Comparison of SLM-MUX with Other Methods.SLM-MUX outperforms existing methods such as Self-Consistency (SC) wang2023selfconsistencyimproveschainthought, Mixture-of-Agents (MoA) wang2024mixtureofagentsenhanceslargelanguage, LLM-Debate du2023improvingfactualityreasoninglanguage, Multi-Agent Verification (MAV) lifshitz2025multiagentverificationscalingtesttime, and Agent Forest li2024agentsneed. Results reported on MATH dataset with SLMs.
  • Figure 2: Comparing SLM-Mux (Ours) with Existing LLM Orchestration Methods. (a) Mixture-of-Agents, (b) LLM-Debate, (c) Multi-Agent Verification, (d) SLM-Mux (Ours).
  • Figure 3: Illustration of SLM-MUX Workflow. (1) Each SLM first independently generates multiple outputs for the same question. (2) The most frequent answer from each SLM is selected, and its frequency in the answer pool is used as the confidence score. (3) The answers with the highest confidence score are selected. (4) If multiple answers share the same confidence score, the tie is broken by selecting the answer from the SLM with the highest accuracy on the validation set.
  • Figure 4: Comparison of Model Choices. Accuracy on 7 subjects for two model selection settings on MATH dataset. Subjects are denoted as: A = Prealgebra, B = Algebra, C = Intermediate Algebra, D = Number Theory, E = Counting & Probability, F = Geometry, G = Precalculus.
  • Figure 5: Comparison of discussion-based orchestration when invoking SLMs and LLMs. We compare three orchestration methods (Mixture-of-Agents, LLM-Debate, and Verification) using (a) SLMs (Llama 3.1 8B, Mistral 8$\times$7B, Gemma 2 27B) and (b) frontier LLMs (DeepSeek V3, Gemini 2.0 Flash, GPT-4o) on the MATH and GPQA datasets. The baseline (Single-Model Max) reflects the best performance of individual models. A orchestration is considered successful if it surpasses Single-Model Max.
  • ...and 10 more figures