Table of Contents
Fetching ...

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan

TL;DR

CodeSteer addresses the challenge of steering LLMs between textual reasoning and symbolic code-based computation. It introduces SymBench, a 37-task symbolic benchmark, and a two-stage fine-tuning pipeline (SFT then DPO) for a small CodeSteerLLM that guides larger TaskLLMs. The framework employs Symbolic and Self-answer Checkers to enhance code quality and answer correctness, yielding strong gains over baselines and across unseen models. The results demonstrate that integrating symbolic computing with multi-turn guidance is a scalable path to robust symbolic reasoning in LLMs with broad cross-model generalization.

Abstract

Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

TL;DR

CodeSteer addresses the challenge of steering LLMs between textual reasoning and symbolic code-based computation. It introduces SymBench, a 37-task symbolic benchmark, and a two-stage fine-tuning pipeline (SFT then DPO) for a small CodeSteerLLM that guides larger TaskLLMs. The framework employs Symbolic and Self-answer Checkers to enhance code quality and answer correctness, yielding strong gains over baselines and across unseen models. The results demonstrate that integrating symbolic computing with multi-turn guidance is a scalable path to robust symbolic reasoning in LLMs with broad cross-model generalization.

Abstract

Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.

Paper Structure

This paper contains 29 sections, 2 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Examples and performance of CodeSteer on guiding LLM code/text generation to integrate symbolic computing. At each interaction with TaskLLM, it reviews current and previous answers, then provides guidance for the next turn. CodeSteer returns final answers when it deems them ready. With CodeSteer, GPT-4o outperforms OpenAI Code Interpreter, o1, and o1-preview models.
  • Figure 2: Schematic of multi-turn DPO data sampling: blue squares represent intermediate (non-final) turns, and brown ovals mark finalizing turns. Guidance responses from the same parent node in CodeSteerLLM are compared to generate the DPO data.
  • Figure 3: Normalized score distribution of CodeSteer+GPT-4o and o1 in 37 SymBench tasks.
  • Figure 4: Method performance across four representative tasks as task complexity increases from left to right on the x-axis controlled by value scales. C.S. and Inter. represent CodeSteer and Interpreter.
  • Figure 5: Score vs. token and runtime costs for each method, highlighting CodeSteer, R1, o1, and o1-preview in red. We display CodeSteer results separately for inferences using single or four H100 GPUs. Specific values are in Table \ref{['table: Score-cost table for each method']}.
  • ...and 7 more figures