Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks

Huanxuan Liao; Shizhu He; Yao Xu; Yuanzhe Zhang; Kang Liu; Jun Zhao

Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks

Huanxuan Liao, Shizhu He, Yao Xu, Yuanzhe Zhang, Kang Liu, Jun Zhao

TL;DR

The paper addresses the difficulty of equipping small language models with complex reasoning by decoupling general reasoning from specialized knowledge. It introduces NesyCD, a neural-symbolic collaborative distillation framework in which general abilities are learned via neural distillation from LLM teachers, while sparse specialized knowledge is captured in a symbolic knowledge base generated from the student’s errors. The method uses retrieval augmented distillation to incorporate relevant KB content during training and inference, complemented by auxiliary tasks such as Answer Prediction and Direct CoT to reinforce reasoning. Empirical results on in-domain and out-of-domain benchmarks show NesyCD significantly improves small models, with some configurations surpassing GPT-3.5-turbo and approaching much larger models, demonstrating the practical viability of neural-symbolic knowledge integration for efficient complex reasoning.

Abstract

In this paper, we propose $\textbf{Ne}$ural-$\textbf{Sy}$mbolic $\textbf{C}$ollaborative $\textbf{D}$istillation ($\textbf{NesyCD}$), a novel knowledge distillation method for learning the complex reasoning abilities of Large Language Models (LLMs, e.g., \textgreater 13B). We argue that complex reasoning tasks are difficult for Small Language Models (SLMs, e.g., $\leq$ 7B), as these tasks demand not only general cognitive abilities but also specialized knowledge, which is often sparse and difficult for these neural-based SLMs to effectively capture. Therefore, NesyCD distills the general capabilities and specialized knowledge in LLMs using different manners. On the one hand, we distill only general abilities from teacher LLMs into the student SLMs of parameterized neural networks. On the other hand, for the specialized abilities and uncommon knowledge of a complex reasoning task, we employ a symbolic knowledge distillation approach to obtain and store the specialized knowledge within a symbolic knowledge base (KB). By decoupling general and specialized capabilities, the proposed NesyCD can achieve superior performance cost-effectively, utilizing smaller models and blending parameterized neural networks with symbolic KB. Moreover, the specialized KB generalizes well and is comprehended and manipulated by humans. Our experiments show that NesyCD significantly boosts SLMs' complex reasoning performance on in-domain (BBH, GSM8K) and out-of-domain (AGIEval, ARC) datasets. Notably, our approach enabled the LLaMA3-8B and Qwen2-7B to surpass GPT-3.5-turbo in performance and come close to matching LLaMA3-70B, despite the latter having nine times more parameters. Our code will be available at https://github.com/Xnhyacinth/NesyCD.

Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks

TL;DR

Abstract

In this paper, we propose

ural-

mbolic

ollaborative

istillation (

), a novel knowledge distillation method for learning the complex reasoning abilities of Large Language Models (LLMs, e.g., \textgreater 13B). We argue that complex reasoning tasks are difficult for Small Language Models (SLMs, e.g.,

7B), as these tasks demand not only general cognitive abilities but also specialized knowledge, which is often sparse and difficult for these neural-based SLMs to effectively capture. Therefore, NesyCD distills the general capabilities and specialized knowledge in LLMs using different manners. On the one hand, we distill only general abilities from teacher LLMs into the student SLMs of parameterized neural networks. On the other hand, for the specialized abilities and uncommon knowledge of a complex reasoning task, we employ a symbolic knowledge distillation approach to obtain and store the specialized knowledge within a symbolic knowledge base (KB). By decoupling general and specialized capabilities, the proposed NesyCD can achieve superior performance cost-effectively, utilizing smaller models and blending parameterized neural networks with symbolic KB. Moreover, the specialized KB generalizes well and is comprehended and manipulated by humans. Our experiments show that NesyCD significantly boosts SLMs' complex reasoning performance on in-domain (BBH, GSM8K) and out-of-domain (AGIEval, ARC) datasets. Notably, our approach enabled the LLaMA3-8B and Qwen2-7B to surpass GPT-3.5-turbo in performance and come close to matching LLaMA3-70B, despite the latter having nine times more parameters. Our code will be available at https://github.com/Xnhyacinth/NesyCD.

Paper Structure (34 sections, 7 equations, 4 figures, 10 tables)

This paper contains 34 sections, 7 equations, 4 figures, 10 tables.

Introduction
Related Work
CoT Distillation from LLMs
Knowledge-Augmented LMs
Learning from Errors
Methods
General Distillation
Demonstration Collection
Symbolic Knowledge Distillation
Symbolic KB Augmented Neural Distillation
Experiments
Datasets
Baselines
Implementations
Main Results
...and 19 more sections

Figures (4)

Figure 1: CoT distillation aims to train SLMs with the generated rationales obtained from LLMs, which is often limited by the SLMs' capabilities and frequently struggles to handle hard questions. The proposed NesyCD addresses this by decoupling the general and specialized knowledge of LLMs through SLMs' error analysis. It employs neural-based SLMs to model general knowledge while utilizing a symbolic specialized knowledge base (KB) to store specific knowledge. By adaptively utilizing the KB, NesyCD enhances the SLM's ability to handle complex reasoning tasks.
Figure 2: Overview of NesyCD. 1) General Distillation (§\ref{['general_dis']}): Fine-tune the student $\mathcal{S}_{P}$ to generate rationales obtained from the teacher $\mathcal{T}_{G}$ and answers. 2) Demonstration Collection (§\ref{['demos_col']}): Evaluate $\mathcal{S}_{P}$ and collect correct and error cases addressed by $\mathcal{S}_{P}$. 3) Symbolic Knowledge Distillation (§\ref{['kb']}): The teacher $\mathcal{T}_{T}$ analyzes errors and generate specialized KB. 4) Symbolic KB Augmented Neural Distillation (§\ref{['nesy']}): Use multi-task learning to fine-tune $\mathcal{S}_{E}$, enabling it to effectively utilize retrieved specialized knowledge.
Figure 3: Efficiency on training data and model size. The backbone model for the data size variation is Qwen2-1.5B.
Figure 4: Performance variation trend on $\Delta_{\text{threshold}}$. The results are reported by ID-Avg and OOD-Avg which respectively denote average accuracy on ID and OOD datasets.

Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks

TL;DR

Abstract

Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)