Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models
Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan, Shuai Yuan, Qika Lin, Yu Qiao, Jun Liu
TL;DR
This work targets the NL-centric interface limitation of large language models by introducing Symbol-LLM, a symbol-centric foundation that learns across ~20 symbolic forms via 34 text-to-symbol tasks. A two-stage tuning framework—Injection Stage to learn symbolic knowledge and Infusion Stage to balance with general NL data—prevents forgetting and preserves NL capabilities, yielding Symbol-LLMBase and Symbol-LLMInstruct variants. Empirical results show substantial improvements in symbol-related tasks, competitive NL performance, and strong math-delegation abilities with symbol-driven reasoning, including OOD extrapolation. The paper also introduces Symbol-evol to diversify symbolic definitions and analyzes Alignment and Uniformity to understand the embedding structure, with open-source releases to foster further development of symbol-centric LLMs.
Abstract
Although Large Language Models (LLMs) demonstrate remarkable ability in processing and generating human-like text, they do have limitations when it comes to comprehending and expressing world knowledge that extends beyond the boundaries of natural language(e.g., chemical molecular formula). Injecting a collection of symbolic data directly into the training of LLMs can be problematic, as it disregards the synergies among different symbolic families and overlooks the need for a balanced mixture of natural and symbolic data. In this work, we tackle these challenges from both a data and framework perspective and introduce Symbol-LLM series models. First, we curated a data collection consisting of 34 tasks and incorporating approximately 20 distinct symbolic families, intending to capture the interrelations and foster synergies between symbols. Then, a two-stage tuning framework succeeds in injecting symbolic knowledge without loss of the generality ability. Extensive experiments on both symbol- and NL-centric tasks demonstrate the balanced and superior performances of Symbol-LLM series models. The project page is https://xufangzhi.github.io/symbol-llm-page/.
