Table of Contents
Fetching ...

NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

Pratibha Zunjare, Michael Hsiao

TL;DR

A multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space that enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities.

Abstract

Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B--32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23\% (Qwen-32B, $p < 0.01$), +3.43\% (GPT-OSS-20B, $p < 0.01$), and +5.54\% (Llama-3B, $p < 0.05$) over single-task baselines. Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12\% repair rate) into correctable domain errors (96\% repair rate), achieving 92.7\% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.

NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

TL;DR

A multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space that enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities.

Abstract

Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B--32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23\% (Qwen-32B, ), +3.43\% (GPT-OSS-20B, ), and +5.54\% (Llama-3B, ) over single-task baselines. Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12\% repair rate) into correctable domain errors (96\% repair rate), achieving 92.7\% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.
Paper Structure (61 sections, 4 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 61 sections, 4 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of traditional LLM fine-tuning with NeuroProlog approach
  • Figure 2: NeuroProlog Pipeline. Execution-guided decoding with error feedback
  • Figure 3: GSM8K accuracy across four models and three configurations. Annotations show cocktail FT improvement over the base model. Qwen3-8B is the only model where fine-tuning decreases accuracy, revealing a generation--correction trade-off.
  • Figure 4: First-try success vs. correction rate across all configurations
  • Figure 5: Final error distribution after $k{=}3$ correction attempts, by Prolog error type
  • ...and 4 more figures