Optimizing Language Model's Reasoning Abilities with Weak Supervision

Yongqi Tong; Sizhe Wang; Dawei Li; Yifan Wang; Simeng Han; Zi Lin; Chengsong Huang; Jiaxin Huang; Jingbo Shang

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Yongqi Tong, Sizhe Wang, Dawei Li, Yifan Wang, Simeng Han, Zi Lin, Chengsong Huang, Jiaxin Huang, Jingbo Shang

TL;DR

This work tackles the scalability challenge of improving LLM reasoning by proposing a weak-to-strong learning framework called self-reinforcement, which iteratively refines reasoning with minimal human annotation. It introduces PuzzleBen, a large weakly-supervised benchmark (25,147 labeled questions with rationales and 10,000 unlabeled questions) spanning brainteasers, puzzles, riddles, parajumbles, and critical reasoning. The method combines base supervised fine-tuning, self-filtering of unlabeled data, and differential performance optimization to progressively outperform baselines, demonstrated by significant gains on PuzzleBen (e.g., from 10.38 to 37.82) and ablation evidence for self-filtering. The approach reduces reliance on extensive human rationales while maintaining strong reasoning improvements, offering a practical direction for future LLM reasoning under limited supervision.

Abstract

While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.

Optimizing Language Model's Reasoning Abilities with Weak Supervision

TL;DR

Abstract

Paper Structure (32 sections, 6 equations, 4 figures, 8 tables)

This paper contains 32 sections, 6 equations, 4 figures, 8 tables.

Introduction
Related Work
LLMs' Reasonings
Reinforcement Learning
Self-training and Self-improvement
Weak-to-strong Learning and Generalizations
Weakly-supervised Learning
Our Methodology: Self-Reinforcement
Step 1: Base Modeling
Step2: Self-Filtering
Step3: Self-Reinforcement
Iterative Self-Reinforcement
Data Collection for PuzzleBen
Brainteasers
Riddles
...and 17 more sections

Figures (4)

Figure 1: The overview pipeline of our methods, self-reinforcement and the detailed implementation of self-filtering in our methodology. This is an iterative weak-to-strong learning framework that intends to improve LLMs' reasoning under weak supervision. Blue content indicates this response comes from strong models while red content is from weaker models.
Figure 2: Question examples from PuzzleBen. The detailed texts are attached in Table \ref{['tab: dataset_example']}.
Figure 3: Average Length of Questions and Rationales designed in PuzzleBen and the other existing benchmarks.
Figure 4: Accuracy of Llama2-13b across interval-based difficulty score ranges on the subset of PuzzleBen. The difficulty ratings represent the average of all user-assigned scores ranging from 1 to 4, with each category containing an equal number of items.

Optimizing Language Model's Reasoning Abilities with Weak Supervision

TL;DR

Abstract

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (4)