Interactive Learning for LLM Reasoning
Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, Chengwei Qin
TL;DR
This work introduces ILR, a co-learning framework that trains LLMs as autonomous problem solvers by combining Dynamic Interaction (cooperation/competition guided by IRT-based difficulty estimation) with Idea3 (Idea Sharing, Analysis, and Fusion) and automated Reward Calibration via GRPO. By training under this paradigm, ILR achieves consistent improvements in independent reasoning across multiple math and coding benchmarks, with up to about 5% gains over strong baselines. The findings show that Idea3 enhances robustness for stronger LLMs and that dynamically chosen interaction types yield better learning than fixed cooperative or competitive regimes. The approach advances multi-agent learning by enabling individuals to improve their reasoning without re-running the entire MAS at inference, using automated reward signals to reinforce cohesive, high-quality peer feedback.
Abstract
Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs' independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM's reward distribution characteristics into another's reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.
