Table of Contents
Fetching ...

Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference

Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, Tianke Ban

TL;DR

Unilaw-R1 introduces a 7B legal reasoning LLM trained via a two-stage SFT+GRPO RL pipeline and enhanced by an explicit iterative Assessor-Reviser inference loop. The approach builds a high-quality legal CoT dataset (Unilaw-R1-Data) and a dedicated eval benchmark (Unilaw-R1-Eval), achieving strong results on LawBench and LexEval and competitive scores against larger models. Key innovations include a legal validity reward, a GRPO objective, and a multi-agent iterative refinement to improve reasoning transparency and accuracy. The work demonstrates that compact, domain-tuned LLMs can deliver robust legal reasoning with cost-efficient deployment, benefiting legal AI applications and risk management.

Abstract

Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.

Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference

TL;DR

Unilaw-R1 introduces a 7B legal reasoning LLM trained via a two-stage SFT+GRPO RL pipeline and enhanced by an explicit iterative Assessor-Reviser inference loop. The approach builds a high-quality legal CoT dataset (Unilaw-R1-Data) and a dedicated eval benchmark (Unilaw-R1-Eval), achieving strong results on LawBench and LexEval and competitive scores against larger models. Key innovations include a legal validity reward, a GRPO objective, and a multi-agent iterative refinement to improve reasoning transparency and accuracy. The work demonstrates that compact, domain-tuned LLMs can deliver robust legal reasoning with cost-efficient deployment, benefiting legal AI applications and risk management.

Abstract

Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.

Paper Structure

This paper contains 36 sections, 9 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The pipeline for constructing Unilaw-R1. The diagram depicts the two-stage construction framework of Unilaw-R1: Data Generation (using DeepSeek-R1 for reasoning to generate CoT data, followed by quality filtering with the DeepSeek-V3) and Model Training (including SFT pretraining and GRPO optimization for Unilaw-R1).
  • Figure 2: The pipeline of Data Construction (Stage 1): (1) Data Distillation, (2) Data Filtering, including Answer Check and Reasoning Selection, Chain Rewriting, and Explanation Generation. "Reasoning" represents the reasoning output, while "Model Response" refers to the evaluation process of the judgment model.
  • Figure 3: Iterative inference pipeline, consisting of four main stages: sampling, reviewing, refinement, and final answer selection. The reviewing and refinement stages involve a multi-agent setup, with separate Assessor and Revisor agents.
  • Figure 4: Comparison of convergence behavior of Unilaw-R1-SFT under different combinations of reinforcement learning reward functions on the Unilaw-R1-Eval benchmark.
  • Figure 5: The prompt of data distillation that we used for DeepSeek-R1.
  • ...and 7 more figures