Table of Contents
Fetching ...

Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise: A Case Study on Chinese Legal Domain

Zhen wan, Yating Zhang, Yexiang Wang, Fei Cheng, Sadao Kurohashi

TL;DR

This paper tackles the challenge of zero-shot generation by large language models in specialized domains, specifically Chinese law, where hallucinations are prevalent. It introduces adapt-retrieve-revise, a three-stage framework that domain-adapts a compact 7B LLM, drafts answers, retrieves evidence from a knowledge base using the draft, and leverages GPT-4 to revise the final answer. The approach yields substantial gains over direct GPT-4 prompts and retrieval baselines across four Chinese legal tasks, with notable improvements when evidence is incorporated and revised. The work highlights a cost-efficient path to domain competence by combining domain-specific continual learning with evidence-based revision, and it discusses the potential for generalization to other specialized domains.

Abstract

While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it's not plausible to continue training LLMs of such scale on in-domain data. This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process. The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain by continuing learning on in-domain data. When solving a task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to \textbf{retrieve} supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and \textbf{revise} the draft answer to generate the final answer. Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to the direct generation by GPT-4. When compared to two stronger retrieval-based baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be released

Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise: A Case Study on Chinese Legal Domain

TL;DR

This paper tackles the challenge of zero-shot generation by large language models in specialized domains, specifically Chinese law, where hallucinations are prevalent. It introduces adapt-retrieve-revise, a three-stage framework that domain-adapts a compact 7B LLM, drafts answers, retrieves evidence from a knowledge base using the draft, and leverages GPT-4 to revise the final answer. The approach yields substantial gains over direct GPT-4 prompts and retrieval baselines across four Chinese legal tasks, with notable improvements when evidence is incorporated and revised. The work highlights a cost-efficient path to domain competence by combining domain-specific continual learning with evidence-based revision, and it discusses the potential for generalization to other specialized domains.

Abstract

While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it's not plausible to continue training LLMs of such scale on in-domain data. This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process. The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain by continuing learning on in-domain data. When solving a task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to \textbf{retrieve} supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and \textbf{revise} the draft answer to generate the final answer. Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to the direct generation by GPT-4. When compared to two stronger retrieval-based baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be released
Paper Structure (30 sections, 7 figures, 4 tables)

This paper contains 30 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Left: A real translated example of Chinese LegalQA. The square brackets and subscripts are offered for the purpose of clear demonstration, not actually exist in the ground-truth answer or generation. Right: Models' F1 scores on the LegalQA dataset.
  • Figure 2: Examples of hallucinations of various models. Red denotes the content containing hallucinations. The ground-truth answer refers to the left case in Figure \ref{['fig:abstract']}.
  • Figure 3: Overview of our proposed method. The example and prompt are translated from Chinese to English for the demonstration purpose.
  • Figure 4: Comparison of retrieval recalls on the LegalQA dataset.
  • Figure 5: We compare performances of the draft answer of 7B legal LLM and our proposed adapt-retrieve-revise model using different contents in retrieval.
  • ...and 2 more figures