DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

Janghyeok Choi; Jaewon Lee; Sungzoon Cho

DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

Janghyeok Choi, Jaewon Lee, Sungzoon Cho

Abstract

Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.

DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

Abstract

Paper Structure (56 sections, 2 equations, 7 figures, 9 tables)

This paper contains 56 sections, 2 equations, 7 figures, 9 tables.

Introduction
Related Works
Legal Information Retrieval
LLM-based Data Augmentation and Persona-based Prompting
LLM-Based Data Augmentation
Persona-Based Prompting
Datasets for Legal Information Retrieval
COLIEE.
CLERC.
Method & Approach
DALDALL: Persona-Based Data Augmentation
Prompt Design
Stage 1: Essential Extraction.
Stage 2: Augmentation.
Generation Strategy.
...and 41 more sections

Figures (7)

Figure 1: The DALDALL framework for persona-based data augmentation. Given a legal case statement, the Persona method (top) generates synthetic queries from multiple professional perspectives (attorney, prosecutor, judge, law professor), while the Vanilla method (bottom) does not incorporate persona information. Compared to Vanilla, Persona augmentation produces outputs with 20% greater lexical diversity (Table \ref{['tab:token-length-diversity']}).
Figure 2: Triplet composition for fine-tuning on the COLIEE dataset. We chunk a query and positive documents, and find k-best semantic pair with base model. If $n$ is lower than $i$, then all query chunks are used for the triplet.
Figure 3: Distribution of augmented query lengths on COLIEE for 3, 5, 7, and 10 personas. Persona-based prompting yields a broader token distribution than vanilla prompting regardless of persona count.
Figure 4: Effect of persona count on augmentation diversity (COLIEE). Top: Self-BLEU scores (lower = more lexically diverse). Middle: cosine similarity with original queries. Bottom: intra-augmentation cosine similarity (lower = more semantically diverse). Increasing persona count reduces lexical diversity but improves semantic diversity. Five personas offer a balanced trade-off.
Figure 5: Extract Essentials prompt template.
...and 2 more figures

DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

Abstract

DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

Authors

Abstract

Table of Contents

Figures (7)