Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks

Masaya Tsunokake; Yuta Koreeda; Terufumi Morishita; Koichi Nagatsuka; Hikaru Tomonari; Yasuhiro Sogawa

Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks

Masaya Tsunokake, Yuta Koreeda, Terufumi Morishita, Koichi Nagatsuka, Hikaru Tomonari, Yasuhiro Sogawa

TL;DR

The paper investigates whether micro domain-adaptive pre-training (mDAPT) can support generative tasks in real-world operations by decomposing answering into elicitation, reasoning, and composing. A multi-step oracle evaluation framework is used to pinpoint bottlenecks, revealing that mDAPT primarily improves elicitation while reasoning and composing remain bottlenecks. The findings suggest that achieving usable real-world performance requires both resolving elicitation and enhancing reasoning capability, potentially by combining mDAPT with stronger base models or targeted reasoning training. The study provides a framework and empirical evidence for diagnosing LLM deployment challenges in proprietary domains, with practical implications for enterprise AI adoption and future research directions in domain-specific reasoning and memory retention.

Abstract

When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations ($\textbf{micro domains}$). A previous study shows micro domain-adaptive pre-training ($\textbf{mDAPT}$) with fewer documents is effective, similarly to DAPT in larger domains. However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown. We aim to reveal the potential and bottlenecks of mDAPT for generative tasks. To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) $\textbf{eliciting}$ facts relevant to questions from an LLM's own knowledge, (2) $\textbf{reasoning}$ over the facts to obtain conclusions, and (3) $\textbf{composing}$ long-form answers based on the conclusions. We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks. This clarifies mDAPT's effectiveness in the knowledge aspect and its bottlenecks in other aspects. Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.

Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks

TL;DR

Abstract

When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations (

). A previous study shows micro domain-adaptive pre-training (

) with fewer documents is effective, similarly to DAPT in larger domains. However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown. We aim to reveal the potential and bottlenecks of mDAPT for generative tasks. To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1)

facts relevant to questions from an LLM's own knowledge, (2)

over the facts to obtain conclusions, and (3)

long-form answers based on the conclusions. We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks. This clarifies mDAPT's effectiveness in the knowledge aspect and its bottlenecks in other aspects. Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.

Paper Structure (35 sections, 9 figures, 5 tables)

This paper contains 35 sections, 9 figures, 5 tables.

Introduction
Background
Micro Domain
Micro Domain-Adaptive Pre-Training
Evaluation Framework
Overview
Multi-step Oracle Evaluation
Multiple Oracle Settings
Oracle Reasoning Setting
Oracle Elicitation Setting
No-Oracle Setting
LLM-as-a-judge for Each Setting
Knowledge Evaluation
Experiment
Experimental Setting
...and 20 more sections

Figures (9)

Figure 1: (a) Subtasks in an answering process (b) Results by our evaluation framework. To identify bottleneck tasks, our framework observes performance changes after inserting ideal results of each task (oracle result) into prompts. Although the base model struggles with the elicitation task, mDAPT resolves this difficulty, showing mDAPT's effectiveness. However, the mDAPT model still struggles with the reasoning and composing tasks.
Figure 2: Facts written in JP1 manuals JP1_manuals
Figure 3: Our evaluation framework consists of multi-step oracle evaluation and knowledge evaluation. In multi-step oracle evaluation, we observe LLM's overall performance changes after inserting oracle results for subtasks. If the performance is improved, it means that an LLM could not solve the corresponding task by itself. In knowledge evaluation, we evaluate LLM's memorization and elicitation capability by using trained texts relevant to questions.
Figure 4: Evaluation results. The main results on the far-left show that memorization and elicitation capabilities improve as CPT epochs increase. At the third epoch, the ASR of the mDAPT model on the no-oracle setting reach the base model's ASR on the oracle elicitation setting, indicating that mDAPT resolves the elicitation task.
Figure 5: Prompt used for synthesizing JP1-QA. Chunks extracted from JP1 manuals are inserted to {chunk}. We use the Japanese prompt in our experiments.
...and 4 more figures

Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks

TL;DR

Abstract

Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks

Authors

TL;DR

Abstract

Table of Contents

Figures (9)