Table of Contents
Fetching ...

From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

Nitin Sharma, Thomas Wolfers, Çağatay Yıldız

TL;DR

A deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation is presented, which enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and chat models.

Abstract

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on multiple choice questions, which suffer from inherent biases. To measure domain-specific knowledge in LLMs, we present a deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation. Our approach first extracts domain-specific keywords and related target vocabulary from an input corpus. It then constructs prompt-target pairs where domain-specific words serve as prediction targets. By measuring LLMs' ability to complete these prompts, we provide a direct assessment of domain knowledge at low computational cost. Our pipeline avoids benchmark contamination, enables automated updates with new domain data, and facilitates fair comparisons between base and instruction-tuned (chat) models. We validate our approach by showing that model performances on our benchmark significantly correlate with those on an expert-curated benchmark. We then demonstrate how our benchmark provides insights into knowledge acquisition in domain-adaptive, continual, and general pretraining. Finally, we examine the effects of instruction fine-tuning by comparing base and chat models within our unified evaluation framework. In conclusion, our pipeline enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and chat models.

From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

TL;DR

A deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation is presented, which enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and chat models.

Abstract

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on multiple choice questions, which suffer from inherent biases. To measure domain-specific knowledge in LLMs, we present a deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation. Our approach first extracts domain-specific keywords and related target vocabulary from an input corpus. It then constructs prompt-target pairs where domain-specific words serve as prediction targets. By measuring LLMs' ability to complete these prompts, we provide a direct assessment of domain knowledge at low computational cost. Our pipeline avoids benchmark contamination, enables automated updates with new domain data, and facilitates fair comparisons between base and instruction-tuned (chat) models. We validate our approach by showing that model performances on our benchmark significantly correlate with those on an expert-curated benchmark. We then demonstrate how our benchmark provides insights into knowledge acquisition in domain-adaptive, continual, and general pretraining. Finally, we examine the effects of instruction fine-tuning by comparing base and chat models within our unified evaluation framework. In conclusion, our pipeline enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and chat models.

Paper Structure

This paper contains 59 sections, 3 equations, 24 figures, 10 tables.

Figures (24)

  • Figure 1: Issues with existing domain-specific benchmarks: Perplexity aggregates predictions over all tokens (including domain-irrelevant ones); performance on multiple choice questions depend on the order of options; many benchmarks are already incorporated in the training sets of LLMs; and manual creation is simply too expensive.
  • Figure 2: A conceptual overview of our proposed pipeline for generating a completion-based benchmark from a raw domain corpus. Our pipeline extracts and refines keywords from the input corpus. It then matches each sentence with relevant keywords to focus evaluation on domain-relevant content. For every keyword, we construct a target vocabulary by collecting domain-specific phrases from the matched sentences; these phrases will serve as prediction targets. The domain expertise is quantified by how well the models complete these targets from given prompts.
  • Figure 3: Validation of benchmarking approaches. Left: MCQ benchmarks show significant sensitivity to option ordering. Right: Completion-based validation across six base models showing the ranks computed on (a) manual expert benchmark, (b) Claude-generated benchmark, and (c) TF-based pipeline. Correlation analysis reveals (d) r=0.91, p=0.012 between Claude and expert benchmarks, and (e) r=0.99, p<0.001 between TF-based and expert benchmarks, validating that Claude-generated benchmarks reliably capture expert-level domain knowledge patterns.
  • Figure 4: Validation of our pipeline through domain adaptation, where we adapt Llama-2-7B to seven domains separately. Top row: The x-axes show the adapted domains ordered by proximity to CS.AI and the y-axes represent evaluation metrics. Prediction ranks on (a) the reference benchmark generated by Claude Sonnet 4, and (b-c) TF- and TF-IDF-based benchmarks produced by our pipeline, along with baseline metrics (d) perplexity and (e) last-layer attribution rate. Bottom row: Correlation analysis between each metric and the Claude benchmark. Claude, TF and TF-IDF ranks follow the expected pattern, where models adapted to domains similar to CS.AI achieve better rank than those trained on unrelated domains. We observe significantly strong correlations between ranks on Claude-generated reference dataset and our TF-based methods (r=0.97, r=0.89) while perplexity and attribution rate show weaker correlations.
  • Figure 5: Prediction ranks and probabilities of OLMo-2 pretraining checkpoints on the Physics and Society domain (Physics (Soc-Ph)). The left figure displays the results across the first 50 checkpoints, with the last point representing the final model for reference. The right figure displays equally spaced checkpoints from the entire pretraining. Consistent patterns above demonstrate that our benchmark can guide the knowledge accumulation for a domain of interest by replacing the average perplexity metric.
  • ...and 19 more figures