Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts
Sumit Asthana, Hannah Rashkin, Elizabeth Clark, Fantine Huot, Mirella Lapata
TL;DR
This work tackles the challenge of helping skilled adult readers understand domain-specific texts by introducing targeted concept simplification and the WikiDomains dataset, which pairs 22k definitions with a difficult concept within each. It benchmarks multiple LLMs and a dictionary baseline across three rewriting strategies (explain, simplify) and evaluates them via human judgments and automated metrics, finding a reader preference for contextual explanations over lexical substitutions. The results reveal that no single model excels across all dimensions and that automated metrics poorly correlate with human comprehension measures, underscoring the need for personalized, context-aware tools and better evaluation methodologies. The study highlights the potential of context-rich explanations to improve understanding while outlining practical considerations and future directions for evaluating domain-text comprehension support.
Abstract
One useful application of NLP models is to support people in reading complex text from unfamiliar domains (e.g., scientific articles). Simplifying the entire text makes it understandable but sometimes removes important details. On the contrary, helping adult readers understand difficult concepts in context can enhance their vocabulary and knowledge. In a preliminary human study, we first identify that lack of context and unfamiliarity with difficult concepts is a major reason for adult readers' difficulty with domain-specific text. We then introduce "targeted concept simplification," a simplification task for rewriting text to help readers comprehend text containing unfamiliar concepts. We also introduce WikiDomains, a new dataset of 22k definitions from 13 academic domains paired with a difficult concept within each definition. We benchmark the performance of open-source and commercial LLMs and a simple dictionary baseline on this task across human judgments of ease of understanding and meaning preservation. Interestingly, our human judges preferred explanations about the difficult concept more than simplification of the concept phrase. Further, no single model achieved superior performance across all quality dimensions, and automated metrics also show low correlations with human evaluations of concept simplification ($\sim0.2$), opening up rich avenues for research on personalized human reading comprehension support.
