Leveraging Large Language Models for Automated Definition Extraction with TaxoMatic A Case Study on Media Bias
Timo Spinde, Luyang Lin, Smi Hinterreiter, Isao Echizen
TL;DR
TaxoMatic presents an LLM-driven framework for automated definition extraction from scholarly literature, evaluated in the media bias domain. The approach uses a three-stage workflow—relevance classification, definition extraction, and evaluation—and builds a ground-truth dataset from 2,398 relevancy-rated articles and 123 definitions sourced from 113 papers. Claude-3-sonnet leads in relevance classification, while Chain-of-Thought and Role prompting yield the strongest extraction performance, revealing both the promise and limitations of current LLMs for formalizing definitions in contested domains. The work contributes a scalable methodology, a sizeable public dataset, and insights to guide future expansion to additional domains and more robust taxonomy-building efforts.
Abstract
This paper introduces TaxoMatic, a framework that leverages large language models to automate definition extraction from academic literature. Focusing on the media bias domain, the framework encompasses data collection, LLM-based relevance classification, and extraction of conceptual definitions. Evaluated on a dataset of 2,398 manually rated articles, the study demonstrates the frameworks effectiveness, with Claude-3-sonnet achieving the best results in both relevance classification and definition extraction. Future directions include expanding datasets and applying TaxoMatic to additional domains.
