All Entities are Not Created Equal: Examining the Long Tail for Ultra-Fine Entity Typing
Advait Deshmukh, Ashwin Umadi, Dananjay Srinivas, Maria Leonor Pacheco
TL;DR
This paper interrogates ultra-fine entity typing (UFET) under a long-tail regime where rare entities are underrepresented in pre-training data. It proposes a simple, practical proxy for pre-training frequency based on Google search hits and demonstrates that PLM-derived probabilities strongly correlate with this proxy across multiple models, validating the proxy as reflective of pre-training exposure. Through a comparative benchmark of seven models (both PLM-only and knowledge-infused) on UFET and OntoNotes, the study finds that PLMs struggle on infrequent entities, while knowledge-infused approaches—such as LITE, which leverages label dependencies via an NLI framework—are more robust to frequency shifts. The findings advocate for integrating external resources and structured knowledge into UFET systems to improve long-tail performance, guiding future work toward more knowledge-aware typing solutions with practical impact for real-world entity recognition tasks.
Abstract
Due to their capacity to acquire world knowledge from large corpora, pre-trained language models (PLMs) are extensively used in ultra-fine entity typing tasks where the space of labels is extremely large. In this work, we explore the limitations of the knowledge acquired by PLMs by proposing a novel heuristic to approximate the pre-training distribution of entities when the pre-training data is unknown. Then, we systematically demonstrate that entity-typing approaches that rely solely on the parametric knowledge of PLMs struggle significantly with entities at the long tail of the pre-training distribution, and that knowledge-infused approaches can account for some of these shortcomings. Our findings suggest that we need to go beyond PLMs to produce solutions that perform well for infrequent entities.
