Insert or Attach: Taxonomy Completion via Box Embedding
Wei Xue, Yongliang Shen, Wenqi Ren, Jietian Guo, Shiliang Pu, Weiming Lu
TL;DR
TaxBox tackles taxonomy completion by enabling insertion and attachment of new concepts within a box-embedding space. It integrates a structurally enhanced box decoder, two specialized scorers (insertion and attachment), and three learning objectives (box-constrained loss, dynamic ranking loss, and a classification-type objective) to capture fine-grained hierarchical relations without relying on pseudo-leaves. The approach demonstrates strong, consistent improvements across six real-world datasets over strong baselines, validating the advantages of box geometry for asymmetric is-a relations and multiple-parent scenarios. This has practical impact for knowledge graphs and downstream tasks requiring robust, nuanced taxonomy integration without leaf-imputation biases.
Abstract
Taxonomy completion, enriching existing taxonomies by inserting new concepts as parents or attaching them as children, has gained significant interest. Previous approaches embed concepts as vectors in Euclidean space, which makes it difficult to model asymmetric relations in taxonomy. In addition, they introduce pseudo-leaves to convert attachment cases into insertion cases, leading to an incorrect bias in network learning dominated by numerous pseudo-leaves. Addressing these, our framework, TaxBox, leverages box containment and center closeness to design two specialized geometric scorers within the box embedding space. These scorers are tailored for insertion and attachment operations and can effectively capture intrinsic relationships between concepts by optimizing on a granular box constraint loss. We employ a dynamic ranking loss mechanism to balance the scores from these scorers, allowing adaptive adjustments of insertion and attachment scores. Experiments on four real-world datasets show that TaxBox significantly outperforms previous methods, yielding substantial improvements over prior methods in real-world datasets, with average performance boosts of 6.7%, 34.9%, and 51.4% in MRR, Hit@1, and Prec@1, respectively.
