Table of Contents
Fetching ...

AutoML-guided Fusion of Entity and LLM-based Representations for Document Classification

Boshko Koloski, Senja Pollak, Roberto Navigli, Blaž Škrlj

TL;DR

This work addresses robust document classification by enriching LLM-based embeddings with knowledge-graph grounded signals. It introduces BabelFusion, which fuses text embeddings $X_{txt} \in \mathbb{R}^{d}$ with KG embeddings $X_{kg} \in \mathbb{R}^{c}$ into $X_{concat} \in \mathbb{R}^{d+c}$, reduces to $X_{final} \in \mathbb{R}^{k}$ via truncated SVD, and trains a classifier with AutoML (TPOT) on $X_{final}$. Evaluated across six datasets spanning sentiment and news genres, BabelFusion delivers competitive to superior performance relative to high-dimensional baselines, with notable gains for Angle and mxbai embeddings and indications that low-dimensional representations can match or exceed high-dimensional counterparts. The results highlight practical benefits in speed and resource use, while qualitative analyses and dimensionality studies point to future work in scaling KG graphs, token-level knowledge integration, and advanced disambiguation strategies.

Abstract

Large semantic knowledge bases are grounded in factual knowledge. However, recent approaches to dense text representations (i.e. embeddings) do not efficiently exploit these resources. Dense and robust representations of documents are essential for effectively solving downstream classification and retrieval tasks. This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. Further, by considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space obtained via efficient matrix factorization. This result shows that significantly faster classifiers can be achieved with minimal or no loss in predictive performance, as demonstrated using five strong LLM baselines on six diverse real-life datasets. The code is freely available at \url{https://github.com/bkolosk1/bablfusion.git}.

AutoML-guided Fusion of Entity and LLM-based Representations for Document Classification

TL;DR

This work addresses robust document classification by enriching LLM-based embeddings with knowledge-graph grounded signals. It introduces BabelFusion, which fuses text embeddings with KG embeddings into , reduces to via truncated SVD, and trains a classifier with AutoML (TPOT) on . Evaluated across six datasets spanning sentiment and news genres, BabelFusion delivers competitive to superior performance relative to high-dimensional baselines, with notable gains for Angle and mxbai embeddings and indications that low-dimensional representations can match or exceed high-dimensional counterparts. The results highlight practical benefits in speed and resource use, while qualitative analyses and dimensionality studies point to future work in scaling KG graphs, token-level knowledge integration, and advanced disambiguation strategies.

Abstract

Large semantic knowledge bases are grounded in factual knowledge. However, recent approaches to dense text representations (i.e. embeddings) do not efficiently exploit these resources. Dense and robust representations of documents are essential for effectively solving downstream classification and retrieval tasks. This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. Further, by considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space obtained via efficient matrix factorization. This result shows that significantly faster classifiers can be achieved with minimal or no loss in predictive performance, as demonstrated using five strong LLM baselines on six diverse real-life datasets. The code is freely available at \url{https://github.com/bkolosk1/bablfusion.git}.
Paper Structure (18 sections, 3 equations, 7 figures, 4 tables)

This paper contains 18 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Schema of the proposed approach.
  • Figure 2: Babelfy disambiguation of the sentence Germany is hosting the euro cup. The retrieved entities, are then matched to the WikiData5m sub-graph wikidata and their respective embeddings are retrieved.
  • Figure 3: Projecting at different dimensions. The x-axis is log-scaled for better portrial of results.
  • Figure 4: Aggregated results for each embedding across dimensions.
  • Figure 5: Aggregated results for each dataset across dimensions.
  • ...and 2 more figures