Table of Contents
Fetching ...

Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles

Kevin Bönisch, Alexander Mehler

TL;DR

We address legal information retrieval (LIR) in German by reframing retrieval as a needle-in-a-haystack problem and building an ensemble of Support Vector Regressors (SVR) trained on concatenated query–passage embeddings. The approach uses bagging to partition the embedding space into overlapping subspaces and trains $s$ SVRs over $k$ nearest passages per query, with a final voting ensemble that yields recall of $0.849$, surpassing baselines without fine-tuning large DL models. Embeddings are derived from Longformer-based encoders, and experiments on GerDaLIR demonstrate strong recall and competitive precision, aided by GPU-accelerated training on CUDA/RAPIDS. The work motivates further refinement of German-domain encoding models, larger initial retrieval radii, and richer multi-embedding spaces to enhance both transparency and retrieval performance.

Abstract

We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849 > 0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.

Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles

TL;DR

We address legal information retrieval (LIR) in German by reframing retrieval as a needle-in-a-haystack problem and building an ensemble of Support Vector Regressors (SVR) trained on concatenated query–passage embeddings. The approach uses bagging to partition the embedding space into overlapping subspaces and trains SVRs over nearest passages per query, with a final voting ensemble that yields recall of , surpassing baselines without fine-tuning large DL models. Embeddings are derived from Longformer-based encoders, and experiments on GerDaLIR demonstrate strong recall and competitive precision, aided by GPU-accelerated training on CUDA/RAPIDS. The work motivates further refinement of German-domain encoding models, larger initial retrieval radii, and richer multi-embedding spaces to enhance both transparency and retrieval performance.

Abstract

We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849 > 0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.
Paper Structure (11 sections, 4 equations, 4 figures, 1 table)

This paper contains 11 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Modeling the flow of the needle-in-a-haystack training, we begin by partitioning the document space into several subsets (bagging), with each subset being assigned a separate SVR model for training. For each query in a subset, we identify the top $k$ nearest passages through their vector spaces and concatenate their embeddings into a single feature embedding. Consequently, each query is associated with $k - 1$ negative labels and one positive label. The SVR model is trained to find this single positive label within the haystack. This process is repeated for each subset and query. During prediction, each model in the group predicts a match for its respective subset. If only one model recognizes a positive match, the corresponding section is marked as relevant and output.
  • Figure 2: Model Accuracy
  • Figure 3: Classification Report
  • Figure : Figure 1a: t-SNE plots of embedding spaces from different encoder models, generated for an exemplary sample of 10,000 collection passages. Each grey point represents a document passage in the collection. The red lines indicate the shortest distance from placed queries to their labelled relevant passages. Four queries were placed in each collection space, showing different distances to their relevant passages. While none of the relevant passages were the closest to their respective queries, it can be observed that in many cases the relevant passage is in close proximity. In particular, the longformer_base embedding space seems to capture the context best, as the entire collection forms a U-shaped cluster, and the distances from query to passage remain consistently the smallest, observable by the very short and barely visible red lines.