Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles
Kevin Bönisch, Alexander Mehler
TL;DR
We address legal information retrieval (LIR) in German by reframing retrieval as a needle-in-a-haystack problem and building an ensemble of Support Vector Regressors (SVR) trained on concatenated query–passage embeddings. The approach uses bagging to partition the embedding space into overlapping subspaces and trains $s$ SVRs over $k$ nearest passages per query, with a final voting ensemble that yields recall of $0.849$, surpassing baselines without fine-tuning large DL models. Embeddings are derived from Longformer-based encoders, and experiments on GerDaLIR demonstrate strong recall and competitive precision, aided by GPU-accelerated training on CUDA/RAPIDS. The work motivates further refinement of German-domain encoding models, larger initial retrieval radii, and richer multi-embedding spaces to enhance both transparency and retrieval performance.
Abstract
We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849 > 0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.
