Table of Contents
Fetching ...

Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models

Antoine Edy, Max Conti, Quentin Macé

Abstract

While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark. Results show that while the theoretical length bias of causal Late Interaction models holds in practice, bi-directional models can also suffer from it in extreme cases. We also note that no significant similarity trend lies beyond the top-1 document token, validating that the MaxSim operator efficiently exploits the token-level similarity scores.

Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models

Abstract

While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark. Results show that while the theoretical length bias of causal Late Interaction models holds in practice, bi-directional models can also suffer from it in extreme cases. We also note that no significant similarity trend lies beyond the top-1 document token, validating that the MaxSim operator efficiently exploits the token-level similarity scores.

Paper Structure

This paper contains 12 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Mean length comparison between the retrieved false positive chunks, the relevant ground-truth documents, and the global corpus average. Queries are grouped into quantiles on the x-axis based entirely on the average token length of their corresponding relevant chunks.
  • Figure 2: Expected decrease in retrieval performance (nDCG@10) when a chunk of a specific length is added to the corpus. Chunks are categorized into equal-sized quantile bins by token length on the x-axis. The solid line plots the average nDCG penalty incurred by the presence of a chunk from that bin, evaluated against a random baseline (dashed line) and its 90% confidence interval (shaded area).
  • Figure 3: ColBERT-Zero document token similarities on failed queries. While some datasets exhibit interesting results (e.g., NanoArguAna, right), no clear tendency emerges for positive documents across NanoBEIR (left).
  • Figure 4: Chunk Length Distribution across the merged NanoBEIR corpus.
  • Figure 5: Absolute occurrences of irrelevant chunks ranked above the highest-ranked true positive passage. Bin limits are defined to contain an equal number of chunks. The dashed line plots the no-bias expected baseline, bounded by a 90% variance interval.
  • ...and 1 more figures