Table of Contents
Fetching ...

Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

Andre Bacellar

Abstract

Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks.

Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

Abstract

Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks.

Paper Structure

This paper contains 51 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Per-query last-hop outcomes: PhaseGraph vs. vector-only on 2Wiki HippoRAG2 test split (n$=$491). Left: all 491 queries; right: zoomed win/loss. The 11:2 asymmetry gives $p{=}0.022$ (McNemar).
  • Figure 2: Raw score distributions from 100 sample queries (2Wiki HippoRAG2). Vector cosine similarity and PPR scores occupy incomparable scales; direct fusion would be dominated by vector scores.
  • Figure 3: Effect of normalization. Top row: PIT maps both distributions to approximately uniform $[0,1]$, making them commensurable. Bottom row: min-max normalization preserves the power-law spike in PPR scores, producing unequal marginals. Dashed line: expected uniform count per bin.
  • Figure 4: Theory ablation: LastHop@5 on 2Wiki HippoRAG2 for baseline vs. normalization and fusion ablations. Bars show tune (n$=$509, lighter) and held-out test (n$=$491, solid). Annotations: W/L vs. baseline and $\Delta$LastHop on test bars.
  • Figure 5: Ising parameter sweep. Sharp threshold at blend$=$0.25 across all $(J, T)$ pairs. Green: 27 wins, gold: 26, red gradient: $\leq$25.
  • ...and 1 more figures