Table of Contents
Fetching ...

Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval

Eugene Yang, Suraj Nair, Dawn Lawrie, James Mayfield, Douglas W. Oard, Kevin Duh

TL;DR

This work revisits Probabilistic Structured Queries for cross-language information retrieval by implementing an indexing-time PSQ-HMM and conducting a multi-criteria pruning study to map efficiency–effectiveness frontiers on modern CLIR collections. It demonstrates that PMF and Top-k pruning yield Pareto-optimal tradeoffs, often outperforming strong neural baselines in overall effectiveness while maintaining smaller index sizes suitable for real-time retrieval. CDF pruning, by contrast, tends to be less favorable for Pareto optimization. The results inform practical design choices for sparse CLIR and have implications for integrating PSQ into neural CLIR cascades.

Abstract

Probabilistic Structured Queries (PSQ) is a cross-language information retrieval (CLIR) method that uses translation probabilities statistically derived from aligned corpora. PSQ is a strong baseline for efficient CLIR using sparse indexing. It is, therefore, useful as the first stage in a cascaded neural CLIR system whose second stage is more effective but too inefficient to be used on its own to search a large text collection. In this reproducibility study, we revisit PSQ by introducing an efficient Python implementation. Unconstrained use of all translation probabilities that can be estimated from aligned parallel text would in the limit assign a weight to every vocabulary term, precluding use of an inverted index to serve queries efficiently. Thus, PSQ's effectiveness and efficiency both depend on how translation probabilities are pruned. This paper presents experiments over a range of modern CLIR test collections to demonstrate that achieving Pareto optimal PSQ effectiveness-efficiency tradeoffs benefits from multi-criteria pruning, which has not been fully explored in prior work. Our Python PSQ implementation is available on GitHub(https://github.com/hltcoe/PSQ) and unpruned translation tables are available on Huggingface Models(https://huggingface.co/hltcoe/psq_translation_tables).

Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval

TL;DR

This work revisits Probabilistic Structured Queries for cross-language information retrieval by implementing an indexing-time PSQ-HMM and conducting a multi-criteria pruning study to map efficiency–effectiveness frontiers on modern CLIR collections. It demonstrates that PMF and Top-k pruning yield Pareto-optimal tradeoffs, often outperforming strong neural baselines in overall effectiveness while maintaining smaller index sizes suitable for real-time retrieval. CDF pruning, by contrast, tends to be less favorable for Pareto optimization. The results inform practical design choices for sparse CLIR and have implications for integrating PSQ into neural CLIR cascades.

Abstract

Probabilistic Structured Queries (PSQ) is a cross-language information retrieval (CLIR) method that uses translation probabilities statistically derived from aligned corpora. PSQ is a strong baseline for efficient CLIR using sparse indexing. It is, therefore, useful as the first stage in a cascaded neural CLIR system whose second stage is more effective but too inefficient to be used on its own to search a large text collection. In this reproducibility study, we revisit PSQ by introducing an efficient Python implementation. Unconstrained use of all translation probabilities that can be estimated from aligned parallel text would in the limit assign a weight to every vocabulary term, precluding use of an inverted index to serve queries efficiently. Thus, PSQ's effectiveness and efficiency both depend on how translation probabilities are pruned. This paper presents experiments over a range of modern CLIR test collections to demonstrate that achieving Pareto optimal PSQ effectiveness-efficiency tradeoffs benefits from multi-criteria pruning, which has not been fully explored in prior work. Our Python PSQ implementation is available on GitHub(https://github.com/hltcoe/PSQ) and unpruned translation tables are available on Huggingface Models(https://huggingface.co/hltcoe/psq_translation_tables).
Paper Structure (18 sections, 2 equations, 5 figures, 2 tables)

This paper contains 18 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Indexing-time PSQ Retrieval Pipeline
  • Figure 2: Relative R@100 score to the unpruned index of each collection. The x-axis of the PMF threshold graph is in log scale, and the unpruned index (0) is marked at the far left.
  • Figure 3: R@100 and index size (GB) on NeuCLIR Russian. Blue and brown represent R@100 and index size, respectively; darker indicates larger values. The ideal is light brown (small index) and dark blue (high R@100).
  • Figure 4: Pareto graphs. Stars indicate Pareto-optimal runs, i.e., those on the Pareto frontier.
  • Figure 5: Comparison of Pareto Frontier between using and not using CDF thresholds.