BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives

Xiaoyue Wang; Jianyou Wang; Weili Cao; Kaicheng Wang; Ramamohan Paturi; Leon Bergen

BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives

Xiaoyue Wang, Jianyou Wang, Weili Cao, Kaicheng Wang, Ramamohan Paturi, Leon Bergen

TL;DR

BIRCO defines a benchmark for IR tasks with complex, multi-faceted objectives, evaluating how well systems retrieve documents when user goals extend beyond semantic similarity. It compiles five diverse datasets with paragraph-length queries and modest candidate pools, coupled with a decontamination protocol to mitigate pretraining data leakage. A modular framework examines ranking vs scoring, chain-of-thought reasoning, task decomposition, and task objective awareness, revealing that GPT-4-based methods outperform baselines but no approach succeeds across all tasks. The study highlights substantial challenges in multi-objective retrieval, the impact of hard negatives, and the need for more capable retrieval protocols and cost-efficient LLM usage in real-world IR systems.

Abstract

We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating large language model (LLM)-based information retrieval systems. We present a modular framework for investigating factors that may influence LLM performance on retrieval tasks, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.

BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives

TL;DR

Abstract

Paper Structure (50 sections, 24 figures, 9 tables)

This paper contains 50 sections, 24 figures, 9 tables.

Introduction
Related Work
A Benchmark for LLM-based IR Systems
Dataset Descriptions
BIRCO Evaluation Methodology
Data Decontamination
Measuring contamination
Reducing contamination
Query Complexity
Task Complexity
Domain Diversity
Hard Negatives in BIRCO
Candidate Pool Construction
A Framework for LLM-based Retrieval
Ranking vs. Scoring
...and 35 more sections

Figures (24)

Figure 1: BIRCO contains 5 IR tasks with complex objectives
Figure 2: An example query from DORIS-MAE dorismae with multiple facets.
Figure 3: Jaccard similarity between pairs of datasets from BIRCO
Figure 4: Passage embeddings for a randomly selected query from ArguAna. arguana
Figure 5: A framework for integrating LLMs into retrieval tasks
...and 19 more figures

BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives

TL;DR

Abstract

BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives

Authors

TL;DR

Abstract

Table of Contents

Figures (24)