BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives
Xiaoyue Wang, Jianyou Wang, Weili Cao, Kaicheng Wang, Ramamohan Paturi, Leon Bergen
TL;DR
BIRCO defines a benchmark for IR tasks with complex, multi-faceted objectives, evaluating how well systems retrieve documents when user goals extend beyond semantic similarity. It compiles five diverse datasets with paragraph-length queries and modest candidate pools, coupled with a decontamination protocol to mitigate pretraining data leakage. A modular framework examines ranking vs scoring, chain-of-thought reasoning, task decomposition, and task objective awareness, revealing that GPT-4-based methods outperform baselines but no approach succeeds across all tasks. The study highlights substantial challenges in multi-objective retrieval, the impact of hard negatives, and the need for more capable retrieval protocols and cost-efficient LLM usage in real-world IR systems.
Abstract
We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating large language model (LLM)-based information retrieval systems. We present a modular framework for investigating factors that may influence LLM performance on retrieval tasks, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.
