Table of Contents
Fetching ...

OSCAR: Optimization-Steered Agentic Planning for Composed Image Retrieval

Teng Wang, Rong Shan, Jianghao Lin, Junjie Wu, Tianyi Xu, Jianping Zhang, Wenteng Chen, Changwang Zhang, Zhaoxiang Wang, Weinan Zhang, Jun Wang

TL;DR

OSCAR reframes agentic composed image retrieval as principled trajectory optimization, replacing heuristic search with a two-stage MIP that yields optimal tool-call trajectories and set-theoretic compositions. An offline phase constructs a Golden Library of demonstrations that guide a VLM planner during online inference, enabling efficient, single-pass CIR with robust generalization from only 10% of training data. Empirically, OSCAR achieves state-of-the-art results on CIRCO, CIRR, FashionIQ, and industrial galleries, while maintaining strong performance across diverse VLM backbones. This optimization-guided framework offers a scalable, reusable approach to complex multimodal reasoning in retrieval tasks.

Abstract

Composed image retrieval (CIR) requires complex reasoning over heterogeneous visual and textual constraints. Existing approaches largely fall into two paradigms: unified embedding retrieval, which suffers from single-model myopia, and heuristic agentic retrieval, which is limited by suboptimal, trial-and-error orchestration. To this end, we propose OSCAR, an optimization-steered agentic planning framework for composed image retrieval. We are the first to reformulate agentic CIR from a heuristic search process into a principled trajectory optimization problem. Instead of relying on heuristic trial-and-error exploration, OSCAR employs a novel offline-online paradigm. In the offline phase, we model CIR via atomic retrieval selection and composition as a two-stage mixed-integer programming problem, mathematically deriving optimal trajectories that maximize ground-truth coverage for training samples via rigorous boolean set operations. These trajectories are then stored in a golden library to serve as in-context demonstrations for online steering of VLM planner at online inference time. Extensive experiments on three public benchmarks and a private industrial benchmark show that OSCAR consistently outperforms SOTA baselines. Notably, it achieves superior performance using only 10% of training data, demonstrating strong generalization of planning logic rather than dataset-specific memorization.

OSCAR: Optimization-Steered Agentic Planning for Composed Image Retrieval

TL;DR

OSCAR reframes agentic composed image retrieval as principled trajectory optimization, replacing heuristic search with a two-stage MIP that yields optimal tool-call trajectories and set-theoretic compositions. An offline phase constructs a Golden Library of demonstrations that guide a VLM planner during online inference, enabling efficient, single-pass CIR with robust generalization from only 10% of training data. Empirically, OSCAR achieves state-of-the-art results on CIRCO, CIRR, FashionIQ, and industrial galleries, while maintaining strong performance across diverse VLM backbones. This optimization-guided framework offers a scalable, reusable approach to complex multimodal reasoning in retrieval tasks.

Abstract

Composed image retrieval (CIR) requires complex reasoning over heterogeneous visual and textual constraints. Existing approaches largely fall into two paradigms: unified embedding retrieval, which suffers from single-model myopia, and heuristic agentic retrieval, which is limited by suboptimal, trial-and-error orchestration. To this end, we propose OSCAR, an optimization-steered agentic planning framework for composed image retrieval. We are the first to reformulate agentic CIR from a heuristic search process into a principled trajectory optimization problem. Instead of relying on heuristic trial-and-error exploration, OSCAR employs a novel offline-online paradigm. In the offline phase, we model CIR via atomic retrieval selection and composition as a two-stage mixed-integer programming problem, mathematically deriving optimal trajectories that maximize ground-truth coverage for training samples via rigorous boolean set operations. These trajectories are then stored in a golden library to serve as in-context demonstrations for online steering of VLM planner at online inference time. Extensive experiments on three public benchmarks and a private industrial benchmark show that OSCAR consistently outperforms SOTA baselines. Notably, it achieves superior performance using only 10% of training data, demonstrating strong generalization of planning logic rather than dataset-specific memorization.
Paper Structure (38 sections, 20 equations, 4 figures, 9 tables)

This paper contains 38 sections, 20 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The illustration of limitations of existing image retrieval methods, i.e., (a) single-model myopia of unified embedding retrieval, and (b) suboptimal orchestration of heuristic agentic retrieval.
  • Figure 2: The overall framework of our proposed OSCAR.
  • Figure 3: Performance comparison w.r.t. different numbers of demonstrations (i.e., the number of shots) for inference-time steering. "m" and "R" denotes mAP and Recall. The red dashed line denotes the zero-shot performance of OSCAR (i.e., mAP@50 on CIRCO, Recall@50 on CIRR and FashionIQ).
  • Figure 4: Case Studies on FashionIQ (left) and CIRR (right) datasets. The left part shows the correct tool call trajectory with the ground truth image ranked to the first place. The right part illustrates the effectiveness of golden library, with whose help the agent can avoid previous wrong tool calls and finally retrieve the ground truth image.