Table of Contents
Fetching ...

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

Tianyi Xu, Rong Shan, Junjie Wu, Jiadeng Huang, Teng Wang, Jiachen Zhu, Wenteng Chen, Minxin Tu, Quantao Dou, Zhaoxiang Wang, Changwang Zhang, Weinan Zhang, Jun Wang, Jianghao Lin

TL;DR

PhotoBench is introduced, the first benchmark constructed from authentic, personal albums designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning, and indicates that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion.

Abstract

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

TL;DR

PhotoBench is introduced, the first benchmark constructed from authentic, personal albums designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning, and indicates that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion.

Abstract

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.
Paper Structure (56 sections, 10 equations, 7 figures, 11 tables)

This paper contains 56 sections, 10 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The illustration of the dataset construction pipeline for PhotoBench.
  • Figure 2: The query distribution w.r.t. ground-truth count (left) and source-aware taxonomy (right) in PhotoBench.
  • Figure 3: Linguistic comparison with traditional benchmarks. We evaluate four dimensions: Avg Query Length (total tokens), Noun Density (proportion of entities), Avg Syntactic Depth (grammatical complexity), and Lexical Diversity (MTLD, measuring vocabulary richness independent of text length). PhotoBench queries are structurally streamlined (lower length and depth) yet lexically diverse, reflecting search-style rather than narrative-style language.
  • Figure 4: Temporal distribution of photos across albums.
  • Figure 5: Supplementary query statistics. (a) Distribution of labels across semantic dimensions. (b) Distribution of Cognitive labels per query (0 indicates a pure Fact-based query).
  • ...and 2 more figures