PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Lirong Che; Zhenfeng Gan; Yanbo Chen; Junbo Tan; Xueqian Wang

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang

Abstract

Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Abstract

Paper Structure (22 sections, 9 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Robotic Photography
Reasoning Architectures for Embodied Agents
World Models for Visual Foresight
Method
System Overview
Intention Parsing
Geometric Solving
Reflective Reasoning
Experiments
Spatial Reasoning Evaluation
Experimental Setup
Baseline
Comparative Experiments
...and 7 more sections

Figures (7)

Figure 1: Overview of PhotoAgent and its capabilities. (a) illustrates the inefficiency of real-world trial-and-error, while PhotoAgent leverages internal simulation to achieve one-shot success. (b) highlights three key capabilities.
Figure 2: Overall cognitive architecture of PhotoAgent.
Figure 3: Intention Parsing workflow demonstration. "Take a photo of the toys with visual tension, like the reference."
Figure 4: Closed-loop reflective reasoning. Starting from an internal 3DGS “imagined world”, the agent renders candidate views, critiques them via the LMM, and issues optimized motion commands.
Figure 5: Performances of our method and baselines on three tasks of different levels.
...and 2 more figures

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Abstract

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Authors

Abstract

Table of Contents

Figures (7)