PhotoBot: Reference-Guided Interactive Photography via Natural Language
Oliver Limoyo, Jimmy Li, Dmitriy Rivkin, Jonathan Kelly, Gregory Dudek
TL;DR
PhotoBot tackles the problem of producing aesthetically pleasing, reference-guided photographs through natural-language interaction. It advances a two-module pipeline in which a Reference Suggestion component uses a VLM, an object detector, and an LLM to retrieve relevant reference images from a curated gallery, and a Camera View Adjustment component aligns the current scene to the chosen reference via semantic keypoint correspondences processed by a PnP solver. The method employs DINO-ViT–based semantic keypoints, Best-Buddies correspondences, and MAGSAC++ for robust pose estimation, enabling iterative refinements toward the target composition. User studies and experiments demonstrate that PhotoBot can outperform a No PhotoBot baseline in aesthetics and prompt fidelity, while demonstrating robustness to threshold variation and generalization to non-photographic references such as paintings. The work highlights practical implications for autonomous, language-grounded photography and suggests future extensions to more capable embodiments and language-driven posing feedback.
Abstract
We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery. We leverage a visual language model (VLM) and an object detector to characterize the reference images via textual descriptions and then use a large language model (LLM) to retrieve relevant reference images based on a user's language query through text-based reasoning. To correspond the reference image and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across marked appearance variations. Using these features, we compute suggested pose adjustments for an RGB-D camera by solving a perspective-n-point (PnP) problem. We demonstrate our approach using a manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback. We also show that PhotoBot can generalize to other reference sources such as paintings.
