Table of Contents
Fetching ...

PhotoBot: Reference-Guided Interactive Photography via Natural Language

Oliver Limoyo, Jimmy Li, Dmitriy Rivkin, Jonathan Kelly, Gregory Dudek

TL;DR

PhotoBot tackles the problem of producing aesthetically pleasing, reference-guided photographs through natural-language interaction. It advances a two-module pipeline in which a Reference Suggestion component uses a VLM, an object detector, and an LLM to retrieve relevant reference images from a curated gallery, and a Camera View Adjustment component aligns the current scene to the chosen reference via semantic keypoint correspondences processed by a PnP solver. The method employs DINO-ViT–based semantic keypoints, Best-Buddies correspondences, and MAGSAC++ for robust pose estimation, enabling iterative refinements toward the target composition. User studies and experiments demonstrate that PhotoBot can outperform a No PhotoBot baseline in aesthetics and prompt fidelity, while demonstrating robustness to threshold variation and generalization to non-photographic references such as paintings. The work highlights practical implications for autonomous, language-grounded photography and suggests future extensions to more capable embodiments and language-driven posing feedback.

Abstract

We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery. We leverage a visual language model (VLM) and an object detector to characterize the reference images via textual descriptions and then use a large language model (LLM) to retrieve relevant reference images based on a user's language query through text-based reasoning. To correspond the reference image and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across marked appearance variations. Using these features, we compute suggested pose adjustments for an RGB-D camera by solving a perspective-n-point (PnP) problem. We demonstrate our approach using a manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback. We also show that PhotoBot can generalize to other reference sources such as paintings.

PhotoBot: Reference-Guided Interactive Photography via Natural Language

TL;DR

PhotoBot tackles the problem of producing aesthetically pleasing, reference-guided photographs through natural-language interaction. It advances a two-module pipeline in which a Reference Suggestion component uses a VLM, an object detector, and an LLM to retrieve relevant reference images from a curated gallery, and a Camera View Adjustment component aligns the current scene to the chosen reference via semantic keypoint correspondences processed by a PnP solver. The method employs DINO-ViT–based semantic keypoints, Best-Buddies correspondences, and MAGSAC++ for robust pose estimation, enabling iterative refinements toward the target composition. User studies and experiments demonstrate that PhotoBot can outperform a No PhotoBot baseline in aesthetics and prompt fidelity, while demonstrating robustness to threshold variation and generalization to non-photographic references such as paintings. The work highlights practical implications for autonomous, language-grounded photography and suggests future extensions to more capable embodiments and language-driven posing feedback.

Abstract

We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery. We leverage a visual language model (VLM) and an object detector to characterize the reference images via textual descriptions and then use a large language model (LLM) to retrieve relevant reference images based on a user's language query through text-based reasoning. To correspond the reference image and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across marked appearance variations. Using these features, we compute suggested pose adjustments for an RGB-D camera by solving a perspective-n-point (PnP) problem. We demonstrate our approach using a manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback. We also show that PhotoBot can generalize to other reference sources such as paintings.
Paper Structure (14 sections, 1 equation, 10 figures)

This paper contains 14 sections, 1 equation, 10 figures.

Figures (10)

  • Figure 1: PhotoBot provides a reference photograph suggestion based on an observation of the scene and a user's input language query (upper left). The user strikes a pose matching that of the person in the reference photo (upper right) and PhotoBot adjusts its camera accordingly to faithfully capture the layout and composition of the reference image (lower left). The lower-right panel shows an unretouched photograph produced by PhotoBot.
  • Figure 2: PhotoBot system diagram. The two main modules are shown: Reference Suggestion and Camera View Adjustment. Given the observed scene and a user query, PhotoBot suggests a reference image to the user and adjusts the camera to take a photo with a similar layout and composition to the reference image.
  • Figure 3: We convert a gallery of curated reference images into a text-based representation using a combination of readily-available metadata, an object detector, and a VLM. A text-based gallery enables a LLM to search, match, and suggest reference images based on a language query from a user and a list of detected objects in the current scene.
  • Figure 4: Examples of user queries and objects detected in the scene and the resulting reference image suggested by the LLM. We explicitly query the LLM to explain its choice of image suggestion. We can also query for an image suggestion without any information from the observed scene, as shown in the fourth row.
  • Figure 5: Sample photos of users evoking various emotions. The user prompts, from top to bottom, are surprised, confident, guilty, confident, happy, and confident. Columns, from left to right, are: user's own creative posing; user mimicking the suggested reference using a static camera; photo taken by our PhotoBot system; and reference image suggested by PhotoBot. The checkered background indicates cropping. The black background indicates padding of the reference image to facilitate the PnP solution. PhotoBot automatically crops the photos it takes to match the image template.
  • ...and 5 more figures