Table of Contents
Fetching ...

PhotoFramer: Multi-modal Image Composition Instruction

Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang

TL;DR

<3-5 sentence high-level summary> PhotoFramer addresses the challenge of guiding casual photographers toward better composition by delivering both actionable textual guidance and illustrative exemplar images in a single, unified multi-modal framework. It organizes guidance into a hierarchy of view-change, zoom-in, and shift tasks, and builds a large-scale dataset of <poor image, good image, text guidance> triplets to train a Bagel-based model that jointly understands and generates with text and visuals. The approach leverages cropping datasets for shift/zoom-in, a degradation-based pipeline for view-change, and a text–vision annotation model to produce high-quality guidance; experiments show superiority over open-source editors and competitive performance against GPT-4o, with strong text–image alignment and robust ablations. This work demonstrates a practical step toward in-camera composition assistants that democratize expert photographic priors for everyday users, with code and data released to support further research.

Abstract

Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users. Codes, model weights, and datasets have been released in https://zhiyuanyou.github.io/photoframer.

PhotoFramer: Multi-modal Image Composition Instruction

TL;DR

<3-5 sentence high-level summary> PhotoFramer addresses the challenge of guiding casual photographers toward better composition by delivering both actionable textual guidance and illustrative exemplar images in a single, unified multi-modal framework. It organizes guidance into a hierarchy of view-change, zoom-in, and shift tasks, and builds a large-scale dataset of <poor image, good image, text guidance> triplets to train a Bagel-based model that jointly understands and generates with text and visuals. The approach leverages cropping datasets for shift/zoom-in, a degradation-based pipeline for view-change, and a text–vision annotation model to produce high-quality guidance; experiments show superiority over open-source editors and competitive performance against GPT-4o, with strong text–image alignment and robust ablations. This work demonstrates a practical step toward in-camera composition assistants that democratize expert photographic priors for everyday users, with code and data released to support further research.

Abstract

Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users. Codes, model weights, and datasets have been released in https://zhiyuanyou.github.io/photoframer.

Paper Structure

This paper contains 22 sections, 9 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: We propose PhotoFramer, a model designed for composition instruction during photo capturing. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates an example image that follows the described suggestions. The photo-taker can follow the textual guidance and the example image to capture a better-composed photo.
  • Figure 2: Task paradigm and data example. Given a poorly composed image, our PhotoFramer is required to generate a textual guidance (describing how to improve the composition) together with an example image (depicting what a well-composed image looks like). Motivated by three key photography factors (vantage point, focal choice, and subject placement), our PhotoFramer comprises three tasks: (a) Shift: adjust the framing to place the subject properly and remove border distractions; (b) Zoom-in: select a tighter crop (simulating a longer focal length) that yields a stronger composition; (c) View-change: choose a new vantage point or camera pose to reframe the scene.
  • Figure 3: Dataset construction for the shift and zoom-in tasks. For the shift task, given an image from the cropping dataset, we sample its crops to form a <poor,good> image pair. A random rotation is applied to the poor crop. For the zoom-in task, the original image and a well-composed crop form an <original,good> pair. To ensure sufficient resolution, we apply $4\times$ super resolution to the original image using HYPIR hypir.
  • Figure 4: Qualitative results of our composition assessment model, illustrating the thinking process and final assessment output.
  • Figure 5: Dataset construction for the view-change task. (a) Leveraging our composition assessment model in \ref{['subsubsec:assess_model']}, we sample <poor,good> image pairs from multi-view datasets. We then train a composition degradation model that generates poor-composition images from good ones. (b) We apply this degradation model to expert-taken good photos to synthesize pseudo poor-composition images, forming the final pairs. We do not rely solely on multi-view datasets, as most of their good images are not sufficiently well-composed.
  • ...and 15 more figures