PhotoFramer: Multi-modal Image Composition Instruction
Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang
TL;DR
<3-5 sentence high-level summary> PhotoFramer addresses the challenge of guiding casual photographers toward better composition by delivering both actionable textual guidance and illustrative exemplar images in a single, unified multi-modal framework. It organizes guidance into a hierarchy of view-change, zoom-in, and shift tasks, and builds a large-scale dataset of <poor image, good image, text guidance> triplets to train a Bagel-based model that jointly understands and generates with text and visuals. The approach leverages cropping datasets for shift/zoom-in, a degradation-based pipeline for view-change, and a text–vision annotation model to produce high-quality guidance; experiments show superiority over open-source editors and competitive performance against GPT-4o, with strong text–image alignment and robust ablations. This work demonstrates a practical step toward in-camera composition assistants that democratize expert photographic priors for everyday users, with code and data released to support further research.
Abstract
Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users. Codes, model weights, and datasets have been released in https://zhiyuanyou.github.io/photoframer.
