Table of Contents
Fetching ...

Learning Subject-Aware Cropping by Outpainting Professional Photos

James Hong, Lu Yuan, Michaël Gharbi, Matthew Fisher, Kayvon Fatahalian

TL;DR

GenCrop presents a weakly supervised approach to subject-aware image cropping by generating a large synthetic training set through outpainting stock images with a diffusion model, then training a Transformer-based crop regressor conditioned on a subject mask. The method eliminates the need for new manual crop annotations and achieves competitive results with supervised baselines on portrait-focused and broader subject datasets, while providing extensive ablations and a new evaluation setup based on Unsplash images. The contribution includes the dataset generation pipeline, new evaluation sets, and a conditioning extension, demonstrating scalable data creation from generative models for a visually subjective task. This work highlights the practical potential of diffusion-based data augmentation to reduce annotation burden and improve generalization in cropping tasks across diverse subjects and domains.

Abstract

How to frame (or crop) a photo often depends on the image subject and its context; e.g., a human portrait. Recent works have defined the subject-aware image cropping task as a nuanced and practical version of image cropping. We propose a weakly-supervised approach (GenCrop) to learn what makes a high-quality, subject-aware crop from professional stock images. Unlike supervised prior work, GenCrop requires no new manual annotations beyond the existing stock image collection. The key challenge in learning from this data, however, is that the images are already cropped and we do not know what regions were removed. Our insight is to combine a library of stock images with a modern, pre-trained text-to-image diffusion model. The stock image collection provides diversity and its images serve as pseudo-labels for a good crop, while the text-image diffusion model is used to out-paint (i.e., outward inpainting) realistic uncropped images. Using this procedure, we are able to automatically generate a large dataset of cropped-uncropped training pairs to train a cropping model. Despite being weakly-supervised, GenCrop is competitive with state-of-the-art supervised methods and significantly better than comparable weakly-supervised baselines on quantitative and qualitative evaluation metrics.

Learning Subject-Aware Cropping by Outpainting Professional Photos

TL;DR

GenCrop presents a weakly supervised approach to subject-aware image cropping by generating a large synthetic training set through outpainting stock images with a diffusion model, then training a Transformer-based crop regressor conditioned on a subject mask. The method eliminates the need for new manual crop annotations and achieves competitive results with supervised baselines on portrait-focused and broader subject datasets, while providing extensive ablations and a new evaluation setup based on Unsplash images. The contribution includes the dataset generation pipeline, new evaluation sets, and a conditioning extension, demonstrating scalable data creation from generative models for a visually subjective task. This work highlights the practical potential of diffusion-based data augmentation to reduce annotation burden and improve generalization in cropping tasks across diverse subjects and domains.

Abstract

How to frame (or crop) a photo often depends on the image subject and its context; e.g., a human portrait. Recent works have defined the subject-aware image cropping task as a nuanced and practical version of image cropping. We propose a weakly-supervised approach (GenCrop) to learn what makes a high-quality, subject-aware crop from professional stock images. Unlike supervised prior work, GenCrop requires no new manual annotations beyond the existing stock image collection. The key challenge in learning from this data, however, is that the images are already cropped and we do not know what regions were removed. Our insight is to combine a library of stock images with a modern, pre-trained text-to-image diffusion model. The stock image collection provides diversity and its images serve as pseudo-labels for a good crop, while the text-image diffusion model is used to out-paint (i.e., outward inpainting) realistic uncropped images. Using this procedure, we are able to automatically generate a large dataset of cropped-uncropped training pairs to train a cropping model. Despite being weakly-supervised, GenCrop is competitive with state-of-the-art supervised methods and significantly better than comparable weakly-supervised baselines on quantitative and qualitative evaluation metrics.
Paper Structure (55 sections, 1 equation, 16 figures, 10 tables)

This paper contains 55 sections, 1 equation, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Generated training pairs. We outpaint professional images (left) to obtain plausible, uncropped input images (right). The original image is treated as a pseudo-label crop (red). Since the images come from stock image collections, each pseudo-label is an acceptable, professional crop.
  • Figure 2: Dataset generation pipeline. Stages are marked (a-f). Refer to §\ref{['sub:dataset_generation']} for detailed explanation. We start with a stock image (a) and estimate its text caption (b). To determine the region to be outpainted, we sample a blank canvas to outpaint around the image (c). Outpainting is done using a text-to-image inpainting model stablediffusion and results in a square image (d). Afterwards, we apply automated filters to remove poorly generated images (e). Later, when training a cropping model, we sample an enclosing view (f) in the uncropped image from a common aspect (e.g., $3:4$) so that the model generalizes beyond square images. The region containing the original image is treated as a pseudo-label when training a cropping model.
  • Figure 3: Common outpainting failure cases. The original image is marked with a red rectangle. (a) An extra person was synthesized in the outpainted region. This can alter the ideal composition of the scene. (b) The outpainted region is a grid or composite of multiple images (col 1, 2), frames the original image (col 3), or has a border (col 4). These artificial edges can bias the model towards detecting sharp borders.
  • Figure 4: Cropping model architecture. Our design is inspired by CACNet cacnet: details in §\ref{['sub:cropping_architecture']} and § \ref{['sub:supp_model_and_training']}. We extract CNN features from the input image, $\mathbf{x}_o$, and subject mask, $\mathbf{m}_o$. These features are used by a transformer-encoder transformer to generate crop proposals at a grid of anchor points. The crop proposal at each anchor point contained in the subject region is weighted by a second branch. A softmax-weighted sum computes the final crop prediction $\mathbf{\hat{y}}$.
  • Figure 5: Examples of crops with subtle mistakes (input on left; crop on right). First pair: the crop cuts through the subject's feet. Second pair: the crop leaves clutter on the edges and places the subject neither centered for left-right symmetry nor at a third, resulting in an unbalanced image.
  • ...and 11 more figures