Table of Contents
Fetching ...

Cropper: Vision-Language Model for Image Cropping through In-Context Learning

Seung Hyun Lee, Jijun Jiang, Yiran Xu, Zhuofang Li, Junjie Ke, Yinxiao Li, Junfeng He, Steven Hickson, Katie Datsenko, Sangpil Kim, Ming-Hsuan Yang, Irfan Essa, Feng Yang

TL;DR

This paper proposes an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples and introduces an iterative refinement strategy to iteratively enhance the predicted crops.

Abstract

The goal of image cropping is to identify visually appealing crops in an image. Conventional methods are trained on specific datasets and fail to adapt to new requirements. Recent breakthroughs in large vision-language models (VLMs) enable visual in-context learning without explicit training. However, downstream tasks with VLMs remain under explored. In this paper, we propose an effective approach to leverage VLMs for image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, we refer to as Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.

Cropper: Vision-Language Model for Image Cropping through In-Context Learning

TL;DR

This paper proposes an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples and introduces an iterative refinement strategy to iteratively enhance the predicted crops.

Abstract

The goal of image cropping is to identify visually appealing crops in an image. Conventional methods are trained on specific datasets and fail to adapt to new requirements. Recent breakthroughs in large vision-language models (VLMs) enable visual in-context learning without explicit training. However, downstream tasks with VLMs remain under explored. In this paper, we propose an effective approach to leverage VLMs for image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, we refer to as Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.
Paper Structure (11 sections, 1 equation, 11 figures, 16 tables)

This paper contains 11 sections, 1 equation, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Cropper is a unified framework for various cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping built on top of a pretrained large vision-language model through in-context learning. Given the input image, top-K semantically similar images are retrieved as in-context learning prompt, and fed to pretrained vision-language model to generate crops. The crop candidates are iteratively refined to yield the visually pleasing output crop. All images are from Unsplash unsplash2025.
  • Figure 2: Cropper Overview. Cropper consists of two main steps: visual prompt retrieval and iterative crop refinement. Through visual prompt retrieval, top-$S$ ICL examples are retrieved using an image similarity metric. In the iterative crop refinement stage, the VLM generates candidate crops based on these ICL examples and then these crops are subsequently scored by a scorer which measures aesthetics, content similarity, and area size. The VLM iteratively refines the crop candidates using the feedback from the scorer $L$ times. All images are from Unsplash unsplash2025.
  • Figure 3: Relationship between number of in-context learning examples $S$ and IoU on the GAICD zeng2020grid validation dataset for free-form cropping. We could see when the number of in-context learning examples $S$ is 30, IoU is the best.
  • Figure 4: Relationship between number of crops $R$ and IoU on the GAICD zeng2020grid validation dataset for free-form cropping. When the number of crops $R$ is 6, IoU is the best.
  • Figure 5: Relationship among the number of refinement iterations $L$, VLM model temperature and IoU on the GAICD zeng2020grid validation dataset. The experiments are based on the optimal number of ICL examples and candidate crops previously determined.
  • ...and 6 more figures