Table of Contents
Fetching ...

HALO: Human-Aligned End-to-end Image Retargeting with Layered Transformations

Yiran Xu, Siqi Xie, Zhuofang Li, Harris Shadmany, Yinxiao Li, Luciano Sbaiz, Miaosen Wang, Junjie Ke, Jose Lezama, Hang Qi, Han Zhang, Jesse Berent, Ming-Hsuan Yang, Irfan Essa, Jia-Bin Huang, Feng Yang

TL;DR

HALO tackles the challenge of image retargeting by introducing layered transformations that treat salient and non-salient regions separately, mitigating artifacts and preserving content. The method employs a Multi-Flow Network with cross-attention between the original and target-size images to predict two warp fields, which are composited to form the output along with a warped saliency map. A key contribution is the Perceptual Structure Similarity Loss (PSSL), which uses DreamSim on a layout-augmented pseudo-ground-truth to supervise structure preservation without paired data. Empirical results on RetargetMe and extensive ablations show HALO achieving state-of-the-art content and structure preservation with strong user preferences, while offering faster inference due to end-to-end training.

Abstract

Image retargeting aims to change the aspect-ratio of an image while maintaining its content and structure with less visual artifacts. Existing methods still generate many artifacts or fail to maintain original content or structure. To address this, we introduce HALO, an end-to-end trainable solution for image retargeting. Since humans are more sensitive to distortions in salient areas than non-salient areas of an image, HALO decomposes the input image into salient/non-salient layers and applies different wrapping fields to different layers. To further minimize the structure distortion in the output images, we propose perceptual structure similarity loss which measures the structure similarity between input and output images and aligns with human perception. Both quantitative results and a user study on the RetargetMe dataset show that HALO achieves SOTA. Especially, our method achieves an 18.4% higher user preference compared to the baselines on average.

HALO: Human-Aligned End-to-end Image Retargeting with Layered Transformations

TL;DR

HALO tackles the challenge of image retargeting by introducing layered transformations that treat salient and non-salient regions separately, mitigating artifacts and preserving content. The method employs a Multi-Flow Network with cross-attention between the original and target-size images to predict two warp fields, which are composited to form the output along with a warped saliency map. A key contribution is the Perceptual Structure Similarity Loss (PSSL), which uses DreamSim on a layout-augmented pseudo-ground-truth to supervise structure preservation without paired data. Empirical results on RetargetMe and extensive ablations show HALO achieving state-of-the-art content and structure preservation with strong user preferences, while offering faster inference due to end-to-end training.

Abstract

Image retargeting aims to change the aspect-ratio of an image while maintaining its content and structure with less visual artifacts. Existing methods still generate many artifacts or fail to maintain original content or structure. To address this, we introduce HALO, an end-to-end trainable solution for image retargeting. Since humans are more sensitive to distortions in salient areas than non-salient areas of an image, HALO decomposes the input image into salient/non-salient layers and applies different wrapping fields to different layers. To further minimize the structure distortion in the output images, we propose perceptual structure similarity loss which measures the structure similarity between input and output images and aligns with human perception. Both quantitative results and a user study on the RetargetMe dataset show that HALO achieves SOTA. Especially, our method achieves an 18.4% higher user preference compared to the baselines on average.

Paper Structure

This paper contains 31 sections, 10 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Content- and structure-aware image retargeting. Our method, HALO, takes a single image as input and reformats it for different aspect-ratios. Compared to previous methods: Self-Play-RL kajiura2020self, GPNN granot2022drop, and DragonDiffusion mou2024dragondiffusion, our method shows better performance in preserving the structure and content of the input image and has better visual quality.
  • Figure 2: Limitations of exisiting retargeting methods. Previous image retargeting methods have difficulty preserving the input image content and structure. (b) A traditional method Shift-Map pritch2009shift duplicates the structure of the car. (c) A generative modeling method GPDM elnekave2022generating adds extra content. (d) A feed-forward method WSSDCNN cho2017wssdcnn introduces out-of-boundary (OOB) artifacts.
  • Figure 3: Overview of HALO. We retarget an input image $\boldsymbol{I} \in \mathbb{R}^{H \times W}$ to an output image $\boldsymbol{I}^{\prime}$ at the target size $H^{\prime} \times W^{\prime}$. (a) Layered Transformation. We decompose the input image into a salient layer (SL) $\boldsymbol{I}_{SL}$ and a non-salient layer (NSL) $\boldsymbol{I}_{NSL}$ with a saliency map from gao2024multiscale. We inpaint the hole in $\boldsymbol{I}_{NSL}$ by suvorov2022lama to obtain the inpainted NSL $\boldsymbol{I}_{NSLI}$. We then transform $\boldsymbol{I}_{SL}$ and $\boldsymbol{I}_{NSLI}$ with the predicted warping fields $\mathcal{F}_{SL}$ and $\mathcal{F}_{NSL}$, respectively. We also warp the saliency map $\boldsymbol{M}$ with $\mathcal{F}_{SL}$ to obtain a warped saliency map $\boldsymbol{M}^{\prime}$. We obtain the output $\boldsymbol{I}^{\prime}$ by composing the warped layers with $\boldsymbol{M}^{\prime}$ via Eqn. \ref{['eqn:layered_output']}. To train our model, we use our Perceptual Structure Similarity Loss (PSSL, Eqn. \ref{['eqn:pssl']}) and non-saliency regularization (Eqn. \ref{['eqn:NSReg']}). (b) Multi-Flow Network. Our Multi-Flow Network (MFN) takes the input image $\boldsymbol{I} \in \mathbb{R}^{H \times W}$ and its resized version $\boldsymbol{I}_{R} \in \mathbb{R}^{H^{\prime} \times W^{\prime}}$ as input. $\boldsymbol{I}$ and $\boldsymbol{I}_{R}$ are encoded with a shared encoder. The resulting feature maps are then passed into $L$ cross-attention blocks. Finally, Salient-Layer (SL) head and Non-Salient Layer (NSL) head predict a salient flow $\mathcal{F}_{SL}$ and a non-salient flow $\mathcal{F}_{NSF}$ for the corresponding layers.
  • Figure 4: Comparison between DreamSim and LPIPS. We calculate the similarities of the features from LPIPS zhang2018perceptual and DreamSim fu2023dreamsim for image pairs (a, b) and (a, c), and report the results under each column (LPIPS sim.$\uparrow$ / DreamSim sim.$\uparrow$). Surprisingly, the distorted result in (c) shows a higher LPIPS similarity to the source image compared to the undistorted image in (b). DreamSim, however, is more sensitive to structural similarity, showing a higher score for the undistorted image pair (a, b) and a lower score for the distorted pair (a, c).
  • Figure 5: Layout Augmentation. Because DreamSim fu2023dreamsim preprocesses the images by resizing them to $224\times224$, after preprocessing, the naively resized input $\boldsymbol{I}_{R}$ (distorted at the target size $H^{\prime} \times W^{\prime}$) and the input $\boldsymbol{I}$ have a similar structure and result in a small DreamSim loss. On the other hand, the layout augmentation $\boldsymbol{I}_{aug}$ (undistorted at the target size) has a small DreamSim loss with the (ideally) undistorted output $\boldsymbol{I}^{\prime}$. Therefore, to obtain an undistorted output, we compute the DreamSim loss between the output $\boldsymbol{I}^{\prime}$ and $\boldsymbol{I}_{aug}$ as supervision, instead of between $\boldsymbol{I}^{\prime}$ and $\boldsymbol{I}$.
  • ...and 16 more figures