Spatial-Semantic Collaborative Cropping for User Generated Content

Yukun Su; Yiwen Cao; Jingliang Deng; Fengyun Rao; Qingyao Wu

Spatial-Semantic Collaborative Cropping for User Generated Content

Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, Qingyao Wu

TL;DR

This work tackles the challenge of generating aesthetically pleasing, content-preserving thumbnails for diverse UGC under a fixed aspect ratio. It introduces S2CNet, a Spatial-Semantic Collaborative Cropping Network that builds a fully connected, adaptive graph over RoIs and the crop candidate, integrating semantic similarity and spatial topology through a graph-aware attention mechanism to propagate information toward the crop candidate. A key contribution is the UGCrop5K dataset, comprising 5,000 images and 450,000 densely labeled candidate crops with MOS ratings, enabling robust evaluation in real-world, multi-object scenes. Experiments show that S2CNet outperforms state-of-the-art cropping methods on UGCrop5K and GAICv1/v2 benchmarks while maintaining real-time efficiency (~162 FPS on an RTX 2080Ti), demonstrating practical impact for thumbnailing, cover images, and icon generation in UGC platforms.

Abstract

A large amount of User Generated Content (UGC) is uploaded to the Internet daily and displayed to people world-widely through the client side (e.g., mobile and PC). This requires the cropping algorithms to produce the aesthetic thumbnail within a specific aspect ratio on different devices. However, existing image cropping works mainly focus on landmark or landscape images, which fail to model the relations among the multi-objects with the complex background in UGC. Besides, previous methods merely consider the aesthetics of the cropped images while ignoring the content integrity, which is crucial for UGC cropping. In this paper, we propose a Spatial-Semantic Collaborative cropping network (S2CNet) for arbitrary user generated content accompanied by a new cropping benchmark. Specifically, we first mine the visual genes of the potential objects. Then, the suggested adaptive attention graph recasts this task as a procedure of information association over visual nodes. The underlying spatial and semantic relations are ultimately centralized to the crop candidate through differentiable message passing, which helps our network efficiently to preserve both the aesthetics and the content integrity. Extensive experiments on the proposed UGCrop5K and other public datasets demonstrate the superiority of our approach over state-of-the-art counterparts. Our project is available at https://github.com/suyukun666/S2CNet.

Spatial-Semantic Collaborative Cropping for User Generated Content

TL;DR

Abstract

Paper Structure (14 sections, 9 equations, 5 figures, 5 tables)

This paper contains 14 sections, 9 equations, 5 figures, 5 tables.

Introduction
Related Work
Methodology
Network Overview
Adaptive Attention Graph
Graph-Aware Attention Module
Network Optimization
Experiment
Datasets and Metrics
Implementation Details
Ablation Analysis
Compare with the State-of-the-art Methods
Conclusion
Acknowledgments

Figures (5)

Figure 1: Illustrative example of cropping for UGC in a real-life application, which is more complicated with multi-objects and confounding backgrounds ranging from life clips, news, sports, games and lyric videos, etc. Note that the original size of the UGC is marked above each image. For intuitive explanation, the red dashed box indicates the cropped image produced by our algorithm for a fixed aspect ratio and the extraneous content is removed. Best viewed by zooming.
Figure 2: The overall pipeline of our proposed framework. We first use the convolutional backbone to extract visual features followed by RoIAlign he2017mask and RoDAlign zeng2019reliable extracting $d$-dimension features for each potential object and the crop candidate. These features are then provided as inputs to the proposed adaptive attention graph (AAG), which performs joint spatial-semantic information propagation over each node in the graph. Ultimately, the updated messages are centralized to the crop candidate node to perform aesthetic score prediction.
Figure 3: Statistics of the proposed UGCrop5K dataset, including (a) some visualization sample images, (b) taxonomic structure, (c) scatter plot of image width versus image height distribution with marker size indicating the number, and (d) histograms of the MOS.
Figure 4: The t-SNE feature visualization before and after the proposed graph. Different colours indicate the crop candidates, the regions should be removed, reserved, or partially reserved, respectively. The features before the graph show indistinguishable clusters, while the features learned by our graph are more discriminative, which can guide the model to find good views more reasonably. Zoom in for the best view.
Figure 5: (a): Qualitative comparisons of different state-of-the-art methods. The first two rows of images are from the GAICv1 and GAICv2 datasets, and the last two rows of images are from the UGCrop5k dataset. The top-scored best crops are in the yellow dotted box. (b): Image cropping results with different aspect ratios.

Spatial-Semantic Collaborative Cropping for User Generated Content

TL;DR

Abstract

Spatial-Semantic Collaborative Cropping for User Generated Content

Authors

TL;DR

Abstract

Table of Contents

Figures (5)