Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

Tianxiang Du; Hulingxiao He; Yuxin Peng

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

Tianxiang Du, Hulingxiao He, Yuxin Peng

TL;DR

Venus is proposed, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales, enabling interpretable and interactive aesthetic refinement across both stages of photo creation.

Abstract

The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) -- an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 4 figures, 4 tables)

This paper contains 16 sections, 2 equations, 4 figures, 4 tables.

Introduction
Related Work
Image Aesthetic Tasks and Datasets
Aesthetic MLLMs
Specialized Aesthetic Cropping Models
A New Dataset and Benchmark: AesGuide
Data Collection and Annotation
Benchmark
Method
Aesthetic Guidance Capability Building
Aesthetic Cropping Power Activation
Experiments
Implementation Details
Main Results
Ablation Studies
...and 1 more sections

Figures (4)

Figure 1: Overview of image aesthetic tasks and datasets. We follow and refine the comprehensive aesthetic task taxonomy proposed by Jin et al. jin2024apddv2. A user survey of 1,069 participants shows that 91% prefer AG, a largely underexplored task with no dedicated dataset.
Figure 2: Illustration of the AGGF and examples from the proposed AesGuide dataset, showing aesthetic scores (in purple), aesthetic analyses (in orange), and aesthetic guidance (issue identification in green and shooting guidance in red) on the right side of each image.
Figure 3: Overview of the Venus framework: (1) Aesthetic guidance capability building, where AesGuide is leveraged to empower MLLMs with AG capability. (2) Aesthetic cropping power activation, which unlocks the cropping ability using CoT-based rationales.
Figure 4: Qualitative comparison of aesthetic cropping results among GPT-4o, AesExpert, and Venus-Q (ours) (a), along with a demonstration of Venus-Q’s interpretable and interactive aesthetic cropping capabilities (b).

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

TL;DR

Abstract

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

Authors

TL;DR

Abstract

Table of Contents

Figures (4)