Table of Contents
Fetching ...

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

Jiachen Yang, Xianhui Lin, Yi Dong, Zebiao Zheng, Xing Liu, Hong Gu, Yanmei Fang

TL;DR

BeautyGRPO is proposed, a reinforcement learning framework that aligns face retouching with human aesthetic preferences and outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.

Abstract

Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

TL;DR

BeautyGRPO is proposed, a reinforcement learning framework that aligns face retouching with human aesthetic preferences and outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.

Abstract

Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.
Paper Structure (46 sections, 27 equations, 16 figures, 5 tables)

This paper contains 46 sections, 27 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Comparison of face retouching methods and training paradigms. (a) Our approach outperforms specialized (RetouchFormer) and general (NanoBanana) models in blemish removal, identity preservation, and human preference. (b) By stabilizing stochastic exploration, BeautyGRPO overcomes the limitations of Supervised Fine-Tuning (SFT) and standard online RL (FlowGRPO), achieving natural and highly realistic results.
  • Figure 2: Overview of our FRPref-10K construction, reward model training, and BeautyGRPO with Dynamic Path Guidance (DPG). Top: FRPref-10K dataset curation pipeline. Multiple retouched candidates are generated with diverse editing models, preference pairs are formed via output-vs-output/label comparisons, and are annotated by VLMs across five quality dimensions before human verification. Bottom left: Three-stage reward model training, including SFT, self-training with consistency filtering, and GRPO. Bottom right: BeautyGRPO training with DPG on a FluxKontext-LoRA backbone.
  • Figure 3: Overview of different sampling trajectories. (a) Standard flow-matching ODE trajectory for SFT. (b) Uncontrolled SDE trajectory in FlowGRPO, which gradually drifts from the high-fidelity anchor point and introduces noise artifacts. (c) Proposed BeautyGRPO with Dynamic Path Guidance (DPG). At each timestep, DPG dynamically computes an anchor-based ODE path and replans a guided trajectory by linearly blending a correction vector with standard Gaussian noise, correcting stochastic drift while maintaining controlled exploration.
  • Figure 4: Visual comparison of face retouching results across different methods on FFHQR and in-the-wild datasets, where Flux.K denotes FluxKontext. Our BeautyGRPO removes blemishes cleanly while retaining natural texture, skin gloss, and moles, unlike existing models that exhibit incomplete blemish removal, over-smoothing, or artificial appearance. Please refer to the supplementary materials for more results.
  • Figure 5: Radar chart comparing human judgment accuracy for our reward model against various VLMs.
  • ...and 11 more figures