The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers
Daiqing Qi, Handong Zhao, Jing Shi, Simon Jenni, Yifei Fan, Franck Dernoncourt, Scott Cohen, Sheng Li
TL;DR
The paper addresses the gap between general visual understanding and aesthetic visual understanding in Multimodal LLMs. It introduces PhotoCritique, a large-scale dataset of 450K images with 2.63M instruction-tuning pairs derived from photography communities; PhotoEye, a language-guided multi-view vision fusion model designed for aesthetics; and PhotoBench, a professional benchmark with 284 sub-topics. Empirical results show PhotoEye outperforms open-source baselines on Q-Bench and PhotoBench, and analyses reveal how language-guided fusion and diverse vision encoders capture both high-level aesthetics and low-level attributes. This work advances practical aesthetic reasoning in MLLMs, enabling more informed image critique, editing guidance, and artistically driven applications in photography-related tasks.
Abstract
While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.
