Table of Contents
Fetching ...

The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers

Daiqing Qi, Handong Zhao, Jing Shi, Simon Jenni, Yifei Fan, Franck Dernoncourt, Scott Cohen, Sheng Li

TL;DR

The paper addresses the gap between general visual understanding and aesthetic visual understanding in Multimodal LLMs. It introduces PhotoCritique, a large-scale dataset of 450K images with 2.63M instruction-tuning pairs derived from photography communities; PhotoEye, a language-guided multi-view vision fusion model designed for aesthetics; and PhotoBench, a professional benchmark with 284 sub-topics. Empirical results show PhotoEye outperforms open-source baselines on Q-Bench and PhotoBench, and analyses reveal how language-guided fusion and diverse vision encoders capture both high-level aesthetics and low-level attributes. This work advances practical aesthetic reasoning in MLLMs, enabling more informed image critique, editing guidance, and artistically driven applications in photography-related tasks.

Abstract

While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers

TL;DR

The paper addresses the gap between general visual understanding and aesthetic visual understanding in Multimodal LLMs. It introduces PhotoCritique, a large-scale dataset of 450K images with 2.63M instruction-tuning pairs derived from photography communities; PhotoEye, a language-guided multi-view vision fusion model designed for aesthetics; and PhotoBench, a professional benchmark with 284 sub-topics. Empirical results show PhotoEye outperforms open-source baselines on Q-Bench and PhotoBench, and analyses reveal how language-guided fusion and diverse vision encoders capture both high-level aesthetics and low-level attributes. This work advances practical aesthetic reasoning in MLLMs, enabling more informed image critique, editing guidance, and artistically driven applications in photography-related tasks.

Abstract

While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

Paper Structure

This paper contains 15 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Examples on our model (PhotoEye), existing MLLMs tailored for low-level vision or aesthetics, and GPT-4o (2024-08-06). The left example and the middle top example highlight a notable limitation of existing open-source MLLMs for low-level vision and aesthetics: insufficient coverage of visual aesthetics. When they fail to identify issues, they either provide positive aspects or claim there are no issues, significantly limiting their usefulness in real-world scenarios. The middle bottom example shows existing MLLMs' lack of expertise in photography: lowering exposure properly can instead enhance colors of objects, so B is correct. While other models, including GPT-4o, made mistakes, our model is correct. The right example reveals another clear limitation of existing open-source MLLMs: their vision encoders are insensitive to low-level vision and aesthetics. In a series of increasingly overexposed photos (2-nd, 3-rd, and 4-th), PhotoEye’s vision modules, more attuned to low-level and aesthetic features, identified overexposure by the 2-nd photo, while other models recognized it only when the photos were severely overexposed (i.e., 3-rd and 4-th images). High-quality aesthetic content is highlighted.
  • Figure 2: An example from the dpchallenge platform. 3 of the total 64 comments for the photo are presented for illustration.
  • Figure 3: Data Generation Pipeline. The generation of one critique from a group of comments of one image is shown as an example.
  • Figure 4: Top: Examples of PhotoCritique. Bottom: Comparison with examples in Q-Instruct wu2024qinstruct.
  • Figure 5: Comparison of description length Distributions of aesthetic comments in PhotoCritique and Q-Instruct. Scale of PhotoCritique is Blue and that of Q-Instruct is Red.
  • ...and 6 more figures