Table of Contents
Fetching ...

Representing Beauty: Towards a Participatory but Objective Latent Aesthetics

Alexander Michael Rusnak

TL;DR

This work interrogates what it means for a machine to recognize beauty, proposing that aesthetic representations converge across diverse models and modalities, implying a realist, universal geometry of beauty. It introduces the universal representation hypothesis and anchors it in hylomorphic principles, arguing that form is inseparable from material constraints and that beauty functions as a teleological binder in latent spaces. Empirically, it notes higher embedding self-similarity for aesthetic content and greater cross-model alignment, especially in mid-layer representations, suggesting a hierarchical abstraction from particular to transcendent concepts. The study frames human co-creation as foundational, with machines capable of offering novel insights at scale while remaining grounded in human intentions and cultural processes, signaling a productive partnership in cultural production and analysis.

Abstract

What does it mean for a machine to recognize beauty? While beauty remains a culturally and experientially compelling but philosophically elusive concept, deep learning systems increasingly appear capable of modeling aesthetic judgment. In this paper, we explore the capacity of neural networks to represent beauty despite the immense formal diversity of objects for which the term applies. By drawing on recent work on cross-model representational convergence, we show how aesthetic content produces more similar and aligned representations between models which have been trained on distinct data and modalities - while unaesthetic images do not produce more aligned representations. This finding implies that the formal structure of beautiful images has a realist basis - rather than only as a reflection of socially constructed values. Furthermore, we propose that these realist representations exist because of a joint grounding of aesthetic form in physical and cultural substance. We argue that human perceptual and creative acts play a central role in shaping these the latent spaces of deep learning systems, but that a realist basis for aesthetics shows that machines are not mere creative parrots but can produce novel creative insights from the unique vantage point of scale. Our findings suggest that human-machine co-creation is not merely possible, but foundational - with beauty serving as a teleological attractor in both cultural production and machine perception.

Representing Beauty: Towards a Participatory but Objective Latent Aesthetics

TL;DR

This work interrogates what it means for a machine to recognize beauty, proposing that aesthetic representations converge across diverse models and modalities, implying a realist, universal geometry of beauty. It introduces the universal representation hypothesis and anchors it in hylomorphic principles, arguing that form is inseparable from material constraints and that beauty functions as a teleological binder in latent spaces. Empirically, it notes higher embedding self-similarity for aesthetic content and greater cross-model alignment, especially in mid-layer representations, suggesting a hierarchical abstraction from particular to transcendent concepts. The study frames human co-creation as foundational, with machines capable of offering novel insights at scale while remaining grounded in human intentions and cultural processes, signaling a productive partnership in cultural production and analysis.

Abstract

What does it mean for a machine to recognize beauty? While beauty remains a culturally and experientially compelling but philosophically elusive concept, deep learning systems increasingly appear capable of modeling aesthetic judgment. In this paper, we explore the capacity of neural networks to represent beauty despite the immense formal diversity of objects for which the term applies. By drawing on recent work on cross-model representational convergence, we show how aesthetic content produces more similar and aligned representations between models which have been trained on distinct data and modalities - while unaesthetic images do not produce more aligned representations. This finding implies that the formal structure of beautiful images has a realist basis - rather than only as a reflection of socially constructed values. Furthermore, we propose that these realist representations exist because of a joint grounding of aesthetic form in physical and cultural substance. We argue that human perceptual and creative acts play a central role in shaping these the latent spaces of deep learning systems, but that a realist basis for aesthetics shows that machines are not mere creative parrots but can produce novel creative insights from the unique vantage point of scale. Our findings suggest that human-machine co-creation is not merely possible, but foundational - with beauty serving as a teleological attractor in both cultural production and machine perception.

Paper Structure

This paper contains 6 sections, 2 figures.

Figures (2)

  • Figure 1: (A) The internal similarity of embeddings produced by the same models, either by cosine similarity for the CLIP type models or by euclidean distance for DINOv2 oquab2024dinov2learningrobustvisual. We show the average similarity of the embeddings of aesthetic images minus those of unaesthetic images (as classified by the Aesthetic Visual Analysis dataset AVA). The aesthetic embeddings are more self-similar, and that this excess similarity increases with model size / performance. (B) The mutual nearest neighbors representational alignment (following platonic) between DINOv2-Large and the two CLIP variant model families, as delineated by source image aesthetic classification. There is a distinctly higher level of representational alignment between the aesthetic images than between the unaesthetic images or the representations corresponding to aesthetically ambiguous images.
  • Figure 2: The layerwise alignment to DINOv2-Large for multiple models and corresponding to aesthetic and unaesthetic images. Not only do the aesthetic representations have higher overall alignment, but they also demonstrate the abstraction paradigm we described more clearly, where the middle layers display more universal, abstract representations.