Table of Contents
Fetching ...

Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Colin Conwell, Christopher Hamblin, Chelsea Boccagno, David Mayo, Jesse Cummings, Leyla Isik, Andrei Barbu

TL;DR

The results suggest that whatever words the authors may eventually find to describe their experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.

Abstract

When we experience a visual stimulus as beautiful, how much of that experience derives from perceptual computations we cannot describe versus conceptual knowledge we can readily translate into natural language? Disentangling perception from language in visually-evoked affective and aesthetic experiences through behavioral paradigms or neuroimaging is often empirically intractable. Here, we circumnavigate this challenge by using linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal (language-aligned) deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Unimodal language models (e.g. GPT2) conditioned on visual embeddings to generate captions (via CLIPCap) yield no further gains. Caption embeddings alone yield less accurate predictions than image and caption embeddings combined (concatenated). Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.

Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

TL;DR

The results suggest that whatever words the authors may eventually find to describe their experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.

Abstract

When we experience a visual stimulus as beautiful, how much of that experience derives from perceptual computations we cannot describe versus conceptual knowledge we can readily translate into natural language? Disentangling perception from language in visually-evoked affective and aesthetic experiences through behavioral paradigms or neuroimaging is often empirically intractable. Here, we circumnavigate this challenge by using linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal (language-aligned) deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Unimodal language models (e.g. GPT2) conditioned on visual embeddings to generate captions (via CLIPCap) yield no further gains. Caption embeddings alone yield less accurate predictions than image and caption embeddings combined (concatenated). Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.

Paper Structure

This paper contains 8 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Schematic of our feature regression pipeline for decoding affective information from deep net responses, a reproduction (with only minimal modification) of the methods described in conwell2021_feeling. Our target in these experiments are group-average beauty ratings, which we predict by extracting image features from a candidate deep neural network model, (optionally) reducing their dimensionality, then employing them as predictors in a cross-validated ridge regression with the group-average beauty ratings as output. This method gives us a beauty decoding score per layer per candidate model.
  • Figure 2: A Schematic of our controlled modeling experiment using the SLIP model family mu2021slip. 'Controlled' in this case refers to the isolation of singular axes of interest across distinct sets of model that vary exclusively along these axes (with other possible variations held constant). In SLIP, both the training dataset (YFCC15M) and architecture (ViT-[S,B,L]) are held constant across 3 variants of model (SimCLR, CLIP, and SLIP). The difference between SimCLR and SLIP (a combination of SimCLR's visual augmentation regime with CLIP's language alignment in a unified contrastive learning pipeline) are a direct empirical instantiation of variation in the presence or absence of training provided by language. B Results from our feature regression pipeline as applied to SimCLR (a unimodal vision model), CLIP (a language-aligned model) and SLIP (a model that combines unimdal vision training and language alignment) -- holding dataset and architecture constant. B1 In the top plot, we see results across layers (the semitransparent jagged lines are individual layer scores; the curves are the output of a generalized additive smoother across layers; the SLIP models each have 3 variants: ViT-[Small, Base, Large]). The takeaway here is that for all models, predictive accuracy is generally higher in deeper layers (with the final embedding layer often the highest). B2 In the bottom plot, we see the results from the maximally predictive layers of each model. Error bars are 95% confidence intervals across 1000 bootstrap resamples of the human subject pool. The takeaway here is that adding language alignment (without taking away unimodal vision training) in the form of the SLIP objective does significantly increase downstream readout of aesthetic information.
  • Figure 3: A Schematic of our experiment using CLIPCap mokady2021clipcap to translate the visual embeddings of CLIP into natural language by way of a GPT2 text decoder: The process begins with the embedding of an image (red line) into the latent space of a CLIP-ViT-B32 model. These embeddings contain only feedforward visual information. CLIP's latent visual embedding is then piped into GPT2 by way of CLIPCap's MLP adapter, and in the first pass through GPT (blue line), the only context available to GPT2 for next token generation is the visual information instantiated in CLIPCap's 'prefix' tokens. Once a caption is produced, we concatenate (purple line) this caption with the original visual prefix and pipe it once again through GPT2 to extract embeddings that instantiate both the original visual information in the prefix, as well as any added information instantiated in the caption. Finally, we remove the visual prefix from the caption, and extract the GPT2 embeddings for the generated caption alone, effectively extracting the pure linguistic context provided by this caption. B Results of the CLIPCap translation experiment: The red line in the facet on the left are the scores across the layers of the CLIP visual encoder used to generate an image 'prefix' embedding that is subsequently passed to GPT2 for captioning. The line in blue in the facet on the right is the predictive power of that prefix embedding as it is processed across the layers of GPT2. In other words, this blue line tracks the potential of GPT2 to facilitate better aesthetic decoding by extracting further information from the visual prefix. The line in green is the predictive power of the generated caption passed back through GPT2 without the prefix embedding. This line tracks how well (machine-generated, image-conditioned) language alone might predict aesthetic ratings. The line in purple is the predictive power of the generated caption passed back through GPT2 with the prefix embedding. This line tracks whether visual embeddings and image-conditioned language together might outperform either one alone. The difference between the blue line and the green line represents the difference in predictive power between CLIP's visual features and GPT2's linguistic features -- the difference, in other words, between language-aligned perception and language alone. This gap is substantial. The negative slope on the purple line seems to be an artifact of the feature regression overfitting to the embedding complexity added by the caption. Each line in this plot may be thought of as instantiating a form of 'context window' -- a term used in natural language processing to describe one information provides precedent for any given 'next token' prediction in the language-generating process.