Table of Contents
Fetching ...

Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods

Adam Phillips, Daniel Grandes Rodriguez, Miriam Sánchez-Manzano, Alan Salvadó, Manuel Garin, Gloria Haro, Coloma Ballester

TL;DR

This work shows how features extracted from a CLIP model can be leveraged by using a shallow network and an appropriate loss to classify images into 20 different motifs, with surprisingly good results: an $F_1$-score of 0.91 on the authors' test set.

Abstract

In cinema, visual motifs are recurrent iconographic compositions that carry artistic or aesthetic significance. Their use throughout the history of visual arts and media is interesting to researchers and filmmakers alike. Our goal in this work is to recognise and classify these motifs by proposing a new machine learning model that uses a custom dataset to that end. We show how features extracted from a CLIP model can be leveraged by using a shallow network and an appropriate loss to classify images into 20 different motifs, with surprisingly good results: an $F_1$-score of 0.91 on our test set. We also present several ablation studies justifying the input features, architecture and hyperparameters used.

Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods

TL;DR

This work shows how features extracted from a CLIP model can be leveraged by using a shallow network and an appropriate loss to classify images into 20 different motifs, with surprisingly good results: an -score of 0.91 on the authors' test set.

Abstract

In cinema, visual motifs are recurrent iconographic compositions that carry artistic or aesthetic significance. Their use throughout the history of visual arts and media is interesting to researchers and filmmakers alike. Our goal in this work is to recognise and classify these motifs by proposing a new machine learning model that uses a custom dataset to that end. We show how features extracted from a CLIP model can be leveraged by using a shallow network and an appropriate loss to classify images into 20 different motifs, with surprisingly good results: an -score of 0.91 on our test set. We also present several ablation studies justifying the input features, architecture and hyperparameters used.

Paper Structure

This paper contains 21 sections, 5 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: A selection of images from our dataset representing the motif Facing the Image. On the left, three stills from films of different styles and periods: from top to bottom, (500) Days of Summer (2009) by Marc Webb, Andrei Rublev (1966) by Andrei Tarkovsky, and The Wind Rises (2013) by Hayao Miyazaki. On the right, the painting Jacob Cornelisz. van Oostsanen Painting a Portrait of His Wife (circa 1530-1550) by Dirck Jacobsz and a press photography of French President Emmanuel Macron at the Picasso National Museum in Paris.
  • Figure 2: Graph showing the 20 motifs of our proposed Curated Comparative Dataset, and the corresponding number of image samples per motif (proportional bar heights)
  • Figure 3: A frame from Beau Travail (1999) by Claire Denis. Although the image clearly represents the Hug motif, we can see that in the background, there is also a Brawl going on, mostly off camera. The classification of this image within our dataset is therefore Primary Motifs: Hug, and Secondary Motifs: Brawl.
  • Figure 4: Two examples from the Brawl motif. On the left, a press photography from the BBC of a fight that broke out between the Georgetown Hoyas and the Bayi Rockets basketball teams in 2011. The typical sports setting, blatant pushing and grabbing of a large group of people, and wide shot showing the whole scene make this a Canonical image of the Brawl motif. On the right, Duel (July 4th) (2004) by Barnaby Furnas. The mix of colours, imagery of conflict, and general chaos present in this painting make it an instance of the Brawl motif, but its abstract nature means that it is tagged as a Red Flag within the dataset.
  • Figure 5: Four examples of images from our test set, and their associated results after applying our model. For each image, we specify, when relevant, its Primary Motifs (PM), Secondary Motifs (SM), and tags (Red Flag or Canonical). We also give all motifs predicted by our model for that image, i.e. those having a probability of at least 0.5 (cf.$O^I$ in \ref{['sec:evalmetrics']}), and their probability. The images are, from top to bottom and left to right: a frame from Orpheus (1950) by Jean Cocteau, a frame from the video game Final Fantasy VIII (1999), the painting The Kiss (1969) by Pablo Picasso, and a frame from S5E13 Homer and Apu (1994) of The Simpsons, by Mark Kirkland.
  • ...and 6 more figures