Table of Contents
Fetching ...

Contrastive Language-Image Pre-training for the Italian Language

Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Silvia Terragni, Gabriele Sarti, Sri Lakshmi

TL;DR

This work presents CLIP-Italian, the first language-specific CLIP model for Italian, trained on 1.4 million image-caption pairs assembled from four data sources and refined through translations and data cleaning. By fine-tuning pre-trained vision and language encoders, the authors achieve notable improvements over a multilingual CLIP baseline in image retrieval and zero-shot classification, demonstrating the value of language-specific adaptation. They provide an online demo and release the model and code, while acknowledging translation biases, data limits, and environmental costs. The results advance Italian multi-modal capabilities and outline practical considerations for scaling such efforts to other languages.

Abstract

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.

Contrastive Language-Image Pre-training for the Italian Language

TL;DR

This work presents CLIP-Italian, the first language-specific CLIP model for Italian, trained on 1.4 million image-caption pairs assembled from four data sources and refined through translations and data cleaning. By fine-tuning pre-trained vision and language encoders, the authors achieve notable improvements over a multilingual CLIP baseline in image retrieval and zero-shot classification, demonstrating the value of language-specific adaptation. They provide an online demo and release the model and code, while acknowledging translation biases, data limits, and environmental costs. The results advance Italian multi-modal capabilities and outline practical considerations for scaling such efforts to other languages.

Abstract

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.

Paper Structure

This paper contains 14 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Numpy-like pseudo code that describes the CLIP loss, image is inspired to the one provided by CLIP21.
  • Figure 2: Result of the query "due cani sulla neve" (two dogs on the snow) on the Unsplash25K dataset.
  • Figure 3: Result of the query "una coppia al tramonto" (a couple during sunset) on the Unsplash25K dataset.
  • Figure 4: Result of the query "una coppia in montagna" (a couple on the mountain) on the Unsplash25K dataset.
  • Figure 5: Result of the query "un topolino" (a small mouse) on the Unsplash25K dataset.