Table of Contents
Fetching ...

LARE: Latent Augmentation using Regional Embedding with Vision-Language Model

Kosuke Sakurai, Tatsuya Ishii, Ryotaro Shimizu, Linxin Song, Masayuki Goto

TL;DR

The Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM, achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs.

Abstract

In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.

LARE: Latent Augmentation using Regional Embedding with Vision-Language Model

TL;DR

The Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM, achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs.

Abstract

In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.
Paper Structure (14 sections, 5 equations, 4 figures, 5 tables)

This paper contains 14 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of Latent Augmentation using Regional Embedding (LARE). LARE embeds the image as a region (box) in the vision-language embedding space instead of one embedding point like the classic vision-language model (VLM). A more robust image classification model can be constructed by fine-tuning the VLM, including the augmented embedding of various domains obtained from the latent region. Note that the augmented data on the right of the figure CUBCUB-Painting is a hallucinated image and is not actually generated by LARE.
  • Figure 2: Overview of CLIP, CoCa, and LADS
  • Figure 3: Overview of Stage 1 in LARE. The network in Stage 1 outputs a region (box) in the latent space based on the embeddings obtained from the encoder of the VLM. The latent region is described by two points in the vision-language embedding space. The new neural network $f_{Box}$ is trained based on four losses: one box volume loss and three class consistency losses. After training the region, augmented data for unseen domains is created by randomly sampling from within the region (box).
  • Figure 4: Few-shot accuracy on CIFAR-100