Table of Contents
Fetching ...

CLAMP: Contrastive LAnguage Model Prompt-tuning

Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim

TL;DR

This paper proposes an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP, and shows that LLMs can achieve good image classification performance when adapted this way.

Abstract

Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set of categories. First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP. We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP. Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities. LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data.

CLAMP: Contrastive LAnguage Model Prompt-tuning

TL;DR

This paper proposes an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP, and shows that LLMs can achieve good image classification performance when adapted this way.

Abstract

Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set of categories. First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP. We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP. Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities. LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data.
Paper Structure (26 sections, 12 equations, 7 figures, 11 tables)

This paper contains 26 sections, 12 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: State-of-the-art multimodal LLMs excel at generative visual tasks like answering questions that involve common sense, but underperform on standard image classification tasks like predicting the car type (generated using LLaVA1.5 liu2023visual). On a suite of 24 zero-shot image classification datasets they underperform SOTA zero-shot classification models like CLIP radford2021learning by 13%. In this paper, we present CLAMP, an approach to add classification abilities to a base LLM. This extends an LLM's visual reasoning ability to include visual discrimination, a fundamental computer vision task that true foundation models need to have. Putting together prior mLLM adapter modules and CLAMP, LLM's are now able to generate text, answer visually-grounded questions, chat interactively, and do zero-shot object classification.
  • Figure 2: Adapting LLMs for image classification: a) Applying prior multimodal LLMs such as LLaVA liu2023visual and MiniGPT zhu2023minigpt to classification by computing the GPTScore lin2023visualgptscore has poor accuracy; b) Our approach CLAMP achieves high accuracy by lightly fine-tuning the LLM with a contrastive image-caption objective.
  • Figure 3: Training CLAMP: a) The overall training loss of CLAMP. CLAMP is trained with a CLIP loss together with a distillation loss. b.) An overview of trainable parameters. We combine Read-only Prompt Optimization, LORA, and Attention Pooling.
  • Figure 4: Scaling training data. We confirm that data scale remains very important even with our strong language prior by subsampling our data and training. As data grows, so does zero-shot ImageNet accuracy.
  • Figure 5: Read only prompts. The attention we use. The Learned Prompts can attend to all positions in the sequence, while text tokens can only attend to tokens in positions before.
  • ...and 2 more figures