Table of Contents
Fetching ...

Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features

Hanbyul Lee, Juneho Yi

TL;DR

This work tackles the rapid emergence of unseen generative models by reframing model attribution (MA) as a few-shot class-incremental learning problem. It introduces a learnable, multi-level CLIP-ViT representation via Adaptive Integration Module (AIM) that assigns per-image weights to block features for accurate model attribution and integrates it within a TEEN-based FSCIL framework. The approach is validated on 28 generators, with a base session of GANs and incremental sessions incorporating diffusion-based models, showing strong and scalable attribution across model evolution and CLIP backbones. The results demonstrate that leveraging low-level information is crucial for MA, and combining information across all levels with AIM yields the best performance, enabling rapid adaptation to newly released generators with minimal data.

Abstract

Recently, images that distort or fabricate facts using generative models have become a social concern. To cope with continuous evolution of generative artificial intelligence (AI) models, model attribution (MA) is necessary beyond just detection of synthetic images. However, current deep learning-based MA methods must be trained from scratch with new data to recognize unseen models, which is time-consuming and data-intensive. This work proposes a new strategy to deal with persistently emerging generative models. We adapt few-shot class-incremental learning (FSCIL) mechanisms for MA problem to uncover novel generative AI models. Unlike existing FSCIL approaches that focus on object classification using high-level information, MA requires analyzing low-level details like color and texture in synthetic images. Thus, we utilize a learnable representation from different levels of CLIP-ViT features. To learn an effective representation, we propose Adaptive Integration Module (AIM) to calculate a weighted sum of CLIP-ViT block features for each image, enhancing the ability to identify generative models. Extensive experiments show our method effectively extends from prior generative models to recent ones.

Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features

TL;DR

This work tackles the rapid emergence of unseen generative models by reframing model attribution (MA) as a few-shot class-incremental learning problem. It introduces a learnable, multi-level CLIP-ViT representation via Adaptive Integration Module (AIM) that assigns per-image weights to block features for accurate model attribution and integrates it within a TEEN-based FSCIL framework. The approach is validated on 28 generators, with a base session of GANs and incremental sessions incorporating diffusion-based models, showing strong and scalable attribution across model evolution and CLIP backbones. The results demonstrate that leveraging low-level information is crucial for MA, and combining information across all levels with AIM yields the best performance, enabling rapid adaptation to newly released generators with minimal data.

Abstract

Recently, images that distort or fabricate facts using generative models have become a social concern. To cope with continuous evolution of generative artificial intelligence (AI) models, model attribution (MA) is necessary beyond just detection of synthetic images. However, current deep learning-based MA methods must be trained from scratch with new data to recognize unseen models, which is time-consuming and data-intensive. This work proposes a new strategy to deal with persistently emerging generative models. We adapt few-shot class-incremental learning (FSCIL) mechanisms for MA problem to uncover novel generative AI models. Unlike existing FSCIL approaches that focus on object classification using high-level information, MA requires analyzing low-level details like color and texture in synthetic images. Thus, we utilize a learnable representation from different levels of CLIP-ViT features. To learn an effective representation, we propose Adaptive Integration Module (AIM) to calculate a weighted sum of CLIP-ViT block features for each image, enhancing the ability to identify generative models. Extensive experiments show our method effectively extends from prior generative models to recent ones.

Paper Structure

This paper contains 16 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Frequencies at which each block is ranked as the most important block in MA using the features from Transformer blocks of pre-trained CLIP-ViT. (a) CycleGAN, (b) DDPM, (c) DALL-E 2, (d) InfoMaxGAN, (e) Improved Diffusion, (f) Stable Diffusion 1.4. The frequencies are measured as follows: First, we calculate the weight of each block feature with AIM, then find the block with the largest value and the block with the second largest value for each channel of the weights. If the second largest value is greater than half of the largest value, the frequency of the block with the second largest weight is also counted. Since early blocks of CLIP-ViT extract low-level information and later blocks extract high-level information, these results indicate that the level containing the most important information varies depending on generative models. Specifically, GAN-based models such as CycleGAN and InfoMaxGAN tend to have a high frequency of large weights in the middle blocks. In contrast, models based on more recent structures like DDPM, Improved Diffusion, DALL-E 2, and Stable Diffusion 1.4 have higher frequencies for later blocks than early blocks.
  • Figure 2: Overview of our proposed method. We utilize a learnable representation for attributing generative models. In the proposed method, all the features from pre-trained CLIP image encoder blocks are taken and integrated. We perform a weighted sum of the features, and the weights used here are obtained using Adaptive Integration Module (AIM). AIM is a trainable module that calculates appropriate weights for a given image.
  • Figure 3: Correct and incorrect cases in attributing images generated by DALL-E. (a) shows sample images and frequencies of CLIP-ViT blocks that are considered the most important for each image. The frequencies are obtained in the same way as Figure \ref{['fig:frequency']}. $\blacktriangle$ means a correctly attributed DALL-E test image and $\blacktriangledown$ is incorrectly attributed to Guided Diffusion. $\star$ and $\star$ indicate an image in the DALL-E and Guided Diffusion support sets, respectively. (b) is the result of t-SNE performed on CLIP-ViT low-level features of DALL-E support set, Guided Diffusion support set, and DALL-E test images. As aforementioned and shown in (a), attribution results are highly affected by features from early blocks. Consequently, if a low-level representation of an image generated by a particular model is similar to those of other generative models, it may lead to incorrect attribution.