Table of Contents
Fetching ...

HyperDet: Generalizable Detection of Synthesized Images by Generating and Merging A Mixture of Hyper LoRAs

Huangsen Cao, Yongwei Wang, Yinfeng Liu, Sixian Zheng, Kangtao Lv, Zhimeng Zhang, Bo Zhang, Xin Ding, Fei Wu

TL;DR

This work introduces a novel and generalizable detection framework termed HyperDet, which innovatively captures and integrates shared knowledge from a collection of functionally distinct and lightweight expert detectors and paves a new way to establish generalizable domain-specific fake image detectors based on pretrained large vision models.

Abstract

The emergence of diverse generative vision models has recently enabled the synthesis of visually realistic images, underscoring the critical need for effectively detecting these generated images from real photos. Despite advances in this field, existing detection approaches often struggle to accurately identify synthesized images generated by different generative models. In this work, we introduce a novel and generalizable detection framework termed HyperDet, which innovatively captures and integrates shared knowledge from a collection of functionally distinct and lightweight expert detectors. HyperDet leverages a large pretrained vision model to extract general detection features while simultaneously capturing and enhancing task-specific features. To achieve this, HyperDet first groups SRM filters into five distinct groups to efficiently capture varying levels of pixel artifacts based on their different functionality and complexity. Then, HyperDet utilizes a hypernetwork to generate LoRA model weights with distinct embedding parameters. Finally, we merge the LoRA networks to form an efficient model ensemble. Also, we propose a novel objective function that balances the pixel and semantic artifacts effectively. Extensive experiments on the UnivFD and Fake2M datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance. Moreover, our work paves a new way to establish generalizable domain-specific fake image detectors based on pretrained large vision models.

HyperDet: Generalizable Detection of Synthesized Images by Generating and Merging A Mixture of Hyper LoRAs

TL;DR

This work introduces a novel and generalizable detection framework termed HyperDet, which innovatively captures and integrates shared knowledge from a collection of functionally distinct and lightweight expert detectors and paves a new way to establish generalizable domain-specific fake image detectors based on pretrained large vision models.

Abstract

The emergence of diverse generative vision models has recently enabled the synthesis of visually realistic images, underscoring the critical need for effectively detecting these generated images from real photos. Despite advances in this field, existing detection approaches often struggle to accurately identify synthesized images generated by different generative models. In this work, we introduce a novel and generalizable detection framework termed HyperDet, which innovatively captures and integrates shared knowledge from a collection of functionally distinct and lightweight expert detectors. HyperDet leverages a large pretrained vision model to extract general detection features while simultaneously capturing and enhancing task-specific features. To achieve this, HyperDet first groups SRM filters into five distinct groups to efficiently capture varying levels of pixel artifacts based on their different functionality and complexity. Then, HyperDet utilizes a hypernetwork to generate LoRA model weights with distinct embedding parameters. Finally, we merge the LoRA networks to form an efficient model ensemble. Also, we propose a novel objective function that balances the pixel and semantic artifacts effectively. Extensive experiments on the UnivFD and Fake2M datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance. Moreover, our work paves a new way to establish generalizable domain-specific fake image detectors based on pretrained large vision models.
Paper Structure (22 sections, 9 equations, 13 figures, 4 tables, 2 algorithms)

This paper contains 22 sections, 9 equations, 13 figures, 4 tables, 2 algorithms.

Figures (13)

  • Figure 1: Overview of the proposed HyperDet framework. For a given input image, we first generate different filtered views using various groups of filter modules. These filtered views are then used to produce the corresponding task embeddings. Subsequently, the different views, along with the original image, are fed into the ViT module of the CLIP model. Simultaneously, the task embeddings, layer embeddings, and position embeddings are used as inputs to the Hyper LoRAs to generate the corresponding LoRA for fine-tuning CLIP, and finally, the outputs of different LoRA experts are merged to obtain the final output that integrates the knowledge from each expert. This output feature effectively facilitates synthetic image detection.
  • Figure 2: The figure illustrates four filter matrices, each with a size of 5×5. The gray areas indicate matrix elements with a value of zero, while the negative values correspond to the central data to be predicted, and the positive values represent the surrounding data used for prediction. The SRM filter derives residual features by subtracting the central data from the edge data.
  • Figure 3: Frequency analysis of fake and real images. This figure presents a comparison of the feature maps generated by five models (BigGAN, StyleGAN, StarGAN, CycleGAN, CRN) before and after applying the SRM filter. The top row shows the original feature maps produced by each generative model, while the bottom row displays the corresponding SRM-processed feature maps. After SRM filtering, the edge high-frequency features are enhanced, revealing potential artifacts and inconsistencies, while the central low-frequency features are suppressed, reducing the semantic impact on detection.
  • Figure 4: Generalization results on Fake2M dataset lu2024seeing. We used radar charts to present the detection of accuracy results, with each concentric circle representing a 20% scale. Our method demonstrated optimal performance across multiple datasets.In Midjourney, the performance exhibits slightly inferior results.
  • Figure 5: Robustness evaluation results of accuracy. We conducted robustness evaluations on four detection methods under two post-processing conditions: (a) Gaussian blur and (b) JPEG compression, using the UnivFD dataset. The results indicate that our method (HyperDet) outperforms other networks across all post-processing scenarios involving Gaussian blur, while underperforming slightly compared to the UnivFD method in some cases of JPEG compression.
  • ...and 8 more figures