Table of Contents
Fetching ...

Boosting Few-Shot Detection with Large Language Models and Layout-to-Image Synthesis

Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller

TL;DR

This work proposes a collaborative framework employing a Large Language Model (LLM) and an LIS model for enhancing few-shot detection beyond state-of-the-art generative augmentation approaches and introduces a novel layout-aware CLIP score for sample ranking, enabling tight coupling between generated layouts and images.

Abstract

Recent advancements in diffusion models have enabled a wide range of works exploiting their ability to generate high-volume, high-quality data for use in various downstream tasks. One subclass of such models, dubbed Layout-to-Image Synthesis (LIS), learns to generate images conditioned on a spatial layout (bounding boxes, masks, poses, etc.) and has shown a promising ability to generate realistic images, albeit with limited layout-adherence. Moreover, the question of how to effectively transfer those models for scalable augmentation of few-shot detection data remains unanswered. Thus, we propose a collaborative framework employing a Large Language Model (LLM) and an LIS model for enhancing few-shot detection beyond state-of-the-art generative augmentation approaches. We leverage LLM's reasoning ability to extrapolate the spatial prior of the annotation space by generating new bounding boxes given only a few example annotations. Additionally, we introduce our novel layout-aware CLIP score for sample ranking, enabling tight coupling between generated layouts and images. Significant improvements on COCO few-shot benchmarks are observed. With our approach, a YOLOX-S baseline is boosted by more than 140%, 50%, 35% in mAP on the COCO 5-,10-, and 30-shot settings, respectively.

Boosting Few-Shot Detection with Large Language Models and Layout-to-Image Synthesis

TL;DR

This work proposes a collaborative framework employing a Large Language Model (LLM) and an LIS model for enhancing few-shot detection beyond state-of-the-art generative augmentation approaches and introduces a novel layout-aware CLIP score for sample ranking, enabling tight coupling between generated layouts and images.

Abstract

Recent advancements in diffusion models have enabled a wide range of works exploiting their ability to generate high-volume, high-quality data for use in various downstream tasks. One subclass of such models, dubbed Layout-to-Image Synthesis (LIS), learns to generate images conditioned on a spatial layout (bounding boxes, masks, poses, etc.) and has shown a promising ability to generate realistic images, albeit with limited layout-adherence. Moreover, the question of how to effectively transfer those models for scalable augmentation of few-shot detection data remains unanswered. Thus, we propose a collaborative framework employing a Large Language Model (LLM) and an LIS model for enhancing few-shot detection beyond state-of-the-art generative augmentation approaches. We leverage LLM's reasoning ability to extrapolate the spatial prior of the annotation space by generating new bounding boxes given only a few example annotations. Additionally, we introduce our novel layout-aware CLIP score for sample ranking, enabling tight coupling between generated layouts and images. Significant improvements on COCO few-shot benchmarks are observed. With our approach, a YOLOX-S baseline is boosted by more than 140%, 50%, 35% in mAP on the COCO 5-,10-, and 30-shot settings, respectively.

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of proposed framework. First, we employ a pretrained LLM to generate new layouts with the help of a prompt template and the available few-shot annotations. Next, we generate new images using a layout-to-image synthesis diffusion model, conditioned on the newly created layouts. Finally, we employ a layout-aware CLIP score (LACS) to rate the generated samples, and construct a generated set of images G with high layout-adherence by picking the highest scoring images from a given batch. After reformatting the generated layouts into detection annotations, the resulting image-annotation pairs are used to augment the few-shot detection data.
  • Figure 2: LLM-based spatial prior extrapolation. First, we embed a batch of few-shot annotations into a prompt template by formatting them as layout descriptions. The embedded layout descriptions serve as in-context examples (in green) to steer the text generation process. We then prompt the LLM to complete a caption of randomly reordered objects (brown) from one of the layout descriptions and obtain a response containing generated bounding boxes (red). Finally, we parse the response to obtain new layouts.
  • Figure 3: Overview of LACS. First, we create $\mathbf{n}$ masked images from a generated image, where $\mathbf{n}$ is the number of object categories in the image. Next, for each category, we perform zero-shot classification on both the generated image and the masked image and obtain a per-category layout-adherence score by subtracting the two classification scores. We average over all categories to arrive at the final sample score.
  • Figure 4: Samples generated with InstanceDiffusion wang2024instancediffusion and their LACS score. Green boxes highlight the conditional layout, while red regions highlight out-of-layout hallucinations.
  • Figure 5: Quality vs. quantity analysis. We analyse the effect of picking the top n samples from a generated batch of eight images sorted by the layout-aware clip-score (LACS).
  • ...and 2 more figures