Table of Contents
Fetching ...

LLaFS: When Large Language Models Meet Few-Shot Segmentation

Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, Jun Liu

TL;DR

LLaFS introduces a novel framework that leverages large language models to perform few-shot segmentation by translating image understanding into a text-driven polygon prediction task. It couples a segmentation-task instruction with a fine-grained in-context instruction, enabling the LLM to propose a 16-point polygon that delineates target objects, followed by a lightweight refinement network to yield precise masks. The approach is trained with pseudo-sample curriculum pretraining, using progressively harder synthetic data to augment limited labeled samples. Across PASCAL-5^i and COCO-20^i, LLaFS achieves state-of-the-art results, underscoring the potential of LLMs for cross-modal few-shot vision tasks and suggesting a path toward multi-domain, LLM-enabled perception systems.

Abstract

This paper proposes LLaFS, the first attempt to leverage large language models (LLMs) in few-shot segmentation. In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images, LLaFS leverages the vast prior knowledge gained by LLM as an effective supplement and directly uses the LLM to segment images in a few-shot manner. To enable the text-based LLM to handle image-related tasks, we carefully design an input instruction that allows the LLM to produce segmentation results represented as polygons, and propose a region-attribute table to simulate the human visual mechanism and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization. LLaFS achieves state-of-the-art results on multiple datasets, showing the potential of using LLMs for few-shot computer vision tasks.

LLaFS: When Large Language Models Meet Few-Shot Segmentation

TL;DR

LLaFS introduces a novel framework that leverages large language models to perform few-shot segmentation by translating image understanding into a text-driven polygon prediction task. It couples a segmentation-task instruction with a fine-grained in-context instruction, enabling the LLM to propose a 16-point polygon that delineates target objects, followed by a lightweight refinement network to yield precise masks. The approach is trained with pseudo-sample curriculum pretraining, using progressively harder synthetic data to augment limited labeled samples. Across PASCAL-5^i and COCO-20^i, LLaFS achieves state-of-the-art results, underscoring the potential of LLMs for cross-modal few-shot vision tasks and suggesting a path toward multi-domain, LLM-enabled perception systems.

Abstract

This paper proposes LLaFS, the first attempt to leverage large language models (LLMs) in few-shot segmentation. In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images, LLaFS leverages the vast prior knowledge gained by LLM as an effective supplement and directly uses the LLM to segment images in a few-shot manner. To enable the text-based LLM to handle image-related tasks, we carefully design an input instruction that allows the LLM to produce segmentation results represented as polygons, and propose a region-attribute table to simulate the human visual mechanism and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization. LLaFS achieves state-of-the-art results on multiple datasets, showing the potential of using LLMs for few-shot computer vision tasks.
Paper Structure (30 sections, 4 equations, 13 figures, 13 tables)

This paper contains 30 sections, 4 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Overview of LLaFS. The image encoder and Q-former extract image features and generate a set of visual tokens. Subsequently, a segmentation task instruction and fine-grained in-context introduction are introduced to provide detailed and comprehensive information. These two instructions are integrated and fed into the LLM to produce the vertices coordinates of polygons that enclose the target object. The segmentation mask represented by this polygon is processed by a refinement network to get the final result.
  • Figure 2: Illustration of how to construct the region-attribute corresponding table used in the fine-grained in-context instruction.
  • Figure 3: Examples of using ChatGPT for (a) class attributes generation, (b) ambiguity detection and (c) discriminative attributes generation.
  • Figure 4: Examples of pseudo samples generated at different pretraining stages. Foreground regions are marked by white contours. As pretraining progresses, pseudo images have reduced intra-image foreground-background differences and greater support-query foreground differences. Meanwhile, the number of polygon vertex coordinates provided in the instruction decreases, while the predicted vertex count increases. These changes gradually increase the pretraining difficulty. (Best viewed in color)
  • Figure 5: Pretraining (a) and training (b) loss curves in different settings. Curriculum pretraining results in the best convergence in both pretraining and training stages. (Best viewed in color)
  • ...and 8 more figures