Table of Contents
Fetching ...

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, Shanghang Zhang

TL;DR

Subpopulation distributions hidden in datasets are crucial for robust ML but remain understudied. The paper introduces SSD-LLM, a pipeline that uses an MLLM to generate informative image captions and an LLM to autonomously discover and refine a four-layer subpopulation structure (class–dimension–attribute–subpopulation), enabling automated downstream tasks. The method proceeds through caption extraction, Criteria Initialization, Criteria Refinement, and Subpopulation Assignment, followed by Task-specific Tuning for Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Empirical results show SSD-LLM achieving competitive or superior performance across these tasks, including improvements in worst-group accuracy and slice-topic consistency, with ablations highlighting the impact of refinement and prompt design. This work provides a principled, automated framework for interpretable, data-centric analysis of subpopulations and paves the way for fairer, more balanced dataset construction and evaluation in multimodal settings.

Abstract

The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

TL;DR

Subpopulation distributions hidden in datasets are crucial for robust ML but remain understudied. The paper introduces SSD-LLM, a pipeline that uses an MLLM to generate informative image captions and an LLM to autonomously discover and refine a four-layer subpopulation structure (class–dimension–attribute–subpopulation), enabling automated downstream tasks. The method proceeds through caption extraction, Criteria Initialization, Criteria Refinement, and Subpopulation Assignment, followed by Task-specific Tuning for Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Empirical results show SSD-LLM achieving competitive or superior performance across these tasks, including improvements in worst-group accuracy and slice-topic consistency, with ablations highlighting the impact of refinement and prompt design. This work provides a principled, automated framework for interpretable, data-centric analysis of subpopulations and paves the way for fairer, more balanced dataset construction and evaluation in multimodal settings.

Abstract

The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.
Paper Structure (19 sections, 4 figures, 5 tables, 3 algorithms)

This paper contains 19 sections, 4 figures, 5 tables, 3 algorithms.

Figures (4)

  • Figure 1: (A) The Workflow of Subpopulation Structure Discovery with Large Language Models (SSD-LLM). SSD-LLM can further support several downstream tasks including: (B) Dataset Subpopulation Organization; (C) Subpopulation Shift; (D) Slice discovery.
  • Figure 2: Metashift has the same-level attributes Surfboard, Water, and Grass for class Dog, which is irrational due to the possible overlap. As an improvement, we take dimensions into consideration. The class Dog has dimensions including Action, Co-occurrence Object, Location, etc., and in dimension Location, it includes various attributes like Water, Grass, etc, which offers a more appropriate assignment for the samples.
  • Figure 3: Subpopulation Structure Discovery with Large Language Model (SSD-LLM). (Step 1) Multimodality Large Language Model (MLLM) extracts informative captions from images. (Step 2) LLM initializes the criteria with a sample-based generate-and-select paradigm. (Step 3) LLM refines the criteria using self-consistency as an indicator. (Step 4) LLM assigns each caption with specific attributes according to the refined criteria, uncovering the intrinsic subpopulation structures hidden in the dataset. The resulting criteria and subpopulations are used in several downstream tasks.
  • Figure 4: A visualization of organised subpopulations in a dataset of cats.