Representative Language Generation
Charlotte Peale, Vinod Raman, Omer Reingold
TL;DR
The paper introduces representative generation, a framework extending language generation in the limit to ensure outputs reflect the training-data group proportions and to mitigate bias and mode collapse. It defines the group-structured objective through α-representativeness and introduces the group closure dimension GC_α(H, 𝔄) to characterize when rep uniform generation is feasible, proving tight necessary-and-sufficient conditions for finite GC_α. It also extends results to representative non-uniform generation and representative generation in the limit, including both information-theoretic guarantees for countable hypothesis classes and a negative computability result showing membership-query-only methods cannot achieve limit-representation in full generality. The work connects to fairness notions and breadth concepts, and outlines future directions for richer, dynamic group collections and other distance measures, providing a rigorous foundation for designing more diverse and representative generative models.
Abstract
We introduce "representative generation," extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the "group closure dimension" as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.'s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.
