Table of Contents
Fetching ...

No Single Best Model for Diversity: Learning a Router for Sample Diversity

Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar, Daphne Ippolito, Eunsol Choi

Abstract

When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

No Single Best Model for Diversity: Learning a Router for Sample Diversity

Abstract

When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

Paper Structure

This paper contains 55 sections, 1 equation, 20 figures, 16 tables, 1 algorithm.

Figures (20)

  • Figure 1: Left: LLMs exhibit different diversity coverage. Right: There is no universal best model on NB-WildChat. A model is only considered to be the best model if its diversity scores are $5\%$ higher than the second most best candidate. Queries without a model satisfying this margin are labeled as "No dominant single models". On Simple Questions, all models perform similarly, resulting in $100\%$ of "No dominant single models". On NB-WildChat, there is no model that consistently dominates all queries.
  • Figure 2: Scaling training data improves router performance on Infinity-Chat.
  • Figure 3: Efficiency analysis comparing the time (seconds per query) across routing (Router), Top overall (Top) and Top model per query (Oracle). We include routing to one model per query and routing to 2 models per query. Sample$_{n}$ denotes sample $n$ answers. Oracle incurs the highest cost, as the routing requires exhaustively comparing all candidate models or model pairs.
  • Figure 4: Div-Cov (%) results on NB-WildChat with various prompting strategies. Training the router under each prompting strategy (in domain and out-of-domain evaluation).
  • Figure 5: Generate-one prompt has higher answer quality. Under the generate-all prompts, as more answers are listed, the quality decreases with large variations if using generate-all prompt.
  • ...and 15 more figures