Table of Contents
Fetching ...

FRACTAL: Fine-Grained Scoring from Aggregate Text Labels

Yukti Makhija, Priyanka Agrawal, Rishi Saket, Aravindan Raghuveer

TL;DR

FRACTAL addresses the challenge of deriving fine-grained sentence-level feedback from coarser response-level labels for complex text generation tasks. It combines multiple instance learning and learning from label proportions with priors computed from sentence-document similarities and sentence correlations, and introduces a max-likelihood pseudo-labeling scheme to further refine instance-level predictions. The approach is evaluated across six datasets and four tasks (retrieval, QA, summarization, and math reasoning), showing consistent improvements in sentence-level scoring over traditional bag-loss baselines and competitive performance relative to fully supervised sentence-level models. This work demonstrates that task-specific priors and pseudo-labeling can meaningfully translate aggregate feedback into actionable sentence-level signals, enabling more effective end-to-end fine-tuning and RLHF-style training with reduced annotation cost.

Abstract

Large language models (LLMs) are being increasingly tuned to power complex generation tasks such as writing, fact-seeking, querying and reasoning. Traditionally, human or model feedback for evaluating and further tuning LLM performance has been provided at the response level, enabling faster and more cost-effective assessments. However, recent works (Amplayo et al. [2022], Wu et al. [2023]) indicate that sentence-level labels may provide more accurate and interpretable feedback for LLM optimization. In this work, we introduce methods to disaggregate response-level labels into sentence-level (pseudo-)labels. Our approach leverages multiple instance learning (MIL) and learning from label proportions (LLP) techniques in conjunction with prior information (e.g., document-sentence cosine similarity) to train a specialized model for sentence-level scoring. We also employ techniques which use model predictions to pseudo-label the train-set at the sentence-level for model training to further improve performance. We conduct extensive evaluations of our methods across six datasets and four tasks: retrieval, question answering, summarization, and math reasoning. Our results demonstrate improved performance compared to multiple baselines across most of these tasks. Our work is the first to develop response-level feedback to sentence-level scoring techniques, leveraging sentence-level prior information, along with comprehensive evaluations on multiple tasks as well as end-to-end finetuning evaluation showing performance comparable to a model trained on fine-grained human annotated labels.

FRACTAL: Fine-Grained Scoring from Aggregate Text Labels

TL;DR

FRACTAL addresses the challenge of deriving fine-grained sentence-level feedback from coarser response-level labels for complex text generation tasks. It combines multiple instance learning and learning from label proportions with priors computed from sentence-document similarities and sentence correlations, and introduces a max-likelihood pseudo-labeling scheme to further refine instance-level predictions. The approach is evaluated across six datasets and four tasks (retrieval, QA, summarization, and math reasoning), showing consistent improvements in sentence-level scoring over traditional bag-loss baselines and competitive performance relative to fully supervised sentence-level models. This work demonstrates that task-specific priors and pseudo-labeling can meaningfully translate aggregate feedback into actionable sentence-level signals, enabling more effective end-to-end fine-tuning and RLHF-style training with reduced annotation cost.

Abstract

Large language models (LLMs) are being increasingly tuned to power complex generation tasks such as writing, fact-seeking, querying and reasoning. Traditionally, human or model feedback for evaluating and further tuning LLM performance has been provided at the response level, enabling faster and more cost-effective assessments. However, recent works (Amplayo et al. [2022], Wu et al. [2023]) indicate that sentence-level labels may provide more accurate and interpretable feedback for LLM optimization. In this work, we introduce methods to disaggregate response-level labels into sentence-level (pseudo-)labels. Our approach leverages multiple instance learning (MIL) and learning from label proportions (LLP) techniques in conjunction with prior information (e.g., document-sentence cosine similarity) to train a specialized model for sentence-level scoring. We also employ techniques which use model predictions to pseudo-label the train-set at the sentence-level for model training to further improve performance. We conduct extensive evaluations of our methods across six datasets and four tasks: retrieval, question answering, summarization, and math reasoning. Our results demonstrate improved performance compared to multiple baselines across most of these tasks. Our work is the first to develop response-level feedback to sentence-level scoring techniques, leveraging sentence-level prior information, along with comprehensive evaluations on multiple tasks as well as end-to-end finetuning evaluation showing performance comparable to a model trained on fine-grained human annotated labels.
Paper Structure (26 sections, 6 equations, 1 figure, 16 tables)

This paper contains 26 sections, 6 equations, 1 figure, 16 tables.

Figures (1)

  • Figure 1: Overview of our proposed method, FRACTAL. Input is a set of responses each with a response label. A response is a bag of sentences. The output is a model that can predict the score for each sentence in a response. The semantic meaning of a score depends on how the response label was defined. FRACTAL consists of three key components a) Loss Function Design (Section \ref{['sec:bagloss']}). b) Differentiable Approximations of Aggregation Functions (Section \ref{['sec:min_approx_main']}) c) Max-Likelihood Pseudolabeling (Section \ref{['sec:pseudolab']})