Table of Contents
Fetching ...

ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers

Seyed Mohssen Ghafari, Ronny Kol, Juan C. Quiroz, Nella Luan, Monika Patial, Chanaka Rupasinghe, Herman Wandabwa, Luiz Pizzato

TL;DR

ConCISE introduces a reference-free conciseness metric for LLM-generated answers by averaging three content-reduction signals: abstractive summary compression, extractive summary compression, and word-removal compression, each obtained via LLMs. The method is evaluated on WikiEval using verbose rewrites and human judgments, showing meaningful alignment with human conciseness ratings ($r_s = 0.628$, $\tau = 0.523$, $p<0.001$) and high pairwise agreement (~94%) with human preferences, outperforming a naive GPT Score baseline. The approach provides a practical, low-cost tool for automatic conciseness evaluation without ground-truth references, though it acknowledges domain-dependence and highlights future improvements such as domain adaptation and reduced cross-technique bias. Overall, ConCISE offers a scalable solution for monitoring and improving the brevity of LLM responses in real-world conversational AI systems.

Abstract

Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.

ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers

TL;DR

ConCISE introduces a reference-free conciseness metric for LLM-generated answers by averaging three content-reduction signals: abstractive summary compression, extractive summary compression, and word-removal compression, each obtained via LLMs. The method is evaluated on WikiEval using verbose rewrites and human judgments, showing meaningful alignment with human conciseness ratings (, , ) and high pairwise agreement (~94%) with human preferences, outperforming a naive GPT Score baseline. The approach provides a practical, low-cost tool for automatic conciseness evaluation without ground-truth references, though it acknowledges domain-dependence and highlights future improvements such as domain adaptation and reduced cross-technique bias. Overall, ConCISE offers a scalable solution for monitoring and improving the brevity of LLM responses in real-world conversational AI systems.

Abstract

Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.

Paper Structure

This paper contains 14 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: ConCISE Architecture