Table of Contents
Fetching ...

MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering

Piyush Kumar Singh, Jayesh Choudhari

Abstract

Reviews are central to how travelers evaluate products on online marketplaces, yet existing summarization research often emphasizes end-to-end quality while overlooking benchmark reliability and the practical utility of granular insights. To address this, we propose MOSAIC, a scalable, modular framework designed for industrial deployment that decomposes summarization into interpretable components, including theme discovery, structured opinion extraction, and grounded summary generation. We validate the practical impact of our approach through online A/B tests on live product pages, showing that surfacing intermediate outputs improves customer experience and delivers measurable value even prior to full summarization deployment. We further conduct extensive offline experiments to demonstrate that MOSAIC achieves superior aspect coverage and faithfulness compared to strong baselines for summarization. Crucially, we introduce opinion clustering as a system-level component and show that it significantly enhances faithfulness, particularly under the noisy and redundant conditions typical of user reviews. Finally, we identify reliability limitations in the standard SPACE dataset and release a new open-source tour experience dataset (TRECS) to enable more robust evaluation.

MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering

Abstract

Reviews are central to how travelers evaluate products on online marketplaces, yet existing summarization research often emphasizes end-to-end quality while overlooking benchmark reliability and the practical utility of granular insights. To address this, we propose MOSAIC, a scalable, modular framework designed for industrial deployment that decomposes summarization into interpretable components, including theme discovery, structured opinion extraction, and grounded summary generation. We validate the practical impact of our approach through online A/B tests on live product pages, showing that surfacing intermediate outputs improves customer experience and delivers measurable value even prior to full summarization deployment. We further conduct extensive offline experiments to demonstrate that MOSAIC achieves superior aspect coverage and faithfulness compared to strong baselines for summarization. Crucially, we introduce opinion clustering as a system-level component and show that it significantly enhances faithfulness, particularly under the noisy and redundant conditions typical of user reviews. Finally, we identify reliability limitations in the standard SPACE dataset and release a new open-source tour experience dataset (TRECS) to enable more robust evaluation.
Paper Structure (35 sections, 1 equation, 3 figures, 10 tables)

This paper contains 35 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 4: Preference of three LLMs (GPT-4o-mini, GPT-4.1-mini, and Llama-3.1-70B) when selecting between summaries generated from redundant opinions and those generated from deduplicated opinions, based on aspect coverage across 10 datasets. For each dataset (X-axis), bars represent the percentage of model responses favoring the non-redundant summary (blue), the redundant summary (red), or indicating equal coverage (green).
  • Figure 5: Preference of three LLMs (GPT-4o-mini, GPT-4.1-mini, and Llama-3.1-70B) when selecting between summaries generated from redundant opinions and those generated from deduplicated opinions, based on aspect faithfulness across 10 datasets. For each dataset (X-axis), bars represent the percentage of model responses favoring the non-redundant summary (blue), the redundant summary (red), or indicating equal coverage (green).
  • Figure 6: Distribution of sentiment scores for theme-level summaries and of the number of themes identified in product summaries for SPACE and MOSAIC, both evaluated using Llama-3.1‑70B