MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering

Piyush Kumar Singh; Jayesh Choudhari

MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering

Piyush Kumar Singh, Jayesh Choudhari

Abstract

Reviews are central to how travelers evaluate products on online marketplaces, yet existing summarization research often emphasizes end-to-end quality while overlooking benchmark reliability and the practical utility of granular insights. To address this, we propose MOSAIC, a scalable, modular framework designed for industrial deployment that decomposes summarization into interpretable components, including theme discovery, structured opinion extraction, and grounded summary generation. We validate the practical impact of our approach through online A/B tests on live product pages, showing that surfacing intermediate outputs improves customer experience and delivers measurable value even prior to full summarization deployment. We further conduct extensive offline experiments to demonstrate that MOSAIC achieves superior aspect coverage and faithfulness compared to strong baselines for summarization. Crucially, we introduce opinion clustering as a system-level component and show that it significantly enhances faithfulness, particularly under the noisy and redundant conditions typical of user reviews. Finally, we identify reliability limitations in the standard SPACE dataset and release a new open-source tour experience dataset (TRECS) to enable more robust evaluation.

MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering

Abstract

Paper Structure (35 sections, 1 equation, 3 figures, 10 tables)

This paper contains 35 sections, 1 equation, 3 figures, 10 tables.

Results
Online Evaluation
Offline Evaluation
Public Datasets
TRECS
Importance of Opinion Clustering
SPACE Deep Dive
Future Work & Conclusion
Ethics Statement
Dataset
Statisics
TRECS
Opinion Redundancy Dataset
Cluster Quality Filtering
Diverse Opinion Sampling
...and 20 more sections

Figures (3)

Figure 4: Preference of three LLMs (GPT-4o-mini, GPT-4.1-mini, and Llama-3.1-70B) when selecting between summaries generated from redundant opinions and those generated from deduplicated opinions, based on aspect coverage across 10 datasets. For each dataset (X-axis), bars represent the percentage of model responses favoring the non-redundant summary (blue), the redundant summary (red), or indicating equal coverage (green).
Figure 5: Preference of three LLMs (GPT-4o-mini, GPT-4.1-mini, and Llama-3.1-70B) when selecting between summaries generated from redundant opinions and those generated from deduplicated opinions, based on aspect faithfulness across 10 datasets. For each dataset (X-axis), bars represent the percentage of model responses favoring the non-redundant summary (blue), the redundant summary (red), or indicating equal coverage (green).
Figure 6: Distribution of sentiment scores for theme-level summaries and of the number of themes identified in product summaries for SPACE and MOSAIC, both evaluated using Llama-3.1‑70B

MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering

Abstract

MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering

Authors

Abstract

Table of Contents

Figures (3)