Distilling Opinions at Scale: Incremental Opinion Summarization using XL-OPSUMM
Sri Raghava Muddu, Rupasai Rangaraju, Tejpalsingh Siledar, Swaroop Nath, Pushpak Bhattacharyya, Swaprava Nath, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Sudhanshu Shekhar Singh, Nikesh Garera
TL;DR
XL-OpSumm tackles the scalability challenge of opinion summarization for thousands of reviews by introducing an incremental, chunk-based framework that maintains a Global Summary and an Aspect Dictionary while updating summaries via aspect-aware per-chunk processing. The approach uses non-overlapping chunks of up to $ au$ tokens, updates aspect sentiments with ABSA, and leverages LLMs to generate per-chunk Local Summaries before merging into a final Global Summary, enabling context-free growth beyond typical token limits. A new large-scale Xl-Flipkart test set (~3,680 reviews/product across 25 products) and the existing AMASUM dataset demonstrate substantial gains, with XL-OpSumm achieving ROUGE-1 F1 gains of 4.38% and ROUGE-L F1 gains of 3.70% over close baselines on average, and strong positive results in reference-free metrics such as BooookScore, fluency, and coherence. The work highlights practical impact for live e-commerce platforms by enabling continuous, scalable opinion synthesis and points to future directions like incorporating additional data sources (Q&A, product descriptions) and addressing remaining evaluation limitations.
Abstract
Opinion summarization in e-commerce encapsulates the collective views of numerous users about a product based on their reviews. Typically, a product on an e-commerce platform has thousands of reviews, each review comprising around 10-15 words. While Large Language Models (LLMs) have shown proficiency in summarization tasks, they struggle to handle such a large volume of reviews due to context limitations. To mitigate, we propose a scalable framework called Xl-OpSumm that generates summaries incrementally. However, the existing test set, AMASUM has only 560 reviews per product on average. Due to the lack of a test set with thousands of reviews, we created a new test set called Xl-Flipkart by gathering data from the Flipkart website and generating summaries using GPT-4. Through various automatic evaluations and extensive analysis, we evaluated the framework's efficiency on two datasets, AMASUM and Xl-Flipkart. Experimental results show that our framework, Xl-OpSumm powered by Llama-3-8B-8k, achieves an average ROUGE-1 F1 gain of 4.38% and a ROUGE-L F1 gain of 3.70% over the next best-performing model.
