Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain
Yuki Nakayama, Koki Hikichi, Yun Ching Liu, Yu Hirate
TL;DR
The paper introduces Rakuten Data Release (RDR), a large-scale, long-span hotel-review corpus (2009–2024) intended to support NLP and recommendation research while addressing data drift in the hotel domain. It details a public, non-commercial release comprising User Evaluation Data and Hotel Master Data, with extensive statistics (2.5M+ users, 27.6K hotels) and 18 fields per record, including multi-aspect ratings and staff replies. Through a chi-square analysis of word distributions, the authors identify post-COVID vocabulary shifts, SDG-related terms, new product features, inbound tourism, inflation, and slang as key drivers of data drift, underscoring the need for up-to-date data in model evaluation. The work provides a valuable, long-running resource for robust, real-world hotel-domain modeling and sets a precedent for ongoing dataset maintenance amid evolving consumer language and industry practices.
Abstract
This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.
