Learning Unsupervised Semantic Document Representation for Fine-grained Aspect-based Sentiment Analysis
Hao-Ming Fu, Pu-Jen Cheng
TL;DR
The paper tackles unsupervised document representation for fine-grained sentiment analysis by learning a document vector $v_D$ through predicting a target sentence from its context with $k$ surrounding sentences and $r$ negative samples. It introduces two CNN-based sentence encoders, a context vector $v_{cntx}$ assembled by averaging and length-adjusting, and a logit-based negative-sampling loss, combined with a document-level loss to capture both local and global relationships via $L_{total} = \alpha L_{cntx} + (1-\alpha) L_{doc}$. Inference uses the length-adjusted average of sentence vectors, enabling new documents to be encoded without retraining. Experiments on IMDB and BeerAdvocate show substantial improvements over state-of-the-art unsupervised methods for both sentiment analysis and aspect-based sentiment analysis, demonstrating strong generality and robustness of the approach. This work provides a scalable, unsupervised representation that preserves intra-sentence order while enabling effective document-level aggregation with practical impact for downstream SA tasks.
Abstract
Document representation is the core of many NLP tasks on machine understanding. A general representation learned in an unsupervised manner reserves generality and can be used for various applications. In practice, sentiment analysis (SA) has been a challenging task that is regarded to be deeply semantic-related and is often used to assess general representations. Existing methods on unsupervised document representation learning can be separated into two families: sequential ones, which explicitly take the ordering of words into consideration, and non-sequential ones, which do not explicitly do so. However, both of them suffer from their own weaknesses. In this paper, we propose a model that overcomes difficulties encountered by both families of methods. Experiments show that our model outperforms state-of-the-art methods on popular SA datasets and a fine-grained aspect-based SA by a large margin.
