Table of Contents
Fetching ...

Fine-grainedly Synthesize Streaming Data Based On Large Language Models With Graph Structure Understanding For Data Sparsity

Xin Zhang, Linhai Zhang, Deyu Zhou, Guoqiang Xu

TL;DR

This work proposes a fine-grained streaming data synthesis framework that categorizes sparse users into three categories: Mid-tail, Long-tail, and Extreme, and designs LLMs to comprehensively understand three key graph elements in streaming data, including Local-global Graph Understanding, Second-Order Relationship Extraction, and Product Attribute Understanding.

Abstract

Due to the sparsity of user data, sentiment analysis on user reviews in e-commerce platforms often suffers from poor performance, especially when faced with extremely sparse user data or long-tail labels. Recently, the emergence of LLMs has introduced new solutions to such problems by leveraging graph structures to generate supplementary user profiles. However, previous approaches have not fully utilized the graph understanding capabilities of LLMs and have struggled to adapt to complex streaming data environments. In this work, we propose a fine-grained streaming data synthesis framework that categorizes sparse users into three categories: Mid-tail, Long-tail, and Extreme. Specifically, we design LLMs to comprehensively understand three key graph elements in streaming data, including Local-global Graph Understanding, Second-Order Relationship Extraction, and Product Attribute Understanding, which enables the generation of high-quality synthetic data to effectively address sparsity across different categories. Experimental results on three real datasets demonstrate significant performance improvements, with synthesized data contributing to MSE reductions of 45.85%, 3.16%, and 62.21%, respectively.

Fine-grainedly Synthesize Streaming Data Based On Large Language Models With Graph Structure Understanding For Data Sparsity

TL;DR

This work proposes a fine-grained streaming data synthesis framework that categorizes sparse users into three categories: Mid-tail, Long-tail, and Extreme, and designs LLMs to comprehensively understand three key graph elements in streaming data, including Local-global Graph Understanding, Second-Order Relationship Extraction, and Product Attribute Understanding.

Abstract

Due to the sparsity of user data, sentiment analysis on user reviews in e-commerce platforms often suffers from poor performance, especially when faced with extremely sparse user data or long-tail labels. Recently, the emergence of LLMs has introduced new solutions to such problems by leveraging graph structures to generate supplementary user profiles. However, previous approaches have not fully utilized the graph understanding capabilities of LLMs and have struggled to adapt to complex streaming data environments. In this work, we propose a fine-grained streaming data synthesis framework that categorizes sparse users into three categories: Mid-tail, Long-tail, and Extreme. Specifically, we design LLMs to comprehensively understand three key graph elements in streaming data, including Local-global Graph Understanding, Second-Order Relationship Extraction, and Product Attribute Understanding, which enables the generation of high-quality synthetic data to effectively address sparsity across different categories. Experimental results on three real datasets demonstrate significant performance improvements, with synthesized data contributing to MSE reductions of 45.85%, 3.16%, and 62.21%, respectively.
Paper Structure (19 sections, 13 equations, 17 figures, 6 tables)

This paper contains 19 sections, 13 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: An example of temporal sparsity among users in streaming data. LLM leverages second-order relationships to synthesize similar product-user data, filling in the temporal gaps.
  • Figure 2: Framework for utilizing LLM as a handler for streaming data sparsity. The bipartite graph stream serves as input; LLM needs to understand three key components in the graph: Local-Global Graph Understanding, Second-Order Relationship Extraction, and Product Attribute Understanding (where product information sometimes originates directly from the initial input and sometimes from other selected products under different rules); Finally, combining sparse user information with selected product information to obtain the final synthesized data, where the synthesized review data includes both review text and corresponding ratings.
  • Figure 3: Long Tail User Scenario. Local bipartite graphs and global bipartite graphs serve as inputs. LLM needs to simultaneously analyze the second-order homogeneous user relationships in both the local bipartite graph and the global bipartite graph of Long Tail Users to obtain supplementary Long Tail User profiles. It also needs to analyze the third-order product relationships corresponding to Long Tail Users in the global bipartite graph to obtain product profiles for data synthesis.
  • Figure 4: Extremely Sparse Scenario. Generating synthetic data by creating fake connections between the top products and Extreme Users to simulate pseudo interactions.
  • Figure 5: Vocabulary Richness Comparison.
  • ...and 12 more figures