Table of Contents
Fetching ...

Textual semantics and machine learning methods for data product pricing

Ruize Gao, Feng Xiao, Jinpu Li, Shaoze Cui

TL;DR

The paper addresses data-product pricing in data marketplaces by systematically evaluating how textual semantics influence price prediction. It benchmarks five textual representations (BoW, TF-IDF, Word2Vec, LDA, BERTopic) with six ML models across regression and price-tier classification, using mRMR for feature selection and SHAP for interpretability. Key findings show Word2Vec excels for continuous price prediction, while frequency-based representations best support classification; SHAP reveals healthcare/demographics raise prices, whereas weather/environment lowers them. The work offers practical guidance for description strategies and pricing tooling, and demonstrates that embedding-to-word mappings can enhance model explainability in data markets.

Abstract

Reasonable pricing of data products enables data trading platforms to maximize revenue and foster the growth of the data trading market. The textual semantics of data products are vital for pricing and contain significant value that remains largely underexplored. Therefore, to investigate how textual features influence data product pricing, we employ five prevalent text representation techniques to encode the descriptive text of data products. And then, we employ six machine learning methods to predict data product prices, including linear regression, neural networks, decision trees, support vector machines, random forests, and XGBoost. Our empirical design consists of two tasks: a regression task that predicts the continuous price of data products, and a classification task that discretizes price into ordered categories. Furthermore, we conduct feature importance analysis by the mRMR feature selection method and SHAP-based interpretability techniques. Based on empirical data from the AWA Data Exchange, we find that for predicting continuous prices, Word2Vec text representations capturing semantic similarity yield superior performance. In contrast, for price-tier classification tasks, simpler representations that do not rely on semantic similarity, such as Bag-of-Words and TF-IDF, perform better. SHAP analysis reveals that semantic features related to healthcare and demographics tend to increase prices, whereas those associated with weather and environmental topics are linked to lower prices. This analytical framework significantly enhances the interpretability of pricing models.

Textual semantics and machine learning methods for data product pricing

TL;DR

The paper addresses data-product pricing in data marketplaces by systematically evaluating how textual semantics influence price prediction. It benchmarks five textual representations (BoW, TF-IDF, Word2Vec, LDA, BERTopic) with six ML models across regression and price-tier classification, using mRMR for feature selection and SHAP for interpretability. Key findings show Word2Vec excels for continuous price prediction, while frequency-based representations best support classification; SHAP reveals healthcare/demographics raise prices, whereas weather/environment lowers them. The work offers practical guidance for description strategies and pricing tooling, and demonstrates that embedding-to-word mappings can enhance model explainability in data markets.

Abstract

Reasonable pricing of data products enables data trading platforms to maximize revenue and foster the growth of the data trading market. The textual semantics of data products are vital for pricing and contain significant value that remains largely underexplored. Therefore, to investigate how textual features influence data product pricing, we employ five prevalent text representation techniques to encode the descriptive text of data products. And then, we employ six machine learning methods to predict data product prices, including linear regression, neural networks, decision trees, support vector machines, random forests, and XGBoost. Our empirical design consists of two tasks: a regression task that predicts the continuous price of data products, and a classification task that discretizes price into ordered categories. Furthermore, we conduct feature importance analysis by the mRMR feature selection method and SHAP-based interpretability techniques. Based on empirical data from the AWA Data Exchange, we find that for predicting continuous prices, Word2Vec text representations capturing semantic similarity yield superior performance. In contrast, for price-tier classification tasks, simpler representations that do not rely on semantic similarity, such as Bag-of-Words and TF-IDF, perform better. SHAP analysis reveals that semantic features related to healthcare and demographics tend to increase prices, whereas those associated with weather and environmental topics are linked to lower prices. This analytical framework significantly enhances the interpretability of pricing models.

Paper Structure

This paper contains 34 sections, 35 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Distribution analysis of data product price
  • Figure 2: Prediction results based on mRMR method
  • Figure 3: Feature importance ranking (Top 20). In the figure, the color of each data point represents the magnitude of the corresponding feature value: redder points indicate larger feature values, while bluer points indicate smaller values. A positive SHAP value (SHAP value $>$ 0) signifies a positive influence on the output, while a negative SHAP value indicates a negative influence.
  • Figure 4: Classification results based on XGBoost+mRMR method
  • Figure 5: Feature importance ranking of classification model (Top 20)