Table of Contents
Fetching ...

TrustDataFilter:Leveraging Trusted Knowledge Base Data for More Effective Filtering of Unknown Information

Jinghong Zhang, Yidong Cui, Weiling Wang, Xianyou Cheng

TL;DR

TrustDataFilter addresses the challenge of filtering unknown information in domain-specific knowledge bases by integrating large language models with natural language inference in a self-NLI framework. It uses a two-layer architecture with Confidence Evaluation, Contradiction Evaluation, and Decision Evaluation to iteratively filter data against a trusted knowledge base, with vector matching and dynamic knowledge base expansion. Experiments across biological, science, and radiation domains with RoBERTa, GPT-3.5, and Qwen2-7B show average improvements of about $3\%$ in accuracy, with notable gains in precision and F1 and with domain-dependent variations. The work provides open-source software and three domain datasets, enabling robust domain-specific knowledge filtering and iterative knowledge validation.

Abstract

With the advancement of technology and changes in the market, the demand for the construction of domain-specific knowledge bases has been increasing, either to improve model performance or to promote enterprise innovation and competitiveness. The construction of domain-specific knowledge bases typically relies on web crawlers or existing industry databases, leading to problems with accuracy and consistency of the data. To address these challenges, we considered the characteristics of domain data, where internal knowledge is interconnected, and proposed the Self-Natural Language Inference Data Filtering (self-nli-TDF) framework. This framework compares trusted filtered knowledge with the data to be filtered, deducing the reasoning relationship between them, thus improving filtering performance. The framework uses plug-and-play large language models for trustworthiness assessment and employs the RoBERTa-MNLI model from the NLI domain for reasoning. We constructed three datasets in the domains of biology, radiation, and science, and conducted experiments using RoBERTa, GPT3.5, and the local Qwen2 model. The experimental results show that this framework improves filter quality, producing more consistent and reliable filtering results.

TrustDataFilter:Leveraging Trusted Knowledge Base Data for More Effective Filtering of Unknown Information

TL;DR

TrustDataFilter addresses the challenge of filtering unknown information in domain-specific knowledge bases by integrating large language models with natural language inference in a self-NLI framework. It uses a two-layer architecture with Confidence Evaluation, Contradiction Evaluation, and Decision Evaluation to iteratively filter data against a trusted knowledge base, with vector matching and dynamic knowledge base expansion. Experiments across biological, science, and radiation domains with RoBERTa, GPT-3.5, and Qwen2-7B show average improvements of about in accuracy, with notable gains in precision and F1 and with domain-dependent variations. The work provides open-source software and three domain datasets, enabling robust domain-specific knowledge filtering and iterative knowledge validation.

Abstract

With the advancement of technology and changes in the market, the demand for the construction of domain-specific knowledge bases has been increasing, either to improve model performance or to promote enterprise innovation and competitiveness. The construction of domain-specific knowledge bases typically relies on web crawlers or existing industry databases, leading to problems with accuracy and consistency of the data. To address these challenges, we considered the characteristics of domain data, where internal knowledge is interconnected, and proposed the Self-Natural Language Inference Data Filtering (self-nli-TDF) framework. This framework compares trusted filtered knowledge with the data to be filtered, deducing the reasoning relationship between them, thus improving filtering performance. The framework uses plug-and-play large language models for trustworthiness assessment and employs the RoBERTa-MNLI model from the NLI domain for reasoning. We constructed three datasets in the domains of biology, radiation, and science, and conducted experiments using RoBERTa, GPT3.5, and the local Qwen2 model. The experimental results show that this framework improves filter quality, producing more consistent and reliable filtering results.

Paper Structure

This paper contains 24 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of traditional filtering and the proposed self-nli-tdf framework. The robot icon represents pre-trained filtering models (e.g., large language models), and the assistant icon represents NLI (Natural Language Inference) models. (a) Basic Filtering: This method uses pre-trained models to filter domain-specific data, relying on prior knowledge embedded in the model. However, it struggles with domain data that deviates from the training distribution, leading to higher false negatives (FN) and reduced filtering accuracy. (b) self-nli-tdf Filtering: This iterative method leverages domain knowledge interrelations and combines reliable domain data with the model's internal knowledge for dual error detection. It effectively removes incorrect data by exploiting internal knowledge correlations. Experimental Results: Compared to basic filtering, self-nli-tdf improves accuracy by reducing false negatives (red section in the figure) Overall, the framework achieves better accuracy through dual detection.
  • Figure 2: self-nli-datafilter architecture
  • Figure 3: self-nli-datafilter architecture
  • Figure 4: Distribution of Topics
  • Figure 5: Qwen Accuracy Improvement
  • ...and 1 more figures