Table of Contents
Fetching ...

Frequency Sensitive Duplicate Detection Using Multi-Metric Spaces

Debjyoti Chatterjee, Shashi Bajaj Mukherjee

TL;DR

This work addresses the inadequacy of classical metrics when repetition carries meaning by introducing frequency-sensitive multi-metric spaces defined on multisets and valued in the multi-real number system $m(\mathbb{R})$. It develops the theoretical foundations (multisets, multi-points, and multi-real numbers) and constructs multi-metrics that lift classical metrics, enabling frequency-aware distance computations. The paper then applies this framework to duplicate detection, proposing a multiplicity-based criterion $\delta$ and analyzing its behavior through analytical and numerical examples, algorithmic procedures, and scalability considerations (blocking, indexing, and streaming). The resulting approach preserves frequency information, offers tunable sensitivity, and supports both batch and streaming settings, with clear implications for data cleaning, record linkage, and data integration in data-intensive systems.

Abstract

Classical metric spaces often fail to model data-intensive systems where repetition and frequency of values are meaningful. In applications such as transactional databases, sensor logs, and record linkage, conventional distance measures ignore multiplicity information, leading to information loss and incorrect similarity judgments. This paper introduces multi-metric spaces defined on multisets and valued in the multi-real number system, providing a principled way to incorporate frequency into distance computations. We demonstrate the usefulness of multi-metrics through a frequency sensitive duplicate detection example, showing improved accuracy over classical metric based approaches.

Frequency Sensitive Duplicate Detection Using Multi-Metric Spaces

TL;DR

This work addresses the inadequacy of classical metrics when repetition carries meaning by introducing frequency-sensitive multi-metric spaces defined on multisets and valued in the multi-real number system . It develops the theoretical foundations (multisets, multi-points, and multi-real numbers) and constructs multi-metrics that lift classical metrics, enabling frequency-aware distance computations. The paper then applies this framework to duplicate detection, proposing a multiplicity-based criterion and analyzing its behavior through analytical and numerical examples, algorithmic procedures, and scalability considerations (blocking, indexing, and streaming). The resulting approach preserves frequency information, offers tunable sensitivity, and supports both batch and streaming settings, with clear implications for data cleaning, record linkage, and data integration in data-intensive systems.

Abstract

Classical metric spaces often fail to model data-intensive systems where repetition and frequency of values are meaningful. In applications such as transactional databases, sensor logs, and record linkage, conventional distance measures ignore multiplicity information, leading to information loss and incorrect similarity judgments. This paper introduces multi-metric spaces defined on multisets and valued in the multi-real number system, providing a principled way to incorporate frequency into distance computations. We demonstrate the usefulness of multi-metrics through a frequency sensitive duplicate detection example, showing improved accuracy over classical metric based approaches.
Paper Structure (30 sections, 68 equations, 1 algorithm)