On the selection of the correct number of terms for profile construction: theoretical and empirical analysis

Luis M. de Campos; Juan M. Fernández-Luna; Juan F. Huete

On the selection of the correct number of terms for profile construction: theoretical and empirical analysis

Luis M. de Campos, Juan M. Fernández-Luna, Juan F. Huete

TL;DR

The work addresses how to select the correct number of terms for user/profile construction from document collections. It develops an axiomatic framework based on seven concentration-inspired properties to evaluate cutoff functions and introduces a cosine-based similarity cutoff (SC) that leverages $Sim(D_j^i,D_j)$ to adapt the profile size to weight distribution. Empirically, Diff emerges as a strong weighting scheme and SC delivers competitive accuracy with smaller profiles, reducing index size and improving recommendation speed across parliamentary datasets. The study demonstrates that coupling a principled cutoff with an appropriate weighting approach yields efficient, effective document-based profiles for content-based filtering tasks in real-world settings.

Abstract

In this paper, we examine the problem of building a user profile from a set of documents. This profile will consist of a subset of the most representative terms in the documents that best represent user preferences or interests. Inspired by the discrete concentration theory we have conducted an axiomatic study of seven properties that a selection function should fulfill: the minimum and maximum uncertainty principle, invariant to adding zeros, invariant to scale transformations, principle of nominal increase, transfer principle and the richest get richer inequality. We also present a novel selection function based on the use of similarity metrics, and more specifically the cosine measure which is commonly used in information retrieval, and demonstrate that this verifies six of the properties in addition to a weaker variant of the transfer principle, thereby representing a good selection approach. The theoretical study was complemented with an empirical study to compare the performance of different selection criteria (weight- and unweight-based) using real data in a parliamentary setting. In this study, we analyze the performance of the different functions focusing on the two main factors affecting the selection process: profile size (number of terms) and weight distribution. These profiles are then used in a document filtering task to show that our similarity-based approach performs well in terms not only of recommendation accuracy but also efficiency (we obtain smaller profiles and consequently faster recommendations).

On the selection of the correct number of terms for profile construction: theoretical and empirical analysis

TL;DR

to adapt the profile size to weight distribution. Empirically, Diff emerges as a strong weighting scheme and SC delivers competitive accuracy with smaller profiles, reducing index size and improving recommendation speed across parliamentary datasets. The study demonstrates that coupling a principled cutoff with an appropriate weighting approach yields efficient, effective document-based profiles for content-based filtering tasks in real-world settings.

Abstract

Paper Structure (21 sections, 18 equations, 8 figures, 7 tables)

This paper contains 21 sections, 18 equations, 8 figures, 7 tables.

Introduction
Related work
Theoretical foundations
Cutoff functions
Unweight-oriented:
Weight-oriented:
A new approach for term selection
Cutoff points for document-based collections
Source collections
Weighting measures
Analyzing weighted-oriented cutoff functions
Analyzing unweighted-oriented cutoff functions
Performance in the MP filtering task
Experimental setting
Analyzing weighting measures performance
...and 6 more sections

Figures (8)

Figure 1: The figure shows a concentration curve for the distributions $L= \{14, 13, 12, 7, 6, 5, 5, 5, 5, 1\}$, $L+=\{14, 13, 12, 11, 6, 5, 5, 5, 1, 1\}$ and $L+h=\{54, 53, 52, 47, 46, 45, 45, 45, 45, 41\}.$
Figure 2: Coefficient of Variation for the different weighting criteria using C-Col.
Figure 3: VT and SC functions versus Coefficient of Variation
Figure 4: Comparing Variable Threshold and Cosine Similarity criteria
Figure 5: Analyzing the performance of the SC function in terms of the document length for each s-collection (using a log-log scale)
...and 3 more figures

Theorems & Definitions (1)

Example 1

On the selection of the correct number of terms for profile construction: theoretical and empirical analysis

TL;DR

Abstract

On the selection of the correct number of terms for profile construction: theoretical and empirical analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)