Table of Contents
Fetching ...

(De)-Indexing and the Right to be Forgotten

Salvatore Vilella, Giancarlo Ruffo

TL;DR

Various IR models are explored, including boolean, probabilistic, vector space, and embedding-based approaches, as well as the role of Large Language Models (LLMs) in enhancing data processing capabilities.

Abstract

In the digital age, the challenge of forgetfulness has emerged as a significant concern, particularly regarding the management of personal data and its accessibility online. The right to be forgotten (RTBF) allows individuals to request the removal of outdated or harmful information from public access, yet implementing this right poses substantial technical difficulties for search engines. This paper aims to introduce non-experts to the foundational concepts of information retrieval (IR) and de-indexing, which are critical for understanding how search engines can effectively "forget" certain content. We will explore various IR models, including boolean, probabilistic, vector space, and embedding-based approaches, as well as the role of Large Language Models (LLMs) in enhancing data processing capabilities. By providing this overview, we seek to highlight the complexities involved in balancing individual privacy rights with the operational challenges faced by search engines in managing information visibility.

(De)-Indexing and the Right to be Forgotten

TL;DR

Various IR models are explored, including boolean, probabilistic, vector space, and embedding-based approaches, as well as the role of Large Language Models (LLMs) in enhancing data processing capabilities.

Abstract

In the digital age, the challenge of forgetfulness has emerged as a significant concern, particularly regarding the management of personal data and its accessibility online. The right to be forgotten (RTBF) allows individuals to request the removal of outdated or harmful information from public access, yet implementing this right poses substantial technical difficulties for search engines. This paper aims to introduce non-experts to the foundational concepts of information retrieval (IR) and de-indexing, which are critical for understanding how search engines can effectively "forget" certain content. We will explore various IR models, including boolean, probabilistic, vector space, and embedding-based approaches, as well as the role of Large Language Models (LLMs) in enhancing data processing capabilities. By providing this overview, we seek to highlight the complexities involved in balancing individual privacy rights with the operational challenges faced by search engines in managing information visibility.
Paper Structure (16 sections, 16 equations, 2 figures)

This paper contains 16 sections, 16 equations, 2 figures.

Figures (2)

  • Figure 1: Left: interest in time of the queries machine learning and llm on Google. We can see how LLMs gain momentum and approach the level of the machine learning query over time. Right: the evolutionary tree of LLMs yang2023harnessing.
  • Figure 2: An intuitive comparison between the steps required to train a special-service dog and the training and fine-tuning phases of a LLM google_intro_llms.