A Mathematical Model of the Hidden Feedback Loop Effect in Machine Learning Systems

Andrey Veprikov; Alexander Afanasiev; Anton Khritankov

A Mathematical Model of the Hidden Feedback Loop Effect in Machine Learning Systems

Andrey Veprikov, Alexander Afanasiev, Anton Khritankov

TL;DR

This work models the long-term effects of repeated machine learning by formulating a dynamical system on probability densities, where each step f_{t+1} = D_t(f_t) captures data sampling, learning, and prediction feedback. The authors prove sufficient conditions for D_t to map PDFs to PDFs, and establish a dichotomy in the limit behavior: weak convergence to a delta function (positive feedback/accurate residuals) or to a zero distribution (error amplification). They derive an autonomy criterion for when the evolution is time-invariant and provide concrete examples of D_t, along with a suite of experiments on synthetic data that validate the theoretical predictions, including moment decay and normality breakdown of prediction errors. The results offer a principled basis for diagnosing hidden feedback loops and guiding design choices to mitigate bias amplification and trustworthiness violations in societal-scale ML systems.

Abstract

Widespread deployment of societal-scale machine learning systems necessitates a thorough understanding of the resulting long-term effects these systems have on their environment, including loss of trustworthiness, bias amplification, and violation of AI safety requirements. We introduce a repeated learning process to jointly describe several phenomena attributed to unintended hidden feedback loops, such as error amplification, induced concept drift, echo chambers and others. The process comprises the entire cycle of obtaining the data, training the predictive model, and delivering predictions to end-users within a single mathematical model. A distinctive feature of such repeated learning setting is that the state of the environment becomes causally dependent on the learner itself over time, thus violating the usual assumptions about the data distribution. We present a novel dynamical systems model of the repeated learning process and prove the limiting set of probability distributions for positive and negative feedback loop modes of the system operation. We conduct a series of computational experiments using an exemplary supervised learning problem on two synthetic data sets. The results of the experiments correspond to the theoretical predictions derived from the dynamical model. Our results demonstrate the feasibility of the proposed approach for studying the repeated learning processes in machine learning systems and open a range of opportunities for further research in the area.

A Mathematical Model of the Hidden Feedback Loop Effect in Machine Learning Systems

TL;DR

Abstract

Paper Structure (30 sections, 6 theorems, 40 equations, 15 figures)

This paper contains 30 sections, 6 theorems, 40 equations, 15 figures.

Introduction
Related Work
Mathematical Model and Problem Statement
Main Results
Basic Notation
Preliminary and Modeling Considerations
Discussion of Theorem \ref{['R_to_R']}.
Results for a General System (\ref{['system']})
Discussion of Theorem \ref{['delta']}.
Example of mappings $\text{D}_t$.
Analysis of Conjecture 1 (khritankov2021hidden)
Discussion of Lemma \ref{['moments']}.
Discussion of Lemma \ref{['ineq_q']}.
Results for an Autonomous System (\ref{['system_aut']})
Discussion of Theorem \ref{['semigroup']}.
...and 15 more sections

Key Result

Theorem 1

If the function $f: \mathbb{R}^n \to \mathbb{R}$ such that $f(x) \geq 0$ for almost every $x \in \mathbb{R}^n$ and $\|f\|_1 = \int\limits_{\mathbb{R}^n} f(x) dx = 1$, then there exists a random vector $\mathbf{\xi}$, for which $f$ will be a probability density function.

Figures (15)

Figure 1: Illustration of weak limit to delta function. $\mathcal{N}(0, 5^2)$ (left), $\mathcal{U}[-2.5, 2.5]$ (right).
Figure 2: Two different experiments schemes. Sliding window update setup (left) and sampling update setup (right).
Figure 3: Change in the standard deviation of the model error for different usage and adherence. Sliding window setup (left), sampling update setup (right). The graph is almost everywhere either red or blue, hence Theorem \ref{['delta']} is applicable in practice.
Figure 4: Counting $f_t(0)$ and $\int_{-\kappa}^{\kappa}f_t(x)dx$ for sliding window setup for SGD regression model on synthetic linear data set. We consider such parameters: usage, adherence = $1$, $0$ (left); $0.1$, $0.9$ (middle); $1$, $3$ (right). In this picture, we can see the entire limit set of the system \ref{['system']} from Theorem \ref{['delta']}.
Figure 5: Counting $f_t(0)$ and $\int_{-\kappa}^{\kappa}f_t(x)dx$ for sampling update setup for SGD regression model on synthetic linear data set. We consider such parameters: usage, adherence = $1$, $0$ (left); $0.1$, $0.9$ (middle); $1$, $3$ (right). In this picture, we can see the entire limit set of the system \ref{['system']} from Theorem \ref{['delta']}.
...and 10 more figures

Theorems & Definitions (11)

Theorem 1: feller1991introduction
Theorem 2: Conditions for $\text{D}_t$ to be a transformation on $\textbf{F}$
Theorem 3: Limit set
Lemma 1: Decreasing moments
Lemma 2: Inequality on $\|\text{D}_t\|_q$
Theorem 4: Autonomy criterion
proof
proof
proof
proof
...and 1 more

A Mathematical Model of the Hidden Feedback Loop Effect in Machine Learning Systems

TL;DR

Abstract

A Mathematical Model of the Hidden Feedback Loop Effect in Machine Learning Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (11)