Table of Contents
Fetching ...

How does the primate brain combine generative and discriminative computations in vision?

Benjamin Peters, James J. DiCarlo, Todd Gureckis, Ralf Haefner, Leyla Isik, Joshua Tenenbaum, Talia Konkle, Thomas Naselaris, Kimberly Stachenfeld, Zenna Tavares, Doris Tsao, Ilker Yildirim, Nikolaus Kriegeskorte

TL;DR

This paper challenges the view that primate vision rests solely on discriminative feedforward processing or purely generative inference, arguing instead for a hybrid algorithm that blends both approaches. It clarifies the terminology, surveys behavioral and neural evidence, and proposes an integrative research program that uses task design, computational modeling, and multi-area neural measurements to reveal how the brain combines generative priors with discriminative inference. By framing vision as latent-variable inference and outlining how hybrids can be implemented across representation levels and time, the work aims to operationalize tests that move beyond a false dichotomy toward a unified theory of visual computation. The proposed framework emphasizes normative probabilistic reasoning, resource-aware computation, and the need for tasks that probe generalization, occlusion, imagery, and spontaneous activity to illuminate the brain’s hybrid algorithms.

Abstract

Vision is widely understood as an inference problem. However, two contrasting conceptions of the inference process have each been influential in research on biological vision as well as the engineering of machine vision. The first emphasizes bottom-up signal flow, describing vision as a largely feedforward, discriminative inference process that filters and transforms the visual information to remove irrelevant variation and represent behaviorally relevant information in a format suitable for downstream functions of cognition and behavioral control. In this conception, vision is driven by the sensory data, and perception is direct because the processing proceeds from the data to the latent variables of interest. The notion of "inference" in this conception is that of the engineering literature on neural networks, where feedforward convolutional neural networks processing images are said to perform inference. The alternative conception is that of vision as an inference process in Helmholtz's sense, where the sensory evidence is evaluated in the context of a generative model of the causal processes giving rise to it. In this conception, vision inverts a generative model through an interrogation of the evidence in a process often thought to involve top-down predictions of sensory data to evaluate the likelihood of alternative hypotheses. The authors include scientists rooted in roughly equal numbers in each of the conceptions and motivated to overcome what might be a false dichotomy between them and engage the other perspective in the realm of theory and experiment. The primate brain employs an unknown algorithm that may combine the advantages of both conceptions. We explain and clarify the terminology, review the key empirical evidence, and propose an empirical research program that transcends the dichotomy and sets the stage for revealing the mysterious hybrid algorithm of primate vision.

How does the primate brain combine generative and discriminative computations in vision?

TL;DR

This paper challenges the view that primate vision rests solely on discriminative feedforward processing or purely generative inference, arguing instead for a hybrid algorithm that blends both approaches. It clarifies the terminology, surveys behavioral and neural evidence, and proposes an integrative research program that uses task design, computational modeling, and multi-area neural measurements to reveal how the brain combines generative priors with discriminative inference. By framing vision as latent-variable inference and outlining how hybrids can be implemented across representation levels and time, the work aims to operationalize tests that move beyond a false dichotomy toward a unified theory of visual computation. The proposed framework emphasizes normative probabilistic reasoning, resource-aware computation, and the need for tasks that probe generalization, occlusion, imagery, and spontaneous activity to illuminate the brain’s hybrid algorithms.

Abstract

Vision is widely understood as an inference problem. However, two contrasting conceptions of the inference process have each been influential in research on biological vision as well as the engineering of machine vision. The first emphasizes bottom-up signal flow, describing vision as a largely feedforward, discriminative inference process that filters and transforms the visual information to remove irrelevant variation and represent behaviorally relevant information in a format suitable for downstream functions of cognition and behavioral control. In this conception, vision is driven by the sensory data, and perception is direct because the processing proceeds from the data to the latent variables of interest. The notion of "inference" in this conception is that of the engineering literature on neural networks, where feedforward convolutional neural networks processing images are said to perform inference. The alternative conception is that of vision as an inference process in Helmholtz's sense, where the sensory evidence is evaluated in the context of a generative model of the causal processes giving rise to it. In this conception, vision inverts a generative model through an interrogation of the evidence in a process often thought to involve top-down predictions of sensory data to evaluate the likelihood of alternative hypotheses. The authors include scientists rooted in roughly equal numbers in each of the conceptions and motivated to overcome what might be a false dichotomy between them and engage the other perspective in the realm of theory and experiment. The primate brain employs an unknown algorithm that may combine the advantages of both conceptions. We explain and clarify the terminology, review the key empirical evidence, and propose an empirical research program that transcends the dichotomy and sets the stage for revealing the mysterious hybrid algorithm of primate vision.
Paper Structure (58 sections, 2 equations, 5 figures)

This paper contains 58 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Toy example of the visual inference problem. (a) As an example, we'll consider a world we can fully control that generates the sensory experiences for an agent. Our example world consists of a simplified environment for which a graphics engine renders a single object (a bear). The graphics engine has only a few parameters: whether the bear is a real bear or a teddy bear, a size distribution conditioned on the type of bear (big for real, small for toy), and the distance of the bear to the camera (i.e, the observer). (b) From the world state $\mathbf{s}$, the graphics engine renders a sensory pattern on the retina of the agent: the observation $\mathbf{o}$. (c) The agent infers some latent variables $\mathbf{z}$ that reflect aspects of the world, here the size and distance of the object, which can be represented by a two-dimensional vector $\mathbf{z}$ in the agent's mind. (d) When presented with a specific input $\mathbf{o}$, the agent infers the corresponding latent variables $\mathbf{z}$. Inference can therefore be understood as a function $f$ which maps each observation $\mathbf{o}$ onto a belief about the latent variables. The belief about $\mathbf{z}$ might contain uncertainty, requiring a probabilistic representation $p(\mathbf{z}|\mathbf{o})$. The agent might compute and represent the exact posterior $p(\mathbf{z}|\mathbf{o})$, a parametric representation $\phi$ (e.g., means and covariances of a multivariate Gaussian), represent the distribution in the form of samples $\mathbf{z} \sim p(\cdot|\mathbf{o})$, or in form of a single point estimate (e.g., the vector $\mathbf{z}^*$ that maximizes $p(\mathbf{z}|\mathbf{o})$. (e) The statistical structure of the inference problem: States in the world $\mathbf{s}$ induce a distribution over observations $p(\mathbf{o})$ that an agent encounters. Each observation is associated with a particular latent variable state, giving rise to a distribution over latent variables $p(\mathbf{z}) = \int p(\mathbf{z}|\mathbf{o}) p(\mathbf{o}) d \mathbf{o}$ that an agent can expect to experience in the world. (f) Latent variables in the brain need not correspond to the world state. Another rendering engine with a different parameterization of the world (e.g., in terms of the position of individual vertices in space) could generate the exact same observations (and the exact same $p(\mathbf{o})$). This highlights the elusive relationship between $\mathbf{s}$ and $\mathbf{z}$. In particular, it is not warranted to simply assume $\mathbf{z}$ from a particular, hypothesized 'world representation'. Instead, it is an empirical question of how the brain's latent variables relate to a researcher's model of the world $\mathbf{s}$. The variables $\mathbf{s}$ in the researcher's mental model (e.g., bear category, bear size) might not correspond to latent variables in the brain's visual inference model (see (g) for other examples of possible latent variables). This raises the fundamental question of how the latent variables $\mathbf{z}$ arise from the agent's interactions with the world.
  • Figure 2: Frameworks for constructing inference models. We are interested in inference models, i.e. models that map $\mathbf{o}$ onto an estimate of $\mathbf{z}$. "Discriminative" and "generative" can be seen as frameworks to arrive at hypothetical models of how the visual system performs inference (bottom ellipse: the space of all possible inference models). Model construction starts with defining the system's overall objective (upper ellipse): discriminative ($p(\mathbf{z}|\mathbf{o})$), generative ($p(\mathbf{o}, \mathbf{z})$) or hybrids between generative and discriminative objectives. For example, the generative framework starts by positing that the overall system's objective is to represent the joint distribution over observations $\mathbf{o}$ and latent variables $\mathbf{z}$. Middle ellipse: The choice of goal/objective leads to particular choices for representations and algorithms that implement inference. Inference with a generative model (in contrast to a discriminative model) needs to "invert" the generative model such that its components compute an estimate of $p(\mathbf{z}|\mathbf{o})$ (red arrow between upper two ellipses). This difference between the discriminative and generative models biases the choice of algorithms and representations under both frameworks (reflected in the notion of "generative computations" and "discriminative computations"). Importantly, however, both frameworks may lead to overlapping choices of individual representational and algorithmic motifs (e.g., discriminative models may be iterative and involve sampling). Lower ellipse: The resulting classes of inference model are distinct by virtue of their construction framework. Possibly, different frameworks may lead to the same inference models (* in the intersection of generative and discriminative inference models). For example, a discriminatively trained RNN might - in principle - learn representations and algorithms that implement inference with a generative model).
  • Figure 3: Examples of interpreting systems in terms of discriminative and generative models. A brain-behavioral model is an interpretation of a particular system abstraction (e.g., firing rates of V1 neurons, or ventral visual stream pattern activations) that involves an inference model and a mapping function that maps components of the inference model $\mathbf{z}$ to measured neural activity and/or behavior $\mathbf{r}$. Ventral stream (top box): two examples of interpreting brain pattern activations in the primate ventral stream. Top: Individual layers of a trained AlexNet can be interpreted as discriminative components of an inference models. Bottom: Alternatively, these neural representations can be interpreted as corresponding to beliefs in a hierarchical belief propagation network, a model of generative inference suggested by e.g., lee_hierarchical_2003. Bottom: two examples of interpreting V1 neuronal firing rates. Top: firing rates in V1 can be mapped onto the latent variables of a model, in which inhibitory lateral connections between representational units $z_1, \dots, z_n$ perform a version of explaining away olshausen_emergence_1996. Bottom: an example of V1 neural activations explained via an analysis-by-synthesis model olshausen_sparse_1997 that 'inverts' the generative model. Note, that this model includes error units, representing the momentary difference between predicted and observed input, a prediction which can be included into the set of latent variables which are mapped onto neural observations.
  • Figure 4: Spectrum of perspectives on vision. There is a spectrum of models whose extreme poles are purely discriminative and purely generative models of visual inference (i.e., "seeing"). A predominant discriminative perspective on visual inference may be compatible with the notion that generativity is involved in other visual tasks or in training the inference component of the visual system (2nd column). The hybrid perspective suggests that the visual system combines discriminative and generative components, which may be identified in time and space (3rd column).
  • Figure :