Table of Contents
Fetching ...

MetaCOG: A Hierarchical Probabilistic Model for Learning Meta-Cognitive Visual Representations

Marlene D. Berke, Zhangir Azerbayev, Mario Belledonne, Zenna Tavares, Julian Jara-Ettinger

TL;DR

It is shown that MetaCOG is robust to varying levels of error in object detector outputs, showing proof-of-concept for a novel approach to the problem of detecting and correcting errors in vision systems when ground-truth is not available.

Abstract

Humans have the capacity to question what we see and to recognize when our vision is unreliable (e.g., when we realize that we are experiencing a visual illusion). Inspired by this capacity, we present MetaCOG: a hierarchical probabilistic model that can be attached to a neural object detector to monitor its outputs and determine their reliability. MetaCOG achieves this by learning a probabilistic model of the object detector's performance via Bayesian inference -- i.e., a meta-cognitive representation of the network's propensity to hallucinate or miss different object categories. Given a set of video frames processed by an object detector, MetaCOG performs joint inference over the underlying 3D scene and the detector's performance, grounding inference on a basic assumption of object permanence. Paired with three neural object detectors, we show that MetaCOG accurately recovers each detector's performance parameters and improves the overall system's accuracy. We additionally show that MetaCOG is robust to varying levels of error in object detector outputs, showing proof-of-concept for a novel approach to the problem of detecting and correcting errors in vision systems when ground-truth is not available.

MetaCOG: A Hierarchical Probabilistic Model for Learning Meta-Cognitive Visual Representations

TL;DR

It is shown that MetaCOG is robust to varying levels of error in object detector outputs, showing proof-of-concept for a novel approach to the problem of detecting and correcting errors in vision systems when ground-truth is not available.

Abstract

Humans have the capacity to question what we see and to recognize when our vision is unreliable (e.g., when we realize that we are experiencing a visual illusion). Inspired by this capacity, we present MetaCOG: a hierarchical probabilistic model that can be attached to a neural object detector to monitor its outputs and determine their reliability. MetaCOG achieves this by learning a probabilistic model of the object detector's performance via Bayesian inference -- i.e., a meta-cognitive representation of the network's propensity to hallucinate or miss different object categories. Given a set of video frames processed by an object detector, MetaCOG performs joint inference over the underlying 3D scene and the detector's performance, grounding inference on a basic assumption of object permanence. Paired with three neural object detectors, we show that MetaCOG accurately recovers each detector's performance parameters and improves the overall system's accuracy. We additionally show that MetaCOG is robust to varying levels of error in object detector outputs, showing proof-of-concept for a novel approach to the problem of detecting and correcting errors in vision systems when ground-truth is not available.

Paper Structure

This paper contains 58 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Conceptual schematic of MetaCOG. Information flows from left to right. From the left, images of a scene (described by a "World State") are processed by a neural "Object Detector," which produces detections (semantic labels with bounding boxes in the 2D image). The MetaCOG model takes detections as input (without any access to the underlying world state or the ground-truth accuracy of these detections), and jointly infers a meta-cognitive representation of the detector, and the objects present in the scene ("Inferred World State"). Specifically, MetaCOG's meta-cognitive representation consists of learned category-specific probabilities of the object detector generating hallucinations (detections of objects that were not actually there) and missing objects (failures to detect an object that was actually there). MetaCOG simultaneously infers this meta-cognitive representation and the world state (i.e., semantic labels and locations in 3D space).
  • Figure 2: Results for MetaCOG and comparison models. Throughout, yellow codes for RetinaNet, magenta for Faster R-CNN, and blue-grey codes for DETR. A) Scatterplot showing MetaCOG's inferred values for the hallucination rates (circles) and miss rates (triangles) against the ground-truth values. B) The MSE (averaged across categories) of MetaCOG's inferences about $\theta$ as a function of number of videos observed. C) Comparisons between MetaCOG and the two baseline models on the test set (after conditioning on the meta-cognitive representation that MetaCOG inferred on the training set). The left group of bars show the difference between MetaCOG and the NN's output, and the right shows the difference between MetaCOG and the lesioned model. Positive values indicated MetaCOG outperforms the comparison model.
  • Figure 3: The results of MetaCOG fine-tuning Faster R-CNN. Each point shows the average accuracy of a model on the test set, and the bars are bootstrapped 95% CIs. From left to right, the leftmost point shows the accuracy of pre-trained, off-the-shelf Faster R-CNN. The next point shows the accuracy of MetaCOG, when paired with Faster R-CNN's inputs. The difference between these two points is depicted by the magenta $\Delta$(MetaCOG,NN) bar in Fig. \ref{['Fig:Exp1Results']}C. The third point shows the accuracy of Faster R-CNN after fine-tuning using MetaCOG's inferences. The rightmost point shows the accuracy of MetaCOG with inputs from fine-tuned Faster R-CNN. For exact accuracy values, see Table \ref{['retraining_table']} in \ref{['Training Faster R-CNN']}.
  • Figure 4: Lightweight MetaCOG's average performance over 40000 simulated detectors varying in faultiness, each processing 75 world states. A) MSE between true and inferred hallucination and miss rates as a function of number of processed world states. Horizontal dotted line represents the average MSE for the mean of the prior (the $\theta$ used in Lesioned MetaCOG). B) Mean accuracy as a function of number of processed world states. Average accuracy of (C) and difference between (D) MetaCOG and Lesioned MetaCOG as a function of faultiness ($\zeta$) in the detections.
  • Figure 5: Schematic depicting the forward generative model. $G_t$ is the prior over meta-cognition ($V_t$) at time t; $W_t$ is the world state; and $D_t$ are the detections that are generated. $W_t$, $D_t$, and $G_t$ collectively influence $G_{t+1}$, the prior over $V_{t_1}$.
  • ...and 2 more figures