SPARLING: Learning Latent Representations with Extremely Sparse Activations

Kavi Gupta; Osbert Bastani; Armando Solar-Lezama

SPARLING: Learning Latent Representations with Extremely Sparse Activations

Kavi Gupta, Osbert Bastani, Armando Solar-Lezama

TL;DR

The Motif Identifiability Theorem is proved, stating that under certain assumptions it is possible to precisely identify motifs exclusively by reducing end-to-end error, thus allowing these representations to be arbitrarily complex functions of the input.

Abstract

Real-world processes often contain intermediate state that can be modeled as an extremely sparse activation tensor. In this work, we analyze the identifiability of such sparse and local latent intermediate variables, which we call motifs. We prove our Motif Identifiability Theorem, stating that under certain assumptions it is possible to precisely identify these motifs exclusively by reducing end-to-end error. Notably, we do not assume identifiability of parameters, but rather of a latent intermediate representation output by a local model, thus allowing these representations to be arbitrarily complex functions of the input. Additionally, we provide the Sparling algorithm, which uses a new kind of informational bottleneck that enforces levels of activation sparsity unachievable using other techniques. We confirm empirically that extreme sparsity is necessary to achieve good intermediate state modeling. On synthetic domains, we are able to precisely localize the intermediate states up to feature permutation with > 90% accuracy, even though we only train end-to-end.

SPARLING: Learning Latent Representations with Extremely Sparse Activations

TL;DR

Abstract

Paper Structure (65 sections, 110 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 65 sections, 110 equations, 12 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries
Motif Identifiability Theorem
Theorem Statement
Error Metrics
Assumptions
Non-Overlapping
Motif-Sufficiency
$\alpha$-Motif-Necessity
Proof Sketch
Methods
Spatial Sparsity Layer
Adaptive Sparsity Algorithm
Experiments
Experimental Setup
...and 50 more sections

Figures (12)

Figure 1: (a) Example of the DigitCircle domain, alongside (b) a cartoon of the splicing problem. The input $x$ is mapped by the ground truth $g^*$ function to the motif map $m^*$ of the positions of every digit/protein binding sites, which is itself mapped by the ground truth $h^*$ function to the output $y^*$, the sequence 072634/splice sites. Only $x$ and $y^*$ are available during training; the goal is to reconstruct $g^*$ and $h^*$. Note that in splicing, unlike DigitCircle, the motifs can overlap. The $M$ dots indicate the representation as described in Section \ref{['sec:problem-setup']}, which is a one-hot encoding at each location.
Figure 2: Two examples of inputs (images), outputs (sequences in titles), and our $\hat{g}$ predictions for seed=1 (colored dots) for DigitCircle, LaTeX-OCR, and AudioMnistSequence. For LaTeX-OCR, we provide the output twice, first as the sequence of commands generated by the network and second as the translation of those commands into LaTeX. We place a dot for every maximal motif, colored/labeled by the channel that it appears in (e.g., the 0th channel is A or #00, 1st is B or #01, etc.). Stars indicate sites where non-maximal motifs are present as well.
Figure 3: Motif Error, across three different metrics. Bar height depicts the mean across 9 seeds, individual dots represent seed, the error bar represents a 95% bootstrap CI. AudioMnistSequence has an FPE of exactly 0. High FNE on LaTeX-OCR is due to fraction bars, parentheses, and plus signs not being recognized in all cases since it is possible to infer the output without access to these. For a comparison of our technique to less-sparse models, see Figure \ref{['fig:error-vs-sparsity']}.
Figure 4: Motif and end-to-end error metrics versus $\delta$. Note that the $x$ axis is a reversed log-scale, since the adaptive sparsity algorithm starts with high density and narrows it exponentially.
Figure 5: Retrained tends to perform as well as or slightly worse than Non-Sparse, making up most of the gap from Sparling. The apparent improvement from Non-Sparse to Retrained should not be interpreted as real, the numerical difference is tiny and the sample accuracies overlap.
...and 7 more figures

SPARLING: Learning Latent Representations with Extremely Sparse Activations

TL;DR

Abstract

SPARLING: Learning Latent Representations with Extremely Sparse Activations

Authors

TL;DR

Abstract

Table of Contents

Figures (12)