Table of Contents
Fetching ...

Interpretable Multi-task Learning with Shared Variable Embeddings

Maciej Żelaszczyk, Jacek Mańdziuk

TL;DR

The paper tackles learning across multiple tasks with heterogeneous input/output spaces by introducing shared variable embeddings (SVE) that reuse a common embedding base through attention. The method enables predictions using a cross-attention mechanism where raw variable embeddings query a compact set of shared embeddings, followed by a shared encoder–decoder pipeline, trained end-to-end with a squared hinge loss. Key contributions include the SVE architecture, demonstrations that it matches vanilla variable embeddings in accuracy while offering interpretability benefits, and systematic ablations showing gains in training efficiency with sparse attention at some interpretability cost. The work advances interpretable multi-task learning on tabular data and suggests practical trade-offs between interpretability and performance, with potential extensions to other domains and self-supervised settings.

Abstract

This paper proposes a general interpretable predictive system with shared information. The system is able to perform predictions in a multi-task setting where distinct tasks are not bound to have the same input/output structure. Embeddings of input and output variables in a common space are obtained, where the input embeddings are produced through attending to a set of shared embeddings, reused across tasks. All the embeddings are treated as model parameters and learned. Specific restrictions on the space of shared embedings and the sparsity of the attention mechanism are considered. Experiments show that the introduction of shared embeddings does not deteriorate the results obtained from a vanilla variable embeddings method. We run a number of further ablations. Inducing sparsity in the attention mechanism leads to both an increase in accuracy and a significant decrease in the number of training steps required. Shared embeddings provide a measure of interpretability in terms of both a qualitative assessment and the ability to map specific shared embeddings to pre-defined concepts that are not tailored to the considered model. There seems to be a trade-off between accuracy and interpretability. The basic shared embeddings method favors interpretability, whereas the sparse attention method promotes accuracy. The results lead to the conclusion that variable embedding methods may be extended with shared information to provide increased interpretability and accuracy.

Interpretable Multi-task Learning with Shared Variable Embeddings

TL;DR

The paper tackles learning across multiple tasks with heterogeneous input/output spaces by introducing shared variable embeddings (SVE) that reuse a common embedding base through attention. The method enables predictions using a cross-attention mechanism where raw variable embeddings query a compact set of shared embeddings, followed by a shared encoder–decoder pipeline, trained end-to-end with a squared hinge loss. Key contributions include the SVE architecture, demonstrations that it matches vanilla variable embeddings in accuracy while offering interpretability benefits, and systematic ablations showing gains in training efficiency with sparse attention at some interpretability cost. The work advances interpretable multi-task learning on tabular data and suggests practical trade-offs between interpretability and performance, with potential extensions to other domains and self-supervised settings.

Abstract

This paper proposes a general interpretable predictive system with shared information. The system is able to perform predictions in a multi-task setting where distinct tasks are not bound to have the same input/output structure. Embeddings of input and output variables in a common space are obtained, where the input embeddings are produced through attending to a set of shared embeddings, reused across tasks. All the embeddings are treated as model parameters and learned. Specific restrictions on the space of shared embedings and the sparsity of the attention mechanism are considered. Experiments show that the introduction of shared embeddings does not deteriorate the results obtained from a vanilla variable embeddings method. We run a number of further ablations. Inducing sparsity in the attention mechanism leads to both an increase in accuracy and a significant decrease in the number of training steps required. Shared embeddings provide a measure of interpretability in terms of both a qualitative assessment and the ability to map specific shared embeddings to pre-defined concepts that are not tailored to the considered model. There seems to be a trade-off between accuracy and interpretability. The basic shared embeddings method favors interpretability, whereas the sparse attention method promotes accuracy. The results lead to the conclusion that variable embedding methods may be extended with shared information to provide increased interpretability and accuracy.
Paper Structure (32 sections, 16 equations, 4 figures, 18 tables)

This paper contains 32 sections, 16 equations, 4 figures, 18 tables.

Figures (4)

  • Figure 1: The overview of the shared variable embeddings method. The variable space contains both the observed and target variables which are associated with their learnable variable embeddings (VEs). The observable variables are first linked to raw VEs which are used as queries in the attention mechanism. A separate set of shared VEs plays the role of both keys and values. The processed VEs are the output of attention. Together with the corresponding variable values they are processed, each (value, VE) pair separately, by the encoder. The outputs of the encoder are summed and passed to the initial decoder. The target variable of interest is directly linked to its VE and this VE is passed with the output of the initial decoder to the final decoder to actually perform the prediction of the value of the target variable of interest. Additional details of the architecture are available in Appendix \ref{['app:architecture']} (Figure \ref{['fig:architecture']}). The differences between our method and VQ-VAE vandenOord2017 are highlighted in Appendix \ref{['app:vq-vae']}.
  • Figure 2: UCI-121 test set accuracy for a given train step (in thousands). SVE - shared embedding method, ENT - $1.05$-entmax with embeddings initialized from $\mathcal{N}(0, 1)$, SR - stable rank with $\alpha_{\text{sr}} = 0.05$.
  • Figure 3: (a) Training steps to reach best test set accuracy. (b) Stable rank of the shared embedding matrix after training - best accuracy model. SVE - shared embedding method, ENT - $1.05$-entmax with embeddings initialized from $\mathcal{N}(0, 1)$, SR - stable rank with $\alpha_{\text{sr}} = 0.05$, RAND - random embedding matrix with entries from $\mathcal{N}(0, 1)$.
  • Figure 4: Architecture of the shared variable embeddings method. SE - shared embeddings, Attn - attention, S - shared embedding matrix, FC - fully connected layers, FiLM - layers proposed by Perez2018, Drop - dropout, ReLU - rectified linear units.