Table of Contents
Fetching ...

MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance

Wenjun Huang, Jianguo Hu

TL;DR

MAL addresses scaling and feature extraction limitations of LSTM-based vision models by introducing cluster-masked masking and autoregressive pretraining for xLSTM, paired with a universal encoder-decoder multitask framework. It enables image autoregression, depth estimation, and segmentation within a single pretraining scheme, improving representation quality while maintaining architectural consistency for fine-tuning. Empirical results on ImageNet-1K and ADE20K demonstrate state-of-the-art performance and robust generalization, outperforming traditional supervised and other Vision-LSTM approaches. This work highlights the viability and efficiency of combining autoregressive and multitask learning for scalable, adaptable visual representation learning.

Abstract

The Long Short-Term Memory (LSTM) networks have traditionally faced challenges in scaling and effectively capturing complex dependencies in visual tasks. The xLSTM architecture has emerged to address these limitations, incorporating exponential gating and a parallel matrix memory structure to enhance performance and scalability. Despite these advancements, the potential of xLSTM in visual computing has not been fully realized, particularly in leveraging autoregressive techniques for improved feature extraction. In this paper, we introduce MAL (Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance), a novel framework that enhances xLSTM's capabilities through innovative pretraining strategies. We propose a cluster-masked masking method that significantly improves local feature capture and optimizes image scanning efficiency. Additionally, our universal encoder-decoder pretraining approach integrates multiple tasks, including image autoregression, depth estimation, and image segmentation, thereby enhancing the model's adaptability and robustness across diverse visual tasks. Our experimental results demonstrate that MAL surpasses traditional supervised models and fully leverages the scaling potential of xLSTM, setting a new benchmark in visual task performance.

MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance

TL;DR

MAL addresses scaling and feature extraction limitations of LSTM-based vision models by introducing cluster-masked masking and autoregressive pretraining for xLSTM, paired with a universal encoder-decoder multitask framework. It enables image autoregression, depth estimation, and segmentation within a single pretraining scheme, improving representation quality while maintaining architectural consistency for fine-tuning. Empirical results on ImageNet-1K and ADE20K demonstrate state-of-the-art performance and robust generalization, outperforming traditional supervised and other Vision-LSTM approaches. This work highlights the viability and efficiency of combining autoregressive and multitask learning for scalable, adaptable visual representation learning.

Abstract

The Long Short-Term Memory (LSTM) networks have traditionally faced challenges in scaling and effectively capturing complex dependencies in visual tasks. The xLSTM architecture has emerged to address these limitations, incorporating exponential gating and a parallel matrix memory structure to enhance performance and scalability. Despite these advancements, the potential of xLSTM in visual computing has not been fully realized, particularly in leveraging autoregressive techniques for improved feature extraction. In this paper, we introduce MAL (Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance), a novel framework that enhances xLSTM's capabilities through innovative pretraining strategies. We propose a cluster-masked masking method that significantly improves local feature capture and optimizes image scanning efficiency. Additionally, our universal encoder-decoder pretraining approach integrates multiple tasks, including image autoregression, depth estimation, and image segmentation, thereby enhancing the model's adaptability and robustness across diverse visual tasks. Our experimental results demonstrate that MAL surpasses traditional supervised models and fully leverages the scaling potential of xLSTM, setting a new benchmark in visual task performance.

Paper Structure

This paper contains 38 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overall architecture.
  • Figure 2: Different prediction units in the autoregressive modeling.
  • Figure 3: Different prediction orderings of a visual sentence.
  • Figure 4: pretrain.