Table of Contents
Fetching ...

PixelBytes: Catching Unified Representation for Multimodal Generation

Fabien Furfaro

TL;DR

This report evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory networks in predictive and autoregressive modes, and results indicate that autoregressive models perform better than predictive models in this context.

Abstract

This report presents PixelBytes, an approach for unified multimodal representation learning. Drawing inspiration from sequence models like Image Transformers, PixelCNN, and Mamba-Bytes, we explore integrating text, audio, action-state, and pixelated images (sprites) into a cohesive representation. We conducted experiments on a PixelBytes Pokemon dataset and an Optimal-Control dataset. Our investigation covered various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, with a focus on bidirectional processing and our PxBy embedding technique. We evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory (LSTM) networks in predictive and autoregressive modes. Our results indicate that autoregressive models perform better than predictive models in this context. Additionally, we found that diffusion models can be applied to control problems and parallelized generation. PixelBytes aims to contribute to the development of foundation models for multimodal data processing and generation. The project's code, models, and datasets are available online.

PixelBytes: Catching Unified Representation for Multimodal Generation

TL;DR

This report evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory networks in predictive and autoregressive modes, and results indicate that autoregressive models perform better than predictive models in this context.

Abstract

This report presents PixelBytes, an approach for unified multimodal representation learning. Drawing inspiration from sequence models like Image Transformers, PixelCNN, and Mamba-Bytes, we explore integrating text, audio, action-state, and pixelated images (sprites) into a cohesive representation. We conducted experiments on a PixelBytes Pokemon dataset and an Optimal-Control dataset. Our investigation covered various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, with a focus on bidirectional processing and our PxBy embedding technique. We evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory (LSTM) networks in predictive and autoregressive modes. Our results indicate that autoregressive models perform better than predictive models in this context. Additionally, we found that diffusion models can be applied to control problems and parallelized generation. PixelBytes aims to contribute to the development of foundation models for multimodal data processing and generation. The project's code, models, and datasets are available online.
Paper Structure (24 sections, 3 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 24 sections, 3 equations, 4 figures, 4 tables, 4 algorithms.

Figures (4)

  • Figure 1: Overview of the PixelBytes approach: (Left) Scanning process reads different modalities at specific positions. (Center) Example of our dataset and autoregressive generation across modalities. (Right) Generation example, with PxBy embedding window displayed below. Note: The model currently shows some inaccuracies in image generation and may produce incorrect words, indicating areas for future improvement.
  • Figure 2: Training and validation metrics for RNN, Transformer, and SSM models over 200 epochs.
  • Figure 3: Generation results for the 1st Generation Starter Pokémon using the autoregressive model with a temperature of 0.1. Reference images and generated images are paired sequentially.
  • Figure 4: Generation approach for control problem. Left: generation result for linear problem. Right: generation step with setpoint and targeted diffusion action following input mask.