PixelBytes: Catching Unified Representation for Multimodal Generation

Fabien Furfaro

PixelBytes: Catching Unified Representation for Multimodal Generation

Fabien Furfaro

TL;DR

This report evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory networks in predictive and autoregressive modes, and results indicate that autoregressive models perform better than predictive models in this context.

Abstract

This report presents PixelBytes, an approach for unified multimodal representation learning. Drawing inspiration from sequence models like Image Transformers, PixelCNN, and Mamba-Bytes, we explore integrating text, audio, action-state, and pixelated images (sprites) into a cohesive representation. We conducted experiments on a PixelBytes Pokemon dataset and an Optimal-Control dataset. Our investigation covered various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, with a focus on bidirectional processing and our PxBy embedding technique. We evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory (LSTM) networks in predictive and autoregressive modes. Our results indicate that autoregressive models perform better than predictive models in this context. Additionally, we found that diffusion models can be applied to control problems and parallelized generation. PixelBytes aims to contribute to the development of foundation models for multimodal data processing and generation. The project's code, models, and datasets are available online.

PixelBytes: Catching Unified Representation for Multimodal Generation

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 24 sections, 3 equations, 4 figures, 4 tables, 4 algorithms.

Introduction
Exploration for a Unified Representation
Hypothesis Testing Framework
Conceptual Multimodal Embedding
Dataset Construction
Embedding Techniques
Model Architectures Evaluated
Comparative Analysis
Generation Evaluation Metrics
Identified Challenges
Optimizing Unified Representation
Refined Embedding Approach
Dataset Construction
Enhanced Tokenization Strategy
Autoregressive Model Architecture
...and 9 more sections

Figures (4)

Figure 1: Overview of the PixelBytes approach: (Left) Scanning process reads different modalities at specific positions. (Center) Example of our dataset and autoregressive generation across modalities. (Right) Generation example, with PxBy embedding window displayed below. Note: The model currently shows some inaccuracies in image generation and may produce incorrect words, indicating areas for future improvement.
Figure 2: Training and validation metrics for RNN, Transformer, and SSM models over 200 epochs.
Figure 3: Generation results for the 1st Generation Starter Pokémon using the autoregressive model with a temperature of 0.1. Reference images and generated images are paired sequentially.
Figure 4: Generation approach for control problem. Left: generation result for linear problem. Right: generation step with setpoint and targeted diffusion action following input mask.

PixelBytes: Catching Unified Representation for Multimodal Generation

TL;DR

Abstract

PixelBytes: Catching Unified Representation for Multimodal Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)