OmniFusion Technical Report

Elizaveta Goncharova; Anton Razzhigaev; Matvey Mikhalchuk; Maxim Kurkin; Irina Abdullaeva; Matvey Skripkin; Ivan Oseledets; Denis Dimitrov; Andrey Kuznetsov

OmniFusion Technical Report

Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, Maxim Kurkin, Irina Abdullaeva, Matvey Skripkin, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

TL;DR

This paper tackles the challenge of integrating visual data into large language models by introducing OmniFusion, a pretrained LLM augmented with trainable visual adapters and multiple vision encoders. It systematically evaluates adapters (MLP vs transformer), image-encoding strategies (whole-image vs tiles), and encoder mixing, including high-resolution and document-domain enhancements, across eight visual-language benchmarks. The study demonstrates strong VQA performance and detailed domain-specific responses, achieving competitive results with open-source baselines and matching or approaching larger LLMs in several tasks. The work provides an open-source Mistral-based implementation with training and inference scripts, facilitating broader adoption and further multimodal research.

Abstract

Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

OmniFusion Technical Report

TL;DR

Abstract

Paper Structure (17 sections, 4 figures, 9 tables)

This paper contains 17 sections, 4 figures, 9 tables.

Introduction
OmniFusion
Model Architecture
Training pipeline
Stage 1: Pretraining.
Stage 2: Fine-tuning.
Training hyperparameters.
Experiments
Experimental setup
Vision encoders.
Mix of image encoders.
Scaling Images to HD.
Tuning with synthesized TeX formulas.
Evaluation on benchmarks
Main results.
...and 2 more sections

Figures (4)

Figure 1: Comparison of OmniFusion performance on the benchmarks and generation examples.
Figure 2: OmniFusion VQA examples.
Figure 3: OmniFusion architecture with feature merging (left) and with single adapter (right): MLP or transformer layer.
Figure 4: An example of LaTeX formula understanding by OmniFusion fine-tuned with the Texify vision encoder is depicted below. The upper image displays the input image, while the lower image showcases the compilation of LaTeX code generated by the model.

OmniFusion Technical Report

TL;DR

Abstract

OmniFusion Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (4)