MANTA -- Model Adapter Native generations that's Affordable

Ansh Chaurasia

MANTA -- Model Adapter Native generations that's Affordable

Ansh Chaurasia

TL;DR

MANTA addresses the model-adapter composition problem under consumer hardware and cost constraints by introducing a retrieval-driven four-stage pipeline that jointly selects checkpoints and adapters while enabling prompt-driven diversity. The approach uses Structured Concept Development and Detail Enhancement to decompose prompts into task-specific concepts, followed by checkpoint/document retrieval with a triplet-loss-inspired mechanism, and ends with output refinement. Empirical evaluations on COCO 2014 show MANTA delivering strong gains in image diversity ($ ext{Diversity}$) and quality ($ ext{Quality}$) with a modest decline in alignment, achieving up to a 94% diversity win rate and an 80% quality win rate against the best prior system, while reducing LLM token usage by roughly 40x. The work demonstrates practical potential for synthetic data generation and creative AI applications, offering a scalable, open-path workflow with consumer-friendly hardware profiles and emphasis on reproducibility.

Abstract

The presiding model generation algorithms rely on simple, inflexible adapter selection to provide personalized results. We propose the model-adapter composition problem as a generalized problem to past work factoring in practical hardware and affordability constraints, and introduce MANTA as a new approach to the problem. Experiments on COCO 2014 validation show MANTA to be superior in image task diversity and quality at the cost of a modest drop in alignment. Our system achieves a $94\%$ win rate in task diversity and a $80\%$ task quality win rate versus the best known system, and demonstrates strong potential for direct use in synthetic data generation and the creative art domains.

MANTA -- Model Adapter Native generations that's Affordable

TL;DR

) and quality (

) with a modest decline in alignment, achieving up to a 94% diversity win rate and an 80% quality win rate against the best prior system, while reducing LLM token usage by roughly 40x. The work demonstrates practical potential for synthetic data generation and creative AI applications, offering a scalable, open-path workflow with consumer-friendly hardware profiles and emphasis on reproducibility.

Abstract

win rate in task diversity and a

task quality win rate versus the best known system, and demonstrates strong potential for direct use in synthetic data generation and the creative art domains.

Paper Structure (52 sections, 1 equation, 18 figures, 2 tables)

This paper contains 52 sections, 1 equation, 18 figures, 2 tables.

Introduction
Background
Core Research Problem
Past Work
Model Based Methods
Retrieval Based Methods
Retrieval Methods Limitations and Opportunities
Lack of Task Diversity
Low Alignment
Current Image Generation Workflow Challenges
Image Resolution
Alignment
Image Diversity
Consumer Friendliness
Related Works
...and 37 more sections

Figures (18)

Figure 1: Examples of images generated via Stylus
Figure 2: Example of a "low image diversity" generation, source Stylus. The majority of the cars synthetically generated look extremely similar and generic, and all have muted backgrounds.
Figure 3: Example of a low alignment output from Stylus. Prompt: A stop sign that has the picture of George Bush in place of the letter O.
Figure 4: MANTA algorithm. The system consists of four stages - concept development, checkpoint selection, adapter selection, and refinement. The output refinement procedure simply acts as a pass through for the time being, but serves as a location to insert alignment mechanisms.
Figure 5: Overview of the detail enhancement process. The prompt is analyzed into a main concept and a set of supporting concepts, and then each concept is individually processed through the LLM to come up with more details.
...and 13 more figures

MANTA -- Model Adapter Native generations that's Affordable

TL;DR

Abstract

MANTA -- Model Adapter Native generations that's Affordable

Authors

TL;DR

Abstract

Table of Contents

Figures (18)