Table of Contents
Fetching ...

MANTA -- Model Adapter Native generations that's Affordable

Ansh Chaurasia

TL;DR

MANTA addresses the model-adapter composition problem under consumer hardware and cost constraints by introducing a retrieval-driven four-stage pipeline that jointly selects checkpoints and adapters while enabling prompt-driven diversity. The approach uses Structured Concept Development and Detail Enhancement to decompose prompts into task-specific concepts, followed by checkpoint/document retrieval with a triplet-loss-inspired mechanism, and ends with output refinement. Empirical evaluations on COCO 2014 show MANTA delivering strong gains in image diversity ($ ext{Diversity}$) and quality ($ ext{Quality}$) with a modest decline in alignment, achieving up to a 94% diversity win rate and an 80% quality win rate against the best prior system, while reducing LLM token usage by roughly 40x. The work demonstrates practical potential for synthetic data generation and creative AI applications, offering a scalable, open-path workflow with consumer-friendly hardware profiles and emphasis on reproducibility.

Abstract

The presiding model generation algorithms rely on simple, inflexible adapter selection to provide personalized results. We propose the model-adapter composition problem as a generalized problem to past work factoring in practical hardware and affordability constraints, and introduce MANTA as a new approach to the problem. Experiments on COCO 2014 validation show MANTA to be superior in image task diversity and quality at the cost of a modest drop in alignment. Our system achieves a $94\%$ win rate in task diversity and a $80\%$ task quality win rate versus the best known system, and demonstrates strong potential for direct use in synthetic data generation and the creative art domains.

MANTA -- Model Adapter Native generations that's Affordable

TL;DR

MANTA addresses the model-adapter composition problem under consumer hardware and cost constraints by introducing a retrieval-driven four-stage pipeline that jointly selects checkpoints and adapters while enabling prompt-driven diversity. The approach uses Structured Concept Development and Detail Enhancement to decompose prompts into task-specific concepts, followed by checkpoint/document retrieval with a triplet-loss-inspired mechanism, and ends with output refinement. Empirical evaluations on COCO 2014 show MANTA delivering strong gains in image diversity () and quality () with a modest decline in alignment, achieving up to a 94% diversity win rate and an 80% quality win rate against the best prior system, while reducing LLM token usage by roughly 40x. The work demonstrates practical potential for synthetic data generation and creative AI applications, offering a scalable, open-path workflow with consumer-friendly hardware profiles and emphasis on reproducibility.

Abstract

The presiding model generation algorithms rely on simple, inflexible adapter selection to provide personalized results. We propose the model-adapter composition problem as a generalized problem to past work factoring in practical hardware and affordability constraints, and introduce MANTA as a new approach to the problem. Experiments on COCO 2014 validation show MANTA to be superior in image task diversity and quality at the cost of a modest drop in alignment. Our system achieves a win rate in task diversity and a task quality win rate versus the best known system, and demonstrates strong potential for direct use in synthetic data generation and the creative art domains.
Paper Structure (52 sections, 1 equation, 18 figures, 2 tables)

This paper contains 52 sections, 1 equation, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Examples of images generated via Stylus
  • Figure 2: Example of a "low image diversity" generation, source Stylus. The majority of the cars synthetically generated look extremely similar and generic, and all have muted backgrounds.
  • Figure 3: Example of a low alignment output from Stylus. Prompt: A stop sign that has the picture of George Bush in place of the letter O.
  • Figure 4: MANTA algorithm. The system consists of four stages - concept development, checkpoint selection, adapter selection, and refinement. The output refinement procedure simply acts as a pass through for the time being, but serves as a location to insert alignment mechanisms.
  • Figure 5: Overview of the detail enhancement process. The prompt is analyzed into a main concept and a set of supporting concepts, and then each concept is individually processed through the LLM to come up with more details.
  • ...and 13 more figures