MANTA -- Model Adapter Native generations that's Affordable
Ansh Chaurasia
TL;DR
MANTA addresses the model-adapter composition problem under consumer hardware and cost constraints by introducing a retrieval-driven four-stage pipeline that jointly selects checkpoints and adapters while enabling prompt-driven diversity. The approach uses Structured Concept Development and Detail Enhancement to decompose prompts into task-specific concepts, followed by checkpoint/document retrieval with a triplet-loss-inspired mechanism, and ends with output refinement. Empirical evaluations on COCO 2014 show MANTA delivering strong gains in image diversity ($ ext{Diversity}$) and quality ($ ext{Quality}$) with a modest decline in alignment, achieving up to a 94% diversity win rate and an 80% quality win rate against the best prior system, while reducing LLM token usage by roughly 40x. The work demonstrates practical potential for synthetic data generation and creative AI applications, offering a scalable, open-path workflow with consumer-friendly hardware profiles and emphasis on reproducibility.
Abstract
The presiding model generation algorithms rely on simple, inflexible adapter selection to provide personalized results. We propose the model-adapter composition problem as a generalized problem to past work factoring in practical hardware and affordability constraints, and introduce MANTA as a new approach to the problem. Experiments on COCO 2014 validation show MANTA to be superior in image task diversity and quality at the cost of a modest drop in alignment. Our system achieves a $94\%$ win rate in task diversity and a $80\%$ task quality win rate versus the best known system, and demonstrates strong potential for direct use in synthetic data generation and the creative art domains.
