CogMorph: Cognitive Morphing Attacks for Text-to-Image Models
Zonglei Jing, Zonghao Ying, Le Wang, Siyuan Liang, Aishan Liu, Xianglong Liu, Dacheng Tao
TL;DR
CogMorph introduces a cognitive morphing framework that manipulates prompts to inject toxic contextual elements into text-to-image outputs while preserving core subjects. It combines Cognitive Toxicity Augmentation (via Retrieval-Augmented Generation) with Contextual Hierarchical Morphing (hierarchical prompt parsing and feature fusion) under a detailed 10/48-category taxonomy and a 5-dimension toxicity risk matrix. The approach is validated across multiple open-source T2I models and DALL·E-3, outperforming baselines in toxicity escalation (TESR, ATI) and jailbreak success, with human studies corroborating increased harm and a robust, adaptable image-checking defense (A-VLIC). The work highlights a critical ethical risk in generative AI, provides a rich dataset of 1,176 prompts and 283 keywords, and suggests layered defenses and future directions for safer deployment of T2I systems.
Abstract
The development of text-to-image (T2I) generative models, that enable the creation of high-quality synthetic images from textual prompts, has opened new frontiers in creative design and content generation. However, this paper reveals a significant and previously unrecognized ethical risk inherent in this technology and introduces a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements. This nuanced manipulation exploits the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context, producing images that amplify emotional harm far beyond attacks that merely preserve the original semantics. To address this, we first construct an imagery toxicity taxonomy spanning 10 major and 48 sub-categories, aligned with human cognitive-perceptual dimensions, and further build a toxicity risk matrix resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph first introduces Cognitive Toxicity Augmentation, which develops a cognitive toxicity knowledge base with rich external toxic representations for humans (e.g., fine-grained visual features) that can be utilized to further guide the optimization of adversarial prompts. In addition, we present Contextual Hierarchical Morphing, which hierarchically extracts critical parts of the original prompt (e.g., scenes, subjects, and body parts), and then iteratively retrieves and fuses toxic features to inject harmful contexts. Extensive experiments on multiple open-sourced T2I models and black-box commercial APIs (e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly outperforms other baselines by large margins (+20.62% on average).
