CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models
Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager
TL;DR
CLIPSwarm addresses generating drone swarm formations from natural language by leveraging a vision-language model with prompt enrichment. It iteratively optimizes a 2D formation to maximize the CLIP similarity $CS(t, I_{f,\alpha})$ between an enriched prompt $t$ and contour images $I_{f,\alpha}$ of candidate formations, where images are produced via colored alpha-shapes. The best 2D contour is then mapped to 3D drone trajectories using the Hungarian algorithm for assignment and ORCA for collision avoidance, enabling photorealistic drone shows in simulation. The approach demonstrates autonomous, language-driven formation design without retraining foundation models and suggests avenues for handling more complex shapes and full 3D optimization.
Abstract
This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.
