Table of Contents
Fetching ...

CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models

Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager

TL;DR

CLIPSwarm addresses generating drone swarm formations from natural language by leveraging a vision-language model with prompt enrichment. It iteratively optimizes a 2D formation to maximize the CLIP similarity $CS(t, I_{f,\alpha})$ between an enriched prompt $t$ and contour images $I_{f,\alpha}$ of candidate formations, where images are produced via colored alpha-shapes. The best 2D contour is then mapped to 3D drone trajectories using the Hungarian algorithm for assignment and ORCA for collision avoidance, enabling photorealistic drone shows in simulation. The approach demonstrates autonomous, language-driven formation design without retraining foundation models and suggests avenues for handling more complex shapes and full 3D optimization.

Abstract

This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.

CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models

TL;DR

CLIPSwarm addresses generating drone swarm formations from natural language by leveraging a vision-language model with prompt enrichment. It iteratively optimizes a 2D formation to maximize the CLIP similarity between an enriched prompt and contour images of candidate formations, where images are produced via colored alpha-shapes. The best 2D contour is then mapped to 3D drone trajectories using the Hungarian algorithm for assignment and ORCA for collision avoidance, enabling photorealistic drone shows in simulation. The approach demonstrates autonomous, language-driven formation design without retraining foundation models and suggests avenues for handling more complex shapes and full 3D optimization.

Abstract

This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.
Paper Structure (16 sections, 8 equations, 9 figures, 1 table)

This paper contains 16 sections, 8 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Drone formation automatically crafted to match a given text. CLIPSwarm takes a single word describing a shape as input and determines automatically the color and positions of a robotic swarm formation that best fits the given text. The example illustrates the shape created by a formation of 30 robots. The drones move to positions that collectively form a shape corresponding to the word "Leaf". Left: graphical representation of the shape formed by the robot formation. Right: formation of drones as part of a show in a photorealistic simulation representing the input word.
  • Figure 2: CLIPSwarm algorithm diagram. A schematic summary of the platform, its modules, and their interactions. I. Input of the system, which is a word describing the desired formation. II. CLIPSwarm algorithm, including the three modules of the system. (A) Prompt enrichment, involving color selection and prompt engineering to enrich the input word and form a text. (B) Formation Optimization, incorporating the steps select the formation that best describes the input text. (a) Initialization. A set of formations (consisting of robot position) are randomly sampled from a uniform distribution. Some predefined shapes are added as part of a 'warm start'. Evaluation. The formations are converted to images. Then, CLIP extracts the similarity between the images and the input text, and the formations with the best similarities are selected. (b) Update. New formations are iteratively created employing an "exploration-exploitation" strategy, improving the CLIP similarity across iterations. (C) From shapes to drone show. The positions of the obtained formation are optimized through robot position selection and a navigation algorithm. III. Output of the system is drone positions to perform a drone show by moving and selecting the color of the drones representing a shape described by the input word.
  • Figure 3: Influence of Alpha Value on Contour: Alpha shapes representation of a robot formation with varying alpha values. Different alpha values produce distinct representations of the contour. When the alpha value is zero, the contour forms a convex hull. Larger alpha values result in a more finely detailed and intricate contour. Each formation has a maximum value of alpha ($\alpha_f$) that ensures all points are inside a single polygon while the contour has the maximum concavity.
  • Figure 4: Predefined shapes. Columns 1-5 display predefined shapes along with random variations of them, which are added to the initialization pool as a 'warm start' during the Initialization stage. Column 6 shows some random samples from the initialization pool for comparison.
  • Figure 5: Postprocessing step to determine the position of the robots. The postprocessing step determines the positions of the robots to meaningfully represent the given shape. On the left are the end positions of the 30 robots calculated by the second step of the algorithm. On the right, the postprocess step of the algorithm distributes equally the same number of robots to better represent the same shape.
  • ...and 4 more figures