Social Media Ready Caption Generation for Brands
Himanshu Maheshwari, Koustava Goswami, Apoorv Saxena, Balaji Vasan Srinivasan
TL;DR
This work introduces a two-stage pipeline for generating social media captions that align with specific brand personalities. It first produces a plain image caption with a vision-language model (BLIP-2) and then refines it into a catchy, brand-aligned caption using an LLM, with options for fine-tuned FlanT5-XL or GPT-3.5-turbo and supports user-specified hashtags, handles, URLs, and named entities. The authors create an Instagram-based dataset across five brand personalities, and they evaluate grounding with CLIPScore and alignment with G-Eval, showing that a selective prompting strategy and a fine-tuned model can outperform end-to-end multimodal baselines. The approach offers practical benefits for brands seeking personality-consistent captions while preserving privacy and enabling attribute-rich captions, though it faces dataset noise and high costs for GPT-based evaluation. Overall, the paper lays a foundation for brand-aware caption generation and highlights important directions for dataset quality and efficiency in multimodal branding applications.
Abstract
Social media advertisements are key for brand marketing, aiming to attract consumers with captivating captions and pictures or logos. While previous research has focused on generating captions for general images, incorporating brand personalities into social media captioning remains unexplored. Brand personalities are shown to be affecting consumers' behaviours and social interactions and thus are proven to be a key aspect of marketing strategies. Current open-source multimodal LLMs are not directly suited for this task. Hence, we propose a pipeline solution to assist brands in creating engaging social media captions that align with the image and the brand personalities. Our architecture is based on two parts: a the first part contains an image captioning model that takes in an image that the brand wants to post online and gives a plain English caption; b the second part takes in the generated caption along with the target brand personality and outputs a catchy personality-aligned social media caption. Along with brand personality, our system also gives users the flexibility to provide hashtags, Instagram handles, URLs, and named entities they want the caption to contain, making the captions more semantically related to the social media handles. Comparative evaluations against various baselines demonstrate the effectiveness of our approach, both qualitatively and quantitatively.
