Image captioning for Brazilian Portuguese using GRIT model
Rafael Silva de Alencar, William Alberto Cruz Castañeda, Marcellus Amadeus
TL;DR
This work tackles image captioning for Brazilian Portuguese by adapting GRIT, a Grid- and Region-based Image captioning Transformer, to a translated COCO dataset. It fuses grid features $V_{L_b} \in \mathbf{R}^{d \times d_{L_b}}$ and region features through an autoregressive caption generator with sinusoidal positional embeddings and $L_c$ layers. Initial experiments on one epoch yield Portuguese captions with BLEU=$0.758$, METEOR=$0.268$, ROUGE-L=$0.557$, CIDEr=$1.100$, approaching corresponding English baselines and highlighting translation semantics as a challenge. The work paves a path for multilingual captioning by exploring vocabulary-free future setups (vicap branch) and broader PT datasets, enabling scalable image captioning in non-English languages.
Abstract
This work presents the early development of a model of image captioning for the Brazilian Portuguese language. We used the GRIT (Grid - and Region-based Image captioning Transformer) model to accomplish this work. GRIT is a Transformer-only neural architecture that effectively utilizes two visual features to generate better captions. The GRIT method emerged as a proposal to be a more efficient way to generate image captioning. In this work, we adapt the GRIT model to be trained in a Brazilian Portuguese dataset to have an image captioning method for the Brazilian Portuguese Language.
