Table of Contents
Fetching ...

Multimodal Query-guided Object Localization

Aditay Tripathi, Rajath R Dani, Anand Mishra, Anirban Chakraborty

TL;DR

A novel multimodal approach to object localization that combines sketch queries with linguistic category definitions, allowing for a better representation of visual and semantic cues is proposed, and achieves superior performance compared to related baselines in both open- and closed-set settings.

Abstract

Consider a scenario in one-shot query-guided object localization where neither an image of the object nor the object category name is available as a query. In such a scenario, a hand-drawn sketch of the object could be a choice for a query. However, hand-drawn crude sketches alone, when used as queries, might be ambiguous for object localization, e.g., a sketch of a laptop could be confused for a sofa. On the other hand, a linguistic definition of the category, e.g., a small portable computer small enough to use in your lap" along with the sketch query, gives better visual and semantic cues for object localization. In this work, we present a multimodal query-guided object localization approach under the challenging open-set setting. In particular, we use queries from two modalities, namely, hand-drawn sketch and description of the object (also known as gloss), to perform object localization. Multimodal query-guided object localization is a challenging task, especially when a large domain gap exists between the queries and the natural images, as well as due to the challenge of combining the complementary and minimal information present across the queries. For example, hand-drawn crude sketches contain abstract shape information of an object, while the text descriptions often capture partial semantic information about a given object category. To address the aforementioned challenges, we present a novel cross-modal attention scheme that guides the region proposal network to generate object proposals relevant to the input queries and a novel orthogonal projection-based proposal scoring technique that scores each proposal with respect to the queries, thereby yielding the final localization results. ...

Multimodal Query-guided Object Localization

TL;DR

A novel multimodal approach to object localization that combines sketch queries with linguistic category definitions, allowing for a better representation of visual and semantic cues is proposed, and achieves superior performance compared to related baselines in both open- and closed-set settings.

Abstract

Consider a scenario in one-shot query-guided object localization where neither an image of the object nor the object category name is available as a query. In such a scenario, a hand-drawn sketch of the object could be a choice for a query. However, hand-drawn crude sketches alone, when used as queries, might be ambiguous for object localization, e.g., a sketch of a laptop could be confused for a sofa. On the other hand, a linguistic definition of the category, e.g., a small portable computer small enough to use in your lap" along with the sketch query, gives better visual and semantic cues for object localization. In this work, we present a multimodal query-guided object localization approach under the challenging open-set setting. In particular, we use queries from two modalities, namely, hand-drawn sketch and description of the object (also known as gloss), to perform object localization. Multimodal query-guided object localization is a challenging task, especially when a large domain gap exists between the queries and the natural images, as well as due to the challenge of combining the complementary and minimal information present across the queries. For example, hand-drawn crude sketches contain abstract shape information of an object, while the text descriptions often capture partial semantic information about a given object category. To address the aforementioned challenges, we present a novel cross-modal attention scheme that guides the region proposal network to generate object proposals relevant to the input queries and a novel orthogonal projection-based proposal scoring technique that scores each proposal with respect to the queries, thereby yielding the final localization results. ...
Paper Structure (33 sections, 12 equations, 4 figures, 6 tables)

This paper contains 33 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Given an image and a query, our aim is to localize the object in the image (a laptop in this example). A hand-drawn sketch of a laptop alone, when used as a query, might be ambiguous for object localization as it could be confused for a sofa. On the other hand, descriptions obtained from different modalities such as a category label, e.g. "laptop" or a linguistic definition of the category, e.g., "a small portable computer small enough to use in your lap" along with the sketch query, give better visual and semantic cues for the object localization.
  • Figure 2: Given an image and queries of different modalities, our object localization framework works in the following two stages: (i) query-guided proposal generation: in this step, the global fused feature vector of different queries that are shown using blue color is scored with the image feature vectors that corresponds to each location on the image feature map that is shown using pink color to generate the spatial compatibility also called the attention scores. (Block 1). Next, these attention scores, which are shown using violet color, are multiplied with the image feature maps, which are shown using pink color to get the attention features (Block 2). Before passing it through the region proposal network (RPN), it is first concatenated with the original feature maps and projected to the original dimension. The RPN is able to generate relevant object proposals because of the spatial compatibility, that is integrated into the image feature maps, between global fused queries representation and regional image representation (Block-3), (ii) orthogonal-projection based proposal scoring: the representation for each of the pooled object proposals that are shown using indigo is scored with query feature vectors from multiple modalities to generate localization for the object of interest (Block-5). The proposal vector is projected onto the subspace spanned by the queries, and the projection vector is utilized to query against the proposal vector. [Best viewed in color].
  • Figure 3: The localization results are shown for the case when only sketch query (third column), only gloss query (fourth column), and both sketch and gloss queries (fifth column). The results are shown for the open-set setting, i.e., these categories are unseen during training. The first two columns show sketch and gloss queries. We observe that having gloss brings semantics to the model and thereby enables it to perform better than sketch only localization. The last two rows show some of the failure cases.
  • Figure 4: The multi-target localization results are shown for the case when both sketch and gloss query are available. The results are shown for the open-set setting. The first column shows the sketch queries, and the second column shows the corresponding gloss queries using two different colors, i.e., blue and purple. Corresponding localizations in the target image are shown using the same colors as the gloss queries. [Best viewed in color].