Published 2019 | Version v1
Publication

Enhancing visual embeddings through weakly supervised captioning for zero-shot learning

Description

Visual features designed for image classification have shown to be useful in zero-shot learning (ZSL) when generalizing towards classes not seen during training. In this paper, we argue that a more effective way of building visual features for ZSL is to extract them through captioning, in order not just to classify an image but, instead, to describe it. However, modern captioning models rely on a massive level of supervision, e.g up to 15 extended descriptions per instance provided by humans, which is simply not available for ZSL benchmarks. In the latter in fact, the available annotations inform about the presence/absence of attributes within a fixed list only. Worse, attributes are seldom annotated at the image level, but rather, at the class level only: because of this, the annotation cannot be visually grounded. In this paper, we deal with such a weakly supervised regime to train an end-to-end LSTM captioner, whose backbone CNN image encoder can provide better features for ZSL. Our enhancement of visual features, called 'VisEn', is compatible with any generic ZSL method, without requiring changes in its pipeline (a part from adapting hyper-parameters). Experimentally, VisEn is capable of sharply improving recognition performance on unseen classes, as we demonstrate thorough an ablation study which encompasses different ZSL approaches. Further, on the challenging fine-grained CUB dataset, VisEn improves by margin state-of-the-art methods, by using visual descriptors of one order of magnitude smaller.

Additional details

Created:
October 11, 2023
Modified:
November 28, 2023