ContextClip Contextual Alignment of Image-text Pairs on Clip Visual Representations (ICVGIP 2022)

State-of-the-art empirical work has shown that visual representations learned by deep neural networks are robust and capable of performing classification tasks on diverse datasets. For example, CLIP demonstrated zero-shot transfer performance on multiple datasets for classification tasks in a joint embedding space of image and text pairs. However, it showed negative transfer performance on standard datasets, e.g., BirdsNAP, RESISC45, and MNIST. In this paper, we propose ContextCLIP, a contextual and contrastive learning framework for the contextual alignment of image-text pairs on CLIP to improve the zero-shot transfer performance on datasets where CLIP showed negative performance. Contextual learning with a contrastive learning framework was observed to improve the imagetext alignment by aligning text and image representations in the joint embedding space. ContextCLIP learns robust visual representations, showed good qualitative performance for text-to-image retrieval tasks and enhanced classification accuracy. The quantitative performance is shown for zero-shot transfer and fine-tuning experiments on CIFAR-10, CIFAR-100, Birdsnap, RESISC45, and MNIST datasets.

The project page will be updated.