Pre-training Visual Language Models: Harnessing the Power of Combining Caption and Classification Datasets
Pre-training visual language (VL) models on web-scale image-caption datasets has emerged as a powerful alternative to traditional pre-training on image classification data. However, directly combining these datasets for pre-training can result in biased representations that do not generalize well to various downstream tasks. In this article, we present a pre-training strategy called “Prefix Conditioning” that uses both classification and caption datasets to provide complementary benefits and improve zero-shot recognition tasks.
The Biases in Classification and Caption Datasets:
Classification datasets tend to be biased in two ways: limited scene types and restricted vocabulary. On the other hand, caption datasets contain a wider variety of scenes and vocabularies. Simply learning from both datasets can entangle biases, leading to decreased generalization in zero-shot classification.
Prefix Conditioning: Disentangling Dataset Biases
Prefix conditioning is a novel approach that disentangles dataset biases from visual concepts. It involves using prefix tokens to inform the model about the dataset type (classification or caption). During training, prefix tokens absorb the bias of the dataset, allowing the remaining tokens to focus on learning visual concepts. This disentanglement of bias improves the generalization in zero-shot classification.
Application of Prefix Conditioning:
We apply prefix conditioning to two contrastive loss methods: CLIP and UniCL. The models trained with prefix conditioning show significant improvements in zero-shot classification accuracy compared to models trained only with ImageNet or Conceptual 12M datasets.
Impact of Test-Time Prefix:
Choosing the right prefix during test time has a significant impact on performance. Using the prefix tailored for the classification dataset improves classification accuracy, while using the prefix tailored for the image-caption dataset improves performance in zero-shot recognition.
Robustness to Image Distribution Shift:
Prefix conditioning also improves robustness to image distribution shift. It achieves better performance on domains far from the classification dataset, indicating its effectiveness in generalization.
Conclusion and Future Work:
Prefix conditioning is a promising technique for unifying image caption and classification datasets for better zero-shot classification. However, identifying the optimal prefix for each test dataset remains a challenge and an interesting direction for future research.
This research was conducted by Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Special thanks to Zizhao Zhang and Sergey Ioffe for their valuable feedback.
(Included subheadings: Introduction, The Biases in Classification and Caption Datasets, Prefix Conditioning: Disentangling Dataset Biases, Application of Prefix Conditioning, Impact of Test-Time Prefix, Robustness to Image Distribution Shift, Conclusion and Future Work, Acknowledgements)