Learning Transferable Visual Models From Natural Language Supervision

Alec Radford; Jong Wook Kim; Chris Hallacy; Aditya Ramesh; Gabriel Goh; Sandhini Agarwal; Girish Sastry; Amanda Askell; Pamela Mishkin; Jack Clark; Gretchen Krueger; Ilya Sutskever

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Abstract

p.1

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This res...

1. Introduction and Motivating Work

p.1

Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years. Task-agnostic o...

2.1. Natural Language Supervision

p.2

At the core of our approach is the idea of learning perception from supervision contained in natural language. We emphas...

2.2. Creating a Sufficiently Large Dataset

p.3

Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC...

2.3. Selecting an Efficient Pre-Training Method

p.4

State-of-the-art computer vision systems use very large amounts of compute. In the course of our efforts, we found train...

2.4. Choosing and Scaling a Model

p.4

We consider two different architectures for the image encoder. For the first, we use ResNet-50 as the base architecture ...

2.5. Training

p.5

We train a series of 5 ResNets and 3 Vision Transformers. For the ResNets we train a ResNet-50, a ResNet-101, and then 3...

3.1.1. Motivation — Zero-Shot Transfer

p.6

In computer vision, zero-shot learning usually refers to the study of generalizing to unseen object categories in image ...

3.1.2. Using CLIP for Zero-Shot Transfer

p.6

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot c...

3.1.3. Initial Comparison to Visual N-Grams and Prompt Engineering

p.7

In Table 1 we compare Visual N-Grams to CLIP. The best CLIP model improves accuracy on ImageNet from a proof of concept ...

3.1.5. Analysis of Zero-Shot CLIP Performance

p.8

Since task-agnostic zero-shot classifiers for computer vision have been understudied, CLIP provides a promising opportun...

3.2. Representation Learning

p.11

While we have extensively analyzed the task-learning capabilities of CLIP through zero-shot transfer in the previous sec...

3.3. Robustness to Natural Distribution Shift

p.13

In 2015, it was announced that a deep learning model exceeded human performance on the ImageNet test set. However, resea...

4. Comparison to Human Performance

p.16

How does CLIP compare to human performance and human learning? To get a better understanding of how well humans perform ...

5. Data Overlap Analysis

p.17

A concern with pre-training on a very large internet dataset is unintentional overlap with downstream evals. This is imp...

6. Limitations

p.18

There are still many limitations to CLIP. On datasets with training splits, the performance of zero-shot CLIP is on aver...

7. Broader Impacts

p.20

CLIP has a wide range of capabilities due to its ability to carry out arbitrary image classification tasks. CLIP also in...

7.1. Bias

p.21

Algorithmic decisions, training data, and choices about how classes are defined and taxonomized can all contribute to an...

7.2. Surveillance

p.23

We next sought to characterize model performance in relation to a downstream task for which there is significant societa...

7.3. Future Work

p.25

This preliminary analysis is intended to illustrate some of the challenges that general purpose computer vision models p...

8. Related Work

p.25

Any model that leverages written, spoken, signed or any other form of human language as part of its training signal is a...

9. Conclusion

p.27

We have investigated whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to an...