We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

sentence-transformers/

clip-ViT-B-32

$0.005

/ 1M tokens

The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.

Public
77
sentence-transformers/clip-ViT-B-32 cover image
sentence-transformers/clip-ViT-B-32 cover image
clip-ViT-B-32

Ask me anything

0.00s

Settings

Model Information

clip-ViT-B-32

This is the Image & Text model CLIP, which maps text and images to a shared vector space. For applications of the models, have a look in our documentation SBERT.net - Image Search

Performance

In the following table we find the zero-shot ImageNet validation set accuracy:

ModelTop 1 Performance
clip-ViT-B-3263.3
clip-ViT-B-1668.1
clip-ViT-L-1475.4

For a multilingual version of the CLIP model for 50+ languages have a look at: clip-ViT-B-32-multilingual-v1