DeepInfra raises $107M Series B to scale the inference cloud — read the announcement
sentence-transformers/
$0.005
/ 1M tokens
The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.
Ask me anything
Settings
This is the Image & Text model CLIP, which maps text and images to a shared vector space. For applications of the models, have a look in our documentation SBERT.net - Image Search
In the following table we find the zero-shot ImageNet validation set accuracy:
| Model | Top 1 Performance |
|---|---|
| clip-ViT-B-32 | 63.3 |
| clip-ViT-B-16 | 68.1 |
| clip-ViT-L-14 | 75.4 |
For a multilingual version of the CLIP model for 50+ languages have a look at: clip-ViT-B-32-multilingual-v1
© 2026 DeepInfra. All rights reserved.