sentence-transformers/

clip-ViT-B-32-multilingual-v1

$0.005

/ 1M tokens

This model is a multilingual version of the OpenAI CLIP-ViT-B32 model, which maps text and images to a common dense vector space. It includes a text embedding model that works for 50+ languages and an image encoder from CLIP. The model was trained using Multilingual Knowledge Distillation, where a multilingual DistilBERT model was trained as a student model to align the vector space of the original CLIP image encoder across many languages.

Supports Priority Tier

Public

512

api versions

Input

inputs

You can add more items with the button on the right

You need to log in to use this model

Log In

Settings

ServiceTier

The service tier used for processing the request. 'priority' processes the request with higher priority (premium rate); 'flex' processes it at lower priority for a discount, served only when spare capacity exists and may be retried/timed out under load. Both apply only to models that support the respective tier.

Normalize

whether to normalize the computed embeddings

Dimensions

The number of dimensions in the embedding. If not provided, the model's default will be used.If provided bigger than model's default, the embedding will be padded with zeros. (Default: empty, 32 ≤ dimensions ≤ 8192)

Custom Instruction

Custom instruction prepending to each input. If empty, no instruction will be used.. (Default: empty)

Multimodal Inputs

Enter a JSON array

Service tier

Choose the tier these requests run on

Output

[
  [
    0,
    0.5,
    1
  ],
  [
    1,
    0.5,
    0
  ]
]

Model Information

sentence-transformers/clip-ViT-B-32-multilingual-v1

This is a multi-lingual version of the OpenAI CLIP-ViT-B32 model. You can map text (in 50+ languages) and images to a common dense vector space such that images and the matching texts are close. This model can be used for image search (users search through a large collection of images) and for multi-lingual zero-shot image classification (image labels are defined as text).

Multilingual Image Search - Demo

For a demo of multilingual image search, have a look at: Image_Search-multilingual.ipynb ( Colab version )

For more details on image search and zero-shot image classification, have a look at the documentation on SBERT.net.

Training

This model has been created using Multilingual Knowledge Distillation. As teacher model, we used the original clip-ViT-B-32 and then trained a multilingual DistilBERT model as student model. Using parallel data, the multilingual student model learns to align the teachers vector space across many languages. As a result, you get an text embedding model that works for 50+ languages.

The image encoder from CLIP is unchanged, i.e. you can use the original CLIP image encoder to encode images.

Have a look at the SBERT.net - Multilingual-Models documentation on more details and for training code.

We used the following 50+ languages to align the vector spaces: ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.

The original multilingual DistilBERT supports 100+ lanugages. The model also work for these languages, but might not yield the best results.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Citing & Authors

This model was trained by sentence-transformers.

If you find this model helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}
copy