We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

🚀 New models by Bria.ai, generate and edit images at scale 🚀

thenlper/

gte-large

$0.010

/ 1M tokens

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Public
512
PaperLicense
thenlper/gte-large cover image

Input

inputs
You can add more items with the button on the right

You need to login to use this model

Login

Settings

ServiceTier

The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority.

Normalize

whether to normalize the computed embeddings

Dimensions

The number of dimensions in the embedding. If not provided, the model's default will be used.If provided bigger than model's default, the embedding will be padded with zeros. (Default: empty, 32 ≤ dimensions ≤ 8192)

Custom Instruction

Custom instruction prepending to each input. If empty, no instruction will be used.. (Default: empty)

Output

[
  [
    0,
    0.5,
    1
  ],
  [
    1,
    0.5,
    0
  ]
]
Model Information

gte-large

General Text Embeddings (GTE) model. Towards General Text Embeddings with Multi-stage Contrastive Learning

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Metrics

We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the MTEB leaderboard.

Model NameModel Size (GB)DimensionSequence LengthAverage (56)Clustering (11)Pair Classification (3)Reranking (4)Retrieval (15)STS (10)Summarization (1)Classification (12)
gte-large0.67102451263.1346.8485.0059.1352.2283.3531.6673.33
gte-base0.2276851262.3946.284.5758.6151.1482.331.1773.01
e5-large-v21.34102451262.2544.4986.0356.6150.5682.0530.1975.24
e5-base-v20.4476851261.543.8085.7355.9150.2981.0530.2873.84
gte-small0.0738451261.3644.8983.5457.749.4682.0730.4272.31
text-embedding-ada-002-1536819260.9945.984.8956.3249.2580.9730.870.93
e5-small-v20.1338451259.9339.9284.6754.3249.0480.3931.1672.94
sentence-t5-xxl9.7376851259.5143.7285.0656.4242.2482.6330.0873.42
all-mpnet-base-v20.4476851457.7843.6983.0459.3643.8180.2827.4965.07
sgpt-bloom-7b1-msmarco28.274096204857.5938.9381.955.6548.2277.7433.666.19
all-MiniLM-L12-v20.1338451256.5341.8182.4158.4442.6979.827.963.21
all-MiniLM-L6-v20.0938451256.2642.3582.3758.0441.9578.930.8163.05
contriever-base-msmarco0.4476851256.0041.182.5453.1441.8876.5130.3666.68
sentence-t5-base0.2276851255.2740.2185.1853.0933.6381.1431.3969.81

Limitation

This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

Citation

If you find our paper or models helpful, please consider citing them as follows:

@misc{li2023general,
      title={Towards General Text Embeddings with Multi-stage Contrastive Learning}, 
      author={Zehan Li and Xin Zhang and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang},
      year={2023},
      eprint={2308.03281},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}