Links
Comment on page

Reranking with cross-encoders

In this guide we will set up Metarank as a simple inference server for cross-encoder LLMs (Large Language Models). In other words, we will use an open-source cross-encoder model to reorder your search results in a zero-shot manner, without collecting any visitor feedback data. We will use a pre-trained MS-MARCO MiniLM-L6-v2 cross-encoder from the sentence-transformers package.

What are cross-encoders?

Cross-encoder LLM is a way to leverage the semantic power of neural network to reorder top-N matching documents for your query. You can think about it as asking ChatGPT the following question:
For a search query "crocs", reorder the following documents in the decreasing relevance order:
1. Crocs Jibbitz 5-Pack Alien Shoe Charms
2. Crocs Specialist Vent Work
3. Crocs Kids' Baya Clog
TODO: add ChatGPT answer
For typical reranking scenarios, cross-encoders (even in zero-shot modes) are much more precise compared to bi-encoders.

Initial setup

Let's imagine you already have a traditional search engine running (like Elasticsearch, OpenSearch or SOLR), which already has a good recall level - it retrieves all the relevant products, but it sometimes struggles with precision: there can be some false-positives, and the ranking is not perfect.
In this guide we will take top-N matching documents from your search engine, and re-rank them according to their semantic similarity with the query.
reranking flow

Importing data

We will use an Amazon ESCI e-commerce dataset as a toy but realistic example: it has 1.7M real products that we can easily query with Elasticsearch. You can download the JSON-encoded version here: https://github.com/shuttie/esci-s.
Assuming that you have Elasticsearch service running on http://localhost:9200, you can import the complete dataset with the following Python script:
import json
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts="http://localhost:9200")
with open('esci.json', 'r') as f:
for line in f.readlines():
doc = json.loads(line.rstrip())
if 'title' in doc:
# index only title and asin fields
es.index(index="esci", document={'title': doc['title'], 'asin': doc['asin']})
And then you can perform simple keyword searches over the data. For example, you can search for "crocs":
curl -XPOST -d @search.json -H "Content-Type: application/json" http://localhost:9200/esci/_search
Where the search.json looks like this:
{
"query": {
"multi_match": {
"query" : "crocs", "fields": ["title"]
}
},
"fields": ["asin","title"]
}
For this search query, Elasticsearch returned 30 matching products, but we will take only top-10 of them for further reranking:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 30,
"relation": "eq"
},
"max_score": 10.184343,
"hits": [
{
"_index": "esci",
"_id": "17qoxocBFbzZBgn-7-iy",
"_score": 10.184343,
"_source": {
"title": "Crocs Jibbitz 5-Pack Alien Shoe Charms | Jibbitz for Crocs",
"asin": "B089YD2KK5"
},
"fields": {
"asin": [
"B089YD2KK5"
],
"title": [
"Crocs Jibbitz 5-Pack Alien Shoe Charms | Jibbitz for Crocs"
]
}
}
]
}
}

Metarank as an inference server

As we got our top-10 search results for reranking, we're now going to configure Metarank in inference mode for cross-encoders. This can be done with the following configuration file:
inference:
msmarco:
type: cross-encoder
model: metarank/ce-msmarco-MiniLM-L6-v2
After start-up, Metarank will expose it's HTTP API and you can query the /inference API endpoint to perform the reranking. See the API Reference for details about payload format:
curl -XPOST -d @rerank.json -H "Content-Type: application/json" http://metarank:8080/inference/cross/msmarco
Where rerank.json request looks like:
{"input": [
{"query": "crocs", "text": "Crocs Jibbitz 5-Pack Alien Shoe Charms"},
{"query": "crocs", "text": "Crocs Specialist Vent Work"},
{"query": "crocs", "text": "Crocs Kids' Baya Clog"}
]}
Metarank will respond with a set of scores, corresponding to the similarity of each query-document pair:
{"scores": [0.756001, 0.52617, 0.193747]}

Cross-encoder latency

Note that due to the LLM inference happening for all document-query pairs, cross-encoders can be quite slow for large reranking windows:
Benchmark (batch) (model) Mode Cnt Score Error Units
encode 1 ce-msmarco-MiniLM-L6-v2 avgt 30 12.298 ± 0.581 ms/op
encode 10 ce-msmarco-MiniLM-L6-v2 avgt 30 58.664 ± 2.064 ms/op
encode 100 ce-msmarco-MiniLM-L6-v2 avgt 30 740.422 ± 13.369 ms/op
As it can be seen from the benchmark above, windows of top-100 products may incur a noticeable latency, so try to keep this reranking window reasonably small.