Reranking with cross-encoders
Last updated
Last updated
In this guide we will set up Metarank as a simple inference server for cross-encoder LLMs (Large Language Models). In other words, we will use an open-source cross-encoder model to reorder your search results in a zero-shot manner, without collecting any visitor feedback data. We will use a pre-trained MS-MARCO MiniLM-L6-v2
cross-encoder from the sentence-transformers package.
Cross-encoder LLM is a way to leverage the semantic power of neural network to reorder top-N matching documents for your query. You can think about it as asking ChatGPT the following question:
TODO: add ChatGPT answer
For typical reranking scenarios, cross-encoders (even in zero-shot modes) are much more precise compared to bi-encoders.
Let's imagine you already have a traditional search engine running (like Elasticsearch, OpenSearch or SOLR), which already has a good recall level - it retrieves all the relevant products, but it sometimes struggles with precision: there can be some false-positives, and the ranking is not perfect.
In this guide we will take top-N matching documents from your search engine, and re-rank them according to their semantic similarity with the query.
We will use an Amazon ESCI e-commerce dataset as a toy but realistic example: it has 1.7M real products that we can easily query with Elasticsearch. You can download the JSON-encoded version here: https://github.com/shuttie/esci-s.
Assuming that you have Elasticsearch service running on http://localhost:9200
, you can import the complete dataset with the following Python script:
And then you can perform simple keyword searches over the data. For example, you can search for "crocs":
Where the search.json
looks like this:
For this search query, Elasticsearch returned 30 matching products, but we will take only top-10 of them for further reranking:
As we got our top-10 search results for reranking, we're now going to configure Metarank in inference mode for cross-encoders. This can be done with the following configuration file:
After start-up, Metarank will expose it's HTTP API and you can query the /inference
API endpoint to perform the reranking. See the API Reference for details about payload format:
Where rerank.json
request looks like:
Metarank will respond with a set of scores, corresponding to the similarity of each query-document pair:
Note that due to the LLM inference happening for all document-query pairs, cross-encoders can be quite slow for large reranking windows:
As it can be seen from the benchmark above, windows of top-100 products may incur a noticeable latency, so try to keep this reranking window reasonably small.