Text
field_match
An extractor which can match a field from ranking event over an item field. In practice, it can be useful in search related tasks, when you need to match a search query over multiple separate fields in document, like title-tags-category.
Field match extractor supports the following matching methods:
BM25: a Lucene-specific BM25 score between ranking and item fields (for example, between query and item title)
ngram: split item/query fields to N-grams and compute intersection over union score
term: use Lucene to perform language specific tokenization
bert: build LLM embeddings for item/query fields and compute a distance between them
Dataset example
Given this metadata event:
And a following ranking event:
BM25 score
As BM25 formula requires term frequencies and some other index statistics, using BM25 requires you to build the term-freq dictionary beforehead, see the CLI termfreq
docs on how to do it.
Having the term-freq.json
file in hand, you can then configure Metarank to compute BM25 score between ranking field (for example, query
) and item field (like title
):
Ngram matching
With the following config file snippet you can do a per-field matching of ranking.query
field over item.title
field of the items in the ranking with 3-grams:
Term matching
In a similar way you can do the same with term matching:
Both term and ngram matching methods leverage Lucene for text analysis and support the following set of languages:
generic: no language specific transformations
en: English
cz: Czech
da: Danish
nl: Dutch
et: Estonian
fi: Finnish
fr: French
de: German
gr: Greek
it: Italian
no: Norwegian
pl: Polish
pt: Portuguese
es: Spanish
sv: Swedish
tr: Turkish
ar: Arabic
zh: Chinese
ja: Japanese
Both term and ngram method share the same approach to the text analysis:
text line is split into terms (using language-specific method)
stopwords are removed
for non-generic languages each term is stemmed
then terms/ngrams from item and ranking are scored using intersection/union method.
LLM Bi-Encoders
This text matching method:
computes an embedding for both query and document
then computes a cosine between both embeddings.
Semantically-similar query-document pairs will have higher score than irrelevant ones.
Then with the following config snippet we can compute a cosine distance between title and query embeddings:
Metarank supports two embedding methods:
bi-encoder
: ONNX-encoded versions of the sentence-transformers models. See the metarank HuggingFace namespace for a list of currently supported models.cross-encoder
: ONNX-encoded versions of sentence-transformers
For both models, Metarank supports fetching model directly from the HuggingFace Hub, or loading it from a local dir, depending on the model handle format:
namespace/model
: fetch model from the HFHubfile:///<path>/<to>/<model dir>
: load ONNX-encoded embedding model from a local file.
Using CSV cache of precomputed embeddings
In some performance-sensitive cases you don't want to compute embeddings in realtime, but only use offline precomputed ones. This is possible with the csv
field_match
method:
In this case Metarank will load item and query embeddings from a CSV file in the following format:
itemFieldCache:
rankingFieldCache:
when both query and item embeddings are present, then
field_match
will produce a cosine distance between them.when at least one of the embeddings is missing, then
field_match
withcsv
method will produce anil
missing value.when at least one of the embeddings is missing, then
field_match
withtransformer
method will compute the embedding real-time.
LLM Cross-encoders
Cross-encoders are quite similar to bi-encoders, but instead of separately computing embedding for query and document, they feed both texts to the neural network, which produces the matching score.
Compared to the bi-encoder approach:
cross-encoders are much more precise, even generic pre-trained models.
require much more resources, as there's no way to pre-compute embeddings for docs and queries - you need to perform a full neural inference query-time.
Enabling cross-encoders in Metarank can be done with the following snippet:
Note that as cross-encoders are very CPU heavy to run, you can pre-compute a set of query-doc scores offline and supply Metarank with a cache in the following CSV format:
Last updated