Text
An extractor which can match a field from ranking event over an item field. In practice, it can be useful in search related tasks, when you need to match a search query over multiple separate fields in document, like title-tags-category.
Field match extractor supports the following matching methods:
- BM25: a Lucene-specific BM25 score between ranking and item fields (for example, between query and item title)
- ngram: split item/query fields to N-grams and compute intersection over union score
- term: use Lucene to perform language specific tokenization
- bert: build LLM embeddings for item/query fields and compute a distance between them
Given this metadata event:
{
"event": "item",
"id": "81f46c34-a4bb-469c-8708-f8127cd67d27",
"item": "item1",
"timestamp": "1599391467000",
"fields": [
{"name": "title", "value": "red socks"},
{"name": "category", "value": "socks"},
{"name": "brand", "value": "buffalo"},
{"name": "description", "value": "lorem ipsum dolores sit amet"}
]
}
And a following ranking event:
{
"event": "ranking",
"id": "81f46c34-a4bb-469c-8708-f8127cd67d27",
"timestamp": "1599391467000",
"user": "user1",
"session": "session1",
"fields": [
{"name": "query", "value": "sock"}
],
"items": [
{"id": "item3"},
{"id": "item1"},
{"id": "item2"}
]
}
As BM25 formula requires term frequencies and some other index statistics, using BM25 requires you to build the term-freq dictionary beforehead, see the CLI
termfreq
docs on how to do it.Having the
term-freq.json
file in hand, you can then configure Metarank to compute BM25 score between ranking field (for example, query
) and item field (like title
): - name: title_match
type: field_match
rankingField: ranking.query
itemField: item.title
method:
type: bm25
language: english
termFreq: "/path/to/term-freq.json"
With the following config file snippet you can do a per-field matching of
ranking.query
field over item.title
field of the items in the ranking with 3-grams:- name: title_match
type: field_match
itemField: item.title // must be a string
rankingField: ranking.query // must be a string
method:
type: ngram // for now only ngram and term are supported
language: en // ISO-639-1 language code
n: 3
refresh: 0s // optional, how frequently we should update the value, 0s by default
ttl: 90d // optional, how long should we store this field
In a similar way you can do the same with term matching:
- name: title_match
type: field_match
itemField: item.title // must be a string
rankingField: ranking.query // must be a string
method:
type: term // for now only ngram and term are supported
language: en // ISO-639-1 language code
Both term and ngram matching methods leverage Lucene for text analysis and support the following set of languages:
- generic: no language specific transformations
- en: English
- cz: Czech
- da: Danish
- nl: Dutch
- et: Estonian
- fi: Finnish
- fr: French
- de: German
- gr: Greek
- it: Italian
- no: Norwegian
- pl: Polish
- pt: Portuguese
- es: Spanish
- sv: Swedish
- tr: Turkish
- ar: Arabic
- zh: Chinese
- ja: Japanese
Both term and ngram method share the same approach to the text analysis:
- text line is split into terms (using language-specific method)
- stopwords are removed
- for non-generic languages each term is stemmed
- then terms/ngrams from item and ranking are scored using intersection/union method.
This text matching method:
- computes an embedding for both query and document
- then computes a cosine between both embeddings.
Semantically-similar query-document pairs will have higher score than irrelevant ones.
Then with the following config snippet we can compute a cosine distance between title and query embeddings:
- type: field_match
name: title_query_match
rankingField: ranking.query
itemField: item.title
distance: cos # optional, default cos, options: cos/dot
method:
type: bi-encoder
model: metarank/all-MiniLM-L6-v2 # optional, can be only cache-based
dim: 384 # required, dimensionality of the embedding
itemFieldCache: /path/to/item.embedding # optional, pre-computed embedding cache for items
rankingFieldCache: /path/to/query.embedding # optional, pre-computed embedding cache for rankings
Metarank supports two embedding methods:
bi-encoder
: ONNX-encoded versions of the sentence-transformers models. See the metarank HuggingFace namespace for a list of currently supported models.
For both models, Metarank supports fetching model directly from the HuggingFace Hub, or loading it from a local dir, depending on the model handle format:
namespace/model
: fetch model from the HFHubfile:///<path>/<to>/<model dir>
: load ONNX-encoded embedding model from a local file.
In some performance-sensitive cases you don't want to compute embeddings in realtime, but only use offline precomputed ones. This is possible with the
csv
field_match
method:- type: field_match
name: title_query_match
rankingField: ranking.query
itemField: item.title
distance: cos # optional, default cos, options: cos/dot
method:
type: bi-encoder # note that there is no model reference, only caches
dim: 384
itemFieldCache: /path/to/item.embedding
rankingFieldCache: /path/to/query.embedding
In this case Metarank will load item and query embeddings from a CSV file in the following format:
itemFieldCache:
item1,0,1,2,3,4,5
item2,5,4,3,2,1,9
item3,1,1,1,1,1,1
rankingFieldCache:
bananas,0,1,2,3,4,5
red socks,5,4,3,2,1,9
stone,1,1,1,1,1,1
- when both query and item embeddings are present, then
field_match
will produce a cosine distance between them. - when at least one of the embeddings is missing, then
field_match
withcsv
method will produce anil
missing value. - when at least one of the embeddings is missing, then
field_match
withtransformer
method will compute the embedding real-time.
Cross-encoders are quite similar to bi-encoders, but instead of separately computing embedding for query and document, they feed both texts to the neural network, which produces the matching score.
Compared to the bi-encoder approach:
- cross-encoders are much more precise, even generic pre-trained models.
- require much more resources, as there's no way to pre-compute embeddings for docs and queries - you need to perform a full neural inference query-time.
Enabling cross-encoders in Metarank can be done with the following snippet:
- type: field_match
name: title_query_match
rankingField: ranking.query
itemField: item.title
method:
type: cross-encoder
model: metarank/ce-msmarco-MiniLM-L6-v2 # optional, can be only cache-based
cache: /path/to/ce.cache # optional, pre-computed query-doc scores
Note that as cross-encoders are very CPU heavy to run, you can pre-compute a set of query-doc scores offline and supply Metarank with a cache in the following CSV format:
query1,doc1,0.7
query1,doc2,0.1
query2,doc3,0.2
Last modified 4mo ago