The "state" section describes how computed features and models are stored. Check Persistence configuration for more information. An example persistence conf block with comments:
state:# a place to store the feature values for the ML inference and the trained model# Local memory# A node-local in-memory storage without any persistence. # Feature values and the trained model is stored in-memory. # Suitable only for local testing, as in case of a restart it will loose all the data.type:memory# Remote redis, with persistence. # Saves the computed features and trained model in a Redis instance.# You can use remote or local Redis installation.#type: redis#host: localhost#port: 6369#format: binary # optional, default=binary, possible values: json, binary# Metarank implements several optimization strategies when using Redis: caching and pipelining# Check https://docs.metarank.ai/reference/overview/persistence#redis-persistence for more details#cache: # optional# maxSize: 4096 # size of in-memory client-side cache for hot keys, optional, default=4096# ttl: 1h # how long should key-values should be cached, optional, default=1h#pipeline: # optional# maxSize: 128 # batch write buffer size, optional, default=128# flushPeriod: 1s # buffer flush interval, optional, default=1s# enabled: true # toggle pipelining, optional, default=true# can be also overridden from environment variables, see the# https://docs.metarank.ai/reference/overview/persistence#redis-persistence for details#auth: # optional# user: <username> # optional when Redis ACL is disabled# password: <password> # required if Redis server is run with requirepass argument# tls: # optional, defaults to disabled# enabled: true # optional, defaults to false# ca: <path/to/ca.crt> # optional path to the CA used to generate the cert, defaults to the default keychain# verify: full # optional, default=full, possible values: full, ca, off# full - verify both certificate and hostname# ca - verify only certificate# off - skip verification#timeout: # optional, defaults to 1s for all sub-timeouts# connect: 1s # optional, defaults to 1s# socket: 1s # optional, defaults to 1s# command: 1s # optional, defaults to 1s
Training
Metarank also computes a click-through data structure, which contains the following bits of information:
ranking: which items were presented to the visitor
interactions: what visitor did after seeing the ranking (like clicks, purchases and so on)
feature values, which were used to produce the ranking in the past.
These click-through events are essential for model training, as they're later translated into the implicit judgement lists for the underlying LambdaMART model:
Metarank has multiple ways of storing these click-throughs with different pros and cons:
Redis: no special configuration needed, it's possible to perform periodic ML model retraining by reading the latest click-through events from it. But it takes quite a lot of RAM and maybe costly in a case when you have millions of click-through events.
train:type:redis# all options from state.redis herettl:<duration># optional, default 365 days.
Discard: do not store click-through events at all.
train:type:discard
Local dir: takes much less RAM (as ct's are not stored in redis), but you need to manage the directory containing the click-through files by yourself.
train:type:filepath:/path/to/dir# path to a directory which will be used for persistence during export/importformat:json# options are: json, binary
S3: like local file, but offloads data to an external block storage, suits well for Kubernetes deployments.
train:type:s3bucket:<bucket name># required, S3 bucket nameprefix:<prefix name># required, Prefix/dir name to store files intoregion:<aws region># required, S3 regioncompress:none | gzip | zst# optional, default: gzippartSizeBytes:10485760# optional, pre-compression, default: 10MbpartSizeEvents:1024# optional, default: 1024 eventspartInterval:1h# optional, default: 1hendpoint:<endpoint URI># optional, custom S3 endpointformat:json | binary# optional, default: binaryawsKey:"<key>"# optional, you should prefer setting# AWS_KEY_ID and AWS_SECRET_KEY_ID env varsawsKeySecret:"<secret>"# optional
S3 click-through store can either use hardcoded AWS credentials from config (which is not good from security perspective), or fall-back to the ones defined in env variables.
Features
This section describes how to map your input events into ML features that Metarank understands. See Feature extractors for an overview of supported types.
# These features can be shared between multiple models, so if you have a model A using features 1-2-3 and# a model B using features 1-2, then all three features will be computed only once. # You need to explicitly include a feature in the model configuration for Metarank to use it.features: - name:popularitytype:numberscope:itemsource:item.popularity# TTL and refresh fields are part of every feature extractor that Metarank supports.# The purpose of TTL is to configure data retention period, so in a case when there were no# feature updates for a long time, it will eventually be dropped.ttl:60d# Refresh parameter is used to downsample the amount of feature updates emitted. For example,# there is a window_counter feature extractor, which can be used to count a number of clicks that happened for# an item. Incrementing such a counter for a single day is an extremely lightweight operation, but computing# window sums is not. As it's not always required to receive up-to-date counter values in ML models,# these window sums can be updated only eventually (like once per hour), which improves the throughput a lot# (but results in a slightly stale data during the inference process)refresh:1h - name:genretype:stringscope:itemsource:item.genresvalues: - drama - comedy - thriller
models:default:# name of the model, used in the inference process as a part of path, like /rank/defaulttype:lambdamart# model typebackend:type:xgboost# supported values: xgboost, lightgbm for lambdamart modeliterations:100# optional (default 100), number of iterations while training the modelseed:0# optional (default = random), a seed to make training deterministicweights:# types and weights of interactions used in the model trainingclick:1# you can increase the weight of some events to hint model to optimize more for themfeatures:# features from the previous section used in the model - popularity - genre# You can specify several models at once.# This can be useful for A\B test scenarios or while testing different sets of features.#random:# type: shuffle # shuffle model type produces random results# maxPositionChange: 5 # controls the amount of randomness that shuffle can introduce in the original ranking# The noop model does nothing with the original ranking and returns results "as is"#noop:# type: noop# A similar-items MF ALS modelsimilar:type:alsinteractions: [click] # which types of interactions to usefactors:100# how many implicit factors to computeiterations:30# number of model training iterations# A simple "popular items" modeltrending:type:trendingweights: - interaction:clickdecay:1.0# 0..1, 0.5 means yesterday is 50% less important than todayweight:1.0# in a case with multiple interaction types
Inference
The "inference" section describes inference model configuration for search results re-ranking. Check the inference models section for more information.
inference:msmarco:# name of the modeltype:cross-encoder# model typemodel:metarank/ce-msmarco-MiniLM-L6-v2# model source
API
The "api" section describes the Metarank API configuration. This section is optional and by default binds service to port 8080 on all network interfaces.
api:port:8080host:"0.0.0.0"
Data sources
The optional "source" section describes the source of the data, and by default expects you to submit all user feedback using the API. Check Supported data sources for more information.
source:type:file# source type, available options: file, kafka, pulsar, kinesis#path: /home/user/ranklens/events/ # path to events file, alternatively you can use CLI to provide file location#offset: earliest|latest|ts=<unixtime>|last=<duration> #default: earliest#format: <json|snowplow:tsv|snowplow:json> # file format, default: json# Check https://docs.metarank.ai/reference/overview/data-sources#apache-kafka for more information#type: kafka#brokers: [broker1, broker2]#topic: events#groupId: metarank#offset: earliest|latest|ts=<unixtime>|last=<duration>#format: <json|snowplow:tsv|snowplow:json># Check https://docs.metarank.ai/reference/overview/data-sources#apache-pulsar for more information#type: pulsar#serviceUrl: <pulsar service URL>#adminUrl: <pulsar service HTTP admin URL>#topic: events#subscriptionName: metarank#subscriptionType: exclusive # options are exclusive, shared, failover#offset: earliest|latest|ts=<unixtime>|last=<duration>#format: <json|snowplow:tsv|snowplow:json># Check https://docs.metarank.ai/reference/overview/data-sources#aws-kinesis-streams for more information#type: kinesis#region: us-east-1#topic: events#offset: earliest|latest|ts=<unixtime>|last=<duration>#format: <json|snowplow:tsv|snowplow:json>
Core
This optional section contains parameters related to the metarank service itself. Default setup:
core:# How rankings and interactions are joined into click-throughs. For details, see the section below in this doc.clickthrough:maxParallelSessions:10000# how many active sessions may happen within a `maxSessionLength` periodmaxSessionLength:30m# after which period of inactivity session is considered finalized# default = 30m (to be consistent with Google Analytics)# Anonymous usage reporting. It is very helpful to us, so please leave this enabled.tracking:analytics:trueerrors:true
Click-through joining
Metarank joins ranking and interaction events together into click-through chains, which are later used for machine learning model training.
As interactions are happening some time later than rankings, Metarank needs to keep a set of rankings in the buffer, awaiting all the interactions that may happen later.
This buffer policy is controlled by the following parameters:
core.clickthrough.maxSessionLength: after which time period the session should be considered finalized, so no more interactions are allowed to happen. Default values is 30m, as in Google Analytics.
core.clickthrough.maxParallelSessions: how many parallel sessions may hang in buffer awaiting interactions. Default is 10k.
Anonymous usage analytics
By default, Metarank collects anonymous usage analytics to improve the tool. No IP addresses are being tracked, only simple counters track what parts of the service are being used.
It is very helpful to us, so please leave this enabled. Counters are sent to https://analytics.metarank.ai on each service startup.
{"state":"memory","modelTypes": [ "shuffle" ],"usedFeatures": [ {"name":"price","type":"number" } ],"system": {"os":"Linux","arch":"amd64","jvm":"17.0.3",// a SHA256 of your network interface MAC address, used as an installation ID"macHash":"3e78137877f66cfb4f1a0875e7eadb3100fb3c6c4755089b3cc6d9f074a3c4b5" },"mode":"train","version":"snapshot","ts": 1662030509517}
Error logging
We use Sentry for error collection. This behavior is enabled by default and can be disabled with core.tracking.errors: false. Sentry is configured with the following options:
Breadcrumbs are disabled: so it won't share parts of your console log with us.
PII tracking is disabled: no hostnames and IP addresses are included in the error message.