Metarank YAML config file contains the following sections:
  • Persistence: how feature data is stored
  • Models: which models should be trained and used in inference
  • Features: how features are computed from events
  • API: network options for API
  • Data sources: where to read events from
  • Core: service options, like anonymous tracking and error reporting.
See the sample-config.yml file for a full working example.

The "state" section describes how computed features and models are stored. Check Persistence configuration for more information. An example persistence conf block with comments:
state: # a place to store the feature values for the ML inference and the trained model
# Local memory
# A node-local in-memory storage without any persistence.
# Feature values and the trained model is stored in-memory.
# Suitable only for local testing, as in case of a restart it will loose all the data.
type: memory
# Remote redis, with persistence.
# Saves the computed features and trained model in a Redis instance.
# You can use remote or local Redis installation.
#type: redis
#host: localhost
#port: 6369
# Metarank implements several optimization strategies when using Redis: caching and pipelining
# Check for more details
#cache: # optional
# maxSize: 4096 # size of in-memory client-side cache for hot keys, optional, default=4096
# ttl: 1h # how long should key-values should be cached, optional, default=1h
#pipeline: # optional
# maxSize: 128 # batch write buffer size, optional, default=128
# flushPeriod: 1s # buffer flush interval, optional, default=1s

This section describes how to map your input events into ML features that Metarank understands. See Feature extractors for an overview of supported types.
# These features can be shared between multiple models, so if you have a model A using features 1-2-3 and
# a model B using features 1-2, then all three features will be computed only once.
# You need to explicitly include a feature in the model configuration for Metarank to use it.
- name: popularity
type: number
scope: item
source: item.popularity
# TTL and refresh fields are part of every feature extractor that Metarank supports.
# The purpose of TTL is to configure data retention period, so in a case when there were no
# feature updates for a long time, it will eventually be dropped.
ttl: 60d
# Refresh parameter is used to downsample the amount of feature updates emitted. For example,
# there is a window_counter feature extractor, which can be used to count a number of clicks that happened for
# an item. Incrementing such a counter for a single day is an extremely lightweight operation, but computing
# window sums is not. As it's not always required to receive up-to-date counter values in ML models,
# these window sums can be updated only eventually (like once per hour), which improves the throughput a lot
# (but results in a slightly stale data during the inference process)
refresh: 1h
- name: genre
type: string
scope: item
source: item.genres
- drama
- comedy
- thriller
For inspiration, you can use a ranklens feature configuration used for Metarank demo site.

The "models" section describes ML models used for personalization. Check Supported models for more information.
default: # name of the model, used in the inference process as a part of path, like /rank/default
type: lambdamart # model type
type: xgboost # supported values: xgboost, lightgbm for lambdamart model
iterations: 100 # optional (default 100), number of iterations while training the model
seed: 0 # optional (default = random), a seed to make training deterministic
weights: # types and weights of interactions used in the model training
click: 1 # you can increase the weight of some events to hint model to optimize more for them
features: # features from the previous section used in the model
- popularity
- genre
# You can specify several models at once.
# This can be useful for A\B test scenarios or while testing different sets of features.
# type: shuffle # shuffle model type produces random results
# maxPositionChange: 5 # controls the amount of randomness that shuffle can introduce in the original ranking
# The noop model does nothing with the original ranking and returns results "as is"
# type: noop

The "api" section describes the Metarank API configuration. This section is optional and by default binds service to port 8080 on all network interfaces.
port: 8080
host: ""

The optional "source" section describes the source of the data, and by default expects you to submit all user feedback using the API. Check Supported data sources for more information.
type: file # source type, available options: file, kafka, pulsar, kinesis
#path: /home/user/ranklens/events/ # path to events file, alternatively you can use CLI to provide file location
#offset: earliest|latest|ts=<unixtime>|last=<duration> #default: earliest
#format: <json|snowplow:tsv|snowplow:json> # file format, default: json
# Check for more information
#type: kafka
#brokers: [broker1, broker2]
#topic: events
#groupId: metarank
#offset: earliest|latest|ts=<unixtime>|last=<duration>
#format: <json|snowplow:tsv|snowplow:json>
# Check for more information
#type: pulsar
#serviceUrl: <pulsar service URL>
#adminUrl: <pulsar service HTTP admin URL>
#topic: events
#subscriptionName: metarank
#subscriptionType: exclusive # options are exclusive, shared, failover
#offset: earliest|latest|ts=<unixtime>|last=<duration>
#format: <json|snowplow:tsv|snowplow:json>
# Check for more information
#type: kinesis
#region: us-east-1
#topic: events
#offset: earliest|latest|ts=<unixtime>|last=<duration>
#format: <json|snowplow:tsv|snowplow:json>

This optional section contains parameters related to the metarank service itself. Default setup:
# How rankings and interactions are joined into click-throughs. For details, see the section below in this doc.
maxParallelSessions: 10000 # how many active sessions may happen within a `maxSessionLength` period
maxSessionLength: 30m # after which period of inactivity session is considered finalized
# default = 30m (to be consistent with Google Analytics)
# Anonymous usage reporting. It is very helpful to us, so please leave this enabled.
analytics: true
errors: true

Metarank joins ranking and interaction events together into click-through chains, which are later used for machine learning model training.
As interactions are happening some time later than rankings, Metarank needs to keep a set of rankings in the buffer, awaiting all the interactions that may happen later.
This buffer policy is controlled by the following parameters:
  • core.clickthrough.maxSessionLength: after which time period the session should be considered finalized, so no more interactions are allowed to happen. Default values is 30m, as in Google Analytics.
  • core.clickthrough.maxParallelSessions: how many parallel sessions may hang in buffer awaiting interactions. Default is 10k.

By default, Metarank collects anonymous usage analytics to improve the tool. No IP addresses are being tracked, only simple counters track what parts of the service are being used.
It is very helpful to us, so please leave this enabled. Counters are sent to on each service startup.
An example payload:
"state" : "memory",
"modelTypes" : [ "shuffle" ],
"usedFeatures" : [
"name" : "price",
"type" : "number"
"system" : {
"os" : "Linux",
"arch" : "amd64",
"jvm" : "17.0.3",
// a SHA256 of your network interface MAC address, used as an installation ID
"macHash" : "3e78137877f66cfb4f1a0875e7eadb3100fb3c6c4755089b3cc6d9f074a3c4b5"
"mode" : "train",
"version" : "snapshot",
"ts" : 1662030509517

We use Sentry for error collection. This behavior is enabled by default and can be disabled with core.tracking.errors: false. Sentry is configured with the following options:
  • Breadcrumbs are disabled: so it won't share parts of your console log with us.
  • PII tracking is disabled: no hostnames and IP addresses are included in the error message.
An example error payload is available in sample-error.json.
The whole usage logging and error reporting can be disabled also by setting an env variable to METARANK_TRACKING=false.
Copy link
Edit on GitHub
On this page
Data sources
Click-through joining
Anonymous usage analytics
Error logging