Automated ML model retraining
A problem: how to periodically re-train an ML model on a new data?
Click-through collection
While receiving and processing incoming events, Metarank collects click-through records:
On each ranking event, it logs all ML feature values used to produce it. As dynamic features constantly change in time, it allows to easily know, what was the value of any feature back in time.
Within a default 30-minute window (see a
core.clickthrough.maxSessionLength
option for details) all interactions within this ranking event are collected. So you can know which items a visitor has seen, and later interacted with.After a click-trough join window is finalized, then ranking, interactions, and feature values are persisted in the store.
These click-through records can be unfolded into an implicit judgement list. A judgement list later can be translated into a ML backend specific training dataset for a LambdaMART model training.
Metarank collects click-through records automatically, you don't need to tune anything to enable this behavior.
Manual retraining
Given that you have a production Metarank instance running somewhere in the cloud, you can re-train a ML model based on a history of already collected click-through records locally:
While training, Metarank will do the following steps:
Pull all stored click-through records from the store. Your local Metarank config file should be the same as the one use by the serving instance: store config and feature definitions should match.
Convert them to a XGBoost/LightGBM compatible judgement lists:
Do the ML model training.
Upload the model into the store and notify all API serving instances to reload the model.
Automated retraining
Metarank can be deployed inside a Kubernetes cluster using an official Helm manifest. This Helm manifest's configuration file can create a Kubernetes CronJob, which will do the same retraining action as described in the previous section, but automatically with a user-defined schedule:
You can define a custom cron-compatible schedule using a train.schedule
option. ML model training may require a lot of resources, so it's recommended to properly define resource configuration, so you won't hit OOM error.
A retraining CronJob is created by default by the Metarank Helm chart, so if you're using it for production deployment, you're ready to go without any configuration changes.
Last updated