Metarank expects your historical data to be ordered by the timestamp in the ascending order. If for any reason, you cannot generate a sorted file, the sort sub-command can do the job for you.
You can sort both single files and folders with multiple files. In case of folders, sort command will merge all data into one sorted file.
If you don't know what features to include in the configuration file, the autofeature sub-command can generate the configuration for you based on the historical data you have.
if the --model <name> option is not given, then Metarank will train all the defined models sequentially.
While training the model, Metarank will split your data into train/validation datasets with the following supported splitting strategies:
random: shuffle all the training samples and take N% as a training part. May result in an implicit model leakage, when information about the future was leaked in the training set.
An example: on Christmas items with Santa are selling much better (and not selling at all afterwards), and leaking this knowledge into your training set will result in better offline scores (as model knows that Christmas is coming). In production, it will behave significantly worse, as there is no way to predict the future out of the training data anymore.
time: sort all training samples by timestamp and pick first N% as training set (the default option).
hold_last: group all samples by user, and sort per-user samples by timestamp. N% first samples within each user are picked into the training dataset.
Has the same issue with future leaking in the model, but optimizes the train dataset to focus on last user click.
The format of split strategy CLI flag is --strategy name=ratio%. For example:
random with 90% ratio: --split random=90%
random with a default 80% ratio: --split random
Dataset export
Metarank can emit CSV/LibSVM formatted datasets and corresponding config files for LightGBM and XGBoost, so you can later perform a hyper-parameter optimization using your favourite tool:
Metarank export format is dependent on model backend type:
For XGBoost, we export a LibSVM-encoded train/test files with embedded qid. This format is not compatible with the LightGBM LibSVM reader implementation (and we were unable to make it work with both). An example:
for LightGBM, we export a CSV-encoded train/test files with header. XGBoost (as for version 1.70) cannot load query information from CSV files, so it cannot be used for LambdaMART. Example:
For both booster implementations Metarank also emits a corresponding config file with all default values filled in. You can just run the lightgbm/xgboost cli tool externally to replicate what's Metarank is doing.
$> ls-ltotal39740-rw-r--r--1shuttyshutty219Nov213:01lightgbm.conf-rw-r--r--1shuttyshutty6845983Nov213:01test.csv-rw-r--r--1shuttyshutty28453327Nov213:01train.csv$> catlightgbm.confobjective=lambdarankdata=train.csvvalid=test.csvnum_iterations=5learning_rate=0.1seed=0max_depth=8header=truelabel_column=name:labelgroup_column=name:grouplambdarank_truncation_level=10metric=ndcgeval_at=10$> lightgbmconfig=lightgbm.conf[LightGBM] [Info] Warning: last line of lightgbm.conf has no end of line, still using this line[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).[LightGBM] [Info] Finished loading parameters[LightGBM] [Info] Using column label as label[LightGBM] [Info] Using column group as group/query id[LightGBM] [Info] Construct bin mappers from text data time 0.14 seconds[LightGBM] [Info] Finished loading data in 0.279901 seconds[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.020266 seconds.You can set `force_row_wise=true` to remove the overhead.And if memory is not enough, you can set `force_col_wise=true`.[LightGBM] [Info] Total Bins 1837[LightGBM] [Info] Number of data points in the train set: 164232, number of used features: 29[LightGBM] [Info] Finished initializing training[LightGBM] [Info] Started training...[LightGBM] [Info] Iteration:1, valid_1 ndcg@10 : 0.561739[LightGBM] [Info] 0.009204 seconds elapsed, finished iteration 1[LightGBM] [Info] Iteration:2, valid_1 ndcg@10 : 0.572802[LightGBM] [Info] 0.017980 seconds elapsed, finished iteration 2[LightGBM] [Info] Iteration:3, valid_1 ndcg@10 : 0.576812[LightGBM] [Info] 0.032011 seconds elapsed, finished iteration 3[LightGBM] [Info] Iteration:4, valid_1 ndcg@10 : 0.582428[LightGBM] [Info] 0.045455 seconds elapsed, finished iteration 4[LightGBM] [Info] Iteration:5, valid_1 ndcg@10 : 0.582633[LightGBM] [Info] 0.052650 seconds elapsed, finished iteration 5[LightGBM] [Info] Finished training
Metarank supports the same train/test split strategies for export subcommand as for the train one.
BM25 term frequencies dictionary
To use the BM25 score in the field_match, you need to compute a bit of statistics over your textual information.
The same dictionary can be used for multiple field_match extractors, for example when you want to have separate BM25 scores for query-title and query-description matches.
Environment variables
Config file can be passed to the Metarank not only as a command-line argument, but also as an environment variable. This is typically used in docker and k8s-based deployments:
METARANK_CONFIG: path to config file, for example s3://bucket/prefix/config.yml