Automatic feature engineering

A typical problem: to write a Metarank config file with event to feature mapping, you need to read the docs on feature extraction and well-understand your click-through input dataset:

  • Which fields do items have? Which values does each field have?

  • Do these values look like categories?

  • How many unique values are there per field?

Nobody likes reading docs and writing YAML, so Metarank has an AutoML level-4 style generator of typical feature extractors based on the historical click-through dataset you already have.

Running the autofeature generator

Use the autofeature sub-command from the main binary:

                __                              __    
  _____   _____/  |______ ____________    ____ |  | __
 /     \_/ __ \   __\__  \\_  __ \__  \  /    \|  |/ /
|  Y Y  \  ___/|  |  / __ \|  | \// __ \|   |  \    < 
|__|_|  /\___  >__| (____  /__|  (____  /___|  /__|_ \
      \/     \/          \/           \/     \/     \/ ver:None
Usage: metarank <subcommand> <options>

Subcommand: autofeature - generate reference config based on existing data
  -d, --data  <arg>      path to a directory with input files
  -f, --format  <arg>    input file format: json, snowplow, snowplow:tsv,
                         snowplow:json (optional, default=json)
  -o, --offset  <arg>    offset: earliest, latest, ts=1663161962, last=1h
                         (optional, default=earliest)
      --out  <arg>       path to an output config file
  -r, --ruleset  <arg>   set of rules to generate config: stable, all (optional,
                         default=stable, values: [stable, all])
  -c, --cat-threshold  <arg>   min threshold of category frequency, when its
                               considered a catergory (optional, default=0.003)
  -h, --help             Show help message

For all other tricks, consult the docs on https://docs.metarank.ai

An example minimal command to generate the config file for your dataset:

For a RankLens dataset, for example, it will emit the following:

Supported heuristics

Metarank has multiple sets of heuristics to generate feature configuration, toggled by the --ruleset CLI option:

  • stable: a default one, ruleset with less agressive heuristics, proven to be safe in production use.

  • all: generates all features it can, even the problematic ones (like CTR, which may introduce biases).

The following stable heuristics are supported:

  • Numeric: all numerical item fields are encoded as a number feature. So for a numeric field budget describing a movie budget in dollars, it will generate the following feature extractor defitition:

  • String: string item fields with low-cardinality are encoded as a string feature. So movie genres field is a good candidate for this type of heuristic due to its low cardinality:

If you have a lot of distinct categories, and Metarank does not pick them up (e.g. decides that a category is too infrequent, and you get much less possible categories than expected), you can lower the category frequency threshold with a --cat-threshold flag.

The default --cat-threshold value of 0.003 means that only categories with frequencies above 0.3% are included.

  • InteractedWith: all interaction over low-cardinality fields are translated to interacted_with feature. So if a user clicked on an item with horror genre, other horror movies may get extra points:

  • Relevancy: if rankings with non-zero relevancy are present, then a feature relevancy is built:

  • Vector: all numerical vectors are transformed into statically-sized features. Vectors of static size are passed through as-is, and variable-length vectors are reduced into a quadruplets of [min, max, size, avg] values:

The all ruleset contains all stable heuristics with an addition of a couple of extra ones:

  • Rate: For all interaction types a rate feature is generated over multiple typical time windows:

Why stable ruleset has no counters?

The main difference between two rulesets is the lack of rate/window_count features, which is made deliberately:

  • rate/window_count features usually introduce a popularity bias to the final ranking: as people tend to click more on popular items, ML model may attempt to always put popular items on top just because they're popular.

  • this behavior may be not that bad from the business KPI standpoint, but may make your ranking more static and less affected by past visitor actions.

Last updated

Was this helpful?