Metarank Docs
  • Introduction
    • What is Metarank?
    • Quickstart
    • Performance
  • Guides
    • Search
      • Reranking with cross-encoders
  • Reference
    • Installation
    • Event Format
      • Timestamp formats
    • API
    • Command-line options
    • Configuration
      • Feature extractors
        • Counters
        • Date and Time
        • Generic
        • Relevancy
        • Scalars
        • Text
        • User Profile
        • Diversification
      • Recommendations
        • Trending items
        • Similar items
        • Semantic similarity
      • Models
      • Data Sources
      • Persistence
    • Deployment
      • Standalone
      • Docker
      • Kubernetes
      • Prometheus metrics export
      • Custom logging
      • Warmup
    • Integrations
      • Snowplow
  • How-to
    • Automated ML model retraining
    • Automatic feature engineering
    • Running in production
  • Development
    • Changelog
    • Building from source
  • Doc versions
    • 0.7.9 (stable)
    • master (unstable)
Powered by GitBook
On this page
  • Typical Snowplow architecture
  • Schema registry
  • Stream transport types
  • Setting up event tracking
  • Installing ai.metarank schemas
  • Connecting Metarank with Snowplow
  • Realtime events from AWS Kinesis
  • Historical events from AWS S3
  • Validating the setup

Was this helpful?

Edit on GitHub
  1. Reference
  2. Integrations

Snowplow

PreviousIntegrationsNextAutomated ML model retraining

Last updated 2 years ago

Was this helpful?

Metarank can be integrated into existing setup.

We provide a set of that you will use to track metadata and interaction events that can later on be read from Snowplow's enriched event stream by Metarank.

  • are used to track Metarank-specific events.

  • Metarank will use Snowplow's enriched event stream as a source of events.

Typical Snowplow architecture

Typical Snowplow Analytics setup consists of the following parts:

  • Stream Collector writes all incoming events into the raw stream

  • Validated and enriched events are written to the enriched stream

  • Enriched events are delivered to the Analytics DB

Metarank exposes a set of Snowplow-compatible event schemas, and can read events directly from the enriched stream, as shown on the diagram below:

Schema registry

All incoming raw events have a strict JSON schema, that consists of the following parts:

  • unstructured payload with user-defined schema

  • multiple context payloads with user-defined schemas

There are four different Metarank event types with the corresponding schemas:

Stream transport types

  • [stdout]: not supported

Setting up event tracking

These events can be generated both on the frontend side, and on the backend side, depending on your setup and data availability on the front and back ends.

Frontend tracking

An example of tracking a ranking event:

import { trackSelfDescribingEvent } from '@snowplow/browser-tracker';

trackSelfDescribingEvent({
  event: {
    schema: 'iglu:ai.metarank/ranking/jsonschema/1-0-0',
    data: {
        event: 'ranking',
        id: '81f46c34-a4bb-469c-8708-f8127cd67d27',
        timestamp: '1599391467000',
        user: 'user1',
        session: 'session1',
        fields: [
            { name: 'query', value: 'cat' },
            { name: 'source', value: 'search' }
        ],
        items: [
            { id: "item1" },
            { id: "item2" }
        ]
    }
  }
});

Backend tracking

It Often happens that the frontend doesn't have all the required information to generate events. A good example is item metadata event (usually tags, titles and price are altered in some back-office system and are not directly exposed to the frontend).

In this case you can generate such events on the backend side.

import com.snowplowanalytics.snowplow.tracker.*;
import com.snowplowanalytics.snowplow.tracker.emitter.*;
import com.snowplowanalytics.snowplow.tracker.events.Unstructured;
import com.snowplowanalytics.snowplow.tracker.payload.SelfDescribingJson;

import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class JavaTrackerExample {
    public static void main(String[] args) {
        BatchEmitter emitter = BatchEmitter.builder()
                .url("http://collectorEndpoint")
                .build();

        Tracker tracker = new Tracker
                .TrackerBuilder(emitter, "trackerNamespace", "appId")
                .build();

        Map<String, Object> payload = new HashMap<>();
        payload.put("event", "item");
        payload.put("id", "81f46c34-a4bb-469c-8708-f8127cd67d27");
        payload.put("timestamp", String.valueOf(System.currentTimeMillis()));
        payload.put("item", "item1");

        Map<String, Object> fields = new HashMap<>();
        fields.put("title", "your cat");
        fields.put("color", List.of("white", "black"));
        payload.put("fields", fields);

        Unstructured unstructured = Unstructured.builder()
                .eventData(new SelfDescribingJson("iglu:ai.metarank/item/1-0-0", payload))
                .build();

        tracker.track(unstructured);
    }
}

Installing ai.metarank schemas

Metarank schemas are available on a public Iglu server on https://iglu.metarank.ai. To use it, add the following snippet to the resolver.json snowplow-enrich config file:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Metarank",
        "priority": 0,
        "vendorPrefixes": [ "ai.metarank" ],
        "connection": {
          "http": {
            "uri": "https://iglu.metarank.ai"
          }
        }
      }
    ]
  }
}

Both http and https schemas are supported, but https is recommended.

Connecting Metarank with Snowplow

An example loader integration diagram is shown below:

Snowplow is flexible enough to use different data loading destinations (Redshift, Postgres, Snowflake, S3, etc.), but to access both live and historical enriched event data, Metarank needs an access to:

  • Enriched event stream

  • Historical enriched stream dumps done with S3 Loader

At the moment Metarank supports loading historical events only from S3 Loader.

Realtime events from AWS Kinesis

Snowplow enrich is usually configured with three destination streams output/pii/bad, with the same HOCON definition:

  "output": {
    # Enriched events output
    "good": {
      "type": "Kinesis"

      # Name of the Kinesis stream to write to
      "streamName": "enriched"

      # Optional. Maximum amount of time an enriched event may spend being buffered before it gets sent
      "maxBufferedTime": 100 millis
    }
  }
inference:
  port: 8080
  host: "0.0.0.0"
  source:
    type: kinesis
    region: us-east-1
    topic: enriched
    offset: latest
    format: snowplow:tsv

All the supported Metarank sources have an optional format field, which defines the underlying format of the payload in this stream. Valid options are:

  • json: default value, Metarank native format

  • snowplow, snowplow:tsv: Snowplow default TSV stream format

  • snowplow:json: Snowplow optional JSON stream format

With the format: snowplow:tsv, Metarank will read TSV events and transform them into native format automatically.

Historical events from AWS S3

{
  # Optional, but recommended
  "region": "us-east-1",

  # Options are: RAW, ENRICHED_EVENTS, JSON
  "purpose": "ENRICHED_EVENTS",

  # Input Stream config
  "input": {
    # Kinesis Client Lib app name (corresponds to DynamoDB table name)
    "appName": "acme-s3-loader",
    # Kinesis stream name
    "streamName": "enriched",
    # Options are: LATEST, TRIM_HORIZON, AT_TIMESTAMP
    "position": "LATEST",
    # Max batch size to pull from Kinesis
    "maxRecords": 10
  },

  "output": {
    "s3": {
      # Full path to output data
      "path": "s3://acme-snowplow-output/enriched/",

      # Partitioning format; Optional
      # Valid substitutions are {vendor}, {schema}, {format}, {model} for self-describing jsons
      # and {yy}, {mm}, {dd}, {hh} for year, month, day, hour
      partitionFormat: "{vendor}.{schema}/model={model}/date={yy}-{mm}-{dd}"

      # Output format; Options: GZIP, LZO
      "compression": "GZIP"
    }
  }
}
bootstrap:
  source:
    type: file
    path: "s3://acme-snowplow-output/enriched/"
    offset: earliest
    format: snowplow:tsv

Validating the setup

Using , your application emits a clickstream telemetry to the

validates these events according to the predefined schemas from the Schema Registry

predefined fields according to the

These user-defined schemas are pulled from the , and these schemas are standard definitions, describing the payload structure.

ai.metarank/item/1-0-0:

ai.metarank/user/1-0-0:

ai.metarank/ranking/1-0-0:

ai.metarank/interaction/1-0-0:

These schemas are describing native without any modifications.

Check out for more details about Metarank schemas.

Snowplow supports for event delivery:

: supported by Metarank

: supported by Metarank

: support is

: not supported

: not supported

Metarank needs to receive 4 types of , describing items, users and how users interact with items:

: like titles, inventory, tags

: country, age, location

: what items and in what order were displayed to a visitor

: how visitor interacted with the ranking

Using , you can track , which are JSONs with attached schema references.

Check out the and articles for details on fields, event types and their meaning.

Metarank schemas are language-agnostic and you can instrument your app using any supported for your favourite language/framework of choice.

For a sample Java backend application, you can track an item update event with the following code, using the :

Snowplow enrich emits processed records in into the downstream Kinesis/Pubsub/etc. topic. This topic is usually later monitored by "Loaders", like a Snowflake loader, or .

To make metarank connect to this stream, configure the in the following way:

offloads realtime enriched events to gzip/lzo compressed files on S3. Given the following :

You can instrument Metarank to load these GZIP-compressed event dumps for the bootstrapping process with the in the following way:

With Metarank configured to pick live events from the enriched stream, and historical events from the offloaded files in S3, it should be straightforward to do the usual routine of and Metarank.

Snowplow Trackers
Stream Collector
Enrichment
Snowplow Tracker Protocol
Iglu Registry
JSON Schema
item metadata event
user metadata event
ranking event
interaction event
Metarank event types
github.com/metarank/metarank-snowplow
multiple streaming platforms
AWS Kinesis
Kafka
GCP Pubsub
planned in the future
NSQ
Amazon SQS
events
Snowplow JS Tracker SDK
self-describing events
JSON-Schema definitions for events
event format
Snowplow Tracker SDK
Snowplow Java Tracker SDK
TSV format
S3 Loader
Snowplow S3 Loader
sample S3 Loader config snippet
file source
setting up
running it
Snowplow Analytics
Snowplow Trackers
Iglu Schemes
item metadata
user metadata
ranking
interaction
snowplow logo
snowplow typical setup
snowplow trackers
snowplow loaders
kinesis source