Snowplow
Last updated
Last updated
Metarank can be integrated into existing Snowplow Analytics setup.
We provide a set of Iglu Schemes that you will use to track metadata and interaction events that can later on be read from Snowplow's enriched event stream by Metarank.
Snowplow Trackers are used to track Metarank-specific events.
Metarank will use Snowplow's enriched event stream as a source of events.
Typical Snowplow Analytics setup consists of the following parts:
Using Snowplow Trackers, your application emits a clickstream telemetry to the Stream Collector
Stream Collector writes all incoming events into the raw stream
Enrichment validates these events according to the predefined schemas from the Schema Registry
Validated and enriched events are written to the enriched stream
Enriched events are delivered to the Analytics DB
Metarank exposes a set of Snowplow-compatible event schemas, and can read events directly from the enriched stream, as shown on the diagram below:
All incoming raw events have a strict JSON schema, that consists of the following parts:
predefined fields according to the Snowplow Tracker Protocol
unstructured payload with user-defined schema
multiple context payloads with user-defined schemas
These user-defined schemas are pulled from the Iglu Registry, and these schemas are standard JSON Schema definitions, describing the payload structure.
There are four different Metarank event types with the corresponding schemas:
ai.metarank/item/1-0-0
: item metadata event
ai.metarank/user/1-0-0
: user metadata event
ai.metarank/ranking/1-0-0
: ranking event
ai.metarank/interaction/1-0-0
: interaction event
These schemas are describing native Metarank event types without any modifications.
Check out github.com/metarank/metarank-snowplow for more details about Metarank schemas.
Snowplow supports multiple streaming platforms for event delivery:
AWS Kinesis: supported by Metarank
Kafka: supported by Metarank
GCP Pubsub: support is planned in the future
NSQ: not supported
Amazon SQS: not supported
[stdout]: not supported
Metarank needs to receive 4 types of events, describing items, users and how users interact with items:
item metadata: like titles, inventory, tags
user metadata: country, age, location
ranking: what items and in what order were displayed to a visitor
interaction: how visitor interacted with the ranking
These events can be generated both on the frontend side, and on the backend side, depending on your setup and data availability on the front and back ends.
Using Snowplow JS Tracker SDK, you can track self-describing events, which are JSONs with attached schema references.
An example of tracking a ranking event:
Check out the JSON-Schema definitions for events and event format articles for details on fields, event types and their meaning.
Metarank schemas are language-agnostic and you can instrument your app using any supported Snowplow Tracker SDK for your favourite language/framework of choice.
It Often happens that the frontend doesn't have all the required information to generate events. A good example is item metadata event (usually tags, titles and price are altered in some back-office system and are not directly exposed to the frontend).
In this case you can generate such events on the backend side.
For a sample Java backend application, you can track an item update event with the following code, using the Snowplow Java Tracker SDK:
Metarank schemas are available on a public Iglu server on https://iglu.metarank.ai
. To use it, add the following snippet to the resolver.json
snowplow-enrich config file:
Both http
and https
schemas are supported, but https
is recommended.
Snowplow enrich emits processed records in TSV format into the downstream Kinesis/Pubsub/etc. topic. This topic is usually later monitored by "Loaders", like a Snowflake loader, or S3 Loader.
An example loader integration diagram is shown below:
Snowplow is flexible enough to use different data loading destinations (Redshift, Postgres, Snowflake, S3, etc.), but to access both live and historical enriched event data, Metarank needs an access to:
Enriched event stream
Historical enriched stream dumps done with S3 Loader
At the moment Metarank supports loading historical events only from S3 Loader.
Snowplow enrich is usually configured with three destination streams output/pii/bad, with the same HOCON definition:
To make metarank connect to this stream, configure the kinesis source in the following way:
All the supported Metarank sources have an optional format
field, which defines the underlying format of the payload in this stream. Valid options are:
json
: default value, Metarank native format
snowplow
, snowplow:tsv
: Snowplow default TSV stream format
snowplow:json
: Snowplow optional JSON stream format
With the format: snowplow:tsv
, Metarank will read TSV events and transform them into native format automatically.
Snowplow S3 Loader offloads realtime enriched events to gzip/lzo compressed files on S3. Given the following sample S3 Loader config snippet:
You can instrument Metarank to load these GZIP-compressed event dumps for the bootstrapping process with the file source in the following way:
With Metarank configured to pick live events from the enriched stream, and historical events from the offloaded files in S3, it should be straightforward to do the usual routine of setting up and running it Metarank.