Analytics Concepts Guide

In Beeks Analytics, probes capture huge quantities of data at Visibility Points in the network. The data is then mapped to a common format and used in calculations that output easily readable statistics (“stats”) that tell our customers about their network performance.

Aggregation

An Aggregation (noun) is a set of specific calculations applied to data or previously calculated statistics. For example, an aggregation might calculate a single value, such as the average time (in seconds) it takes for RFQs to receive a price within a 10-second window.

Aggregation (verb) is the act of calculating stats from data. Aggregation can occur at the following places in Beeks Analytics:

Stack Probes in VMX-Capture
A Statistics Collector (or stat collector) is a module in a stack probe that aggregates data to generate statistics from the observed traffic, which it then writes to a downstream component such as VMX-Analysis via a Generic Aggregator Input Event (GAIE) Connector.

These are usually only the stats that require a lot of data, e.g., microburst detection. In a microburst example, a VMX-Capture Statistics Collector records the bandwidth measured in a 1 millisecond window, and records the mean, low, average and maximum value seen in each 10 second period. We generate microburst statistics in VMX-Capture as an efficiency measure, because passing large volumes of data to VMX-Analysis may result in the need for additional checking of timings to guarantee precision.

When these stats are passed to VMX-Analysis for use in further aggregation, we refer to them as Pre-aggregated Stats. When these stats are passed to the Core Data Feed, we refer to them as stats.
VMX-Analysis
Aggregation in VMX-Analysis performs calculations on Agent Event data and on Pre-aggregated Stats.
P3 Process
The P3 process also generates Pre-aggregated Stats for further aggregation in VMX-Analysis. However, this pre-aggregation differs from the pre-aggregation in Stack Probes.

P3 pre-aggregation is used when the volume of messages is exceptionally high and cannot be handled efficiently by Stack Probes alone. In this scenario, the message load is split between multiple Stack Probes, and these probe outputs are then split between multiple P3 processes. P3 can then rollup buckets of distribution to VMX-Analysis to get effective HDR distributions for high message volume calculations.

What are Aggregators?

Aggregators perform aggregation. One Aggregator can perform multiple aggregations.

In Beeks Analytics, Aggregators generate stats that are hierarchical, so that you can view top-level summaries and then more granular stats. We think of this hierarchy as a tree, with a root node, branch nodes, and leaf nodes.

A lot of the data you’ll see in VMX-Explorer dashboards is the output of one or more aggregations in an Aggregator.

What Aggregators are provided in Beeks Analytics for Markets?

See Beeks Analytics for Markets Aggregators in the Beeks Analytics Data Guide.

What is Pre-Aggregation?

Within the Beeks Analytics architecture, we talk about pre-aggregation often in the sense that certain statistics that are derived from high-volume data are calculated in the VMX-Capture layer before being passed as aggregated statistics to VMX-Analysis. This means that rather than VMX-Analysis processing each individual network packet or application message, it just processes the statistics updates from VMX-Capture. This allows Beeks Analytics to achieve its high levels of performance compared to competitor products.

See the Beeks Analytics Performance Guide for details of Beeks Analytics performance benchmarks, and in particular the market-leading market data update processing.

However, pre-aggregation also refers to the way that data is stored in Beeks, and affects how clients can access this data. VMX-Analysis stores all of its data using a fixed hierarchical nodepath.

Nodepath terminology

A nodepath is a hierarchy of the properties we want to aggregate on.

For example, an aggregation for market data traffic with 6 levels might have the following properties:
Level 1: Visibility Point (root node)
Level 2: External Group name (branch node)
Level 3: Switchport (branch node)
Level 4: Market Data feed (branch node)
Level 5: Market Data feed Side - A or B (branch node)
Level 6: Market Data feed Channel (leaf node)

This structure is very similar to the MD_Stats aggregator, which is implemented as part of Beeks Analytics for Markets (BAM) templated deployments. See the Beeks Analytics Data Guide for more information about the different BAM aggregators.

We can represent this aggregation as the nodepath:

vp/extgroup/switchport/feed/side/channel

This is a full nodepath, i.e. it represents the hierarchy from root to leaf. We call the full nodepath a Key. We can also define aggregations using partial nodepaths that terminate in a branch instead of a leaf node. For example:

vp/extgroup/switchport/
vp/extgroup/
vp

In relational/tabular database terms, you can think of this node path as the primary key for the table e.g.

Nodepath (primary key)	mAv Packets/s	mAv wiretime
VP/NYSE NY Primary/Port6/NYSE OpenBook Ultra/B/UZ_data	340	2.68 ms

Because each of of the cells in the table will also have a history recorded for it (or may have a forecast predicted for it), you can also think of the Beeks Analytics native statistics storage model as a multi-dimensional OLAP cube with a temporal dimension, or as a hierarchical time-series database.

Comparison of fixed nodepath hierarchy with other data models

The advantages of writing data with a fixed nodepath hierarchy are:

Because aggregates are calculated and stored at insert time, read performance is not slowed down by having to compute aggregations on the fly.
Aggregations can be kept up-to-date in real-time.

A disadvantage is that you have less flexibility if the hierarchy needs to change - reprocessing of the original data may be required if you want to aggregate the data in a different way.

It is important to note that although Beeks Analytics natively uses a fixed hierarchical node path in its aggregator structures, where you are writing data to an external data source using the Core Data Feed you can easily adopt other techniques, depending on your particular needs for that data.

This comparison may help determine the correct approach for you to store your data:

Criteria	Hierarchical Node Path (Beeks Analytics native, also supported via CDF)	Flat Data Model (via Core Data Feed)
Query Speed	Fast (precomputed)	Slower (aggregated at query time)
Insert Performance	Fast	Slower (indexing may be needed)
Flexibility	Rigid (fixed hierarchy)	High (dynamically aggregable)
Storage	More storage (duplicate rollups)	Less storage (only raw data)
Hierarchy Changes	Hard to modify	Easy to adjust

The Core Data Feed is also flexible to output data to non-relational data models, for example Key-Value stores, columnar databases, graph databases, data lakes and even vector databases. Its support of Kafka, Airflow and Parquet easily integrates with enterprise data use cases.

Types of Aggregation calculations

Aggregation calculations fall into the following two types:

Horizontal
Calculates some related columns based on columns from a nodepath.
Vertical
A child-aware parent calculation.

An example of use is in BAM, where pre-aggregation in VMX-Capture provides leaf-node aggregation only, which is passed to VMX-Analysis. VMX-Analysis then re-aggregates all parents of the nodepath (vertical), and then it also calculates new columns in the same row (horizontal), e.g. percentage loss, which is a derived calculation from two other fields (total packets sent, and a count of how many packets were lost).

Aggregation - a worked example

Let’s look at a simple example of one panel that displays the output from the aggregations in an Aggregator in a table.

In the example below, the nodepath column on the far-left includes ICE and ICE/Arca Trade. These are nodepaths that represent a branch node, i.e. summary stats for ICE, and a leaf node i.e., more granular stats for Arca Trade within ICE. Generally, an Aggregator offers top level summary data and multiple levels of breakdown by technical or business attributes.

We can see in the screenshot the familiar aggregator structure, standard for Beeks Analytics, which resembles a spreadsheet that represents a flattened tree structure, in which each cell is tracking a different stat.

Example aggregator view in VMX-Explorer.

Alternative worked example using CDF and a flat data model

If you are using CDF-T to output individual stats to a Time-series database, you can choose to retain the fixed hierarchical nodepath model or you can implement your own alternative data model.

If we take the earlier example market data aggregation, which uses a nodepath of vp/extgroup/switchport/feed/side/channel, we might see an aggregation like the following:

Nodepath (primary key)	mAv Packets/s	mAv wiretime
VP/NYSE NY Primary/Port6/NYSE OpenBook Ultra/B/UZ_data	340	2.68 ms

The ‘340’ and ‘2.68ms’ values might represent the latest values, but using the timeseries database functions you would be able to query previous values.

The above aggregation might have resulted from a stat update which published to the full nodepath:

{
  "leafnode": "VP/NYSE NY Primary/Port6/NYSE OpenBook Ultra/B/UZ_data",
  "packets": 340,
  "wiretime":2.68
}

This would be the traditional Beeks Analytics way of publishing the statistics update.

However, the above aggregation might also have resulted from a dynamic query-time reaggregation of the following data individual statistic update from CDF-T:

{
  "vp": "VP",
  "extgroup": "NYSE NY Primary",
  "switchport": "Port6",
  "feed": "NYSE OpenBook Ultra",
  "side": "B",
  "channel": "UZ_data",
  "packets": 340,
  "wiretime":2.68
}

In the above case, the database table might look more like the following:

ID (primary key)	VP	ExtGroup	Feed	Side	mAv Packets/s	mAv wiretime
1	VP	NYSE NY Primary	NYSE OpenBook Ultra	B	340	2.68 ms

The columns for switchport and channel have been excluded from the example

The advantage of publishing the data in this second way is that you have the flexibility to group the data by, for example, “side”, giving you the mAv packets/s and mAv wiretime stats for all A and B sides e.g.

Side	mAv Packets/s	mAv wiretime
A	500	1.23 ms
B	500	1.74 ms

This flexibility is not possible where statistics are published with a fixed nodepath.