What is Pre-Aggregation?

Within the Beeks Analytics architecture, we talk about pre-aggregation often in the sense that certain statistics that are derived from high-volume data are calculated in the VMX-Capture layer before being passed as pre-aggregated statistics to VMX-Analysis. This means that rather than VMX-Analysis processing each individual network packet or application message, it just processes the statistics updates from VMX-Capture. This allows Beeks Analytics to achieve its high levels of performance compared to competitor products.

See the Beeks Analytics Performance Guide for details of Beeks Analytics performance benchmarks, and in particular the market-leading market data update processing.

However, pre-aggregation also refers to the way that data is stored in Beeks, and affects how clients can access this data. VMX-Analysis stores all of its data using a fixed hierarchical nodepath.

Nodepath terminology

A nodepath is a hierarchy of the properties we want to aggregate on.

For example, an aggregation for market data traffic with 6 levels might have the following properties:
Level 1: Visibility Point (root node)
Level 2: External Group name (branch node)
Level 3: Switchport (branch node)
Level 4: Market Data feed (branch node)
Level 5: Market Data feed Side - A or B (branch node)
Level 6: Market Data feed Channel (leaf node)

This structure is very similar to the MD_Stats aggregator, which is implemented as part of Beeks Analytics for Markets (BAM) templated deployments. See the Beeks Analytics Data Guide for more information about the different BAM aggregators.

We can represent this aggregation as the nodepath:

  • vp/extgroup/switchport/feed/side/channel

This is a full nodepath, i.e. it represents the hierarchy from root to leaf. We call the full nodepath a Key. We can also define aggregations using partial nodepaths that terminate in a branch instead of a leaf node. For example:

  • vp/extgroup/switchport/

  • vp/extgroup/

  • vp

In relational/tabular database terms, you can think of this node path as the primary key for the table e.g.

Nodepath as primary key for a table

Because each of of the cells in the table will also have a history recorded for it (or may have a forecast predicted for it), you can also think of the Beeks Analytics native statistics storage model as a multi-dimensional OLAP cube with a temporal dimension, or as a hierarchical time-series database.

Comparison of fixed nodepath hierarchy with other data models

The advantages of writing data with a fixed nodepath hierarchy are:

  • Because aggregates are calculated and stored at insert time, read performance is not slowed down by having to compute aggregations on the fly.

  • Aggregations can be kept up-to-date in real-time.

A disadvantage is that you have less flexibility if the hierarchy needs to change - reprocessing of the original data may be required if you want to aggregate the data in a different way.

It is important to note that although Beeks Analytics natively uses a fixed hierarchical node path in its aggregator structures, where you are writing data to an external data source using the Core Data Feed you can easily adopt other techniques, depending on your particular needs for that data.

This comparison may help determine the correct approach for you to store your data:

Hierarchical Node Path vs Flat Data Model

The Core Data Feed is also flexible to output data to non-relational data models, for example Key-Value stores, columnar databases, graph databases, data lakes and even vector databases. Its support of Kafka, Airflow and Parquet easily integrates with enterprise data use cases.

Types of Aggregation calculations

Aggregation calculations fall into the following two types:

  • Horizontal
    Calculates some related columns based on columns from a nodepath.

  • Vertical
    A child-aware parent calculation.

An example of use is in BAM, where pre-aggregation in VMX-Capture provides leaf-node aggregation only, which is passed to VMX-Analysis. VMX-Analysis then re-aggregates all parents of the nodepath (vertical), and then it also calculates new columns in the same row (horizontal), e.g. percentage loss, which is a derived calculation from two other fields (total packets sent, and a count of how many packets were lost).