Aggregation
As Relay federates queries to multiple downstream data sources, but represents a single upstream collection, it is necessary to aggregate downstream results into a single upstream result.
The general process for this is as follows:
- Relay sends a query to all downstream subnodes
- Relay waits for results from every subnode.
- Once all results have been received - or a timeout has been reached:
- All received data are aggregated appropriately for the type of result (see below).
- The aggregated results (at a record or final level as appropriate) are obfuscated per configuration.
- Aggregated (and obfuscated) results are transformed to the query source result format and returned.
The actual aggregation behaviour differs for different Task types, described in detail below.
The behaviours described are covered by automated software tests to ensure Relay is behaving as expected.
Task expiry
Relay “expires” running Tasks if not all subnodes have returned results after a certain amount of time, to provide a timely response to the query source, and prevent blocking by offline subnodes.
The timeout differs for different Task types:
Type | Timeout | Notes |
---|---|---|
Availability | 4 minutes |
|
Distribution | 2 hours |
|
When Tasks expire, Relay aggregates what subnode results it has, and responds to the Query Source with the aggregate results.
Obfuscation / Disclosure Control
Relay performs obfuscation of its aggregated results based on its configuration.
Exactly where this obfuscation is applied depends on the aggregation process, and is covered in the breakdown below.
Availability Results
Since availability results return only a count from each subnode, the aggregation behaviour is quite simple:
- Each subnode’s count is added to a running total
- Once the final total is reached, it is obfuscated per configuration.
Missing data
Missing subnode results (i.e. a subnode did not return results within the timeout period) are omitted - essentially represented by a count of 0.
Generic Code Distribution
For generic code distribution, each subnode returns a list of rows by code and a count of matches, along with optionally some summary statistics.
From the downstream results alone, Relay can aggregate the counts and some of the trivial statistics (e.g. min, max), but cannot aggregate other statistics (e.g. mean, quartiles…)
The aggregation behaviour is effectively done per-row per-subnode, building an aggregated row for each code:
- Each subnode’s results are iterated row by row
- Each row’s count is added to a running total for that row’s code
- When all subnode’s results have been aggregated by code
- Each coded row count is obfuscated per configuration
- Each coded row’s summary statistics are calculated where possible
Missing data
It’s not necessary for every subnode to have rows for all codes; it will depend on the subnode’s dataset.
Relay will aggregate by each code present.
Missing subnode results simply don’t contribute to any code’s totals.
Demographics Distribution
For demographics distribution results, each subnode returns a list of rows by code that can present results in several ways:
- Most commonly: A lookup of valid values for that code to a count of matches
- e.g. for
GENDER
- valid values ofMALE
,FEMALE
andOTHER
might each have counts of matches.
- e.g. for
- A simple count of matches for the code, similar to Generic Code Distribution
- Alternative code specific value representations such as for
AGE
.
For the most common form, with a break down of counts per valid value for the code, Relay aggregates by each code and value:
- Each subnode’s results are iterated row by row.
- Each row’s valid value counts are added to running totals per value.
- When all subnode’s results have been aggregated by code and valid value
- Each coded value count is obfuscated per configuration
- Each coded row’s count and summary statistics are calculated from the obfuscated value counts where possible
For rows similar to Generic Code Distribution, that aggregation process is followed.
Currently Relay does not aggregate special case codes such as AGE
, though it is architected such that specialist aggregators could be added later.