Aggregation Internals
Aggregations allow data to be effectively summarized. The aggregation system in Penguin has been carefully designed to be type safe and extensible, while supporting efficient operations including parallelization across multiple cores.
We start with the core operation: recall, aggregation operations execute across the rows of a single
column and compute a result. This behavior is represented by the protocol AggregationOperation
.
At a high leve, the update(with:)
method will be called once for each entry in the column being
aggregated. Once the operation has seen all the data in a given column, finish()
is called, which
computes the result.
AggregationOperation
is designed to be fully general, and thus both the Input
and Output
types
are generic, and represented with associatedtype
‘s. This means you can write an
AggregationOperation
that takes String
s and returns Double
’s (e.g. computing the average
length of the strings in given column).
In order to support parallelism, the column may be divided into pieces and a separate instance of
the AggregationOperation
may be initialized. In order to compute a single overall answer, the
merge(with:)
method will be called until only a single instance remains, whose finish()
method
will be called to compute the final result.
Aggregation
and its subclasses (e.g. ArbitraryTypedAggregation
, NumericAggreegation
,
StringAggregation
, …) andAggregationEngine
’s work together with PTable
to efficiently
execute thee aggregation operation.
To learn how to write your own aggregation, check out the Custom Aggregations tutorial.
-
AggregationOperation’s represent the per-group operations within a “split-apply-combine” analysis. Types conforming to the
AggregationOperation
protocol can be used as aggregation functions within a “groupBy” operation.Because
See moreAggregationOperation
s can be parallelized across multiple cores or hosts, they must support a “merge” operation, which can be used to aggregate the state within the operations themselves.Declaration
Swift
public protocol AggregationOperation
-
Undocumented
See moreDeclaration
Swift
public class AggregationEngine
-
Undocumented
Declaration
Swift
public class ArbitraryTypedAggregation : Aggregation
-
Undocumented
Declaration
Swift
public class DoubleConvertibleAggregation : Aggregation
-
Undocumented
Declaration
Swift
public class NumericAggregation : Aggregation
-
Undocumented
Declaration
Swift
public class StringAggregation<Op> : Aggregation where Op : AggregationOperation, Op.Input == String