As data volumes grow, the anomaly detection process has to scale with them. RRCF is efficient, but large, high-dimensional datasets can still create performance challenges. The following strategies focus on reducing dimensionality, smoothing bursts of input, and distributing independent work.
Compute Summary Statistics Instead of Shingling
Shingling transforms a single time series into a multivariate one by stacking lagged versions of the data. This can help capture temporal dependencies, but it also increases the dimensionality of the points inserted into each tree, which can slow the algorithm down.
An alternative is to compute summary statistics that capture the types of anomalies you are looking for. For instance, if you are detecting spikes, the data points could consist of second central differences. If you are looking for long-term trends, the data points could consist of rolling means at different window sizes. This reduces the dimension of the points inserted into each tree, improving performance.
Buffer Input and Compute Rolling Summary Statistics
When data arrives too quickly to be inserted into the trees directly, buffer the input and compute rolling summary statistics (mean, median, max, etc.). This reduces the number of points that need to be inserted into the trees and helps the algorithm keep up with the streaming data.
Parallelisation
RRCF can be parallelised, which is particularly useful when dealing with multiple independent time series. Different RRCF instances can be run for each time series, using separate processes or server instances. This distributes the computational load and can improve performance.
For instance, if you have 10 independent time series, you can run 10 instances of RRCF in parallel, each focusing on one time series. This scales the anomaly detection process to handle larger volumes of data.
Conclusion
Scaling RRCF for large datasets usually means reducing the work each tree has to do, controlling input volume, and parallelising where the data allows it. Summary statistics, input buffering, and independent RRCF instances can help manage high-dimensional data and high data velocities without changing the underlying anomaly detection goal.