Once we've applied the RRCF algorithm to our streaming data, the resulting scores measure how anomalous each data point is. To classify data points as "normal" or "anomalous", we still need to set a threshold. This defines the level of deviation considered anomalous and controls how often anomalies are over-identified or missed.
Why is Thresholding Needed?
Thresholding matters in anomaly detection because it separates normal and anomalous behaviour. Without a threshold, the scores still indicate relative degrees of anomalousness, but they do not provide a clear dividing line between normal points and anomalies.
Set the threshold too low and normal data points may be misclassified as anomalies, increasing false positives. Set it too high and actual anomalies may be missed, increasing false negatives.
How to Set the Threshold?
There are several ways to set a threshold for RRCF scores, including the Median Absolute Deviation (MAD), Min/Max, and others. The right method depends on the characteristics of the data and the specific use case.
Median Absolute Deviation (MAD)
The Median Absolute Deviation is a robust measure of variability in a data set. For RRCF scores, MAD can be used to set a threshold. A typical approach is to set the threshold as some multiple of the MAD above the median. This approach is robust to outliers and can be useful when the data has heavy-tailed distributions.
Min/Max
Another approach is to use the minimum and maximum RRCF scores to set the threshold. This could mean setting the threshold as a percentage of the range between the minimum and maximum scores. The method is straightforward, but it can be sensitive to extreme score values.
Z-Score
Several other methods can be used to set the threshold, depending on the data. These include statistical techniques such as setting the threshold based on standard deviations from the mean, using quartiles of the data, or using machine learning techniques to dynamically adjust the threshold based on observed data.
Conclusion
Thresholding gives anomaly detection a clear boundary between normal and anomalous scores, which helps identify potential issues such as cyber threats or system errors. The choice of thresholding method depends on the use case and the characteristics of the data. Whatever method is used, the threshold needs to balance anomaly detection against the risk of false positives and false negatives.