Peakhour.IO - Anomaly Detection

Dive into CVSS Scores

2023-11-10T00:00:00+11:00

Understanding CVSS through Atlassian Confluence Vulnerabilities

The Common Vulnerability Scoring System (CVSS) gives security teams a shared way to rate the severity of software vulnerabilities. It does not predict risk on its own; it describes the characteristics of a specific security flaw. CVSS uses three metric groups: Base, Temporal, and Environmental. The result is a score from 0 to 10, represented by a vector string that records the details behind the score.

Base Metrics describe the inherent aspects of a vulnerability, including how it can be exploited and its potential system impact.
Temporal Metrics change over time, reflecting current exploitability and available mitigations.
Environmental Metrics account for the specific environment where the vulnerability exists, tailoring the score to the affected organisation.

The National Vulnerability Database (NVD) utilises CVSS to assign base scores and provides tools for calculating Temporal and Environmental scores.

Atlassian Confluence Vulnerability Analysis

Two Atlassian Confluence vulnerabilities show why the vector matters as much as the headline score:

CVE-2023-22515 is a critical flaw with a base score of 10.0. It is exploitable remotely, with low complexity, no privilege requirements, and no need for user interaction. The attack vector is network-based, so exposure is not limited to local access. Its broad scope and impact across confidentiality, integrity, and availability make it a vulnerability that needs immediate attention.

CVE-2023-22518 shares many similarities with CVE-2023-22515, including a critical base score of 10.0. It can also be exploited remotely without privileges or user interaction, and with low complexity. Its impact on the system's confidentiality, integrity, and availability is high, allowing attackers to gain complete control and shut down the affected resources.

Both CVE-2023-22515 and CVE-2023-22518 are critical vulnerabilities that demand urgent remediation. Understanding their CVSS vectors helps prioritise the security response and the mitigations needed.

CVE-2023-22515 carries a CVSS score of 10 because it is remotely exploitable, easy to execute, and does not require privileges or user interaction.

CVSS Vector for CVE-2023-22515

Base Score: 10.0 (Critical)
Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H

This vector indicates:

Attack Vector (AV): Network (N) - The vulnerability is remotely exploitable.
Attack Complexity (AC): Low (L) - It is easy to exploit without major obstacles.
Privileges Required (PR): None (N) - No special access is needed.
User Interaction (UI): None (N) - It can be exploited without user involvement.
Scope (S): Changed (C) - The impact extends beyond the initial target.
Confidentiality, Integrity, Availability (C/I/A): High (H) - There is a complete loss of confidentiality, integrity, and availability.

Atlassian's high CVSS score for CVE-2023-22515 reflects its critical nature and the need for immediate action.

CVE-2023-22518 has the same CVSS score of 10, with similar impact across confidentiality, integrity, and availability.

CVSS Vector for CVE-2023-22518

Base Score: 10.0 (Critical)
Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H

This vector means:

Attack Vector (AV): Network (N) - Exploitable remotely.
Attack Complexity (AC): Low (L) - Easy to exploit with minimal barriers.
Privileges Required (PR): None (N) - No user privileges required.
User Interaction (UI): None (N) - No need for user action.
Scope (S): Changed (C) - Broad impact beyond the initial system.
Confidentiality, Integrity, Availability (C/I/A): High (H) - Complete compromise of the system's security.

Understanding the CVSS scores for these vulnerabilities helps teams prioritise their security response. For a full breakdown and history of CVSS, see Wikipedia. More detailed information on CVSS can also be found in FIRST's official CVSS documentation.

A Risk Based Approach To Vulnerability Scoring

2023-11-10T00:00:00+11:00

The Exploit Prediction Scoring System (EPSS) estimates the likelihood that a published CVE will be exploited in the wild. Its value is that it brings several signals into one risk score, instead of treating every vulnerability with the same CVSS severity as equally urgent. The main inputs are:

Data Sources of EPSS

MITRE’s CVE List: EPSS scores only vulnerabilities that are "published" on this list.
Text-based “Tags”: Extracted from CVE descriptions and related discussions.
Publication Duration: The time period since the CVE was published.
Reference Count: The number of references in the CVE entry.
Published Exploit Code: Code from platforms such as Metasploit, ExploitDB, or GitHub.
Security Scanners: Data from security tools such as Jaeles and Nuclei.
CVSS v3 Vectors: Based on the base score in the National Vulnerability Database (NVD).
CPE (vendor) Information: Details about the vendors of the products involved, also from NVD.
Ground Truth Data: Real-world exploitation data from sources such as AlienVault.

EPSS Model and Tools

The current EPSS model, version 2022.01.01, uses 1,164 variables and is based on Gradient Boosting, a machine learning technique. For a visual and interactive view of EPSS scores, the EPSScall tool is useful. It provides historical data and graphs that make score movement easier to inspect.

The Drivers of EPSS Scores

To understand EPSS, it helps to look at which inputs carry the most weight. The variable importance graph shows the strongest contributors to the EPSS score.

Vendor data plays an outsized role in the scoring process. The graph shows how much weight each component has when estimating whether a vulnerability is likely to be exploited.

Why Does This Matter?

EPSS uses these data sources to predict exploit likelihood more directly than severity-only methods. By considering factors from the age of the CVE to real-world exploit instances, EPSS gives defenders a clearer view of which vulnerabilities are more likely to matter operationally. That makes patching and mitigation decisions easier to prioritise when resources are limited.

Understanding the components of EPSS also makes the score easier to interpret. It is not a single severity metric; it is a blend of several data points, each with different weight. Tools like EPSScall make those inputs easier to inspect when tuning a vulnerability management process.

Final Thoughts

EPSS is useful because it shifts vulnerability triage away from severity alone and towards exploit likelihood. Its use of multiple data sources and machine learning makes it a practical tool for defenders who need to decide what to fix first. Prioritising vulnerabilities this way does not replace judgement, but it gives teams a stronger starting point than CVSS alone.

When Bots Break Bad

2023-05-16T13:00:00+10:00

Bots account for a large share of web traffic. Recent studies put automated traffic at nearly 50% of all internet requests. Some bots are useful, such as search engine crawlers that index your site. Some are clearly harmful, such as scrapers and sneaker bots. Others sit in a grey area, including backlink and marketing bots from services such as Ahrefs and SEMrush. Even useful bots can create problems when they crawl too hard. This article looks at the main bot types and how to manage them with robots.txt and bot management tools.

Understanding the Different Types of Bots

'Good Bots'

Good bots perform legitimate work. Search engine crawlers like Googlebot and Bingbot index webpages so search results can stay current and relevant. Other examples include uptime and performance monitoring bots.

'Bad Bots'

Bad bots harm websites, users, or both. Common examples include:

Scraping content, copying and repurposing data from websites.
Sneaker bots, automatically purchasing limited-edition products (like sneakers) before human users can.
Spam bots, posting unsolicited messages and advertisements in comment sections or forums.
Vulnerability Scanners, trying thousands of website URLs to find security vulnerabilities.
Account Takeover, attempting to gain access to existing user/admin accounts using either credential stuffing or brute-force attacks.

'Grey Bots'

Grey bots sit between good and bad. They often serve a useful purpose and may follow crawling directives in robots.txt, but they can still cause problems when they crawl too aggressively. Common examples include:

AhrefsBot: A backlink analysis bot used by Ahrefs, an SEO tool.
SEMrushBot: A bot used by SEMrush, another popular SEO and digital marketing tool.
MJ12bot: A bot used by Majestic, a service that provides backlink data and analysis.
ScreamingFrog: An SEO analyser run from a local desktop.

When Grey bots (and even Good Bots) go bad.

Left unattended, grey bots can create practical problems:

Slow page loading times, which affect user experience.
Strain on server resources, potentially causing crashes, downtime, and higher costs.
Distorted website analytics, when bot traffic is mistaken for human traffic.

Managing Grey Bots with Robots.txt

The robots.txt file is a simple text file that tells web crawlers which parts of your site they can or cannot access. You can use it to manage bot behaviour and protect your website from aggressive crawling. Useful controls include:

Disallowing specific bots: You can block specific bots from accessing your site by adding a "User-agent" and "Disallow" directive to your robots.txt file. For example:

User-agent: AhrefsBot
Disallow: /

Limiting crawl rate: You can ask bots to slow down their crawling by adding a "Crawl-delay" directive:

User-agent: SEMrushBot
Crawl-delay: 10

Not all bots will follow robots.txt. ScreamingFrog, for example, can be instructed to ignore robots.txt and crawl a site as quickly as possible. You would not want a competitor doing this to your site.

Bot Management Tools

In addition to robots.txt, bot management tools (like those provided by Peakhour) can protect your website from abusive bots. Good bot management tools automatically block most unwanted traffic using a combination of Threat Intelligence, Fingerprinting techniques, Reverse DNS verification, and Header Inspection.

Advanced techniques like rate limiting and machine learning can help identify more sophisticated bad bots.

Search Bots and Double Crawling

Search bots like Bingbot can sometimes blindly follow links and crawl the same page multiple times due to different URL parameters. This double, triple, or worse crawling can increase server load and make indexing less efficient. eCommerce sites are especially exposed because product catalogues often have several filtering paths. We've seen Bing go haywire on a number of sites. Most recently, it was issuing around 50,000 requests per day to the search function of a Magento 2 store while cycling through parameters. This dropped to 2-3k requests per day when fixed. On another store, Bing was responsible for nearly half of all page requests (40k page requests) on a busy OpenCart store. Configuring it to ignore parameters dropped this to around 4k per day.

Configuring Search Bots to Ignore Query Parameters

Note: Since publishing both Google and Bing have removed the ability to ignore parameters when crawling via their webmaster/search console tools. See using robots.txt to instruct search engines to ignore query string parameters

To help search bots crawl your site efficiently, you can configure them to ignore specific query parameters. Use these methods:

Configuring Bing Webmaster Tools

Bing Webmaster Tools provides an option to specify URL parameters that should be ignored during the crawling process. To configure this setting, follow these steps:

Sign in to your Bing Webmaster Tools account and select the website you want to manage.
Navigate to the "Configure My Site" section and click on "URL Parameters."
Click on "Add Parameter" and enter the parameter name you want Bingbot to ignore.
Select "Ignore this parameter" from the dropdown menu and click on "Save."
Configuring Bing Webmaster Tools this way helps stop Bingbot double crawling pages with specific URL parameters, reducing server load and improving indexing efficiency.

Managing Other Search Bots

For other search engines like Google, use the relevant webmaster tools to manage URL parameters. In Google Search Console, follow these steps:

Sign in to your Google Search Console account and select the property you want to manage.
Navigate to the "Crawl" section and click on "URL Parameters."
Click on "Add Parameter" and enter the parameter name you want Googlebot to ignore.
Choose "No URLs" from the "Does this parameter change page content seen by the user?" dropdown menu.
Click on "Save."
Specifying the parameters you want search bots to ignore can prevent double crawling and make indexing more efficient.

Final Thoughts

When good or grey bots crawl too aggressively, they can cause the same operational problems as malicious bots: overloaded servers, slower pages, and worse user experience. Monitor website traffic and server load, set clear robots.txt rules, and use the major search engines' webmaster tools to control inefficient crawling. Done properly, this improves website performance and can lower infrastructure costs.

Advanced Anomaly Detection

2023-05-15T13:00:00+10:00

Modern Application Security Platforms need reliable anomaly detection to identify and respond to emerging threats in real-time. For DevOps, SRE, and DevSecOps teams, machine learning algorithms such as Robust Random Cut Forest (RRCF) provide the foundation for automated threat detection and response systems that can operate at the scale and speed contemporary applications require.

Strategic Importance of Anomaly Detection in Application Security

Real-time anomaly detection is a core Application Security Platform capability. It helps identify threats before attacks affect application performance or security posture:

Enterprise Threat Landscape

Modern applications face attack vectors that traditional signature-based detection cannot address:

Adaptive Bot Networks: AI-powered bots that modify behaviour based on defensive responses
Zero-Day Exploits: Previously unknown attack patterns that bypass traditional security rules
Volumetric Attacks: DDoS attacks that scale dynamically to evade rate limiting
Insider Threats: Subtle anomalies in user behaviour that indicate account compromise

Application Security Platform Requirements

Effective anomaly detection needs to integrate cleanly with broader security capabilities:

Real-Time Processing: Threat identification within milliseconds of detection
Scalable Architecture: Analysis of millions of requests without performance degradation
Context Awareness: Integration with application metadata and user behaviour profiles
Automated Response: Immediate threat mitigation through dynamic rule deployment

Advanced Machine Learning for Security

Robust Random Cut Forest provides anomaly detection capabilities designed for streaming data environments common in Application Security Platforms:

Algorithmic Advantages for Security Applications

Streaming Data Processing: Real-time analysis without historical data dependencies
Dimensionality Handling: Effective analysis of high-dimensional security feature vectors
Adaptive Learning: Continuous model updates based on evolving traffic patterns
Computational Efficiency: Linear scaling suitable for high-throughput security processing

Implementation in Application Security Platforms

RRCF enables threat detection across multiple security dimensions:

Traffic Pattern Analysis: Identification of unusual request volumes, frequencies, and distributions
Behavioural Anomalies: Detection of user actions that deviate from established profiles
Network Fingerprinting: Recognition of abnormal connection patterns and protocol usage
Content Analysis: Identification of malicious payloads and injection attempts

RRCF Advantages for Application Security Platforms

Traditional batch-processing anomaly detection systems are a poor fit for Application Security Platforms that must respond to threats in real-time. RRCF's streaming approach provides practical advantages:

Real-Time Threat Detection

Immediate Analysis: Process and analyse security events as they occur, without waiting for batch processing
Adaptive Baselines: Continuously update normal behaviour models based on current traffic patterns
Memory Efficiency: Maintain configurable rolling windows of security data for optimal performance
Scalable Processing: Handle millions of security events per second without degradation

Security-Optimised Implementation

RRCF's forest-based approach is useful for security applications:

Multi-Dimensional Analysis: Analyse request patterns, user behaviour, and network characteristics at the same time
Shape-Sensitive Detection: Identify subtle changes in attack patterns that signature-based systems miss
False Positive Reduction: Leverage ensemble methods to reduce noise in security alerting
Contextual Awareness: Understand normal application behaviour patterns for more accurate threat detection

Application Security Platform Integration

Enterprise Deployment Architecture

Peakhour's Application Security Platform implements RRCF through high-performance Rust-based processing:

Edge Processing Capabilities

Global Deployment: RRCF analysis deployed across CDN edge locations for minimal latency
Distributed Learning: Aggregated threat intelligence from multiple geographic regions
Local Response: Immediate threat mitigation at the edge without central processing delays
Bandwidth Optimisation: Process security events locally to reduce data transmission requirements

Platform Integration Benefits

Unified Threat Detection: RRCF analysis integrated with WAF/WAAP, bot management, and DDoS protection
Automated Response: Dynamic security rule generation based on anomaly detection results
DevSecOps Workflow: API-first architecture enabling integration with security automation tools
Compliance Reporting: Detailed anomaly detection logs for security audits and regulatory requirements

Advanced Security Use Cases

Credential Stuffing Detection

Behavioural Analysis: Identify unusual login patterns that indicate automated credential testing
Geographic Anomalies: Detect impossible travel scenarios and location-based attack patterns
Volume Analysis: Recognise subtle increases in authentication attempts that indicate coordinated attacks
Success Rate Monitoring: Identify campaigns through abnormal authentication success/failure ratios

API Threat Detection

Endpoint Anomalies: Detect unusual API usage patterns that indicate reconnaissance or exploitation
Rate Pattern Analysis: Identify sophisticated rate limiting evasion techniques
Response Time Analysis: Detect performance impacts from malicious API usage
Authentication Anomalies: Recognise token abuse and API key misuse patterns

Zero-Day Threat Identification

Traffic Pattern Deviations: Identify new attack vectors through unusual request characteristics
Response Pattern Analysis: Detect exploitation attempts through server response anomalies
Protocol Anomalies: Recognise malformed requests that indicate exploit attempts
Payload Analysis: Identify suspicious content patterns in request bodies and parameters

Operational Excellence Through Advanced Anomaly Detection

Performance and Security Integration

RRCF implementation delivers measurable improvements across security and performance metrics:

Threat Detection Speed: Sub-millisecond anomaly identification for real-time response
False Positive Reduction: Ensemble methods reduce security alert fatigue
System Performance: Efficient processing maintains CDN performance whilst enhancing security
Adaptive Learning: Continuous improvement in threat detection accuracy over time

DevSecOps Enablement

Modern Application Security Platforms provide APIs and automation capabilities:

Security Automation: Programmatic access to anomaly detection results for automated response
CI/CD Integration: Security testing and validation integrated into development workflows
Monitoring Integration: SIEM and SOC platform integration for security operations
Custom Rule Development: Framework for developing application-specific anomaly detection rules

Final Thoughts

Advanced anomaly detection through RRCF is a fundamental capability for modern Application Security Platforms. By implementing machine learning algorithms at the edge, organisations can achieve real-time threat detection that adapts to evolving attack patterns whilst maintaining application performance.

The integration of RRCF with security capabilities including WAAP, bot management, and DDoS protection creates a unified platform that addresses the security requirements of contemporary applications and APIs. For DevSecOps teams, this approach enables automated threat response whilst providing the visibility and control needed for effective security operations.

Double MAD?

2023-05-15T13:00:00+10:00

This article explores the use of Double Median Absolute Deviation (Double MAD) for [anomaly detection](/learning/threat-detection/what-is-anomaly-detection/) in time series
data, particularly in skewed or non-symmetric distributions. Double MAD, which calculates two median absolute
deviations — one for data below the median and one for data above — provides a more nuanced approach than traditional
MAD, allowing for accurate detection of anomalies even in skewed data distributions. We also delve into its application
in identifying slow abuse, like bots, by catching lower range anomalies. However, it's important to note Double MAD's
limitations such as not capturing seasonal data shape and trends over time. A comparison is also drawn with the Z-score
method, highlighting that the choice between the two depends on the nature of your data. The article provides insights
into the practical implementation of Double MAD and its potential to improve your data analysis toolkit.

Operational systems increasingly rely on time-series data for decisions. Anomaly detection is one practical use: by identifying patterns that deviate from the norm, businesses can investigate potential issues early or understand unexpected opportunities.

One useful technique for anomaly detection is the Median Absolute Deviation (MAD) and, more specifically, its extension, the Double MAD. This article explains where Double MAD fits in time-series anomaly detection and how it can help identify anomalous clients.

Understanding MAD and Double MAD

MAD, a robust measure of variability, is less susceptible to outliers than standard deviation. It calculates the median of absolute deviations from the data's median, often providing a better representation of 'normal' behaviour in datasets with skewed distributions or outliers.

Double MAD is an extension of MAD, where two MADs are calculated — one for the data below the median and another for the data above. This split gives the detection process a better fit for asymmetric data, which is common in real-world time series data.

Why Double MAD?

While MAD provides a robust way to understand the 'normal' range of a dataset, it assumes a symmetric distribution of data around the median, which may not always hold true. Double MAD is useful where that assumption breaks down, offering an improved anomaly detection process for skewed or asymmetric datasets.

In time-series analysis, especially with 24-hour cycles like web traffic or server usage, patterns can exhibit seasonality and trend components. These patterns can often be asymmetric, making Double MAD a valuable tool for capturing the variability in different parts of the data.

Using Double MAD in Anomaly Detection

The Double MAD implementation provided uses Rust, a system programming language known for speed and memory safety. The code calculates the lower and upper MAD values, along with their respective thresholds. Anomalies can then be detected by comparing each data point to these thresholds.

An anomaly is defined as a data point that deviates significantly from the expected range. If a data point falls below the lower MAD threshold or above the upper one, it can be flagged as an anomaly. This approach is especially effective when handling datasets with high variability or extreme values.

Double MAD for Anomalous Client Detection

Beyond time-series data, Double MAD can also be instrumental in identifying anomalous behaviour among clients. By comparing each client's behaviour against the Double MAD of the time-series data, teams can pinpoint clients that deviate from the norm.

For instance, in the context of web service usage, an anomalous client might be one that is sending an unusually high or low number of requests. By using Double MAD, you can flag such outliers and take appropriate action, such as investigating potential misuse or reaching out to understand and address any issues they may be facing.

Detecting Lower-Range Anomalies: A Case of Slow Abuse

An interesting application of Double MAD is in detecting lower-range anomalies, a pattern often associated with slow abuse such as bots or Distributed Denial of Service (DDoS) attacks. These abuses are characterised by an unusually low frequency of activity that is consistent over a prolonged period. This consistent, low-level activity can fly under the radar of typical anomaly detection systems.

By setting a lower MAD threshold, Double MAD can effectively detect these lower-range anomalies, providing early warning of slow abuse. Its ability to detect both high and low anomalies makes Double MAD a flexible tool for anomaly detection.

The Math Behind Double MAD

To illustrate the power of Double MAD, let's consider a dataset from a right-skewed distribution. Applying the conventional MAD approach might lead to false positives where normal data points are marked as outliers. This is because MAD uses a symmetric interval around the median, which doesn't account for the skewed nature of our data.

With Double MAD, we instead calculate two MADs — one for the data below the median (MAD-lower) and another for the data above (MAD-upper). Outlier thresholds are then defined using these two MADs. The lower threshold is calculated as the median minus a multiplier (k) times MAD-lower. The upper threshold is the median plus k times MAD-upper.

This approach takes into account the asymmetric nature of our data, providing more accurate anomaly detection. For example, in a right-skewed distribution, Double MAD would correctly identify only the extreme right tail values as outliers without incorrectly flagging data points on the left tail.

Wrapping Up

Accurate anomaly detection matters when teams rely on time-series data to operate and investigate systems. The Double MAD approach provides a robust method for this, allowing businesses to better understand their data, spot potential issues early, and make more informed decisions.

Whether you're monitoring web traffic, server usage, or client behaviour, leveraging Double MAD can offer valuable insights and help ensure your operations continue to run smoothly. The ability to detect both high and low anomalies makes it especially powerful, providing protection against potential threats like slow abuse.

Understanding and implementing Double MAD gives your data analysis toolkit a more complete view of asymmetric data and helps you detect potential anomalies earlier.

Double MAD vs the Rest

2023-05-15T13:00:00+10:00

Limitations of Double MAD and Comparison with Z-Score

Double MAD is useful for anomaly detection, but it has clear limits. One is that it does not account for the shape of seasonal data. Time series data often show cyclical patterns by time of day, week, or year. For instance, web traffic to an e-commerce site might spike during holidays and dip on off-peak days.

Double MAD can capture shifts in the median of these data, but it does not consider the shape or pattern within these cycles. It might therefore miss anomalies that occur within a specific season, or flag normal seasonal variations as anomalies.

Another limitation is that Double MAD does not account for trends over time. If your time series data shows a consistent increase or decrease, Double MAD might misinterpret this trend as a series of anomalies.

Double MAD vs. Z-Score

In anomaly detection, Double MAD is often compared with the more traditional Z-score method. A Z-score measures how many standard deviations a data point is from the mean. It assumes that the data follows a Gaussian (or normal) distribution, which often does not hold true for real-world data.

Double MAD, on the other hand, is a non-parametric method that does not make assumptions about the distribution of data. This makes it more robust to outliers and skewed distributions.

However, Z-score has an advantage when data follows a Gaussian distribution, or when the data size is large enough for the Central Limit Theorem to take effect. It also accounts for the mean and standard deviation, giving it an edge in datasets where these measures are informative.

In contrast, Double MAD is more robust for datasets with outliers or skewed distributions, as it uses the median and absolute deviations from the median, which are less sensitive to extreme values.

Both Double MAD and Z-score have strengths, and the choice between them should be guided by the nature of your data. Understanding these nuances helps you choose the method that fits your specific use case.

Scaling anomaly detection with RRCF

2023-05-15T13:00:00+10:00

As data volumes grow, the anomaly detection process has to scale with them. RRCF is efficient, but large, high-dimensional datasets can still create performance challenges. The following strategies focus on reducing dimensionality, smoothing bursts of input, and distributing independent work.

Compute Summary Statistics Instead of Shingling

Shingling transforms a single time series into a multivariate one by stacking lagged versions of the data. This can help capture temporal dependencies, but it also increases the dimensionality of the points inserted into each tree, which can slow the algorithm down.

An alternative is to compute summary statistics that capture the types of anomalies you are looking for. For instance, if you are detecting spikes, the data points could consist of second central differences. If you are looking for long-term trends, the data points could consist of rolling means at different window sizes. This reduces the dimension of the points inserted into each tree, improving performance.

Buffer Input and Compute Rolling Summary Statistics

When data arrives too quickly to be inserted into the trees directly, buffer the input and compute rolling summary statistics (mean, median, max, etc.). This reduces the number of points that need to be inserted into the trees and helps the algorithm keep up with the streaming data.

Parallelisation

RRCF can be parallelised, which is particularly useful when dealing with multiple independent time series. Different RRCF instances can be run for each time series, using separate processes or server instances. This distributes the computational load and can improve performance.

For instance, if you have 10 independent time series, you can run 10 instances of RRCF in parallel, each focusing on one time series. This scales the anomaly detection process to handle larger volumes of data.

Conclusion

Scaling RRCF for large datasets usually means reducing the work each tree has to do, controlling input volume, and parallelising where the data allows it. Summary statistics, input buffering, and independent RRCF instances can help manage high-dimensional data and high data velocities without changing the underlying anomaly detection goal.

Applied RRCF - thresholding techniques.

2023-05-15T13:00:00+10:00

Once we've applied the RRCF algorithm to our streaming data, the resulting scores measure how anomalous each data point is. To classify data points as "normal" or "anomalous", we still need to set a threshold. This defines the level of deviation considered anomalous and controls how often anomalies are over-identified or missed.

Why is Thresholding Needed?

Thresholding matters in anomaly detection because it separates normal and anomalous behaviour. Without a threshold, the scores still indicate relative degrees of anomalousness, but they do not provide a clear dividing line between normal points and anomalies.

Set the threshold too low and normal data points may be misclassified as anomalies, increasing false positives. Set it too high and actual anomalies may be missed, increasing false negatives.

How to Set the Threshold?

There are several ways to set a threshold for RRCF scores, including the Median Absolute Deviation (MAD), Min/Max, and others. The right method depends on the characteristics of the data and the specific use case.

Median Absolute Deviation (MAD)

The Median Absolute Deviation is a robust measure of variability in a data set. For RRCF scores, MAD can be used to set a threshold. A typical approach is to set the threshold as some multiple of the MAD above the median. This approach is robust to outliers and can be useful when the data has heavy-tailed distributions.

Min/Max

Another approach is to use the minimum and maximum RRCF scores to set the threshold. This could mean setting the threshold as a percentage of the range between the minimum and maximum scores. The method is straightforward, but it can be sensitive to extreme score values.

Z-Score

Several other methods can be used to set the threshold, depending on the data. These include statistical techniques such as setting the threshold based on standard deviations from the mean, using quartiles of the data, or using machine learning techniques to dynamically adjust the threshold based on observed data.

Conclusion

Thresholding gives anomaly detection a clear boundary between normal and anomalous scores, which helps identify potential issues such as cyber threats or system errors. The choice of thresholding method depends on the use case and the characteristics of the data. Whatever method is used, the threshold needs to balance anomaly detection against the risk of false positives and false negatives.