How to enhance digital threat monitoring with machine learning
Traditional cyber security defences are designed to protect assets within an organisation's network, yet these assets often extend beyond network perimeters, increasing a company's risk of exposure, theft and financial loss.
As part of a complete digital risk protection solution, a digital threat monitoring (DTM) solution automatically collects and analyses content streamed from external online sources, and subsequently alerts defenders whenever a potential threat is detected.
This capability allows organisations to expose threats early, and more effectively identify potential breaches and exposures before they escalate – without adding operational complexity for already overburdened security teams.
But what is digital threat monitoring, and why is it so hard to get right?
One newly released DTM module alerts customers to threats emanating from social media, the deep and dark web, paste sites and other online channels. An organisation can use this module to monitor and gain visibility into digital threats that target their assets in real-time, either directly or indirectly.
Advanced DTM can also provide pivoting opportunities for further enrichment, context, or threat hunting. DTM supports a wide variety of use cases.
- A threat intelligence analyst wants to discover threat actors actively targeting our infrastructure, so I can prioritise defences and remediation.
- A CISO needs to identify threats to our vendors and supply chain, so I can proactively mitigate that risk.
- A threat hunter wishes to identify possible data leaks and breaches to uncover attackers in an environment and minimise dwell time.
DTM is a continuous process involving data collection, content analysis, alerting, remediation and takedowns, as well as subsequent search refinement and collection all in a loop. A DTM module needs to evolve continuously to enable organisations to be proactive about digital threats.
In addition to the dynamically changing nature of ingested content and the threat landscape itself, the diversity of ingested sources presents another significant technical challenge. While a customer wants a seamless, consistent end-to-end experience for each new source plumbed through DTM, documents derived from different sources can vary widely in terms of their structure, semantic composition, language and length.
Legacy solutions rely primarily on keyword matching to address the issues outlined above. However, individual keywords can match documents in a variety of irrelevant contexts. Also, keyword matching is a brittle, signature-based approach that inevitably fails to recognise novel entities and threats as they evolve.
Worse, trying to define complex threat concepts such as credential dumps or release of new exploits, using simple combinations of keywords, can be an impossible task. Often, this results in huge, totally unmanageable monitoring rules with hundreds or thousands of independent keywords.
Given these challenges, it is essential to take a data-driven approach using machine learning to extract valuable information and present it in a user-friendly way.
The latest DTM modules leverage machine learning (ML) and natural language processing (NLP) to analyse and extract actionable patterns continuously from millions of documents each day. This empowers DTM customers to craft custom monitoring rules to expeditiously identify content that matters most to their organisation.
DTM is underpinned by seven conditionally gated machine learning models that have been implemented, evaluated and deployed to production. Together, these form an end-to-end, cloud-based NLP pipeline that enriches ingested documents with entity extractions and classifications.
This makes it convenient for customers to query proprietary data stores and customise alerts to what they care about most. From a technical point of view, this architecture also derives immediate benefits in terms of being able to:
- Measurably reduce false positives and improve the quality of dispositioned alerts
- Scale horizontally to handle arbitrarily increases in document volume
- Quickly capture any errors or feedback received to allow us to iterate rapidly
- Expose entities and classifications produced by individual models to populate global views and historical trends.
Advanced neural network-based NLP techniques have been integrated in developing the individual machine learning models that make up the pipeline. State-of-the-art transformer neural networks have been applied to security tasks like detecting social media information operations, malicious URLs, and even malware binaries.
Transformers learn context in parallel by tracking long-distance relationships among sequential data, like words in a document, beating out the previous generation of models that inefficiently processed words within a limited window and produced more errors when related words occurred far away from each other.
Additionally, a novel semi-supervised topic classifier combines subject matter expert knowledge with a data-driven ML approach to identify high-level threat topics within each document. High levels of accuracy and noise reduction have been achieved by utilising Transformer models and topic modelling.
High levels of accuracy resulting from the pipelines' machine learning models translate into improved experiences for customers using DTM. When entity types are extracted from ingested documents, organisations looking for supply chain vulnerabilities affecting Apple products do not need to scroll past noisy documents mentioning apple pie recipes.
Entities help customers to cut through the noise present within large volumes of documents. The most advanced pipeline currently supports over 40 distinct entity types with more planned in the future, giving customers access to a rich set of accurately detected entities for crafting the most precise monitors to be alerted to the most relevant information.
Finally, machine learning simplifies the creation of monitors by empowering customers to filter documents by high level topics. Documents flowing through the NLP analysis pipeline are tagged with up to 40 industry or threat topic labels, allowing customers to tailor alerts they receive to common threats and categorised security-related content, or to those pertaining specifically to their industry vertical.
Topics give DTM customers another way to refine their alerts beyond simple keyword matching, meaning that incoming documents pertaining to life hacks or growth hacking get filtered away when specifying a monitor condition in which documents must be associated with the information-security/compromised topic.
DTM has undergone thorough rigorous internal evaluation so users can be confident that the entities and classifications that from which monitors are built, reflecting state-of-the-art NLP and threat intelligence capabilities.