If I ask anyone working on a SOC about a high level description of their monitoring process, the answer will most likely look like this:
“The SIEM generates an alert, the first level analyst validates it and send it to the second level. Then…”
Most SOCs today work by putting their first level analysts – the most junior analysts, usually assigned to be the 24×7 eyes on console – parsing the alerts generated by their security monitoring infrastructure and deciding if that’s something that needs action by the more experienced/skilled second level. There is usually some prioritisation on the alerts with the assignment of severity levels, reminiscent from old syslog severity labels such as CRITICAL, WARNING, INFORMATIONAL, DEBUG.
Most SOCs will have far more alerts being generated than manpower resources to address all of them, so they usually put rules in place such as “let’s address all HIGHs immediately, address as much of the MEDIUMs as we can, don’t need to touch the LOWs”. It is certainly prioritisation 101, but what happens when there are too many critical/high alerts?
Should they prioritise inside that group as well? Also, what if many medium or low severity alerts are being generated about the same entity (an IP, or a user), isn’t that something that should be bumped up in the prioritisation queue? Many teams and tools try to address those concerns in one way or another, but I have the impression that this entire model is showing signs of decline.
If we take a careful look at the interface of the newest generation of security tools we will notice the alerts are not the entities listed in the primary screens anymore; what most tools are doing now is consolidating the many different generated alerts into a numeric scoring mechanism for different entities, most commonly users and endpoints. Most of tools call those scores “risk scores” (which is awful and confusing, as it’s usually nothing related to “risk”).
The idea is to show on the main screen the entities with “top scores”, those that had more signs of security issues linked to them, so the analyst can click on one of them and see all reasons, or alerts, behind the high score. This would automatically address the issue of prioritising among the most critical and the concerns about multiple alerts about a single entity.
For a SOC using a score based view the triage process could be adapted in two different ways: on the first one the highest scores are addressed directly by the second level, removing the first level pre-assessment and allowing for a faster response for something more likely to be a serious issue, while the first level works on a second tier of scores.
The second way would be to use the same method of initial parsing by the first level, but with the basic difference that they would keep picking entities from the top of the list and work as far into it as they can, sending the cases that require further actions to the second level (which can apply the same approach to the cases being forward by the L1).
This may look like a simple change (or, for the cynics, no change at all), but using scores can really be a good way to improve the prioritisation of SOC efforts. But scores are not only useful for that. They are also a mechanism to improve correlation of security events, usually coming from different security monitoring systems or even from SIEM correlation rules.
What we normally see as security events correlation is something like “if you see X and Y, alert”, or “if you see n times X then Y, alert”. Recently many correlation rules have been created trying to reflect the “attack chain”: “if you find a malware infection event, following from a payload download and an established C&C, alert”.
The issue with that is that you need very good normalisation on the existing events in order to keep the number of rules at an acceptable level (you don’t want to write a rule for every combination where there is an event related to C&C detection, for example). You could also miss attacks where the observed events are not following the expected attack chain.
The improvement on prioritisation comes from the fact that in this new model every event would increment the scores for the associated entities with a certain discrete amount. Any time a new event or event type is defined within the system, the amount of points to be added to the score is determined. Smarter systems could even dynamically define those points according to attributes of the event (more points to a data exfiltration event when more data being transferred is detected).
The beauty of the score model is that scores would go up (and eventually hit a point where the entity would become a target for further investigation) by any combination of events, with no need to previously envision the full attack chain and describe it in a correlation rule. This is how most modern UEBA (User and Entity Behaviour Analytics) tools work today: a set of interesting anomalies are defined within the system (either by pre-packaged content or defined by the users) and every time they are observed the scores for the affected entities is incremented.
Here is a nice example of an UEBA tool interface using scores:
Score based monitoring systems can improve even further. Feedback from the analysts could be used to dynamically adapt scores from each event or event type, using something like Naive Bayes, for example. We’ve been doing that for Spam filtering for ages.
The score trend is already clear in the technologies side; SOC managers and analysts should review their processes and training to get the full benefits of that approach. How do you see your organisation adopting that approach? Feasible? Just a distant dream? Maybe you think it doesn’t make sense?
Of course, if your SOC is already working on a score based approach, I’d also love to hear about that experience!
Article by Augusto Barros, Gartner research director