As you probably know, at Hunt and Hackett we absolutely love automation. We apply it everywhere we can to avoid mistakes and improve repeatability. It helps us support multiple different technologies and saves our customers time and effort when working with us. A significant part of our infrastructure is as-code, and we have multiple automated build processes running in several places.
One of the more difficult departments to automate in has got to be the Security Operations Center (or Cyber Defense Center) - a place where every second counts and mistakes can be costly, yet human judgement is needed most of the time. At first glance, questions like “is this malicious?” or “is this abnormal?” have simple yes/no answers. But as soon as you actually try to answer them, you see that there are so many variables, dependencies, context, and just too much actual information that you usually just end up with the cliche “it depends”. But computers don’t work with these nuanced answers. It has to binary, one or zero, true or false, if-then-else.
So human analysts are still very much needed to avoid “computer says no” situations: you can imagine shutting down the wrong computer could have a major impact on your business. Humans are capable of reasoning, judgement, and ownership, can understand the full context in which they work, and are better at handling unexpected situations.
But most of the time there are no unexpected situations. Detection rules get triggered by benign actions, some might even be false positives, but they must all be looked at and handled according to service level agreements. Basically, “monkey / monkish" analyst work that is both not very challenging (could be done by monkeys) and very repetitive (only monks would have the discipline). And here is where mistakes can creep in: you assume too quickly that large floods of alerts are all completely the same and can be closed as false positive; you copy-paste an email meant for the wrong customer; you get so sick and tired of the same alert over and over again that you don’t have any brain power left for analyzing a true positive.
So that’s where our Automatic Triage Framework comes in. A robust and efficient way to automate alerts, freeing up analysts' time for the things that really matter, increasing code proficiency, and making sure the SOC remains a fun place to work. This article will show you how and why we implemented it.
Just like most SOCs we use a SOAR for alert handling, specifically XSOAR by Palo Alto. This enables us to connect with several different alert sources (like Chronicle/Google SecOps, Microsoft Defender, or CarbonBlack), but work in a single interface. It also provides automation in the form of playbooks – flowchart-like work plans consisting of both manual and automated tasks, structured in several branches by conditional checks. The platform community maintains many API integrations mostly written in Python which are used to actually do stuff; like running queries, sending emails, isolating hosts, disabling user accounts, and so on.
Figure 1 - Workload for setting the SLA time
However, while these playbooks can be easy to understand when they are kept simple, it quickly becomes difficult to follow when you try to abstract things away. Sub-playbooks calling sub-sub-playbooks get confusing and are also hard to manage. The only way to build playbooks is using a visual editor in XSOAR, yielding large YAML files with a custom schema not meant to be human-readable. So that is when we started to focus more on actual code. Instead of creating a playbook with complex logic flows and error handling, we decided to write it as an automation script. This is possible because the functionality of API integrations is exposed using commands, which can be called from either playbooks, scripts, or manual input during analysis. It also enables the use of tried-and-true methods for version control, code reviews, and design patterns, instead of having to write the best practices yourself for something you did not contribute to.
Working this way gave us some very handy tools on top of the quite expansive functionality already built into XSOAR. We can now support complex SLA structures, render information in custom views, and load and validate externally managed service configurations. The prototype for automatic triage started out as an allowlisting feature. This script checked each incident against a predefined list of simple conditions, closing it when it matches, without any human needing to look at it.
That of course really tickled our fancy, and we wanted more, but we hit limits based on how we designed our allowlist system. It is based on conditions written in a config file, with a limited set of logic available. But some commonly occurring alerts could be closed based on just slightly more complex rules – and if we could automate those it would free up lots of time.
Eventually we decided to fully open up the logic and have actual code to check if an alert can be closed. This would be flexible enough to accommodate a larger scale and would save us having to use a custom condition language. Furthermore, instead of writing these conditions, our analysts would actually become proficient in regular Python, which is a very useful skill in many other areas, and contributes to our goal of having a skilled SOC. This is the foundation of the framework: a solution that is flexible and scalable enough for our needs, yet usable by people with varying programming skills.
The most important effect, however, is that it gives us control to make our work better, less boring, and more challenging, making sure the SOC remains a place where people can learn and grow.
The core of the framework is the interface with XSOAR, especially running commands. As you may know, XSOAR connects with APIs using its integrations, which then exposes commands to be used in playbooks and its command input in the web interface. Furthermore, automation scripts are also called this way. These commands are commonly built to be used within XSOAR’s incident context data structure, making use in playbooks easy, but they can also be called from within automation scripts themselves (using the Python API) where that very same output format can be convoluted and confusing because you only need the command’s result.
Another core functionality is the ability to access incident investigation data. This is also stored in XSOAR, in its incident context and fields. When the framework is started up, all this data gets passed to it by the entry point. It then exposes simple functions to retrieve and use it in various ways.
Figure 2 - Class diagram for the module “defender_prevented”
The framework is divided into two main parts: the Matcher and a corresponding Handler. Specifically, every case is a single Python module containing one of each, inheriting from their base classes. Any functionality needed in both is inherited from Base, located a level higher. Instead of inheriting from the Handler base class directly, it is also possible to use what we have named intermediate handlers. These can be used to group certain helper functions, for instance related to a single security product (like querying Defender for Endpoint in the example above).
We have done this because the matching is run on every single alert we get. As the number of matchers increases, we need to make sure that it doesn’t slow down the playbook with expensive and/or slow queries. The matching logic should run only on data already available at the time it’s executed and be as specific as possible. Anything more complex should be placed in the handler, which at any point is able to abort and leave the incident to the analyst.
The framework entry point is a small automation script based on the Docker image and is situated in the playbook flow just after enrichment but before analyst assignment. It makes sure that any errors get caught and results are normalized to have the playbook either close the alert or continue down the manual analysis path. To resolve errors as soon as possible, we send a notification to people on call who have the technical skills to fix any problems that might occur. (This error flow is also implemented in other services that use custom components, like automated vulnerability reporting.)
Figure 3 - AutoTriage integrated in the playbook, showing the error path
Another useful function is sending notes to the War Room. These are used to show what the framework is doing and guide the analyst in case it has been returned to manual mode. They show up prominently on the first tab of the incident layout, and by default only contain the message “no autotriage match found” indicating that it does run on each and every alert we get in our XSOAR analysis channel.
Figure 4 - Example of AutoTriage war room note
In order to increase the robustness of our code, we added unit tests. For the matcher, this is easy: you only need a dictionary containing the relevant incident data to test matching and non-matching cases. This is also documented in our manual on creating new implementation modules and we plan to use code generation to further increase efficiency. The handler, however, is a different story. It depends a lot on running commands in XSOAR and parsing the results. For this, we wrote a Mocker class that uses hashes of the commands and their parameters to read local files containing their results. Because these files are committed to version control, we need to ensure that any personally identifiable and/or customer-specific information is redacted, which is done with the help of a script included with the codebase.
One of our SIEM rules is looking for Microsoft Defender being disabled using the Windows Registry as a form of Defense Evasion. We found that this rule generated a lot of alerts daily, and together with customer system administrators we quickly found these to be benign – caused by built-in processes to update and re-enroll Windows devices into the Defender Portal. At first, we used our allowlist feature to block this at customer level but then decided that this was a good proof of concept case to test our then brand-new framework with – especially when it started showing up for multiple customers doing it in different ways.
The AutoTriage framework PoC was at feature parity with the allowlist feature – it simply checked the fields for customer name and parent process and then closed the alert.
Improving on this, we included another use case which needs some extra data from the SIEM. Analysts and existing scripts already regularly use the required XSOAR commands to query this, so converting it to code was quite trivial.
Here, we search for logs containing a specific onboarding script (recommended in Microsoft documentation) that ran within five minutes of the alert time. If we find any results, we close the alert. (Note: the command used here returns a list or a dictionary if anything is found. If not, it simply returns a message like “no results” as a single string.)
To confirm if an alert is caused by certain things, you usually need a lot of information. If you have a well-defined list of conditions, you might be able to automate checking for them. This was the case for a common outcome we saw for alerts about prevented malware in which no further action is required. Our way of working with these alerts is to check if it did not execute any actions, using the following conditions:
If all of these conditions are met, we don’t need to do anything and thus regard the alert as Informational.
This is where AutoTriage guarantees repeatability and avoids errors, by having queries defined in the code to fetch the information needed to check the conditions. We decided to make these more reusable (and keep the implementation module clean and readable) by putting them in an intermediate handler class for Microsoft Defender. Then it’s just a simple matter of checking each condition by running the corresponding query and exiting the handler as soon as one doesn’t match, which returns the alert to manual analysis.
The benefits of our Automatic Triage framework can be nicely explained with a metaphor of a production line robot arm replacing a human worker.
So even though it takes quite specific know-how and initial cost to set up, in the long run it makes the process faster, safer, and higher quality.
Because it actually concerns code running in the digital realm, the setup cost is just development time, and we can program as many cases as we want as we are not limited by physical space (maybe computing power but that’s a different story). Coding is also easier to learn than robotics and more fun than factory floor safety protocols. In our multi-disciplinary SOC, people can now use their freed-up time for much more interesting and/or important matters.
There is a proverb among coders that compares a piece of software to a cursed hammer: when wielded, everything looks like a nail, especially your thumb. (Basically, the software development flavor of Maslow’s Hammer.) There is a chance that AutoTriage will get used for things that it is not meant for, and for that reason we have documented a small design process to combat this (yielding flowcharts like the one above for example). Furthermore, when we organized a hackathon to attempt to write AutoTriage cases for our most common alerts, we found out that many incidents actually need human judgement.
There is also the problem of the black box – when the author of a case does not clearly log what is being done, the SOC might not be able to adequately explain the reasoning behind any actions taken to customers. We try to remediate this with our coding style, making logging easily accessible, and using as many reusable components as possible. This is enforced with code review, which also provides opportunities for general comments and collaboration.
Software is never finished, so of course we are still working on additional functionality for the framework. Firstly, a way to send emails to customers. This function will be used to notify relevant people as fast as possible when certain alerts come in, for instance adding new accounts to administrator groups. It will make our response times faster, our communication more consistent, and our work less boring (no more copy-pasting email templates). Currently we have some proof of concepts in place and are testing and improving various components like our templating system.
Secondly, we want AutoTriage to execute response actions. This can vary from blocking IPs or domains on firewalls, to isolating machines and disabling user accounts. Because this has a lot of (potential) impact we need to be very careful that our design includes important features, for instance making the actions easily reversible, sending a notification to the on-call analyst about what has been done, and ensuring we are allowed to act according to the response mandate agreed upon with the customer.
At the moment we are working on simplifying our tools for running manual response actions while keeping in mind they might be included in the AutoTriage framework later. This is needed because we support multiple different technologies that can execute the same action but have a different way of calling it in their API’s. By adding an abstraction layer, it becomes easier for our analysts to respond more quickly because they don’t need knowledge of which customer uses which technology – which is also useful in AutoTriage.
Ever since the first proof of concept, AutoTriage has proven its value and shown its potential. Now the task at hand is realizing that potential by continuing to build on it, expanding our automated toolset so that we can scale like we want to. There is much efficiency to be gained, but probably also lessons to be learned – and we’re looking forward to it.