Skip to Content

2024.07.23

Tips for SOCLess Oncall

Handling alerts when there's no alert handlers

Tell me if you’ve heard this one before.

  1. The last thing we want is a bunch of lame alerts creating busy work for a large standing SOC.

  2. Instead, we’ll take an adaptive, agile, and highly automated approach to threat management (Autonomic Security Operations / SOCLess)

  3. We’ll use detection-as-code, and an on-call alert rotation instead of a triage bench

  4. On-call alert rotations make it possible to continue hiring a few skilled engineers instead of many analysts

  5. We’re drowning in alerts

    Gru-describes-devious-plot-with-easel meme. Panel one: 'We'll get rid of the large standing SOC'. Panel two: 'Our SOCLess, high signal detection-as-code will let us run an on-call alert rotation of high skilled engineers'. Panel three: 'We're drowning in alerts'. Panel four: Gru looks again at the easel in confusion.

What is a SOCLess Oncall / Autonomic Security Operations?

🟰 Because I want to, I’m going to treat these two terms interchangeably, and lean on SOCLess as the pithier one.

SOCLess was coined by Alex Maestretti, based on Netflix’s program. The core tenants were:

  1. Avoid a large standing SOC, by
  2. Focusing on mature rules with defined response plans, where
  3. Alerts triage is decentralized to system experts, and
  4. You own the alerts you write

Autonomic Security Operations is mostly an Anton Chuvakin-ism. It builds on years of his writing (at Gartner, and later at Google). It includes:

  • 10X analyst productivity and effectiveness through upskilling and automation
  • 10X process through “automation, consistent and predictable processes running at machine speed”
  • 10X technology through comprehensive visibility, interoperability, performance, and improved quality of detection signals

These two approaches try to move the time (and staffing) a SOC traditionally spends on repetitive alert triage into engineers working on wiring controls into a detection and response platform, and overall doing work that compounds (versus toil).

SOCLess Oncall Friction

🤐 Don’t mind the man behind the curtain … my biases (naivete? youth?): I’ve actually never worked at a company with a traditional SOC!

Instead, the security teams I’ve worked with all navigate this “SOCLess” model, amidst generally cloud-native environments at companies building SaaS.

I’ve noticed a pattern. SOCLess detection programs still have the same innate entropy towards lowered signal. In a traditional SOC, this might result in ineffective staffing, analyst burn out, or reduced triage efficacy.

In a SOCLess program, the risks can also put tension on the strategic decision to run SOCLess and can stress the cultural identity of teams that choose this approach.

Let’s circle back:

The last thing we want is a bunch of lame alerts creating busy work for a large standing SOC.

But we have a bunch of clever alerts, based on the latest and greatest threat research scooped from the pages of Detection Engineering Weekly, that will be our only signal someone has $DONE_THE_BAD_THING!

Instead, we’ll take an adaptive, agile, and highly automated approach to threat management (Autonomic SOC / SOCLess)

But our alerts are generally constructed in isolation, by a single engineer, as a side task during their real project

We use detection-as-code, and an on-call alert rotation

But that alert rotation really is a triage desk, and also a runner (in the Stripe mold), handling not just alerts but also inbound requests from humans. Also, they should have time to work on other engineering tasks during their oncall week, not just handle alerts

On-call alert rotations make it possible to continue hiring a few skilled engineers instead of many analysts.

But our skilled engineers are skilled at application security, or 1337 hAckS, or cryptography, not at operations or detection engineering


I think a root cause of this issue is that SOCLess is often adopted as a repudiation of the traditional SOC, more than as a holistic strategy. It starts with the conviction that great teams are not solving detection problems with analysts.

Engineering-oriented security teams, especially at cloud-native startups, want to take an engineering-led approach to detection and response. SOCLess? Well, if it worked for Netflix and Google…

This reflexive move skips the scaffolding and primitives that make SOCLess viable - causing the drift towards noise and tension in operations and oncall.

But it doesn’t have to be this way! I think there are a few core primitives that can buy you a lot of scale in your SOCLess program.

Alert Taxonomy

The cardinal sin of immature SOCLess might be a failure to set Alert Taxonomy.

XKCD comic. Panel one, two people and a rabbit: 'Bun alert!' 'Oh, yeah! Cute!' 'Gotta document this. I'll notify everyone, send out a push alert.' '...To who?' Panel two: 'Everyone subscribed to the alert system.' 'Alert system?' 'Yeah! We built it over the last few years. It's pretty small. Still looking for investors.' 'But...Why are you alerting people about rabbits?' Panel three: 'I mean...Look at them. They're like loaves of bread that hop.' 'I see.' (Emphatically)'People need to know.' Panel four: 'They need to know: THERE ARE BUNS.' Panel five: 'Okay, uhh, I'm gonna go.' A third person arrives. 'I got the alert! Where is the bun? IS IT SMALL?!' 'Extremely.' 'Oh my god.'

Alert Severity is notoriously squirrely - check out Absolute measurement corrupts severity, absolutely. Despite the challenges, many teams set up Info/Low/Medium/High and go on their way. What they miss is that a SOCLess program actually should generate a few distinct classes of alerts.

🧠 This is heavily inspired by Datadog’s Monitoring 101: Alerting on what matters

  1. Records: alerts that are not timely, but might be useful for future reference or investigation. This may also includes alerts that are part of a Correlated Alert but are not actionable on their own.
  2. Notifications: alerts that require further analysis or intervention, but where response can be delayed. The SLA might be on the order of hours, not minutes. Charity Majors calls these “second lane alerts,” which I like.
  3. Pages: alerts that require immediate analysis or intervention from the security on-call.

There is actually one more level of granularity that we’ll circle back to - the User Notification.

Alert Handling

Setting a taxonomy is mostly powerful in how it implies consistent tactics for alert handling. This composes the where, what, and when of alert response.

For example: Pages go to our Ticketing System and Pagerduty where they are ACK’d within 5 minutes. All tickets are reviewed weekly in our retro.

The use of a taxonomy also allows for more flexibility in coordination with the severity. You might want to retro and resolve High Risk Notifications, but auto-close Low Risk ones.

Automation Maturity

This diversity in Alert Handling implies a third pre-requisite to run SOCLess - Automation Maturity. Small teams often lack the dedicated staffing against detection engineering that might happen at Netflix-scale. As a result, many of the automation platforms end up built in parallel to alert development, or a SaaS SOAR might be adopted with its own set of challenges.

Twilio’s SOCLess is an example of the standard architecture I’ve seen teams build, which throws together Lambdas with other cloud infrastructure. These platforms are easy to start but can be hard to scale. The mix of a flexible workflow platform with data engineering challenges results in a problem that can be hard to address along the edges of your days.

Tines’ SOC Automation Capability Matrix offers a good survey of how broad the capabilities can be in a fully automated setup.

I’d highlight User Interaction as the oft-missing stair in small SOCLess programs. The minimally loveable shape is:

  1. A single interface, often Slack
  2. Acknowledgement (this was me), or disavowel (uhhh, what is this?) - with timed escalation
  3. Strong MFA prompting

This capabilities unlocks the fourth alert type, the User Notification: alerts that require more information from a human to become record, security notification, or page.

This contributes to a crucial element of the SOCLess model — the ability to distribute alerts to system owners (aka alert decentralization).

your SOC team is basically an air traffic controller - they assess what’s going on and they direct the issue to the where it needs to go - but they aren’t actually flying the plane.

If this is the case, why can’t we simply automate this air traffic controller process? For example, if a laptop’s endpoint protection fires an alert, why not direct it right to IT?

RIP SOC. Hello D-IR

Take inspiration for your User Notifications from Square:

Square alert dashboard with a tooltip for a sample alert: 'Trying to take over the account of another employee can often be a strong indicator of compromise. There should never be a legitimate case where this kind of access is necessary to perform your job -- if you believe that you do need to do this, please let the PLATSEC team know, and we will be happy to work with you to find a compromise.'

Dropbox:

Dropbox's securitybot that sends you an alert in Slack and asks you to confirm or deny it was you (and provide an explanation of what you were doing).

and Slack:

Slack's securitybot with a sample alert message in Slack: 'I see you just ran the command 'flurb -export' on 'accountingserver01'. This is a sensitive command, so please acknowledge this activity by typing 'acknowledge'.

💭 Fun fact: Crowdalert co-founder & CTO John Sonnenschein was on the team at Slack that worked on securitybot! They’ve been tackling this problem for almost a decade, and Crowdalert is the culmination of lots of lessons learned.

Karthik Rangarajan has some great notes on how OpenAI takes this a step further by integrating LLMs!

Playbook: Maturing your SOCless Oncall

So, we have a diagnosis and cure, but what’s the course of treatment?

☎️ Looking for a quick path to increased signal and reduced alert fatigue? Reach out for a demo of Crowdalert’s platform for Alert Verification, Prioritization, Dispatch, and Identity Visibility.

Feel free to skip steps based on your current maturity!

  1. Classify existing alerts: revise severity (but don’t overthink it!) and set metadata on alert type (Record, User Notification, Security Notification, Page)
  2. Create the requisite hierarchies in adjacent systems (Slack channels, ticket types, etc.) for alert routing
  3. Implement minimally lovable automation:
    1. Quality Deduplication: with grouping and snoozing
    2. User Notification
  4. Move from atomic alerts to correlated ones, where possible
  5. Standardize maturity requirements, based on alert type: for example, all Pages have a low-to-no False Positive tolerance, and all Notifications and Pages must trigger automation before alerting. All Notifications and Pages must have a concrete runbook.
  6. Set up a process to monitor alert health and program metrics on an ongoing basis
  7. Set up a process to monitor program metrics on an ongoing basis

References


This post was written by Rami McCarthy for Crowdalert.
By
Rami McCarthy

Last Updated 2024.07.23