Building an MDR Operation from the Ground Up

When I first came on at Secnap as a SOC analyst, the security operation was built around NDR. No endpoint detection at all. The SOC team had deep experience with network-based detection, but there was a complete blind spot on the endpoint side.

Today, we run a full MDR operation. EDR, ITDR, SIEM, custom detection coverage, threat hunting. All of it had to be built. This is what that process actually looks like when you're doing it at a company that's lean on people and budget, and when you're figuring out a lot of it as you go. It's also a story about how you develop as a hunter, because it turns out that building the thing that generates the telemetry teaches you to read it differently than if someone just handed it to you.

Before EDR: ETW, Fibratus, and Making It Work

There was no budget for a commercial EDR, so I started digging into ETW. Event Tracing for Windows is a kernel-level instrumentation framework built into Windows that exposes telemetry most security tools sit on top of without you ever knowing it. Process creation, file system activity, registry modifications, network connections, image loads, thread injection. It's all there if you know how to access it.

That research led me to Fibratus, an open-source tool that hooks directly into ETW providers and exposes that kernel telemetry for detection. Fibratus lets you write detection rules in YAML that evaluate against real-time ETW events on the endpoint. Process behaviors, file I/O patterns, suspicious registry modifications, network callouts. You're writing detections against the same kernel-level data that commercial EDRs use, just without the managed infrastructure around it.

The key constraint with Fibratus is that everything runs locally. The YAML rule files live on the endpoint. The detection logic executes on the endpoint. There's no cloud console, no centralized rule management, no push-to-all-sensors button. Every rule update means touching every machine. That's a real limitation. But we had nothing, so it was what we were going to use.

The pipeline I put together looked like this: Fibratus detection rules evaluated ETW events in real time on the endpoint. When a rule matched, it generated an event into the Windows Event Log. From there, Wazuh ingested those WEL entries so the SOC team could see endpoint detections alongside their existing network alerts in a single pane. When we needed to respond, we used Velociraptor for remote investigation and containment.

It wasn't a product you could buy off a shelf. It was a pipeline assembled from open-source tools to solve a specific problem: we had zero endpoint detection, and we needed some. The SOC team, who had only ever worked with network telemetry, could now see process-level activity for the first time. That shift alone changed how we operated.

To make all of this work at any kind of scale, I wrote a custom agent in C# that handled deployment and management. It installed Fibratus, Wazuh, and Velociraptor on endpoints, made Fibratus tamper-protected, pulled rule updates from a centralized server, and could update itself. It solved the "how do I manage all of this across machines" problem well enough to keep things running.

But even with that, there was no central telemetry to hunt across. Every environment was still its own island. The agent helped with management, but the architecture was still fundamentally local. It got us endpoint visibility we didn't have before, and it had a ceiling we were going to hit.

Red Canary: Seeing What a Mature Program Looks Like

I left Secnap for Red Canary and spent five months doing threat hunting. Completely different world. Multiple EDR platforms under one roof: SentinelOne, CrowdStrike, Microsoft Defender for Endpoint, Carbon Black, Palo Alto Cortex. Mature hunting processes, structured hypothesis development, peer review, and a detection engineering team that took hunt findings and converted them into production rules.

Hunting across multiple EDR platforms teaches you something specific. Each platform surfaces telemetry differently. The same technique looks different in CrowdStrike than it does in SentinelOne. The same suspicious behavior might generate a high-fidelity alert in one platform and barely register in another. Working across all of them forces you to stop relying on how a specific vendor presents data and start paying attention to the behavior itself.

That experience changed how I think about hunting. Before Red Canary, most of what I knew about endpoint detection came from writing Fibratus rules, which is basically writing detections against raw ETW data. I understood the telemetry from the ground up, but I'd only seen it in one context. Hunting across five different EDR platforms in customer environments I'd never seen before was a different exercise entirely. You can't rely on familiarity with the environment when you're seeing it for the first time. You have to rely on understanding what attack behavior looks like regardless of context, and that's a skill that only develops through doing it over and over.

Red Canary also showed me what the feedback loop between hunting and detection engineering looks like when it's working. A hunter finds something. The finding gets documented. Detection engineering evaluates whether it can be automated. If it can, it becomes a production rule. If it can't, it becomes a recurring hunt. That cycle is how a detection program actually matures. It's not about writing more rules. It's about building a system where coverage grows from real findings.

Zscaler acquired Red Canary, and I ended up leaving. But I left knowing what a mature operation actually looks like, and I had ideas about what I'd do differently if I got the chance to build one again.

Rethinking the Stack

When I came back to Secnap, the Fibratus pipeline was still running. It had done its job, but I knew what the ceiling looked like now. The core limitation hadn't changed: everything was local. Local rules, local execution, no centralized telemetry, no way to hunt across customer environments from a single console. And beyond endpoint, we still had gaps in identity and log aggregation that an MDR operation needs to cover.

I found LimaCharlie. It's not a traditional EDR where the vendor gives you a product with built-in detections and you just deploy it. LimaCharlie gives you the infrastructure: an EDR sensor with kernel-level telemetry, ITDR integrations for M365, Entra ID, Google Workspace, log ingestion for SIEM-like aggregation, and a cloud-native D&R rule engine. But it ships with nothing turned on. No detections. No response playbooks. No out-of-the-box alerting. You get the platform and the telemetry. What you do with it is entirely on you.

That's the tradeoff. A traditional EDR gives you vendor-managed detections but limits how much you can customize. LimaCharlie gives you full control over the detection logic, the response actions, the entire pipeline, but you have to build and manage all of it yourself. Every D&R rule, every integration, every deployment decision. For a lean team, that's a lot of work. But it also means your detection coverage is shaped by your environments and your threat landscape, not by whatever the vendor decided to ship.

Night and day difference from where we were. With Fibratus, every rule lived as a YAML file on the endpoint. Updating coverage meant touching every machine. With LimaCharlie, you write a D&R rule, deploy it, and it's active across every sensor immediately. That alone was worth the switch.

The Scope of an MDR Build

The thing about standing up an MDR operation is how many surfaces you have to cover at the same time. It's not just endpoint detection. A real MDR program has to see across the full attack surface, and each layer comes with its own telemetry, its own detection logic, and its own tuning challenges.

EDR was the first and most urgent gap. Endpoint telemetry is where you see process execution, persistence mechanisms, lateral movement, credential access. It's where most hands-on-keyboard activity becomes visible. Getting this right was the foundation everything else sits on.

ITDR was the next priority. Identity-based attacks are increasingly where initial access and privilege escalation happen, and that means you need visibility into M365, Entra ID, Google Workspace, Active Directory. Detecting things like impossible travel, MFA fatigue, anomalous authentication patterns, suspicious OAuth consent grants, mailbox rule manipulation, inbox forwarding to external addresses. You can't find any of that in process trees. You need authentication logs, directory change events, audit logs from SaaS platforms, and the ability to baseline what normal access looks like for individual users so you can spot when something deviates. A lot of these threats never touch an endpoint. If your MDR operation only sees endpoints, you're blind to an entire category of compromise.

SIEM and log aggregation tied it all together. Having EDR and ITDR telemetry is only useful if you can correlate across them. An attacker who compromises credentials through phishing, uses those credentials to access a VPN, and then moves laterally across endpoints is going to leave traces in both layers. Being able to see that chain of activity in one place, rather than investigating each layer in isolation, is what turns a collection of security tools into an actual detection operation.

None of this is plug-and-play. Each data source needs its own onboarding, its own detection rules, its own tuning. Identity events need baselining before you can detect anomalies meaningfully. The SIEM pipeline needs to be structured so the SOC can actually work from it without drowning in volume. Every layer is its own project.

Detection Coverage: Sigma, Custom Rules, and the Tuning Problem

One of the first things I did on the new platform was import Sigma rules directly into the D&R engine using detectionstream.com. Around 1,300 rules. The import itself is straightforward. What follows is not.

Sigma rules are written to be portable and broad. That's the whole point of Sigma, but it's also the problem. A significant percentage of them will be noisy in any specific environment. Some fire constantly on legitimate admin activity. Some are too broad to be actionable without additional context. Some are written for threat behaviors that aren't relevant to the environments you're protecting. Figuring out which rules actually produce signal in your specific customer base takes months of tuning, and it never fully stops. Environments change. Customer behavior shifts. New software gets deployed. The rules have to evolve with it.

Tuning is where the real detection engineering happens, and it's the part that rarely gets talked about. The conference talk version of detection engineering is writing clever rules for novel techniques. The actual day-to-day version is adjusting thresholds, adding exclusions for known-good activity, reclassifying severity levels based on what you're seeing in production, and figuring out why a rule that was quiet for three months suddenly started firing across half your customer base. The import gives you breadth. Tuning gives you precision. Without precision, the SOC stops trusting the alerts, and at that point your coverage is effectively zero regardless of how many rules you have deployed.

Beyond the Sigma imports, I've written 60+ custom detection rules. I've also kept Fibratus rules in play for specific use cases where its direct ETW access catches kernel-level behaviors that other tooling might not surface the same way. Some of those custom rules came from hunts. Some came from incidents. Some came from sitting in telemetry long enough to notice a pattern that didn't have coverage yet. That's what detection engineering looks like at a small company: it's not a dedicated role with a dedicated team. It's something that happens alongside hunting, incident response, SOC operations, and everything else.

Helping a SOC Team Cross the NDR-to-Endpoint Gap

The SOC team at Secnap has solid analysts who know network detection well. But NDR and endpoint detection are genuinely different disciplines. The telemetry is different. The investigation workflow is different. The mental models are different.

An analyst who can read network traffic and identify C2 beaconing doesn't automatically know what a suspicious parent-child process relationship looks like, or when a PowerShell execution is worth investigating versus normal admin activity. An analyst who's used to reading packet captures needs to build entirely new instincts for reading process trees, understanding LOLBin context, and evaluating whether a service installation is persistence or just software doing its job.

The thing I've noticed watching this transition is that the analytical skills transfer but the pattern recognition doesn't. The NDR analysts know how to investigate. They know how to follow a thread, document findings, communicate results. But knowing what's worth investigating in the first place is tied to the specific telemetry you've spent time in. An analyst who can immediately spot anomalous DNS traffic needs time to build that same instinct for process trees or identity logs. There's no shortcut for it. You have to see enough normal to recognize when something isn't.

A big part of this process has been working alongside the team as everyone builds that familiarity. Not just endpoint, but identity events and SaaS audit logs too. What does a suspicious OAuth consent grant look like? When is a mailbox forwarding rule worth investigating? How do you pivot from an endpoint alert into a broader investigation that spans identity telemetry? We've documented procedures for all of this, but knowing when to deviate from the procedure is the part that takes real time in the data.

None of that comes quickly. The team is getting there. I'm still building my own instincts too. I don't have a decade of experience across all these domains. We're all getting more familiar with what normal looks like in these environments, and that's just where we are.

Building the Hunting Program

There was no formal hunting program at Secnap before this. No hypothesis framework, no recurring hunts, no structured output process. Now there is: hypothesis-driven hunts across customer environments using centralized telemetry, Jupyter Notebooks for analysis, and Pandas for data manipulation. When a hunt produces a finding, it feeds back into the detection program. Either the finding becomes a D&R rule, or it becomes a recurring hunt that runs on a regular cadence.

That feedback loop is the part that matters most, and it's the thing I took away most clearly from Red Canary. A hunting program that doesn't feed back into detection coverage is just an expensive way to do incident response after the fact. The goal is for every hunt to either confirm that existing coverage is adequate or reveal a gap that needs to be closed. Over time, the detection surface grows because it's being shaped by what hunters are actually finding in real environments, not just by what a Sigma ruleset says you should care about.

Something I've been thinking about a lot lately is the difference between what you can document about hunting and what you can't. We have hypothesis templates, hunt packages, documented queries. Those are useful and I'm glad we built them. But the hunts that have actually found things rarely started from a documented package. They started from noticing something in the telemetry that looked off and deciding to dig into it. A scheduled task that didn't match the customer's change window. An authentication pattern that was technically valid but didn't make sense for that user. Stuff that wouldn't trigger a rule but didn't sit right.

I don't think you can train that through documentation alone. It comes from spending time in specific environments and learning what their version of normal looks like. The Fibratus days helped with that more than I expected. Writing YAML detections against raw ETW events forced me to understand the telemetry at a level that just using a commercial EDR probably wouldn't have. When you've built the detection pipeline yourself, you understand the data differently. You know what the sensor is actually capturing, what it's missing, and where the gaps are. That background makes hunting more productive because you're not just running queries, you know what the queries are actually searching through.

This is also why the hunt-to-detection feedback loop matters so much. When a hunt finds something and it becomes a detection rule, that specific thing is now covered automatically. The hunter can move on to the next gap. The detection program handles what's known. The hunting program goes after what isn't. When those two sides feed each other, the operation improves in a way that neither one can do alone.

Having telemetry that spans endpoint and identity makes the hunting richer. An identity-based hunt that starts with anomalous authentication patterns can pivot into endpoint telemetry to see what happened after access was gained. A hunt that surfaces suspicious mailbox rule creation in M365 can be correlated with login events to determine whether the account was compromised. The more surfaces you can see across, the more complete your hunts become, and the better your detection rules get as a result.

The challenge at a lean company is that hunting doesn't exist in isolation. When a hunt surfaces a real threat, you're also the one handling the incident, communicating with the customer, and coordinating containment. At a larger company, those are separate roles with separate people. Here, the boundaries blur. That makes the work harder to scope, but you end up very close to the actual impact on the customer. There's no distance between finding the threat and understanding what it means for them. And that proximity, being the one who finds the threat and then lives with the consequences of it, is its own kind of feedback loop. It shapes what you look for next.

What This Process Has Made Clear

A few things that have gotten clearer over the course of building this:

Tuning is the real detection engineering. Importing rules feels like progress. Tuning them into something the SOC can actually work from is where the coverage becomes real. The industry talks a lot about writing detections. It doesn't talk enough about the months of iterative refinement that determine whether those detections produce signal or noise.

Multi-surface visibility changes the quality of everything. When you can only see endpoints, your hunts are limited and your detection gaps are invisible. Adding identity telemetry from M365, Entra ID, and Google Workspace doesn't just give you more alerts. It lets you see attack chains end-to-end, and that changes how you write detections, how you hunt, and how you investigate. Each surface you add makes every other surface more useful.

The gap between local and cloud-based detection is larger than it appears. Fibratus on individual endpoints gave us something when we had nothing. But going from that to cloud-native D&R rules changed what we could actually do. If you've only ever worked with cloud-managed detections, you probably take that for granted. Do it the local way first and you'll understand real fast what that infrastructure saves you.

The hunt-to-detection feedback loop is the whole point. A hunting program that exists separately from detection engineering is incomplete. Every hunt should be asking: is this something we should be detecting automatically? If yes, write the rule. If no, schedule the recurring hunt. That cycle is how coverage actually matures over time, and it's the mechanism that separates an MDR operation that gets better from one that stays static.

The best hunts don't start from playbooks. They start from noticing something that doesn't fit. Hunt packages and documented procedures are valuable for structure, but the finds that matter tend to come from time spent in the data and knowing what normal looks like well enough to spot when something is wrong. That's hard to teach. It comes from repetition.

Working alongside a team that's learning new domains changes how you think about what you know. Explaining why a particular process tree is suspicious, or why an OAuth consent grant is worth investigating, forces you to put words to instincts you'd otherwise just act on. When everyone's building something new together, teaching and learning aren't separate things.

Still Going

The operation isn't done. I'm still building detection coverage across all surfaces, still converting hunts into recurring processes, still developing tools for SOC triage and identity baselining. The team is still growing into the full breadth of telemetry we now have access to. There's plenty of work ahead.

But compared to where this started, with NDR and barely a SIEM, the distance covered is real. We have EDR with custom detection coverage. ITDR across M365, Entra ID, and Google Workspace. A centralized telemetry pipeline the SOC works from every day. A hunting program that feeds findings back into detections. It's not finished. But the systems are in place for it to keep getting better, and it does.

If you're at a small company trying to build something similar, the thing I keep coming back to is this: start with whatever you have, even if it's open-source tools and a pipeline you put together yourself. The Fibratus stack wasn't elegant, but it solved the problem we had at the time and proved the value of endpoint visibility. When the opportunity came to do it with better tooling, we were ready because we already understood the problems we were solving. You don't need the perfect stack. You just need to start, and keep building from what you learn.

0 comments