Episode 40 — Measure OT Security With Purpose: Metrics, Measures, and What They Really Signal

Most people start thinking about measurement in operational technology because someone asks a simple question that turns out not to be simple at all: are we getting safer, or are we just getting busier. If you are brand new to cybersecurity, it can feel like metrics are just numbers you collect to prove you did work, like counting how many locks you bought instead of whether the doors are actually secure. In an OT environment, that confusion gets sharper because security shares the stage with safety, reliability, uptime, and the real-world consequences of interruptions. Measurement is still useful, but only when you understand what a number can and cannot tell you, and when you connect it to a decision someone needs to make. By the end of this lesson, you should have a practical sense for what to measure, how to interpret it, and how to avoid being fooled by your own dashboard.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

A good way to begin is to separate the idea of a measurement from the idea of a metric, because people often blend the two. A measurement is a single observed value, like how many unpatched devices were discovered during a review this month, or how long it took to restore a controller after a failure. A metric is a defined way of using one or more measurements to answer a question, like the percentage of devices with critical patches applied within an agreed time window, or the median time to restore normal operation after a disruptive event. When you hear someone say we need metrics, what they really need is a clear question, because the same measurements can be turned into different metrics that tell different stories. In OT, you also want to be careful that the story you tell lines up with operational reality, because a security metric that rewards speed can accidentally encourage risky changes. The point is not to create impressive numbers, but to reduce uncertainty and guide choices that protect people, processes, and production.

Before choosing what to track, it helps to understand why OT measurement is different from office technology. In many business systems you can patch quickly, restart services easily, and measure security by how well you follow standard routines, but OT systems often have tight maintenance windows, vendor constraints, and safety requirements that limit what you can change and when. That means a security program can be effective even if it cannot meet the same patch timing targets you might expect on a laptop fleet. It also means measurement should reflect the constraints, not punish the environment for being what it is. Another difference is that in OT, availability is not just convenience, it can be a safety factor or an economic requirement, so your metrics should avoid pushing teams toward actions that increase downtime risk. When you design measures for OT, you are really designing feedback that influences behavior, so it needs to be purposeful and aligned with safe operations.

A practical way to make metrics meaningful is to connect each metric to a decision and an owner. If nobody uses the number to decide something, the metric becomes a reporting ritual that consumes time without creating value. For example, measuring the number of security incidents in OT might sound useful, but if the definition of incident changes or reporting improves, the number can increase even while security improves, which can lead to wrong conclusions. A better approach is to ask, what decision does leadership need to make, and what does the operations team need to prioritize this week. Leadership might need to decide where to invest, such as network segmentation or improved backup and recovery, while operations might need to decide which systems need attention first. Metrics should support those decisions by highlighting risk, progress, and constraints in a way that can be acted on. When you can point to a number and say, because of this we will do that, you are measuring with purpose.

It also helps to group metrics into a few categories so you do not drown in details. One category is outcome metrics, which reflect real-world results, such as unplanned downtime caused by cyber events, successful recovery without safety impact, or reductions in exposure for critical assets. Another category is performance metrics, which show how well processes are working, like how quickly vulnerabilities are triaged, how consistently backups are tested, or how often access reviews are completed on schedule. A third category is capability or maturity metrics, which show whether the building blocks exist, like whether asset inventory coverage has improved, whether network visibility is expanding, or whether incident response plans have been exercised. In OT, it is common to rely heavily on capability and performance metrics early, because true outcomes might be rare events, and waiting for an outcome can be like waiting for a fire to test your smoke alarms. The mix matters, because outcome metrics alone can be misleading, while capability metrics alone can create the illusion of safety without proof.

One of the easiest mistakes beginners make is confusing activity with effectiveness. Counting how many scans were run, how many alerts were reviewed, or how many tickets were closed tells you that people were busy, but it does not necessarily tell you whether risk went down. In OT, this is especially important because excessive scanning, noisy alerting, or constant changes can increase operational risk even if it increases security activity. A useful metric asks whether the activity produced a meaningful improvement, such as reducing the number of critical vulnerabilities on high-impact assets, or reducing the number of remote access paths that bypass strong authentication. You want metrics that reward the right work, not the most work. A helpful thought is to imagine a clever team trying to make the number look good without actually improving security, because that shows you where the metric can be gamed. If it can be gamed easily, it will eventually mislead you, even if nobody intends to cheat.

Another common trap is using averages in ways that hide the problem you care about. If one site patches quickly and another cannot patch at all, an average patch time can look acceptable even while a critical plant remains exposed. In OT, the most important assets and the most dangerous weaknesses tend to be unevenly distributed, so you often want metrics that highlight the worst cases, the oldest exposures, or the highest-impact gaps. Percentiles, medians, and counts by criticality can tell a truer story than a single average number. This is also why context matters, because a vulnerability on a test system is not the same as a vulnerability on a safety-critical controller, even if the technical score is the same. When you measure, you are choosing what to pay attention to, and OT demands that attention follow consequence, not convenience. A good metric makes the highest-risk reality hard to ignore.

If you are wondering what to actually measure first, a strong starting point is visibility and inventory, because you cannot protect what you do not know exists. A purposeful metric here is not just how many devices you found, but how complete your view is compared to what operations believes is present. You might track inventory coverage as a percentage of known segments, or the percentage of critical assets that have an owner assigned and a known function documented. You can also measure how often the inventory changes unexpectedly, because unexpected change can signal unmanaged devices, undocumented vendor access, or drifting configurations. In OT, even a simple metric like the number of assets with unknown firmware versions can be useful, because it points to uncertainty that blocks risk decisions. The key is to treat visibility metrics as progress markers, not as proof of safety, because knowing what exists is necessary, but it does not by itself reduce exposure. Still, without visibility, later metrics become guesses.

A second high-value area is access and remote connectivity, because many OT incidents begin with someone getting a foothold and moving toward control systems. You can measure the number of remote access paths into OT networks, how many of them use strong authentication like Multi-Factor Authentication (M F A), and how many are limited by time windows and approvals. You can also track privileged access, such as how many shared accounts still exist, how often privileged sessions are monitored, and how quickly access is removed when a vendor engagement ends. These metrics matter because they connect directly to the likelihood of compromise, not just to the volume of work performed. They also encourage a healthy kind of simplification, where the goal is to reduce unnecessary paths rather than to bolt security on top of every path. In OT, fewer, well-controlled pathways are often safer and easier to manage than many pathways with inconsistent rules. When access metrics improve, you can explain clearly what risk is being reduced, which makes them powerful for communication.

A third area that benefits from careful measurement is vulnerability and patch management, but you have to adapt it to OT reality. Instead of measuring how fast everything is patched, measure how well risk is triaged and controlled given constraints. A metric might track the percentage of critical vulnerabilities on high-criticality assets that have a documented mitigation, such as network isolation, compensating controls, or scheduled patch windows with vendor support. You might also measure the age of the oldest critical exposure on a critical asset, because lingering exposure can indicate a backlog that is becoming dangerous. Another meaningful measure is the time from discovery to decision, not just discovery to patch, because in OT the decision might be to isolate, monitor, or accept risk with a plan. This keeps the program honest by showing whether the organization is actively managing risk rather than passively accumulating it. The signal you want is whether exposure is understood, prioritized, and reduced in realistic ways.

Detection and response metrics are another place where beginners can get misled, because more alerts does not mean better security. In OT, many alerts can be false positives, and too much noise can cause teams to miss the rare signal that truly matters. A better approach is to measure quality and timeliness, such as the percentage of high-severity alerts that receive review within a defined time, or the proportion of alerts that lead to a confirmed finding. You can also measure coverage, like which network zones have monitoring in place, or which critical assets generate meaningful logs. Recovery metrics can be especially valuable, because the ability to restore control system configurations and return to safe operation is a major resilience factor. Measuring how often backups are tested and whether restores succeed within safe time constraints gives you insight that is hard to fake. In OT, response is not just about speed, it is about correctness and safety, so metrics should reflect that.

Because OT is tied to the physical world, you should also consider how your security metrics relate to safety and reliability indicators. This does not mean security teams should take over safety measurement, but it does mean you should watch for interactions, like whether security changes correlate with increased maintenance issues or downtime. If a new control causes operators to bypass it during emergencies, the metric might look good on paper while real behavior undermines it. A purposeful metric can include the rate of approved exceptions, the number of emergency access events, or the frequency of control overrides, because those can signal friction that will eventually become risk. Another useful measure is how often procedures are followed during abnormal events, because OT environments rely on disciplined response when systems behave unexpectedly. When security supports safety and reliability, the best signal is often the absence of chaos during disruptions, and your metrics should help you see whether preparedness is real or just documented. The goal is alignment, not competition, so the numbers should be discussed in joint language that operations respects.

Once you have metrics, the next skill is interpreting what they really signal, because numbers can lie in subtle ways. A drop in incidents might mean improved security, or it might mean reduced detection, or it might mean people stopped reporting because it became painful. An increase in vulnerabilities found might mean the environment got worse, or it might mean your visibility improved, which is actually progress. This is why you should pair metrics with narrative context, explaining what changed in the environment, what changed in measurement, and what actions were taken as a result. In OT, context can include planned outages, vendor projects, new production lines, or safety upgrades that alter the asset landscape. A metric without context is like a temperature reading without knowing whether the thermometer was moved from the shade into the sun. When you train yourself to ask what could explain this change besides the thing we hope is true, you become harder to fool, and your reporting becomes more trustworthy.

Finally, you want a measurement approach that stays sustainable, because OT programs fail when measurement becomes a burden that steals time from real risk reduction. A small set of well-chosen metrics, consistently defined and reviewed, beats a massive dashboard that nobody trusts. Definitions matter, because if each site counts assets or incidents differently, comparisons become misleading and arguments replace improvement. Ownership matters too, because metrics should have someone accountable for maintaining the definition, collecting the data, and driving the follow-up action. In a healthy program, metrics lead to conversations that end with decisions, not blame, and they evolve over time as the program matures. When measurement is purposeful, it becomes a way to learn, prioritize, and build confidence that you are controlling what matters most. And when you can explain what a number signals, what it does not signal, and what you will do next, you have turned metrics into a practical tool for OT security.

Episode 40 — Measure OT Security With Purpose: Metrics, Measures, and What They Really Signal
Broadcast by