Episode 51 — Use Failure Mode and Criticality Thinking: Safety, Reliability, and Cascading Effects

In this episode, we’re going to connect OT security risk thinking to a mindset that OT teams already use to keep systems safe and reliable: failure mode and criticality thinking. If you are brand new to cybersecurity, you may hear words like threat, vulnerability, and attack and assume the whole topic is about adversaries doing malicious things. In operational technology, many of the worst outcomes look the same whether they start with an attacker, a mistake, or a component failure, because the process still ends up in an unsafe or unstable state. Failure mode thinking asks a simple question with powerful consequences: how can this system fail, and what happens when it does. Criticality thinking asks a related question: which failures matter most, and why. When you combine them, you get a structured way to understand safety and reliability impacts, including the cascading effects that turn a small fault into a major incident. This is incredibly useful for OT security because good security controls often reduce the likelihood of specific failure modes, improve detection before they cascade, or strengthen recovery when failures do happen. The goal here is to help you think in realistic paths that connect technical behavior to real-world consequences without drifting into tool-specific details or engineering math.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

A failure mode is a particular way something can go wrong, and it is more specific than just saying the system fails. A sensor can fail by reading zero, by drifting slowly, by becoming noisy, or by freezing at a past value, and each of those failure modes can lead to different outcomes. A controller can fail by rebooting, by losing network connectivity, by executing incorrect logic, or by accepting unauthorized changes, and again the consequences differ. Beginners often think failure is always obvious, like a system stops working, but OT systems can fail in subtle ways that are harder to notice and more dangerous because the process continues. This matters for cybersecurity because many cyber events are essentially forced failure modes, such as causing a device to behave incorrectly or causing communication to become unreliable. Even if you never mention a specific attacker technique, you can still model cyber risk as the possibility of triggering failure modes that lead to harm. That framing helps OT teams and beginners alike because it ties cybersecurity to familiar reliability concepts. When you can describe a cyber risk as a specific failure mode, you can also describe the controls that prevent it, detect it, or limit its impact. That is what makes failure mode thinking practical.

Criticality in this context means the severity of the consequences when a failure mode occurs, and OT consequences can be physical, operational, and financial all at once. A failure mode in a safety interlock system might be critical because it could allow unsafe conditions, while a failure mode in a non-critical reporting system might be inconvenient but not dangerous. Criticality is also shaped by where the failure occurs in the process, because a failure in a bottleneck step can stop an entire line, while the same failure elsewhere might be worked around. Beginners sometimes assume criticality is only about the device itself, like how important the device seems, but criticality is really about what depends on it and what happens if it behaves incorrectly. It also includes time, because a brief failure might be manageable while a prolonged failure can force shutdowns, product loss, or unsafe manual operation. When you assign criticality to failure modes, you stop treating all problems as equal and start prioritizing the ones that truly threaten safety and continuity. This is one of the biggest shifts a beginner can make in OT security thinking, because it replaces vague worry with structured prioritization. When you know which failure modes are critical, you know where to focus protective effort first.

Safety is often the most important lens for failure mode and criticality thinking because safety outcomes are non-negotiable. In OT environments, safety controls are designed to prevent harm even when other systems malfunction, and these controls often operate on assumptions about correct data, correct logic, and predictable timing. A failure mode that undermines those assumptions can become critical because it can reduce the margin of safety. For example, if a safety-related sensor provides incorrect values, or if a control system stops responding and forces manual operation, safety risk can increase. Beginners should understand that safety systems and operational control systems are related but not identical, and a failure in one can affect the other through shared dependencies like networks, power, time synchronization, or shared engineering tools. Failure mode thinking helps you ask whether a cybersecurity control might inadvertently affect safety, such as by creating delays in emergency access or by introducing new points of failure. It also helps you identify where safety depends on reliable information, which is important because cyber events often target integrity, not just availability. The goal is not to make security compete with safety, but to make security support safety by reducing the chance that critical safety-related failure modes are triggered. When you think in failure modes, safety becomes a clear part of your risk model rather than an afterthought.

Reliability is the next lens, and it matters because a reliable system behaves predictably over time, even as conditions change. OT environments rely on predictable behavior for stable operations, and reliability failures can lead to downtime, equipment damage, and unsafe workarounds. A key idea for beginners is that reliability is not only about components not breaking, it is about systems staying within acceptable behavior boundaries. A network that becomes intermittent can create reliability problems even if it never fully fails, because intermittent issues can cause timeouts, partial updates, and inconsistent state across devices. In cybersecurity, reliability is affected by noise and disruption, such as excessive scanning, uncontrolled changes, or unplanned reboots, which can create failure modes even without malicious intent. Failure mode thinking helps you identify how reliability degrades, such as communication jitter, configuration drift, or loss of synchronization, and then assess how those modes affect operations. It also helps you see that some security practices must be adapted in OT to avoid creating reliability failures, which is why OT security emphasizes controlled change and careful scheduling. When you align security with reliability, you reduce the chance that security work becomes the cause of downtime. A mature OT approach treats reliability as a core requirement that security must respect.

Cascading effects are where failure mode and criticality thinking becomes especially powerful, because cascades are how small failures turn into big incidents. A cascade happens when an initial failure changes conditions in a way that triggers other failures, like dominoes. In OT, a simple network issue might cause a controller to lose a signal, which might cause a process to enter an abnormal state, which might cause alarms and operator stress, which might lead to manual actions that further complicate recovery. A cascade can also be technical, like a misconfiguration that spreads across multiple devices due to shared templates or centralized management tools. Beginners often imagine a single failure with a single consequence, but real OT incidents often involve compounded effects across multiple systems and teams. Failure mode thinking encourages you to map these chains and identify where the cascade accelerates, such as points where visibility is lost, where control becomes unstable, or where response becomes uncoordinated. Those acceleration points are valuable because they reveal where mitigations can have outsized benefit. If you can prevent loss of visibility, you may prevent a cascade of wrong decisions. If you can stabilize recovery procedures, you may prevent prolonged downtime even after a failure occurs.

An important step in applying this thinking is distinguishing between availability failures and integrity failures, because they lead to different cascades. An availability failure means something is not accessible, such as a device being offline or a communication path being blocked. An integrity failure means something is accessible but wrong, such as a sensor value being manipulated, a configuration being altered, or a process display showing incorrect state. In OT, integrity failures can be more dangerous because they can cause the system to operate incorrectly while people believe it is operating correctly. Beginners often focus on outages because they are visible, but the more subtle failures can lead to longer, more complex cascades. Failure mode analysis helps you ask, what happens if the data is wrong, not just what happens if the data is missing. It also helps you identify where checks exist to detect wrongness, like plausibility checks, redundant sensors, alarms, and manual verification procedures. Security controls that protect integrity, such as strong change control and monitoring for unauthorized modifications, can therefore be framed as reducing specific integrity-related failure modes. When integrity is protected, cascades become less likely because the system’s behavior remains predictable and observable.

Another key idea is that humans are part of the system, and human response is a major factor in cascades. Under stress, people may take shortcuts, repeat actions that make things worse, or misinterpret symptoms, especially if they lack clear information. A failure mode analysis can include human failure modes, such as miscommunication between shifts, unclear escalation pathways, or unauthorized emergency access that bypasses logging. This is not about blaming people, it is about designing procedures and controls that support correct action when time is short. Beginners should understand that training, documentation, and practiced response are mitigations because they reduce the chance of human-driven cascades. For example, if the team knows how to verify whether a controller configuration changed and where to find the last known-good baseline, recovery becomes faster and safer. If the team has a clear process for coordinating with safety and engineering, response decisions become more coherent. When you include humans in your failure mode thinking, you see that some of the best mitigations are clarity and rehearsal rather than new technology. In OT, calm and coordinated response is often the most critical capability during abnormal events.

Failure mode and criticality thinking also helps you prioritize controls by matching controls to the failure modes they reduce. If a critical failure mode involves unauthorized changes to control logic, then controls like strong access management, controlled remote support, and monitoring for configuration changes become high priority. If a critical failure mode involves loss of visibility during incidents, then controls like reliable logging, protected historian data, and tested procedures for restoring monitoring become important. If a critical failure mode involves recovery delays, then backups and restore validation become meaningful mitigations because they reduce consequence. Beginners sometimes choose controls based on what sounds modern or impressive, but failure mode analysis anchors control selection to real outcomes. It also helps avoid overcontrol, because not every failure mode deserves the same investment, and some controls may add operational risk without reducing critical failure modes. When you can say, we are implementing this control because it reduces this specific high-criticality failure mode, you create a defensible security plan. That defensibility matters because OT controls must be justified in terms of safety and reliability. The framework turns security into a support function for operational goals.

Finally, using failure mode and criticality thinking in OT security is about building a shared mental model that different teams can use to make consistent decisions. When you describe risks as failure modes with criticality ratings and cascading paths, you create a language that resonates with operations and engineering. You also give security a way to talk about adversarial risk without sounding disconnected from plant reality. For beginners, the most important takeaway is that cybersecurity can be understood as controlling the conditions that lead to harmful failures, whether those conditions are technical weaknesses, uncontrolled access paths, or unclear procedures. This approach makes risk assessment more realistic because it focuses on how systems actually behave and how incidents actually unfold. It also makes mitigations more meaningful because you choose actions that break cascades and protect safety margins. Over time, as you learn more about the environment, you refine the failure modes and adjust criticality based on evidence and experience, keeping the model alive. When you can identify what fails first, what hurts most, and how cascades spread, you are no longer guessing about OT risk, you are reasoning about it in a way that supports safer, more reliable operations.

Episode 51 — Use Failure Mode and Criticality Thinking: Safety, Reliability, and Cascading Effects
Broadcast by