Episode 38 — Define OT SLAs: Internal Versus External Expectations That Protect Uptime

In this episode, we’re going to talk about Service Level Agreements, or S L A s, and why they matter in O T security even though they sound like business documents rather than technical controls. In Operational Technology (O T), uptime is not a vague preference, because uptime supports safety, production commitments, and stable process control. When something fails, the speed and quality of the response often determines whether the impact stays small or becomes a long, expensive disruption. That is where S L A s come in. They set expectations for response times, availability targets, escalation paths, and the practical details of who does what when systems are degraded. For beginners, it can be tempting to treat S L A s as something procurement handles after the “real” work is done, but in industrial environments, S L A s are part of resilience. They influence whether critical systems are monitored, whether issues are addressed quickly, whether vendors can be reached during a crisis, and whether security requirements are honored when the pressure to restore service is intense. By the end, you should be able to explain what an S L A is in O T terms, how internal S L A s differ from external ones, and how the right expectations protect uptime without forcing unsafe shortcuts.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

A useful starting point is to define an S L A in plain language as a written agreement about service expectations. It typically specifies how available a service should be, how quickly issues should be acknowledged and addressed, and how performance will be measured. In O T, “service” can mean many things, including a monitoring platform, a remote access capability, a historian system, a network segment that supports production, or a vendor support function that keeps equipment running. An S L A is different from a wish list because it is meant to be measurable, and it is different from a policy because it focuses on outcomes like response and availability rather than on rules like “use strong passwords.” Beginners sometimes assume S L A s are only for vendors, but internal S L A s are just as important, because many failures in industrial environments are not caused by a vendor being slow, but by internal confusion over who owns response and what “fast enough” means. When expectations are not explicit, teams can argue during outages, and arguments waste time. In O T, time wasted during an outage can turn a manageable issue into a broader production problem, so making expectations explicit is a form of risk reduction.

To understand internal versus external S L A s, it helps to think about what each one is trying to accomplish. An internal S L A sets expectations between groups inside the same organization, such as between operations and the network team, or between engineering and security, or between a central support group and a plant site. It helps ensure that support is not dependent on personal relationships, and it creates a predictable process for resolving problems quickly. An external S L A sets expectations between the organization and an outside party, such as a vendor, managed service provider, telecom carrier, or cloud service provider. It ensures that when an outside dependency fails, the organization has clear rights to support, escalation, and remedies. Both types protect uptime, but they operate differently. Internal S L A s are about coordination and accountability within your control, while external S L A s are about enforcing performance from parties you depend on but cannot command. Beginners should recognize that a strong O T environment usually needs both, because uptime depends on internal readiness and external support working together under stress.

Availability targets are often the first thing people think of when they hear S L A, but in O T security, availability targets must be treated carefully. An availability target is typically expressed as a percentage, and it implies how much downtime is acceptable in a given period. In industrial settings, acceptable downtime varies widely depending on what the service supports. A system that provides real-time operational visibility might need extremely high availability, while a reporting system used for weekly analysis might tolerate more downtime without affecting safety. The danger is that beginners may assume the highest possible availability is always the goal, but extremely high availability can be expensive and sometimes unnecessary if there are safe fallback procedures. The right approach is to link availability targets to operational impact and safety, such as whether downtime would force unsafe manual operation or whether it would simply delay convenience features. This is also where risk appetite comes into play, because the organization must decide what interruptions are acceptable given the consequences. When availability targets are aligned to operational needs, S L A s protect uptime in meaningful ways rather than chasing numbers.

Response time expectations are often more valuable than raw availability targets, because many incidents are not full outages but degradations that get worse if ignored. Response time in an S L A usually includes how quickly an issue is acknowledged, how quickly it is triaged, and how quickly it is either resolved or escalated. In O T, these distinctions matter because the fastest “resolution” might be to do something risky, like bypassing controls, while the safer approach might be to stabilize the process first and then address root cause. A strong S L A recognizes that speed must be paired with discipline, such as requiring that emergency actions be documented and reviewed afterward. Internal S L A s might specify that certain alarms must be acknowledged within minutes and that certain classes of incidents require involvement of operations leadership and safety consultation. External S L A s might specify that a vendor must respond within a certain time and provide qualified personnel who can support recovery. Beginners should learn that response expectations are part of safety, because delayed response can increase hazard, but rushed response can also increase hazard if it introduces uncontrolled change.

Escalation and communication expectations are another essential part of S L A design, because confusion during incidents is a major cause of prolonged downtime. An S L A should define how issues are escalated when initial responders cannot resolve them, and it should define who is notified at each stage. In O T, this often includes escalation to engineering when control behavior is affected, escalation to network teams when connectivity issues appear, escalation to security when suspicious activity is possible, and consultation with safety if actions could affect protective functions. Communication expectations should also include how status updates are provided and how often, because leadership and operations need timely situational awareness to make safe decisions about whether to continue production, shift to a degraded mode, or shut down. External S L A s should define how vendors communicate during incidents, including what information they provide, how they handle confidentiality, and how they coordinate with internal teams. Beginners should see that an S L A is not only about technical response; it is about information flow, because good information flow prevents duplicated effort and prevents contradictory actions that can make outages worse.

Security requirements often have to be embedded into S L A expectations, because uptime pressure can tempt teams to bypass controls. For example, if a vendor cannot meet a response time unless they have persistent remote access, operations may push for an always-on pathway, which can increase risk. A well-designed S L A can instead specify that rapid access will be provided through a controlled, time-limited process that still meets response needs. Similarly, internal S L A s can specify that certain emergency changes are allowed but must be logged and reviewed, which discourages silent shortcuts. An S L A can also specify monitoring and logging expectations, such as ensuring that critical systems generate operational and security logs and that log review happens within a defined timeframe after incidents. This is where S L A s protect uptime in a deeper way: they encourage disciplined recovery and prevent repeated incidents caused by undocumented changes. Beginners should learn that strong security and strong uptime are not opposites when expectations are designed thoughtfully. The S L A is one of the places where that thoughtful design becomes enforceable.

Measurement and evidence are another core element, because an S L A is only useful if performance can be observed and evaluated. In internal relationships, measurement helps leadership see whether support teams are meeting expectations and where bottlenecks exist. In external relationships, measurement helps determine whether vendors are meeting contractual obligations and whether escalation or remedies are necessary. In O T, measurement must also consider what is being measured and why, because some metrics can encourage bad behavior. For example, measuring “time to close tickets” without considering quality can encourage rushed fixes that create instability. Measuring availability without considering the existence of safe fallback modes can encourage expensive overengineering. Better measurements often include a mix of speed, stability, and effectiveness, such as how quickly critical incidents are acknowledged, how often outages repeat, and how often emergency changes are reviewed and corrected. Evidence that holds up might include logs, incident timelines, change records, and post-incident reviews. Beginners should recognize that measurement is not meant to punish; it is meant to improve reliability by making performance visible and by turning vague complaints into actionable improvements.

The difference between internal and external S L A expectations also shows up in what you can do when expectations are not met. Internally, you can change processes, assign resources differently, and adjust governance. Externally, you may rely on escalation, service credits, or contractual remedies, which may not help in the moment of an outage but can influence long-term vendor behavior and investment. This is why external S L A s should be designed with realistic enforcement and meaningful escalation paths, not just optimistic numbers. It is also why internal S L A s are crucial: you cannot outsource readiness. Even with a strong vendor, your organization must have the ability to recognize issues, stabilize operations, and coordinate response while waiting for external support. Beginners should learn that the strongest uptime protection comes from combining internal discipline with external accountability. If either side is weak, outages become longer and more chaotic.

Finally, S L A design should reflect the real world of O T operations, where not all services have the same criticality and where “one size fits all” expectations can create waste and conflict. A monitoring system for safety-related signals may require stronger availability and response expectations than a system used for routine reporting. A remote access service used for emergency vendor support may require strict security controls and rapid response expectations, while a noncritical analytics service may prioritize data integrity and scheduled maintenance windows. Internal S L A s might differ by plant or by production line depending on operational tolerance for downtime. External S L A s might differ by service type, such as on-site support versus remote support. Beginners should understand that the purpose of an S L A is to align expectations with criticality, so teams focus their energy where uptime truly protects safety and business outcomes. If expectations are aligned, stakeholders are less likely to fight, because they can see why some services deserve faster, stronger support than others.

As we wrap up, defining O T S L A s is a resilience and security activity because S L A s turn vague hopes about uptime into explicit, measurable expectations that guide response under pressure. Internal S L A s create predictable coordination between engineering, operations, and security, reducing delay and confusion during incidents. External S L A s create enforceable expectations with vendors and service providers, improving support quality and accountability for critical dependencies. Strong S L A s focus on availability, response times, escalation, and communication, but they also embed security discipline so that uptime is protected without unsafe shortcuts. Measurement and evidence ensure S L A performance can be trusted and improved over time, and thoughtful design ensures expectations match the criticality of the services involved. If you can explain how internal and external S L A s work together to protect stable operations and safe recovery, you will understand why S L A s are not just business paperwork, but practical tools that shape how O T security and uptime are actually delivered.

Episode 38 — Define OT SLAs: Internal Versus External Expectations That Protect Uptime
Broadcast by