What I Learned at Work this Week: Design for Resilience

Mike Diaz
9 min readMar 21, 2021
credit: Pixabay on Pexels

It’s book club week again, so I’ve been working through Chapter 8 of Building Secure & Reliable Systems. I’ve never been a great reader and it’s easy for me to lose focus and find myself having “read” a page without really retaining any of it. To prepare myself for my next group discussion, I’ll do my best to summarize some of the concepts that will help us Design for Resilience.

Resilience describes a system’s ability to withstand attack or failure not necessarily by preventing it, but by building thoughtful contingencies for the inevitable moment of vulnerability. Our text outlines six principles to promote resilience:

Defense in Depth (Independent Resilience)

Depth in this case refers to multiple security layers that attackers must pass in order to reach their goal. This makes sense — it’s harder for me to climb two walls than to climb one — but I learned that this isn’t necessarily about building firewalls or adding security questions to a login. One theme of this book is to look at security from a wider scope. While those methods are helpful in securing my system, stopping someone at the access point isn’t the only thing I should consider.

It’s possible to flag unusual behavior during an attacker’s preparation. One clever example from the book is to keep track of DNS registrations of URLs similar to yours. An unusual registration might mean that an attacker is hoping to spoof my site, potentially to trick employees or customers into sharing personal information. It’s generally good practice to watch out for an individual who’s a bit too curious about how things work, or a program that’s scanning ports and applications.

Outside of prevention at the point of vulnerability, the book also warns to prepare for a situation after the execution of the attack. If we find that our system has been breached, we should have safeguards built in that localize the damage and prevent a complete shutdown (more on this later). Finally, it’s valuable to make mental preparations for what we’re willing to give up if we find ourselves in a compromising situation. Hopefully it never happens, but what can we stand to lose in order to stop the attack?

Controlling Degradation (Prioritize Features)

Another theme of this book is that security and reliability mean preparing for the inevitable. One of those inevitabilities is that parts of our system will fail at some point. Whether that’s because of an oversight on our part, a DDoS attack, a third party API going down, or a server melting in Virginia, we owe it to ourselves to make a plan for how we are going to shift our focus if and when we lose computing resources. That’s why the section is called controlling degradation, rather than preventing degradation.

The text presents us with three key actions to take when planning our controlled degradation:

Differentiate Costs of Failures

We receive some very practical advice here, being reminded that system failure comes with a computational cost. If our system is prompted to re-try an action upon failure, the cost of an isolated incident can start to affect other parts of our system, leading to a total crash. It’s useful to check the CPU/memory/bandwidth costs of various operations and failures and identify parts of our system where there’s a high concentration of resources. Once we’ve found the process that incurs these costs, we’ve got a target for controlled degradation.

This may mean that we’re going to redesign the failure process to retry less frequently or more thoroughly check for the cause of the failure and act accordingly. In a different example, we may learn that certain front end features use a lot of CPU but that we can disable them and still provide a downgraded, though functional, version of our platform in case of emergency.

Deploy Response Mechanisms

In other words, we want our system to react to a loss of resources automatically rather than manually. There’s not much that needs explaining here, so the authors provided an example for how we might handle the issue of excessive load, like if a celebrity promotes our product and we suddenly receive a huge influx of traffic and requests (remember it’s not just malicious attacks that can cause an incident).

There are two common strategies if we’re receiving more requests than we can handle: load shedding and throttling. Load shedding means returning an error instead of processing the request, and throttling means delaying the request response to buy our system some more time. Here’s a nifty illustration of both techniques at work, from the book:

We still have to reject some traffic, but we’re serving important requests while doing it rather than crashing completely. Even better, thanks to throttling, we can process requests that otherwise might have been rejected. Keep in mind that we should be monitoring traffic and request response time to help us think critically about how we want to address load issues in the future.

Automate Responsibly

This point introduces a fascinating dichotomy for which there is no universal balance: security versus reliability. After all, it’s more secure for your system to shut down at the first sign of uncertainty, but it’s more reliable for it to maintain activity in the face of that same uncertainty. Organizations must make difficult decisions about what they’re willing to sacrifice for the sake of security and reliability and what types of uncertainty or vulnerability should trigger a more secure or more reliable response.

Controlling the Blast Radius (Compartmentalize the System)

From the text: Compartmentalization involves deliberately creating small individual operational units (compartments) and limiting access to and from each one. It doesn’t take too much imagination to understand why this would be valuable. If an attacker gains access to part of our system, we’d like to prevent them from accessing all the rest of it. A compartment has to have some sort of opening so that all the components of our system can work together, but we want to control that opening and authenticate requests that use it. We’re presented with a few common techniques for compartmentalization:

Role Separation

A role is a type of permission given to a job that allows it to access certain microservices. The idea here is that we give different jobs different roles so that compromising one does not lead to a universal system compromise.

Location Separation

We can add the expectation that certain system requests will be associated with a specific physical location. If all of our employees work out of an office in New York, it’s probably safe to assume that we won’t be receiving sensitive database requests from London. The book is careful to remind us not to assign permissions based only on location and to diversify permissions based on location. At Google, for example, encryption keys are isolated by location so that gaining access to a part of a system specific to a single region will not allow you to decrypt data for another region.

Time Separation

It’s always good practice to update or rotate permissions to prevent an attacker from gaining unlimited access to your system after uncovering one vulnerability. These rotations or credential handoffs can be vulnerabilities themselves, so we have to be delicate about how we handle updating our secrets.

Failure Domains and Redundancies

If all else fails, save a backup copy! It’s never that simple, but failure domains advocate partitioning a system into multiple equivalent but completely independent copies for the sake of security and reliability. That independence is challenging, as we want the domains to be similar enough to fill in for each other, but without relying on the same structures (functional isolation) or databases (data isolation) which could cause both to fail simultaneously. The authors provide some non-malicious examples of why failure domains are useful: temporarily updating one but not others will allow us to quickly revert back to a functional system if our latest change causes issues. Likewise, redundancies that span geographic locations can help protect us against natural disasters that affect our physical hard drives and servers.

So how much energy should we put into these copies? It would be great to produce 100 or 1,000 redundancies for every component, but such extreme measures are impractical. To help us understand where we should focus our resources, we can break our system’s components into categories based on their use. High-capacity components are what drive our business or user experience. These are core to our product and therefore deserve the most attention and redundancy. We can develop high-availability components by introducing failure domains for our high-capacity components. We want the critical parts of our product to be widely available in case of emergency, and reliable, automated alternate systems are a great way to achieve this. We want updates to these features to be controlled by limiting frequency and permissions.

A low-dependency component is a stripped down version of our high-capacity component. We should be prepared to deploy these types of components if an attack or incident has affected the dependencies of our primary components. Rather than shut down our service completely, we can use a low-dependency component to keep the lights on while we troubleshoot. It can be a useful exercise to even consider what low-dependency components would look like for some of our critical functions. If we lose 30% of our storage capacity, is it still possible for us to serve up our product?

We don’t want to skimp on security for our low-dependency components — the authors actually posit that it’s worthwhile to make these more secure than our high-capacity components. After all, if we’re using an alternate component, it means that something has gone wrong and we’re likely already at a disadvantage against an incident or attacker. We also want to make sure that our redundancies are regularly maintained and tested. We don’t want to have to constantly use them in production, but we want them to be ready when we need them.

Continuous Validation (Automate Resilience Measures)

Maintaining security and reliability is a constant process and we want to be constantly aware of any vulnerabilities that could compromise our system. When it comes to validation, the text stresses testing our systems under realistic but controlled circumstances. We want to know what we’re testing for and we want to know how it will translate to a practical situation that might put our system under stress. There is such a thing as information overload, so we’re presented with validation as a cycle that focuses on known vulnerabilities:

  1. Discovering new failures
  2. Implementing validators for each failure
  3. Executing all validators repeatedly
  4. Phasing out validators when the relevant features or behaviors no longer exist

We’ll want to be vigilant about discovering these new failures, paying attention to bug reports veterans on our team with experience, or running tests like a fuzz test (using a program to generate a bunch of unusual/random requests). There are a ton of different ways to test and check on our systems. Another example is logging interactions between components and checking to see if they are crossing any boundaries we intended to set with our compartments.

We’re given five different examples from inside Google, including information about a library that can add delays or failures into a server. Google engineers can methodically add latency and other issues and see how their systems will react, checking specifically for red flags like cascading failures that allow one error to spread throughout a system.

Practical Advice

The authors do us one final favor in this chapter by organizing the described principles in order of expected cost to help us prioritize what to try first. They advocate starting with failure domains and blast radius control because they tend to remain static and therefore don’t require a lot of resources to maintain. High-availability services are next, followed by load-shedding/throttling and DoS defense.

I’m grateful that I’m not in a position to make these decisions right now, but I know that understanding these concepts will help me better empathize with those who are. There’s no limit to what we can learn about software development and computer engineering, so it’s all about learning a little bit more every day. Keep it up, readers!