What I Learned at Work this Week: Design for Resilience

credit: Pixabay on Pexels

It’s book club week again, so I’ve been working through Chapter 8 of Building Secure & Reliable Systems. I’ve never been a great reader and it’s easy for me to lose focus and find myself having “read” a page without really retaining any of it. To prepare myself for my next group discussion, I’ll do my best to summarize some of the concepts that will help us Design for Resilience.

Resilience describes a system’s ability to withstand attack or failure not necessarily by preventing it, but by building thoughtful contingencies for the inevitable moment of vulnerability. Our text outlines six principles to promote resilience:

Defense in Depth (Independent Resilience)

It’s possible to flag unusual behavior during an attacker’s preparation. One clever example from the book is to keep track of DNS registrations of URLs similar to yours. An unusual registration might mean that an attacker is hoping to spoof my site, potentially to trick employees or customers into sharing personal information. It’s generally good practice to watch out for an individual who’s a bit too curious about how things work, or a program that’s scanning ports and applications.

Outside of prevention at the point of vulnerability, the book also warns to prepare for a situation after the execution of the attack. If we find that our system has been breached, we should have safeguards built in that localize the damage and prevent a complete shutdown (more on this later). Finally, it’s valuable to make mental preparations for what we’re willing to give up if we find ourselves in a compromising situation. Hopefully it never happens, but what can we stand to lose in order to stop the attack?

Controlling Degradation (Prioritize Features)

The text presents us with three key actions to take when planning our controlled degradation:

Differentiate Costs of Failures

This may mean that we’re going to redesign the failure process to retry less frequently or more thoroughly check for the cause of the failure and act accordingly. In a different example, we may learn that certain front end features use a lot of CPU but that we can disable them and still provide a downgraded, though functional, version of our platform in case of emergency.

Deploy Response Mechanisms

There are two common strategies if we’re receiving more requests than we can handle: load shedding and throttling. Load shedding means returning an error instead of processing the request, and throttling means delaying the request response to buy our system some more time. Here’s a nifty illustration of both techniques at work, from the book:

We still have to reject some traffic, but we’re serving important requests while doing it rather than crashing completely. Even better, thanks to throttling, we can process requests that otherwise might have been rejected. Keep in mind that we should be monitoring traffic and request response time to help us think critically about how we want to address load issues in the future.

Automate Responsibly

Controlling the Blast Radius (Compartmentalize the System)

Role Separation

Location Separation

Time Separation

Failure Domains and Redundancies

So how much energy should we put into these copies? It would be great to produce 100 or 1,000 redundancies for every component, but such extreme measures are impractical. To help us understand where we should focus our resources, we can break our system’s components into categories based on their use. High-capacity components are what drive our business or user experience. These are core to our product and therefore deserve the most attention and redundancy. We can develop high-availability components by introducing failure domains for our high-capacity components. We want the critical parts of our product to be widely available in case of emergency, and reliable, automated alternate systems are a great way to achieve this. We want updates to these features to be controlled by limiting frequency and permissions.

A low-dependency component is a stripped down version of our high-capacity component. We should be prepared to deploy these types of components if an attack or incident has affected the dependencies of our primary components. Rather than shut down our service completely, we can use a low-dependency component to keep the lights on while we troubleshoot. It can be a useful exercise to even consider what low-dependency components would look like for some of our critical functions. If we lose 30% of our storage capacity, is it still possible for us to serve up our product?

We don’t want to skimp on security for our low-dependency components — the authors actually posit that it’s worthwhile to make these more secure than our high-capacity components. After all, if we’re using an alternate component, it means that something has gone wrong and we’re likely already at a disadvantage against an incident or attacker. We also want to make sure that our redundancies are regularly maintained and tested. We don’t want to have to constantly use them in production, but we want them to be ready when we need them.

Continuous Validation (Automate Resilience Measures)

  1. Discovering new failures
  2. Implementing validators for each failure
  3. Executing all validators repeatedly
  4. Phasing out validators when the relevant features or behaviors no longer exist

We’ll want to be vigilant about discovering these new failures, paying attention to bug reports veterans on our team with experience, or running tests like a fuzz test (using a program to generate a bunch of unusual/random requests). There are a ton of different ways to test and check on our systems. Another example is logging interactions between components and checking to see if they are crossing any boundaries we intended to set with our compartments.

We’re given five different examples from inside Google, including information about a library that can add delays or failures into a server. Google engineers can methodically add latency and other issues and see how their systems will react, checking specifically for red flags like cascading failures that allow one error to spread throughout a system.

Practical Advice

I’m grateful that I’m not in a position to make these decisions right now, but I know that understanding these concepts will help me better empathize with those who are. There’s no limit to what we can learn about software development and computer engineering, so it’s all about learning a little bit more every day. Keep it up, readers!


Solutions Engineer