The Attentive Technical Book Club has started up again, and this time we’re looking to tackle Building Secure & Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, and Adam Stubblefield. The book is about 500 pages long and the group is looking to finish it in 10 weeks, which doesn’t give me much time to carefully review any topics that I didn’t fully grasp the first time around. This week (and likely in the future as well), I’ll use this blog to unpack one of the chapters I’ve recently read: Safe Proxies.
What we know so far
Safe Proxies is the first chapter of Part II in Building Secure & Reliable Systems. In Part 1, we learn about why it’s important to build such systems and the various threats that our systems might face. Individuals or groups looking to take advantage of vulnerabilities in our systems are referred to as adversaries and naturally they can take many forms. From hobbyists to hacktivists to insiders, it’s important to understand that there is not one single version of a “hacker” or “criminal” who is trying to break our system. Their backgrounds and motivations will vary, so we have to keep an open mind when building secure and reliable systems.
We’re also reminded that security and reliability go hand-in-hand. An unreliable system is not very secure, and an insecure system is certainly not reliable. It’s best to consider these factors from the very beginning of our work, but a safe proxy is an option if we have to add security to an already existing feature.
What’s a Safe Proxy?
According to the text, safe proxies are a framework that allows authorized persons to access or modify the state of physical servers, virtual machines, or particular applications. This framework can be added as a step between user command and system execution to assure that the command is safe or that the user has the appropriate permissions to access the system in question. The proxy can also be written with a custom log that will make it easier for us to keep track of changes we’re making in case one of them causes an issue.
The authors provide this diagram to illustrate the restrictions imposed by a proxy used at Google:
Clients, such as engineers, salespeople, product users, or even an automated script, can make a request to access a target system application. This request could be a change, but as we can see, it could also simply be viewing information. After all, security means protecting private data from those who don’t need to see it. The proxy itself includes both a logging service and an approval service, which may require review from a group of Approvers, likely a more senior engineer.
Proxies can abstract processes to make sure they are executed safely and efficiently. The authors provide the example of rate limiting a system restart, which should take place gradually to prevent any errors caused by a part of the task from propagating. They also point out that there are some drawbacks to using a proxy: engineers may lament the fact that they no longer have direct access to certain resources and adding this extra step could slow down work that was easier without it (though it was also easier to make a mistake and break a system). There’s an increased cost to running a proxy and it introduces a single vulnerability point where your team could be hamstrung if your proxy goes down or is taken over by an adversary. Google makes use of multiple, redundant proxies to address this risk.
The Google Example
This book was written from the perspective of best practices often used at Google. In the chapter on Safe Proxies, the authors use Google’s Tool Proxy to illustrate how this framework improves security. The Tool Proxy runs between request and execution for Google’s command line tools, which control the majority of their administrative operations. With the proxy in place, no one at Google can write directly to a production server, which does seem like a good way to improve security and reliability.
The Tool Proxy deploys instances as Borg jobs, which are configurations that allow for permissions and process specification. They provide this configuration as a Borg policy example:
config = {
proxy_role = ‘admin-proxy’
tools = {
borg = {
mpm = ‘client@live’
binary_in_mpm = ‘borg’
any_command = true
allow = [‘group:admin’]
require_mpa_approval_from = [‘group:admin-leads’]
unit_tests = [{
expected = ‘ALLOW’
command = ‘file.borgcfg up’
}]
}
}
}
What we’re seeing here is that members of the group admin are allowed to make commands in a certain RPC methods (remote procedure call) with permission from a member of the group admin-proxy. Once permission is received, the proxy will execute the command! All the while it’s logging its actions for auditing purposes, of course.
It’s not surprising that this practical example takes advantage of the many aspects of Safe Proxies detailed in Chapter 3 of Building Secure & Reliable Systems. And it’s also no surprise that I’ve seen patterns like this used frequently in my workplace as well. It’s convenient to allow members of our Client Strategy Team to make certain backend changes on behalf of their clients, but additional DB access presents a security and reliability risk. We therefore built a tool that integrates with Slack that checks things like syntax and permissions. Ultimately, it provided the ideal outcome of a proxy: maintaining security and reliability while also streamlining a process. The system works!
Sources:
- Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield
- Large-scale cluster management at Google with Borg