What I Learned at Work this Week: A Culture of Security and Reliability

8 min readMay 2, 2021

This was a busy week at work, but one where I learned a lot more about the internal workings of our systems than general concepts that I can share. For the first time in a while, I woke up on Saturday morning without knowing what I was going to write about! I was tempted to take the weekend off, but I know myself well enough to realize that if I break my routine I might never come back to it. So instead, I’ll bring things back to Building Secure & Reliable Systems, which we’re still reading in our engineering book club.

Defining a Healthy Security and Reliability Culture

We’ll be reviewing the final few chapters of the book this Friday and, though they’re not very technical, they still of course contain valuable lessons for developing a secure and reliable product. Culture has always been of special interest to me, and is another important aspect of security and reliability. It’s split up into two main parts, the first of which is defining our culture. The authors describe six points of focus here:

Culture of Security and Reliability by Default

We want security and reliability to be on our minds from the early stages of product development. Rather than building first and then attempting a retrofit to address security/reliability concerns, we want to develop a culture where we choose patterns and frameworks that will, for example, protect against XSS and SQL injection or avoid memory corruption errors.

Culture of Review

Not all PR reviews are created equal. When code is up for review, we’re ideally considering security and reliability (because we have a culture of doing so by default), but that means we have to know what these things look like. Reviewers should have enough context to make decisions, especially if they’re reviewing a change to, say, an access request.

This context comes from education — reviewers should know the goals and expectations of a project and be trained in providing structured, documented feedback. If goals and feedback are clear, it can help address hard feelings when an author is asked to revise their code. And of course all changes should require review, even if they’re being proposed by a team lead or code owner.

Culture of Awareness

This section was especially insightful because it offered some suggestions based on how Google maintains vulnerability awareness among their employees. Awareness helps us understand when a situation is especially risky (a public network vs. a VPN) or what types of employees might be more attractive targets for attackers (executives or those with access, generally). If we are aware of the risks, we can better plan for them.

The authors explain that hands-on, interactive messaging is especially effective for learning and retaining information. For example, Google has produced a game for learning about cross-site scripting risks and how to avoid them. They also employ public presentations (ideally interactive), thorough documentation (necessary to maintain the specific nuance that more engaging methods might gloss over), and awareness campaigns. At Google, they’ve developed a security and reliability newsletter called “Testing on the Toilet,” which provides tips that can be read while developers are in the restroom. Personally I wonder if this takes focus on work a little too far, but it’s apparently been quite effective for them.

Culture of Yes

As we learn about the various risks and attackers who may expose vulnerabilities in our products, it’s natural to become conservative with changes and updates. Our text warns that this perspective could create a “culture of no,” defaulting to avoiding anything new because it could have adverse effects on our environment. Becoming too conservative removes our opportunity for innovation and growth; it’s more effective to embrace change while maintaining a reasonable caution around security and reliability.

At Google, the authors have developed a strong relationship with feature developers and site reliability engineers, but not every company has an entire team devoted to SRE (Site Reliability Engineering). If you do, it makes sense to do the work and build that relationship. The SRE team can offer suggestions which will help development rather than halt it before it starts. If that’s not an option, however, it’s helpful to take responsibility away from individuals by adopting design strategies such as least privilege, resiliency, and testing. With those in place, vulnerabilities that were potentially missed by human developers can be caught by our security systems.

The book also references an error budget to allow for some errors but set a threshold for acceptability. Once we reach a certain scale, it’s unrealistic to expect zero errors on every service and platform, so acknowledging a number that promotes experimentation without affecting performance can be a big help in developing that Culture of Yes.

Culture of Inevitability

Error budgets transition well into the understanding that failure of some sort is inevitable. The culture of inevitability means that we prepare for such failures with documented incident response plans and even potentially drills or role plays.

Culture of Sustainability

This section could be about the long-term efficacy of our code, but I was happy to learn that it was more focused on avoiding burn out with employees. Handle incident responses and on-call rotation in reasonable shifts, avoiding extra long working hours when possible. And when an incident forces our hand, we should stress that the situation is temporary. It’s not sustainable to constantly maintain an “all hands on deck” pace or staffing permanently, so build a culture where the norm is much less than that with the potential to flex up in an emergency.

Changing Culture Through Good Practice

As I was reading through the Defining Culture section, I was looking around for tips on how to develop the culture we were aiming for. That advice is largely contained in the next section, on changing culture through good practice. It’s one thing to teach people what you want them to do, but it’s not always easy to get them to follow through and put those principles into practice.

Align Project Goals and Participant Incentives

It’s not enough to say that we’re focused on security and reliability — we have to reward and recognize those who follow through on that value. Those standards should be documented and applied consistently so that there is no confusion over priorities when a project is being planned or developed. SRE might not be as exciting or marketable as fast or flashy development, but individuals contributing to the former should be promoted as frequently as any other type of contributor.

Reduce Fear with Risk-Reduction Mechanisms

Reducing fear of failure goes hand-in-hand with a Culture of Yes. If we’re working on developing a Culture of Security and Reliability, we may start to hit roadblocks when proposing certain software updates that seem risky. To address these fears, the authors suggest things like Canaries and staged rollouts, which should greatly reduce the blast radius if a vulnerability is exposed. Dogfooding and Opt in before mandatory are also versions of staged rollouts whereby those who feel comfortable can integrate new features before they’re required for all. Dogfooding is short for “eating your own dog food,” or applying a change to a part of the system that you own, to prove its safety and viability, before imposing it on others.

Make Safety Nets the Norm

On the flip side of introducing a potentially risky new feature, security measures may prompt us to alter familiar processes to reduce risk (ie “we’re going to change the way people access our DB by implementing a proxy”). Members of our team may be resistant to this change because it will affect their workflow — they may even complain that it will make it impossible for them to do their jobs in certain situations. To address these concerns, we can implement emergency mechanisms that will revert the change if it is discovered that it truly is a blocker at a critical juncture. The authors take care to point out that “break glass” mechanisms should be truly reserved for emergencies. We want our team to have the confidence that they’re there if needed, but not to use them to circumvent our new secure practices.

Increase Productivity and Usability

As discussed in the previous point, we may be facing an uphill battle if there is a perception that increased security or reliability will negatively impact productivity. We can address these concerns by finding and promoting security measures that will improve productivity, or at least stress the parts of our proposals that do so. The text provides an example where a requirement to update a user’s password was relaxed because it was discovered that it was actually more risky to frequently make users create and memorize new credentials. It also discusses self service security portals for approved software. Rather than relying on a central team of SREs, we can automate implementation. Since there can still be a bottleneck in seeking approval for new software, Google introduced a self-help portal called Upvote. If the SRE team cannot approve a new application quickly enough, developers can seek peer approval on Upvote to vouch for its safety and reliability. The authors admitted that they have seen some non-work software approved by Upvote (video games), but don’t appear to have any concerns that a less-than-secure application will pass through.

Overcommunicate and Be Transparent

I’ll insert my own perspective here: transparency builds trust and should be implemented across every possible level of our organizations. Trust is faith that the other party is being honest and by demonstrating consistent transparency, we develop that trust with concrete examples of honesty. In the case of security and reliability initiatives, communication can come in the form of documenting rationale around decisions and benchmarks, providing platforms for feedback (not just saying “we’re open to feedback” but actually creating a channel for that feedback and demonstrating that it’s taken seriously), building dashboards, and sending frequent updates.

Build Empathy

Disagreements between two teams are often framed in the context of “you don’t understand what this will do to our process.” Empathy is the development of this understanding. At Google, they build empathy through shadowing or temporary job swapping, ranging from a few hours to six months. The authors describe how SREs shadow each other to learn more about another part of their department. The longer version, Mission Control (a six month program for non-SREs to temporarily join the organization) requires more management oversight and buy-in, but reaps huge benefits. Finally, simple recognition of good work through thank-you notes and small bonuses is always helpful for empathy and morale.

None of these points are surprising or particularly unique at face value, but it’s useful to have them all gathered in one place. The lesson here is that there’s always more we can be doing to foster a culture of security and reliability. We should see significant return on our invested time and energy.

Sources

Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield
XSS game by Google
Digital Identity Guidelines, NIST