Guiding Principles for Site Reliability Engineering (and the Engineer).
Being involved with SRE is exciting, scary, exhilirating, stressful, positive emotion, negative feeling, positive emotion, negative feeling ... ad infinitum. To be a good Reliability Engineer (not 'great' .. just 'good'), the right lens to troubleshoot and find root causes is CRITICAL.
But it can be difficult....
to figure out where to start looking for the main cause of system instability. Of course, there is the in your face alert that gives you an indication of what may be wrong but that is usually (in my humble experience) always the tip of a pointy icicle. Burr and ouch !
It is therefore no surprise, that SRE's may want to establish a set of guiding principles they can use while troubleshooting and fixing existing issues and also when they are starting a green field project.
The list provided here is based off personal experiences, discussions with other SRE's and literature review of other Agile inspired methodologies such as DevOps.
A set of SRE guiding principles
Depending upon the context (repairing vs designing), the principles could become questions, the answers to which could shed light on issue resolution.
Principle # 1: Design your operating model to be scalable (or "is the operating model scalable?")
Operating models are based off the process/frameworks in use and the people who have to implement the edicts of the said process/frameworks. A scalable operating model is one that is able to increase the scope of its influence without incurring a propotional increase in the cost of maintaining that influence.
Most operations teams will include IT support and some elements of SRE since these two functions may be the first line of defence against any systemic disruptions. As the systems we use scale (i.e. become more distributed, with more web services, and therefore create a larger surface area for mishap), we CANNOT (and should not?) increase the number of SRE's and/or IT Support staff.
"But shouldn't more infrastructure mean more eyes for monitoring and fixing it?" you may ask.
Ideally, no.
This is not to say that adding more muscle to the team is discouraged. Rather, before beefing up the team, think about what can be improved within the context of the status quo, for example, is there an architectural change that will reduce the risk of system failure, or is the approach for triaging and prioritizing fixes clumsy and can be streamlined etc?
Takeaway: Don't scale your team before making sure the philosophical constructs of the operating model are optimized, standardized and ultimately scalable.
Must haves:
- Eliminate toil wherever and whenever possible.
- Don't keep things in your head. If it needs documentation to ensure reliability, so be it.
- Create run books for SOPs.
- Optimize your approach, optimize it a bit more and automate it.
Avoid:
- Making team size directly propotional to the rising system load.
- Keeping things in your head.
- The mentality that suggests automation is the only response to anything that can and does go wrong with the systems reliability.
Principle # 2: Adopt the engineering mindset (in fact its expected, its right there in the title).
Go deep and go wide. Yup! Thats right. Technology is becoming increasingly complex and if you want to keep up with this runaway train, you have to keep an eye out for that new tool that has proven to be successful in reducing your toil, increasing system stability and looks promising for your own tech stack.
Along with the hard skills, don't forget that people in crisis look for a leader (and a system failure in the middle of a busy shopping season is the definition of crisis). Think like a leader and if you can't right now, observe those who do.
Remember, Wikipedia defines Engineering as
the practice of using natural science, mathematics, and the engineering design process to solve technical problems, increase efficiency and productivity, and improve systems
However, within the ability to design solutions, that increase efficiency and productivity, one cannot ignore the human beings who will be impacted (positively or negatively).
Engineer your tech IQ and your people EQ. They both count.
Nice to do
- Let engineers carve out time to keep their tech skills sharp and current.
- Shift-left some of the responsibilities of the SRE team to developers. Example: developers should consider security-by-design or architecting a scalable and fault tolerant solution, things that SRE's will usually recommend should be done anyway.
No, No, No.
- Turning your SRE team into help desk and IT support.
- Not shift-left.
Principle # 3: Make adding observability capabilities non-negotiable.
This is inline with the old adage "what you don't measure, you never improve".
There are many logging libraries that can be used for monitoring, collection and insights into a systems behavior. All the big CSP's have tools like log collectors, alerting systems, reporting panels as a standard set of services. There is little to no reason left to justify NOT including these capabilities into your software systems.
Non-negotiables.
- Assess system behavior using LETS (Latency, Error Rate, Traffic and Saturation) or STELA (Saturation, Traffic, Error Rate, Latency, Availability).
- Measure system behavior using MELT (Metrics, Events, Logs and Traces) and other quantitatives.
- Continuous Testing of system behavior to track system behavior (like response rate, errors and timeouts etc).
- Always try to articulate the state of the system using LETS, STELA and MELT BUT within the context of user experience. Numbers on their own mean nothing.
Avoid doing anything that contradicts the 4 points provided above.
Principle # 4: Make user-centric Service Level Agreements.
Point of views matter. Unlike this famous dialog in the movie Cloud Atlas,
Truth is singular. Its 'versions' are mistruths.
I believe truth is actually 'perspective'. What is good for the goose may not be good for the gander and therefore it behooves us to focus on things that are truly needed and appropriate for our functions.
SLA, SLO, SLI and Error Budgets are agreements SRE's make with the system/product owners and like all agreements, must be developed in good faith and be reasonable.
A data aggregation system for Cancer treatment may not require the five 9's for availability but would be useless if the data integrity was compromised. Similarly, a hospital management system can NEVER afford to be inaccessible or fragile. It could mean the difference between a patient surviving versus facing other adverse outcomes.
Do this.
- SLA's should come with penalties and fines if they are not met.
- Make SLI's from users perspective.
- Use SLI's to create SLO's (seems backwards but it works).
- Adopt the engineering mindest to tighten the SLX's as new application and system reliability tactics come to light.
Tsk, Tsk, Tsk.
- SLX's with 100% success CANNOT HAPPEN and even if they could, it would leave the SRE with no time for ANYTHING ELSE (including feature releases which is the bread and butter for a lot of products online).
- Create SLA's without knowing what your SLO's are. If your system can only do X% on the SLO's, agreeing to an X+1% on the SLA is just silly and you WILL fail.
Principle # 5: Don't be 'Trigger Alert' happy.
Alerts can exhaust their recipients, especially if there is one for every event. This is the typical 'cry wolf' strategy and the day will come when the SRE team gets the alert, assumes its a whole lot of noise over something minor, shuts off the beeper and goes back to sleeping, playing XBox, watching their favorite sports team or whatever else they may be doing at the moment.
Inform SRE's about a problem when..
- ..the problem impacts user-experience and is one that can be resolved by the SRE. A bad process flow, based on rules established by a business team, that impedes user experience cannot be fixed by an SRE.
Avoid at all costs.
- Alert mania. Warnings are not alerts and therefore should not be treated as one.
Principle # 6: Pointing fingers is not ideal (but sometimes necessary).
The blameless post mortem is important and needs to be handled cautiously. However, its core tenet that a person is not responsible for the problem is founded in a falsehood. Sometimes, people are the reason and there is no harm in calling a spade a spade. What is important is how this spade calling is done. (Suggestion: in private 'spading' is desirable).
A blameless post mortem is the emperor's new clothes, wrapped around his great great grandfathers withering body (I refer to Root Cause Analysis here). In the end, a bpm is just that, a RCA, a peeling of the onion skin to get to the main issues hampering reliability.
Do's.
- Ask questions, lots of them, from different point of views.
Don't.
- Publicly deride anyone.
- Avoid having the hard talk with those who need one.
I write to remember and if in the process, I can help someone learn about Containers, Orchestration (Docker Compose, Kubernetes), GitOps, DevSecOps, VR/AR, Architecture, and Data Management, that is just icing on the cake.