Previously we blogged about improving DevOps and SRE job interviews, where we mentioned that DevOps and SRE roles come in various forms but they usually involve two broad types of tasks: building things or fixing things. Imagine a spectrum ranging from 0% to 100% that shows the average amount of time spent by engineering employees on building things vs fixing things. Software engineers lean towards mostly building things whereas DevOps and SREs should ideally be in the middle of that spectrum, meaning they spend around 50% of their time building things (automation, tools, systems engineering) and the other 50% fixing things.
As a company goes through different stages, these time allocations can change. For example, as a company scales, systems can break more often because they are pushed to their boundaries. The company may focus on new features and target new users, paying less attention to fixing root causes or repeating issues. Other times it happens at companies that have legacy systems with low Mean Time Between Failures (MTBF); it can be difficult or expensive to change these systems. Regardless of the underlying causes, it is common to see DevOps and SREs spend significant amounts of time on fixing things. This is not necessarily a bad thing; it’s partly the nature of the work involved in these roles and some people find troubleshooting very rewarding. However, if you are finding that your DevOps and SREs are spending the majority of their time over sustained periods of time fixing issues, then a root cause needs to be addressed. This could also indicate that you do not have enough DevOps or SREs in the company.
This leads us perfectly to the key difference between hiring DevOps/SREs and software engineers. When a role involves spending 50% or more time on troubleshooting and fixing things, it is crucial to look for that skill in the hiring process. You want candidates who are able to do this alongside building automation, tools and doing systems engineering.
Fixing things involves identifying and isolating symptoms, digging deeper to find root causes, and deciding to fix, defer or work around them. DevOps and SREs have to deal with stressful situations that arise when dealing with broken systems, specially ones that cause outages and impact users directly. Being able to stay calm and think clearly under pressure is important. So is being able to communicate in a timely and concise manner. Furthermore, when dealing with production systems, an attitude of “playing it safe” works best where, for example, components are double-checked or backed-up before they are changed as well as having revert steps ready.
In the same way that you would never hire a software engineer without seeing their coding ability, you should test DevOps and SREs on their troubleshooting abilities. The best way we have found to do this is by using live running scenarios with issues introduced. They provide a rich environment for candidates to demonstrate their troubleshooting skills, how careful they are when making changes to the system, how well they articulate their hypothesis, and whether they put a hack or workaround in place or fix the root cause.
This can be done by running live environments, and breaking different components in the stack, then asking candidates to diagnose and fix the issues. These can then be assessed. For example, if you are hiring for junior positions, you might want to check the basics, so 1-2 easy scenarios; but if you are hiring for senior positions, then a longer 3-5 scenario test covering different aspects is better suited.