Professional Services, Managed Services, Cloud Platform Services
DevOps and SRE's are the new Linux Admins. But What the heck does this mean?
You have probably heard these buzzwords already. With the current boom in cloud computing, the “Linux admins” are going out of fashion and “DevOps Engineers” and “SRE specialists” are popping out all around the globe. But what do these terms, or positions actually incorporate? Should employees with above-mentioned titles work side by side, or should they be afraid of being replaced by one another? Let’s bring some light into this topic.
DevOps
Let’s start with DevOps. A community-driven approach to organisational structure, practices, and tools, which:
-
Speeds up the delivery of application and thus reduces time to market
-
Implements small, gradual changes, rather than big rollouts
-
Brings focus on tooling and automation
DevOps is an abstract concept, a culture, an ideology or philosophy if you will. It was developed as a reaction to typical organisational problems. One of the biggest obstacles in traditional development flow was the so-called “siloed” teams. Imagine a giant wall, where folks from the Development and Operation are sitting on either side of this wall. They can’t see each other, they can’t communicate with each other. A typical deployment process looked like this: Developers created a new version of a product, and then threw it over the wall to Ops folks to “take care of it.”
The main problem here (apart from lack of communication and shared knowledge) is, that these two groups have vastly different goals. Developers want to build new features and updates and ship them ASAP, whereas the OPs folks approach things in “If it ain’t broken don’t fix it” kind of way. They are concerned with stability more than anything else, and with each new feature or update, they can already see yet another sleepless night.
DevOps methodology transforms this concept. It tears down the wall and brings both teams together, often into a single person. Thus the name, DevOps. Developers + Operations working together, deploying, maintaining and automating stuff.
Nowadays, the trends are to bring other teams to direct cooperation as well. A good example would be connecting sales and marketing departments, making the most out of different competencies and shared knowledge to bring less siloed, more fluent workflow. But the term DevOps is probably here to stay as something like “DevPMOpSale” doesn’t sound as fancy anymore does it…
SRE
We’ve talked about DevOps as an abstract ideology. SRE, or Site Reliability Engineering, on the other hand, is a concrete, prescriptive way of implementing very similar principles. It was designed by Google for its internal purposes. It was created parallelly and around the same time as DevOps principles. Later on, Google decided that SRE principles should be shared with the public and become a common practice. They even published an SRE book that you can read online for free.
SRE actually covers most of the DevOps ideology with concrete principles. As google folks like to say, the class SRE implements DevOps. If you dig deep into SRE books, you will find principles of shared ownership (mixed teams), toil & automation as well as an emphasis on small increments.
SRE also brings additional terms into the mix, such as:
-
SLI: Service-Level Indicator
-
Measurement of a service behaviour i.e. Is the latency of request below 300ms?
-
-
SLO: Service-Level Objective
-
The target of the exact amount (percentage) of the SLI that must be “healthy” e.g. 99.9% of the requests will have a latency below 300ms.
-
-
SLA: Service-Level Agreement
-
Agreement between the service provider and the client on what happens when the defined SLO is not met e.g. compensation for the loss of profit.
-
-
PM: Post Mortem
-
a blameless evaluation of service incidents with analysis of the root cause and the next steps to prevent from happening again.
-
SRE relies on the method called Error Budget. With this method, whenever you are about to, or already did, break your defined SLO, all deployments of new features, updates or maintenance windows are postponed, and all work is focused on reliability until you have enough error budget again or at the end of the SLO period.
So in summary, SRE and DevOps are not enemies, they are not even friends, completing each other. It’s more like one guy talking about a dream house and the other guy actually picking up the tools and starting to build.
FAQs
Q1: What is DevOps?
DevOps is a community-driven approach, culture, and philosophy for organizational structure. It aims to speed up the delivery of applications, implement small, gradual changes, and bring a strong focus on tooling and automation.
Q2: What core problem was DevOps created to solve?
DevOps was developed as a reaction to “siloed” teams, particularly the functional barrier between Development and Operations. These two teams traditionally had conflicting goals: developers wanted to ship new features quickly, while operations prioritized stability above all else, slowing things down.
Q3: How does the DevOps methodology solve this problem?
DevOps tears down the wall between Development and Operations, bringing the teams to work together and share knowledge. The name itself reflects this, combining “Developers“ and “Operations” to work collaboratively on deploying, maintaining, and automating products.
Q4: What is Site Reliability Engineering (SRE)?
SRE is a concrete and prescriptive way of implementing principles that are very similar to DevOps. It was originally designed by Google for its own internal use before being shared publicly as a common practice.
Q5: How do DevOps and SRE relate to each other?
SRE is essentially a specific implementation of the DevOps philosophy. As Google states, “class SRE implements DevOps.” They are not competing ideas; rather, DevOps is the abstract ideology (“the dream house”), and SRE is the concrete practice of applying it (“picking up the tools and starting to build”).
Q6: What are the key terms that SRE brings into practice?
SRE introduces several specific terms, including:
- SLI (Service-Level Indicator): A direct measurement of a service’s behavior, like request latency.
- SLO (Service-Level Objective): The target percentage for an SLI to be considered healthy (e.g., 99.9% of requests having a latency below 300ms).
- SLA (Service-Level Agreement): A contract that defines what happens if an SLO is not met, such as compensation for the client.
- PM (Post Mortem): A blameless evaluation of a service incident to analyze the root cause and prevent it from happening again.
Q7: What is the “Error Budget” in SRE?
The Error Budget is a core SRE method. If a service fails to meet its defined SLO, the “error budget” is consumed. Once the budget is used up, all new feature deployments and updates are paused, and all work is refocused on improving reliability until the budget is restored.