Managed Services
Revolgy stack: empowering Ops team with efficient tools and practices
Revolgy stack is a comprehensive solution that empowers our team to manage incidents and provide top-notch observability effectively.
This article introduces a series of use cases that focus on the Revolgy Operations Team’s observability and inner workings. We will share our experiences, highlighting the pain points we encountered and how we overcame them by implementing an optimized toolset tailored to our operational needs. We aim to enhance reliability and observability, ensuring seamless operations for our clients.
Join us on this journey as we delve into the intricacies of our operational processes and share insights on how the Revolgy stack has revolutionized our approach to incident management.
So, let’s look closer into each pain we had and how our team decided to solve it. We will focus on the following operational pains:
- Dealing with inconsistent monitoring tools and systems
- Struggling with vendor/provider agnosticism
- Complex and time-consuming updates and maintenance
- Inefficient workflows and manual processes
- Manual checks and inadequate quality assurance
Inconsistent monitoring tools and systems
In our journey to achieve effective IM/OM, we experienced the frustrations of using different variations of Prometheus, including the operator and the vanilla-OS versions. Additionally, inconsistent Prometheus rules across environments and clients created further complexity.
We have adopted the kube-prometheus-stack to address these challenges — an open-source project with excellent community support. The stack comprises all the industry-used and well-described tools for monitoring and alerting, such as Prometheus-operator, Grafana, Alertmanager, and others, wrapped as a helm chart deployment.
We combine the kube-prometheus-stack with AWS Cloudwatch and Google Cloud Operations service. Still, the kube-prometheus-stack serves as our Revolgy stack base, ensuring standardized configurations and customizations tailored to our clients’ specific needs.
Vendor/Provider agnosticism
Our primary focus has been to avoid vendor lock-in, as it fosters flexibility and encourages innovation in managing our cloud infrastructure.
To achieve this goal, we deliberately chose the kube-prometheus-stack, which offers compatibility with major cloud providers such as GCP and AWS. Furthermore, we have integrated an external uptime and health checks monitoring platform called UptimeRobot into our system. We rely on Pagerduty, a SaaS incident response platform, to streamline the aggregation of alerts.
This approach enables us to seamlessly connect our technology stack with tools and services from each provider, ensuring that our incident management remains adaptable to the changing requirements of our infrastructure.
Streamlining updates and maintenance
Keeping our toolset up-to-date posed a significant challenge. We used Prometheus-operator Helm charts, deployed Prometheus stacks via Kustomize, and integrated with CloudWatch and Google Cloud Operations (formerly Stackdriver). To establish consistency and simplify maintenance, we centralized our repository.
This approach allowed us to efficiently manage version control, releases, patches, and updates. By leveraging Helm charts and deploying our tools using Terraform code, we established a well-defined process to ensure smooth deployment and maintenance of our stack.
Automating workflows for efficiency
Manual work in incident management can introduce errors and slow down response times. To address this, we embraced automation throughout our workflows. Automated pipelines played a crucial role in streamlining the deployment of our Terraform code.
With this automation, we eliminated the need to run code locally, simplifying the process and ensuring consistent deployments. By automating routine tasks, we freed up valuable time and resources to focus on critical aspects of incident resolution, ultimately enhancing the reliability of our services.
Automated checks and quality assurance
Maintaining the reliability and accuracy of our monitoring systems is vital. To ensure the integrity of our infrastructure, we implemented automated checks and quality assurance practices.
Leveraging the Renovate bot, we track dependencies and manage version control. The Renovate bot monitors for changes, reports changelogs and performs a test upgrade on our development cluster for quality assurance. In case of any issues, the bot reverts changes, ensuring the stability of our deployments.
Improvements leading to exceptional services
Our focus on unified monitoring, vendor/provider agnosticism, streamlined updates and maintenance, automated workflows, and proactive monitoring has enabled us to deliver exceptional services to our customers.
- Shift from ClickOps to GitOps: One significant change we implemented was transitioning from traditional ClickOps to GitOps activities. Embracing GitOps principles has allowed us to automate and version control our infrastructure, significantly reducing manual interventions and human errors.
- Increased Merge Request (MR) submissions: Since migrating to GitOps, we have observed a substantial increase in MR submissions. The monthly MR count has surged by an impressive 100%, showcasing the efficiency and collaboration fostered by this new approach.
- Enhanced code review knowledge and experience: As part of our commitment to GitOps, we have invested in training and upskilling our team members. This dedication to improving code review knowledge and expertise has resulted in higher-quality code and improved overall system stability.
- Significant reduction in delivery time: Thanks to implementing automated pipelines within our GitOps setup, we have drastically reduced delivery time for new features and updates. This increased speed to market allows us to respond to customer demands more promptly.
- Amplified patching activities: With the integration of GitOps and automated workflows, our ability to apply patches and updates has significantly improved. This proactive approach ensures that our infrastructure always remains secure and up-to-date.
Conclusion
Efficient incident and operations management is a cornerstone of reliable cloud infrastructure. Adopting a comprehensive set of tools and practices has revolutionized our Incident Management (IM) and Operations Management (OM) approach.
In our next blog post (link to be added), we will explore our carefully crafted stack’s components, configurations, and methodologies, enabling us to manage incidents and elevate our operational performance effectively. These advancements and our commitment to innovation continue to strengthen our cloud infrastructure and solidify our position as a trusted and reliable service provider.
Check out our blog for more useful articles, or drop us a line to discuss everything cloud-related.
FAQs
Q1: What is the Revolgy stack?
The Revolgy stack is a comprehensive solution designed to manage incidents and provide observability. Its foundation is the kube-prometheus-stack, which is used in combination with AWS Cloudwatch and Google Cloud Operations service to ensure standardized configurations.
Q2: What operational pains did the Revolgy Operations Team originally face?
The team encountered several challenges, including inconsistent monitoring tools and systems, difficulties with vendor/provider agnosticism, complex and time-consuming updates, inefficient manual workflows, and inadequate quality assurance due to manual checks.
Q3: How did Revolgy solve the problem of having inconsistent monitoring tools?
The team adopted the kube-prometheus-stack, an open-source project that includes industry-standard tools like Prometheus-operator, Grafana, and Alertmanager. This stack serves as the base for monitoring, ensuring standardized configurations tailored to specific client needs.
Q4: How does the Revolgy stack remain vendor-agnostic?
To avoid vendor lock-in, the stack uses the kube-prometheus-stack, which is compatible with major cloud providers like GCP and AWS. It also integrates with UptimeRobot for external health checks and Pagerduty for aggregating alerts, enabling seamless connection with tools from any provider.
Q5: What process was implemented to streamline updates and maintenance?
A centralized repository was created to manage version control, releases, patches, and updates. By using Helm charts and deploying tools as Terraform code, the team established a well-defined process for smooth deployment and maintenance.
Q6: Why was automation introduced into the team’s workflows?
Automation was implemented to reduce the errors and slow response times associated with manual work in incident management. This frees up team resources to focus on critical aspects of incident resolution and enhances service reliability.
Q7: How are automated checks and quality assurance performed?
The team uses the Renovate bot to track dependencies and manage version control. The bot monitors for changes, reports changelogs, and performs a test upgrade on a development cluster. If any issues are detected, the bot automatically reverts the changes to ensure stability.
Q8: Why did the team shift from ClickOps to GitOps?
The transition to GitOps was made to automate and version control the infrastructure. This approach significantly reduces manual interventions and human errors.
Q9: What were the key benefits of moving to GitOps and implementing automation?
The key benefits included a 100% increase in monthly Merge Request (MR) submissions, enhanced code review knowledge within the team, a significant reduction in delivery time for new features, and an amplified ability to apply patches and security updates.