Revolgy stack: empowering Ops team with efficient tools and practices
Revolgy stack is a comprehensive solution that empowers our team to manage incidents and provide top-notch observability effectively.
This article introduces a series of use cases that focus on the Revolgy Operations Team’s observability and inner workings. We will share our experiences, highlighting the pain points we encountered and how we overcame them by implementing an optimized toolset tailored to our operational needs. We aim to enhance reliability and observability, ensuring seamless operations for our clients.
Join us on this journey as we delve into the intricacies of our operational processes and share insights on how the Revolgy stack has revolutionized our approach to incident management.
So, let’s look closer into each pain we had and how our team decided to solve it. We will focus on the following operational pains:
- Dealing with inconsistent monitoring tools and systems
- Struggling with vendor/provider agnosticism
- Complex and time-consuming updates and maintenance
- Inefficient workflows and manual processes
- Manual checks and inadequate quality assurance
Inconsistent monitoring tools and systems
In our journey to achieve effective IM/OM, we experienced the frustrations of using different variations of Prometheus, including the operator and the vanilla-OS versions. Additionally, inconsistent Prometheus rules across environments and clients created further complexity.
We have adopted the kube-prometheus-stack to address these challenges — an open-source project with excellent community support. The stack comprises all the industry-used and well-described tools for monitoring and alerting, such as Prometheus-operator, Grafana, Alertmanager, and others, wrapped as a helm chart deployment.
We combine the kube-prometheus-stack with AWS Cloudwatch and Google Cloud Operations service. Still, the kube-prometheus-stack serves as our Revolgy stack base, ensuring standardized configurations and customizations tailored to our clients’ specific needs.
Our primary focus has been to avoid vendor lock-in, as it fosters flexibility and encourages innovation in managing our cloud infrastructure.
To achieve this goal, we deliberately chose the kube-prometheus-stack, which offers compatibility with major cloud providers such as GCP and AWS. Furthermore, we have integrated an external uptime and health checks monitoring platform called UptimeRobot into our system. We rely on Pagerduty, a SaaS incident response platform, to streamline the aggregation of alerts.
This approach enables us to seamlessly connect our technology stack with tools and services from each provider, ensuring that our incident management remains adaptable to the changing requirements of our infrastructure.
Streamlining updates and maintenance
Keeping our toolset up-to-date posed a significant challenge. We used Prometheus-operator Helm charts, deployed Prometheus stacks via Kustomize, and integrated with CloudWatch and Google Cloud Operations (formerly Stackdriver). To establish consistency and simplify maintenance, we centralized our repository.
This approach allowed us to efficiently manage version control, releases, patches, and updates. By leveraging Helm charts and deploying our tools using Terraform code, we established a well-defined process to ensure smooth deployment and maintenance of our stack.
Automating workflows for efficiency
Manual work in incident management can introduce errors and slow down response times. To address this, we embraced automation throughout our workflows. Automated pipelines played a crucial role in streamlining the deployment of our Terraform code.
With this automation, we eliminated the need to run code locally, simplifying the process and ensuring consistent deployments. By automating routine tasks, we freed up valuable time and resources to focus on critical aspects of incident resolution, ultimately enhancing the reliability of our services.
Automated checks and quality assurance
Maintaining the reliability and accuracy of our monitoring systems is vital. To ensure the integrity of our infrastructure, we implemented automated checks and quality assurance practices.
Leveraging the Renovate bot, we track dependencies and manage version control. The Renovate bot monitors for changes, reports changelogs and performs a test upgrade on our development cluster for quality assurance. In case of any issues, the bot reverts changes, ensuring the stability of our deployments.
Improvements leading to exceptional services
Our focus on unified monitoring, vendor/provider agnosticism, streamlined updates and maintenance, automated workflows, and proactive monitoring has enabled us to deliver exceptional services to our customers.
- Shift from ClickOps to GitOps: One significant change we implemented was transitioning from traditional ClickOps to GitOps activities. Embracing GitOps principles has allowed us to automate and version control our infrastructure, significantly reducing manual interventions and human errors.
- Increased Merge Request (MR) submissions: Since migrating to GitOps, we have observed a substantial increase in MR submissions. The monthly MR count has surged by an impressive 100%, showcasing the efficiency and collaboration fostered by this new approach.
- Enhanced code review knowledge and experience: As part of our commitment to GitOps, we have invested in training and upskilling our team members. This dedication to improving code review knowledge and expertise has resulted in higher-quality code and improved overall system stability.
- Significant reduction in delivery time: Thanks to implementing automated pipelines within our GitOps setup, we have drastically reduced delivery time for new features and updates. This increased speed to market allows us to respond to customer demands more promptly.
- Amplified patching activities: With the integration of GitOps and automated workflows, our ability to apply patches and updates has significantly improved. This proactive approach ensures that our infrastructure always remains secure and up-to-date.
Efficient incident and operations management is a cornerstone of reliable cloud infrastructure. Adopting a comprehensive set of tools and practices has revolutionized our Incident Management (IM) and Operations Management (OM) approach.
In our next blog post (link to be added), we will explore our carefully crafted stack’s components, configurations, and methodologies, enabling us to manage incidents and elevate our operational performance effectively. These advancements and our commitment to innovation continue to strengthen our cloud infrastructure and solidify our position as a trusted and reliable service provider.