Revolgy stack is a comprehensive solution that empowers our team to manage incidents and provide top-notch observability effectively.
This article introduces a series of use cases that focus on the Revolgy Operations Team’s observability and inner workings. We will share our experiences, highlighting the pain points we encountered and how we overcame them by implementing an optimized toolset tailored to our operational needs. We aim to enhance reliability and observability, ensuring seamless operations for our clients.
Join us on this journey as we delve into the intricacies of our operational processes and share insights on how the Revolgy stack has revolutionized our approach to incident management.
So, let’s look closer into each pain we had and how our team decided to solve it. We will focus on the following operational pains:
In our journey to achieve effective IM/OM, we experienced the frustrations of using different variations of Prometheus, including the operator and the vanilla-OS versions. Additionally, inconsistent Prometheus rules across environments and clients created further complexity.
We have adopted the kube-prometheus-stack to address these challenges — an open-source project with excellent community support. The stack comprises all the industry-used and well-described tools for monitoring and alerting, such as Prometheus-operator, Grafana, Alertmanager, and others, wrapped as a helm chart deployment.
We combine the kube-prometheus-stack with AWS Cloudwatch and Google Cloud Operations service. Still, the kube-prometheus-stack serves as our Revolgy stack base, ensuring standardized configurations and customizations tailored to our clients’ specific needs.
Our primary focus has been to avoid vendor lock-in, as it fosters flexibility and encourages innovation in managing our cloud infrastructure.
To achieve this goal, we deliberately chose the kube-prometheus-stack, which offers compatibility with major cloud providers such as GCP and AWS. Furthermore, we have integrated an external uptime and health checks monitoring platform called UptimeRobot into our system. We rely on Pagerduty, a SaaS incident response platform, to streamline the aggregation of alerts.
This approach enables us to seamlessly connect our technology stack with tools and services from each provider, ensuring that our incident management remains adaptable to the changing requirements of our infrastructure.
Keeping our toolset up-to-date posed a significant challenge. We used Prometheus-operator Helm charts, deployed Prometheus stacks via Kustomize, and integrated with CloudWatch and Google Cloud Operations (formerly Stackdriver). To establish consistency and simplify maintenance, we centralized our repository.
This approach allowed us to efficiently manage version control, releases, patches, and updates. By leveraging Helm charts and deploying our tools using Terraform code, we established a well-defined process to ensure smooth deployment and maintenance of our stack.
Manual work in incident management can introduce errors and slow down response times. To address this, we embraced automation throughout our workflows. Automated pipelines played a crucial role in streamlining the deployment of our Terraform code.
With this automation, we eliminated the need to run code locally, simplifying the process and ensuring consistent deployments. By automating routine tasks, we freed up valuable time and resources to focus on critical aspects of incident resolution, ultimately enhancing the reliability of our services.
Maintaining the reliability and accuracy of our monitoring systems is vital. To ensure the integrity of our infrastructure, we implemented automated checks and quality assurance practices.
Leveraging the Renovate bot, we track dependencies and manage version control. The Renovate bot monitors for changes, reports changelogs and performs a test upgrade on our development cluster for quality assurance. In case of any issues, the bot reverts changes, ensuring the stability of our deployments.
Our focus on unified monitoring, vendor/provider agnosticism, streamlined updates and maintenance, automated workflows, and proactive monitoring has enabled us to deliver exceptional services to our customers.
Efficient incident and operations management is a cornerstone of reliable cloud infrastructure. Adopting a comprehensive set of tools and practices has revolutionized our Incident Management (IM) and Operations Management (OM) approach.
In our next blog post (link to be added), we will explore our carefully crafted stack’s components, configurations, and methodologies, enabling us to manage incidents and elevate our operational performance effectively. These advancements and our commitment to innovation continue to strengthen our cloud infrastructure and solidify our position as a trusted and reliable service provider.
Check out our blog for more useful articles, or drop us a line to discuss everything cloud-related.