Managed Services

All you need to know before deploying a managed monitoring solution

October 31, 2019

We’ve recently got a chance to try out a SaaS monitoring solution called SignalFx. We really enjoyed working with it and thought it is worth sharing our experience. Let’s have a look at what are some of its main benefits and everything else you need to know before you deploy any managed monitoring solution:

SignalFX is a tool that will allow you to offload all the complicated monitoring infrastructure to a reliable partner who really understands how to deal with such important components properly. Do you sometimes ask yourself any of the following questions?

What happens when my monitoring infrastructure dies?
How can I monitor my monitoring infrastructure?
What happens when storage with historical data got corrupted?
How am I supposed to provision cost-effective infrastructure while the monitoring components are so huuuuuge?

With a managed solution, you don’t have to ask such questions. Instead, you can fully focus on your business and reduce operational expenses.

Of course, there are some drawbacks as every technology requires a certain level of “special treatment”. Today I’m going to show you some of the best practices for using SignalFx deployed in the AWS EKS.

Step 1: Install SignalFX smart agent

Simply follow the instructions in the official documentation. I’d just like to add, that it is very important that you always use advanced installation method (you will be using good old kubectl) since it allows you to customise the entire Smart Agent configuration. I’ll explain the reason for this recommendation later on. ;)

Step 2: Customise Kubelet Metrics

Even after 5 years of its existence, the world of Kubernetes is still pretty wild. I mean, each of the managed services might behave a bit differently. EKS is a good example as it has a slightly different kubelet endpoints from other managed services. Hence, you might not see all the metrics in the SignalFx interface.

Thankfully, this requires just a minor adjustment of the Smart Agent config map. Open the ConfigMap yaml file and extend kubelet section to something like this:

     - type: kubelet-stats
             kubeletAPI:
         url: "https://${MY_NODE_NAME}:10250"
          skipVerify: true
          authType: serviceAccount

Step 3: Even better Kubelet scraping

Have you ever read the horror stories about CPU throttling? If you did, then you know that we should be carefully observing these stats. If you haven’t … well, follow my lead anyway. When you are in the kubelet section, extend the configuration a bit and add the following additional metrics: container_cpu_cfs_periods and container_cpu_cfs_throttled_periods.

Then, the whole kubelet section will look like this:

     - type: kubelet-stats
       extraMetrics:
         - container_cpu_cfs_periods
         - container_cpu_cfs_throttled_periods
         - container_fs_limit_bytes
         - container_fs_usage_bytes
       kubeletAPI:
         url: "https://${MY_NODE_NAME}:10250"
         skipVerify: true
         authType: serviceAccount

Now you can be watching CPU throttling in real time. No more oppressions!

Step 4: Monitor Kubernetes volumes

The best practice to fully leverage scaling capabilities, shorten recovery times etc is to run stateless workloads on Kubernetes. However, sometimes it’s just not possible to stick to this recommendation (for instance, how can you cope with a Zookeeper cluster) and we must work with stateful workload. In such cases, it’s probably a good idea to monitor the free space to prevent data loss or corruption caused by a full file system.

In SignalFx you have to opt-in this feature by enabling kubernetes-volumes exporter.

     - type: kubernetes-volumes
       kubeletAPI:
         url: "https://${MY_NODE_NAME}:10250"
         skipVerify: true
         authType: serviceAccount

But this isn’t all! You also need to modify the ClusterRole to alow SignalFx to be able to read information for Physical Volumes or Persistent Volume Claims. This adjustment is actually pretty easy. Just open the manifest for the Cluster Role and add the following entries to the list of resources:

  - persistentvolumeclaims
  - persistentvolumes

Now you can get all the metrics you need to observe the available space in persistent volumes.

Step 5: Add metrics for stateful tests

By default, SignalFx is collecting metrics only for Kubernetes Deployments. It’s useful when you want to identify any failing workload, deployments which are not fully ready etc. This step is actually connected with the previous one. Sometimes you really need to run some stateful workload. In such cases, it comes handy if you can receive the same metrics even for StatefulSets.

To achieve this, you just need to add a few extra metrics to the kubernetes-cluster section:

     - type: kubernetes-cluster
       useNodeName: true
       extraMetrics:
         - "kubernetes.stateful_set.current"
         - "kubernetes.stateful_set.desired"
         - "kubernetes.stateful_set.ready"
         - "kubernetes.stateful_set.updated"

From now on, you should be alerted when something happens to StatefulSets as well!

Always keep an eye on the usage meter

Now let’s talk about the most important part: almost everything mentioned this article is considered as the custom metrics. These are, of course, limited depending on your subscription level. Standard license gets 50 custom metrics per host. In the enterprise license you can have 200 of them. Please note that it’s extremely easy to exceed the limit and it can bring you an unpleasant surprise when the invoice arrives.

I strongly recommend always reading the documentation of the monitored components carefully and trying to filter out metrics and dimensions you don’t need. For instance, I really don’t need Docker metrics as I’m able to retrieve similar metrics from other components. So there’s nothing easier than just deleting this section from the ConfigMap.

     - type: docker-container-stats
       dockerURL: unix:///var/run/docker.sock
       excludedImages:
         - '*pause-amd64*'
         - 'k8s.gcr.io/pause*'
       labelsToDimensions:
         io.kubernetes.container.name: container_spec_name
         io.kubernetes.pod.name: kubernetes_pod_name
         io.kubernetes.pod.uid: kubernetes_pod_uid
         io.kubernetes.pod.namespace: kubernetes_namespace

And you know what? We’ve just saved tons metrics so you can reuse this capacity for some other important business metrics.

Key takeaways

If you’d like my advice, I’d say take some time and go through all these recommendations one by one. Try to fetch the metrics endpoints with curl or wget and try to understand what actually happens in each of the infrastructure components of Kubernetes. Playing with monitoring is actually a great way to get closer to Kubernetes and mentally connect all these things together.

The second advice would be: always think twice when you are implementing new metrics endpoints to the SignalFx configurations. Ask yourself whether you really need these metrics, because custom metrics are limited based on your subscription and exceeding the limit can be a bit pricey.

And last but not least, have fun! Everything is just about a small portion of math and experimenting with the actual metrics. There’s no need to be afraid of observability.

And what if something bad happens and you can’t see it in the monitoring? It’ll get better in the next iteration, continuous improvement is an integral part of monitoring so always bear this in mind and don’t panic.

FAQs

Q1: What is the primary benefit of using a managed monitoring solution like SignalFx?

A managed solution offloads the complexity of running and maintaining monitoring infrastructure. This means you no longer have to worry about what happens if your monitoring system fails, its storage gets corrupted, or how to provision its huge components cost-effectively. It allows you to focus on your business and reduce operational expenses.

Q2: What installation method is recommended for the SignalFx Smart Agent and why?

The “advanced installation” method using kubectl is recommended because it allows you to customize the entire Smart Agent configuration, which is necessary for the specific adjustments required for certain environments like AWS EKS.

Q3: Why might some Kubelet metrics be missing when monitoring AWS EKS, and what is the fix?

AWS EKS has slightly different Kubelet endpoints compared to other managed Kubernetes services, which can cause some metrics not to appear by default. This is fixed with a minor adjustment to the Smart Agent ConfigMap, specifically by adding a kubeletAPI section with the correct URL (https://\${MY_NODE_NAME}:10250) and authentication type.

Q4: How can you configure SignalFx to monitor for CPU throttling?

To watch for CPU throttling in real-time, you must extend the kubelet-stats configuration in the Smart Agent ConfigMap by adding container_cpu_cfs_periods and container_cpu_cfs_throttled_periods to the extraMetrics section.

Q5: What two steps are required to monitor Kubernetes persistent volumes in SignalFx?

Monitoring persistent volumes is an opt-in feature that requires two steps:

Enable the kubernetes-volumes exporter in the Smart Agent ConfigMap.
Modify the SignalFx ClusterRole manifest to add persistentvolumeclaims and persistentvolumes to the list of resources it is allowed to read.

Q6: How can you enable monitoring for StatefulSets, which is not on by default?

By default, SignalFx collects metrics only for Deployments. To get metrics for StatefulSets, you must add extra metrics like kubernetes.stateful_set.current and kubernetes.stateful_set.desired to the kubernetes-cluster configuration section.

Q7: What is the most important consideration regarding cost when customizing SignalFx monitoring?

The most important thing to consider is that many of the added metrics are considered “custom metrics.” The number of custom metrics you can have is limited by your subscription level (e.g., 50 for a standard license, 200 for enterprise). It is extremely easy to exceed this limit, which can result in an unpleasantly high bill.

Q8: What is a practical way to reduce the number of custom metrics and manage costs?

You should carefully review which metrics you actually need and filter out or remove those you don’t. For example, if you can get container metrics from other components, you can delete the entire docker-container-stats section from the ConfigMap to save a significant number of metrics.