Recommended alert policies and dashboards

When you first install the Kubernetes integration, we deploy a default set of recommended alerts conditions and dashboards to your account that form the basis for alert conditions and dashboards on your Kubernetes cluster. Alert conditions are grouped into a couple of policies: Kubernetes alert policy and Google Kubernetes Engine alert policy.

While we've tried to address the most common use cases in all environments, there are a number of additional alerts you can set up to extend the default policy. See Getting started with New Relic alerts to know more about alerts.

Adding the recommended alert conditions and dashboards

To add recommended alert policies and dashboards, follow these steps:

Go to one.newrelic.com > Integrations & Agents.
In the search box, type kubernetes.
Select one of these options:
- Kubernetes: To add the default set of recommended alert conditions and a dashboard.
- Google Kubernetes Engine: To add the default set of recommended Google Kubernetes engine alert conditions and a dashboard.
Click Begin installation if you need to install the Kubernetes integration or click Skip this step if you already setup this integration.
Depending on the option you selected in step 3, you'll see different resources to add.
Default set of recommended alert conditions and a dashboard when you select Kubernetes in step 3.
Default set of recommended Google Kubernetes engine alert conditions and a dashboard when you select Google Kubernetes Engine in step 3.
Click See your data to see a dashboard with your Kubernetes data in New Relic.

How to see the recommended alert policies

To view the recommended alert policies you've added, do this:

Go to one.newrelic.com > All capabilities > Alerts.
Click Alert Policies in the left navigation pane.
You'll see Kubernetes alert policy and Google Kubernetes engine alert policy.

How to see the Kubernetes dashboards

There is a collection of recommended pre-built dashboards to help you instantly visualize your Kubernetes data for common use cases. See Manage your recommended dashboards to know how to see these dashboards.

Kubernetes alert policy

This is the default set of recommended alert conditions you'll add:

This alert condition generates an alert when a container is throttled by more than 25% for more than 5 minutes. It runs this query:

FROM K8sContainerSample
SELECT sum(containerCpuCfsThrottledPeriodsDelta) / sum(containerCpuCfsPeriodsDelta) * 100 
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET containerName, podName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the average container CPU usage against the limit exceeds 90% for over 5 minutes. It runs this query:

FROM K8sContainerSample
SELECT average(cpuCoresUtilization)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET containerName, podName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the average container memory usage against the limit exceeds 90% for over 5 minutes. It runs this query:

FROM K8sContainerSample
SELECT average(memoryWorkingSetUtilization)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET containerName, podName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when container restarts exceed 0 in a 5-minute sliding window. It runs this query:

FROM K8sContainerSample
SELECT sum(restartCountDelta)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET containerName, podName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a container waits over 5 minutes. It runs this query:

FROM K8sContainerSample
SELECT uniqueCount(podName)
WHERE status = 'Waiting' AND clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET containerName, podName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the daemonset is missing any pods for a period longer than 5 minutes. It runs this query:

FROM K8sDaemonsetSample
SELECT latest(podsMissing)
WHERE clusterName IN ('YOUR_CLUSTER_NAME')
AND namespaceName IN ('YOUR_NAMESPACE_NAME')
FACET daemonsetName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the deployment is missing any pods for a period longer than 5 minutes. It runs this query:

FROM K8sDeploymentSample
SELECT latest(podsMissing)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET deploymentName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the Etcd file descriptor usage exceeds 90% for over 5 minutes. It runs this query:

FROM K8sEtcdSample
SELECT max(processFdsUtilization)
WHERE clusterName IN ('YOUR_CLUSTER_NAME')
FACET displayName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the Etcd file descriptor is leaderless for over 1 minute. It runs this query:

FROM K8sEtcdSample
SELECT min(etcdServerHasLeader)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
FACET displayName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the current replicas of a horizontal pod autoscaler are lower than the desired replicas for more than 5 minutes. It runs this query:

FROM K8sHpaSample
SELECT latest(desiredReplicas - currentReplicas)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET displayName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a horizontal pod autoscaler exceeds 5 replicas. It runs this query:

FROM K8sHpaSample
SELECT latest(maxReplicas - currentReplicas)
WHERE clusterName in ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET displayName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a job reports a failed status. It runs this query:

FROM K8sJobSample
SELECT uniqueCount(jobName)
WHERE failed = 'true' and clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet jobName, namespaceName, clusterName, failedPodsReason

See the GitHub configuration file for more info.

This alert condition generates an alert when more than 5 pods in a namespace fail for more than 5 minutes. It runs this query:

FROM K8sPodSample
SELECT uniqueCount(podName)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
AND status = 'Failed'
FACET namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the average node allocable CPU utilization exceeds 90% for more than 5 minutes. It runs this query:

FROM K8sNodeSample
SELECT average(allocatableCpuCoresUtilization)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
FACET nodeName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the average node allocable memory utilization exceeds 90% for more than 5 minutes. It runs this query:

FROM K8sNodeSample
SELECT average(allocatableMemoryUtilization)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
FACET nodeName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a node is unavailable for 5 minutes. It runs this query:

FROM K8sNodeSample
SELECT latest(condition.Ready)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
FACET nodeName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a node is marked unscheduled. It runs this query:

FROM K8sNodeSample
SELECT latest(unschedulable)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
FACET nodeName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a node's running pods exceed 90% of the node's pod capacity for more than 5 minutes. It runs this query:

FROM K8sPodSample, K8sNodeSample
SELECT ceil
(filter
  (
    uniqueCount(podName),
    WHERE status = 'Running'
  ) / latest(capacityPods) * 100
) AS 'Pod Capacity %' 
WHERE nodeName != '' AND nodeName IS NOT NULL 
AND clusterName IN ('YOUR_CLUSTER_NAME') 
FACET nodeName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when the average node root file system capacity utilization exceeds 90% for more than 5 minutes. It runs this query:

FROM K8sNodeSample
SELECT average(fsCapacityUtilization)
WHERE clusterName IN ('YOUR_CLUSTER_NAME') 
FACET nodeName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when persistent volume is in a failed or pending state for more than 5 minutes. It runs this query:

FROM K8sPersistentVolumeSample
SELECT uniqueCount(volumeName)
WHERE statusPhase IN ('Failed','Pending') 
AND clusterName IN ('YOUR_CLUSTER_NAME') 
FACET volumeName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a pod is unable to be scheduled for more than 5 minutes. It runs this query:

FROM K8sPodSample
SELECT latest(isScheduled)
WHERE clusterName IN ('YOUR_CLUSTER_NAME')
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET podName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when a pod is unavailable for over 5 minutes. It runs this query:

FROM K8sPodSample
SELECT latest(isReady)
WHERE status NOT IN ('Failed', 'Succeeded') 
AND clusterName IN ('YOUR_CLUSTER_NAME')
AND namespaceName IN ('YOUR_NAMESPACE_NAME')
FACET podName, namespaceName, clusterName

See the GitHub configuration file for more info.

This alert condition generates an alert when statefulset is missing pods over 5 minutes. It runs this query:

FROM K8sStatefulsetSample
SELECT latest(podsMissing)
WHERE clusterName IN ('YOUR_CLUSTER_NAME')
AND namespaceName IN ('YOUR_NAMESPACE_NAME') 
FACET daemonsetName, namespaceName, clusterName

See the GitHub configuration file for more info.

Google Kubernetes engine alert policy

This is the default set of recommended Google Kubernetes engine alert conditions you'll add:

This alert condition generates an alert when a node's CPU utilization exceeds 90% for at least 15 minutes. It runs this query:

FROM Metric
SELECT max(`gcp.kubernetes.node.cpu.allocatable_utilization`) * 100
WHERE clusterName LIKE '%' FACET gcp.kubernetes.nodeName

See the GitHub configuration file for more info.

This alert condition generates an alert when a node's memory usage exceeds 85% of its total capacity. It runs this query:

FROM K8sPodSample
SELECT max(gcp.kubernetes.node.memory.allocatable_utilization) * 100 
WHERE clusterName LIKE '%' FACET gcp.kubernetes.nodeName

See the GitHub configuration file for more info.

Recommended alert policies and dashboards

Adding the recommended alert conditions and dashboards

How to see the recommended alert policies

How to see the Kubernetes dashboards

Kubernetes alert policy

Kubernetes Dashboard (dashboard)

Container CPU throttling is high (alert condition)

Container high CPU utilization (alert condition)

Container high memory utilization (alert condition)

Container is restarting (alert condition)

Container is waiting (alert condition)

Daemonset is missing pods (alert condition)

Deployment is missing pods (alert condition)

`Etcd` file descriptor utilization is high (alert condition)

`Etcd` has no leader (alert condition)

HPA current replicas < desired replicas (alert condition)

HPA has reached maximum replicas (alert condition)

Job Failed (alert condition)

More than 5 pods failing in namespace (alert condition)

Node allocatable CPU utilization is high (alert condition)

Node allocatable memory utilization is high (alert condition)

Node is not ready (alert condition)

Node is unschedulable (alert condition)

Node pod count nearing capacity (alert condition)

Node root file system capacity utilization is high (alert condition)

Persistent volume has errors (alert condition)

Pod cannot be scheduled (alert condition)

Pod is not ready (alert condition)

`statefulset` is missing pods (alert condition)

Google Kubernetes engine alert policy

Google Kubernetes Engine (dashboard)

High CPU utilization (alert condition)

High memory usage (alert condition)

Recommended alert policies and dashboards

Adding the recommended alert conditions and dashboards .css-21sua1{background:none;border:none;width:0;padding:0;}

How to see the recommended alert policies

How to see the Kubernetes dashboards

Kubernetes alert policy

Container CPU throttling is high (alert condition)

Container high CPU utilization (alert condition)

Container high memory utilization (alert condition)

Container is restarting (alert condition)

Container is waiting (alert condition)

Daemonset is missing pods (alert condition)

Deployment is missing pods (alert condition)

Etcd file descriptor utilization is high (alert condition)

Etcd has no leader (alert condition)

HPA current replicas < desired replicas (alert condition)

HPA has reached maximum replicas (alert condition)

Job Failed (alert condition)

More than 5 pods failing in namespace (alert condition)

Node allocatable CPU utilization is high (alert condition)

Node allocatable memory utilization is high (alert condition)

Node is not ready (alert condition)

Node is unschedulable (alert condition)

Node pod count nearing capacity (alert condition)

Node root file system capacity utilization is high (alert condition)

Persistent volume has errors (alert condition)

Pod cannot be scheduled (alert condition)

Pod is not ready (alert condition)

statefulset is missing pods (alert condition)

Google Kubernetes engine alert policy

Google Kubernetes Engine (dashboard)

High CPU utilization (alert condition)

High memory usage (alert condition)

Adding the recommended alert conditions and dashboards

`Etcd` file descriptor utilization is high (alert condition)

`Etcd` has no leader (alert condition)

`statefulset` is missing pods (alert condition)