Problem
You have installed the Prometheus OpenMetrics integration for Docker or Kubernetes, but you notice sparse data, missing metrics, and data gaps in New Relic's UI.
Solution
If some metrics are not being collected regularly, do the following:
Check if the CPU is being throttled, and if the memory allocated to the container is enough.
In case the container doesn't have enough resources available, it may send data with a longer interval between samples. Low memory limit may cause the integration to be killed and restarted. The correct limits and requests of resources can vary according to the number of targets monitored, and the number of metrics exposed by each target.For more information, see the guidelines in the large environment configuration documentation.
Check for the following error message in the logs of the integration:
{ "err": "unexpected post response code: 413: Request Entity Too Large" }This issue causes some payloads to be dropped, and it is currently fixed in new releases. If present, update the image to the most recent version.
If some
/metrics
endpoints that are monitored by the integration time out or take several seconds to answer, they may slow down the scraping of data. The integration's performance can degrade if multiple endpoints take a huge amount of time to answer. This leads to intermittent and missing data.To detect those endpoints, run:
SELECT average(nr_stats_integration_fetch_target_duration_seconds)FROM Metric where clusterName=’clustername' SINCE 30 MINUTES AGO FACET target LIMIT 30This query retrieves the data exposed by the Prometheus OpenMetrics integration, and it shows the time required to fetch each endpoint.
Fix the endpoints with a latency above
1s
, or exclude them from the monitoring removing the Prometheus scraping label.If it's not feasible to remove these endpoints, and the latency in the answer can't be avoided, configure the integration to run more workers in parallel. This allows the integration to fetch more endpoints at the same time.
To do so, update the integration to the most recent version, and apply the new configuration option
worker_threads
. We advise to do it gradually, from 4, 6, 8 and up to 16.This workaround only minimizes the issue, and if multiple endpoints are misbehaving, performances will still be degraded. Moreover, memory and CPU consumption will increase with the number of workers, so memory and CPU must be increased accordingly.
In case all the endpoints monitored have a low latency, and the container is not being throttled, run the following query. This helps you verify how much time the integration is taking to scrape all the targets, and to send the data if it's exceeding the configured
scrape_duration
.SELECT latest(nr_stats_integration_process_duration_seconds) FROM Metricwhere clusterName=’clustername' SINCE 30 MINUTES AGO TIMESERIESFirst, update the integration to the latest image published. Then, to reduce the time needed to scrape all the targets, increase the number of workers with the configuration option
worker_threads
. We advise is to do it gradually, from 4 to 6, 8, and up to 16 untilr_stats_integration_process_duration_seconds
approaches the definedscrape_duration
. Note that memory consumption and CPU usage will increase.