Problem
You have a device that collects the expected data sometimes, but there are inconsistent gaps in the charts.
This usually happens when a device can't reliably respond to the SNMP requests within the timeout period because the network between the ktranslate
container and the polled device is experiencing bandwidth contention, losing packets, or has high latency.
Another scenario might be that the device is overloaded and can't respond to the SNMP requests quickly. This usually happens when you try to collect OIDs from very large tables with a poll_time_sec
that is too fast for the device to keep up with.
Solution
As a general rule, locate your polling container as close to the monitored devices as you can to reduce the chance of a the UDP SNMP payloads not making the trip.
In cases where you must poll across WAN links with more latency, you may need to edit the snmp-base.yaml
file to increase the timeout_ms
from the default 5000ms to a longer interval.
If you believe you may have an unreliable connection to the remote site, consider increasing the retries
from the default of 0. Having a high number of retries, but a timeout that is too short, will likely not improve your situation and can potentially increase the loads on the monitored device as it's trying to respond to more requests and none of them get back before the timeout expires.
If you're polling large tables of data from devices, such as a busy load balancer, then it can take some additional time for the monitored device to gather the required data for the response. This can require setting a longer timeout_ms
period and and setting a longer delay for poll_time_sec
.