Technical details on how the discovery_server probe communicates
How does the discovery_server probe communicate?
- discovery_server probe asks the primary hubs for a list of hubs using 'gethubs' callback.
- discovery_server probe then asks each hub in that list for 'getrobots' to get a list of all robots.
- discovery_server probe then does a 'nametoip' callback to find the IP address and port of the robot.
- If there IS a tunnel, then nametoip will return the IP/port of a tunnel session and connect to that tunnel session which will handle the request to the robot.
- If there is NO tunnel, nametoip will return the actual IP/port of the robot itself on port 48000 and the discovery_server will try to connect directly to the robot - so this may fail if a) there is not route to the robot from the primary or b) there is no tunnel between hubs.
Example scenario (no metrics displayed in USM for one or more robots):
In one particular customer case, where there was no tunnel between the primary and remote hubs, the discovery_server relies on the nametoip callback, and gets the address back in the result (e.g., 10.24.x.x ) and then attempts to connect directly to that IP, port 48000.
In the discovery_server log you must search for errors regarding the given robot hostname and/or other hosts for which no metrics are being displayed in USM. As an example this robot below shows the type of error you would expect to see in the discovery_server log indicating that it was trying to communicate directly with the 10.24.x.xxx robot to fetch the niscache elements but it failed.
Error showing failure to fetch the nis_cache elements on the given robot:
15 Jan 2016 17:37:02,322 [robotWorker-2] WARN com.nimsoft.discovery.server.nimbus.scan.NisCacheUpdater - fetch nis cache failed on pass=0 with 0 total elems received for /HIXXX/XX_Secondary_Hub_Servers/<hostname> : (80) Session error, Unable to open a client session for 10.24.xx.xxx:48000: Connection timed out: connect
The discovery_server log should be set to loglevel 5 with a large logsize, e.g., 20000 and check the actual log on the file system to be sure.