Nvidia GPU Monitoring

Telegraf has the capability to monitor Nvidia GPUs via the Nvidia-SMI interface. Basically the SMI gets queried and returns an XML of data to be parsed.

This "guide" assumes you already have InfluxDB and Grafana setup with a telegraf data source.

1. First you will need to have Telegraf installed.

2. Edit your telegaf.conf file and add the following input plugin.

[[inputs.nvidia_smi]]
  bin_path = "C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe"
  timeout = "5s"

Make sure "bin_path" is correctly set for your environment!

Due to the language Telegraf is written in, you will need to specify escape characters in your "bin_path" string. Otherwise Telegraf won't be able to find the smi.exe!

3. Restart Telegraf to pick up the new .conf file.

4. Open Grafana and create the panels.

You will need to change the "host" field to match your setup!

GPU Memory:

SELECT mean("memory_used") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

SELECT mean("memory_total") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

SELECT mean("memory_free") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

GPU Utilization:

SELECT mean("utilization_gpu") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

Memory Utilization:

SELECT mean("utilization_memory") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

Power Usage:

SELECT mean("power_draw") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

GPU Fan Speed:

SELECT mean("fan_speed") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

GPU Temp:

SELECT mean("temperature_gpu") FROM "nvidia_smi" WHERE ("host" = 'GUARDIAN') AND $timeFilter GROUP BY time($__interval) fill(null)

The current implementation of the SMI input for Telegraf does not support all of the query options of the SMI interface yet. Depending on the version you have installed, you may see more data options! (Version for this tutorial: 1.11.0).


Revision #3
Created Tue, Jul 2, 2019 6:23 PM by Alexander Henderson
Updated Fri, Jul 12, 2019 1:02 PM by Alexander Henderson