Monitoring iLO2 with Nagios
We have a whole bunch of HP servers, purchased in large part for their excellent Lights-Out management software. We wanted to have Nagios monitor some basic things, like the status of the fans, internal temperatures, power supplies, and VRMs.
Fortunately, that turned out to be pretty easy.
The first step was to download the HP Lights-Out XML Perl Scripting Sample for Linux. I did this even though I don’t use Linux as a platform. The resulting file contains a bunch of sample XML scripts for accomplishing various goals, and a perl script (locfg.pl) that submits them to an iLO2 processor.
I then used locfg.pl to submit the following XML to each of our servers, in order to create a user with no substantial privileges.
<RIBCL VERSION="2.0">
<LOGIN USER_LOGIN="adminuser" PASSWORD="adminpass">
<USER_INFO MODE="write">
<ADD_USER
USER_NAME="Nagios Monitor"
USER_LOGIN="nagiosuser"
PASSWORD="nagiospass">
<ADMIN_PRIV value ="N"/>
<REMOTE_CONS_PRIV value ="N"/>
<RESET_SERVER_PRIV value ="N"/>
<VIRTUAL_MEDIA_PRIV value ="N"/>
<CONFIG_ILO_PRIV value="N"/>
</ADD_USER>
</USER_INFO>
</LOGIN>
</RIBCL>
This script allowed me to quickly create unprivileged users on each of the iLO2 consoles.
Armed with an unprivileged user, I set about writing the actual plugin, and ended up with this:
#!/usr/bin/env ruby
require 'optparse'
require 'socket'
require 'openssl'
require 'rexml/document'
# Command Line Options
options = {
:server => nil
}
opts = OptionParser.new do |opt|
opt.banner = "Usage: #{$0} [options]"
opt.on('-s', '--server HOSTNAME', String, "Hostname or IP of the server to query") { |i| options[:server] = i }
end
opts.parse!(ARGV)
if not options[:server]
$stderr.puts "Server must be specified"
exit!
end
# iLO XML
xml_start = <<EOF
<RIBCL VERSION="2.22">
<LOGIN USER_LOGIN="nagiosuser" PASSWORD="nagiospass">
EOF
xml_end = <<EOF
</LOGIN>
</RIBCL>
EOF
xml_emhealth = <<EOF
<SERVER_INFO MODE="read">
<GET_EMBEDDED_HEALTH />
</SERVER_INFO>
EOF
error_cnt = 0
error_msg = ''
error_summary = ''
s = TCPsocket.open(options[:server], 443)
ssl = OpenSSL::SSL::SSLSocket.new(s, OpenSSL::SSL::SSLContext.new)
ssl.sync
ssl.connect
ssl.write("<?xml version=\"1.0\"?>\r\n")
ssl.write(xml_start)
ssl.write(xml_emhealth)
ssl.write(xml_end)
ssl.flush
res = ssl.readlines
ssl.close
s.close
doc = REXML::Document.new(res.to_s.match(/<GET_EMBEDDED_HEALTH_DATA>.*<\/GET_EMBEDDED_HEALTH_DATA>/m).to_s)
if ! doc.elements["GET_EMBEDDED_HEALTH_DATA"]
error_cnt += 1
error_msg += "Unable to fetch embedded health data\n"
end
doc.root.elements["FANS"].each_element('//FAN') { |mod|
if mod.elements["STATUS"].attributes['VALUE'] != 'Ok'
error_cnt += 1
error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['ZONE'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']}\n"
error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}."
end
}
doc.root.elements["TEMPERATURE"].each_element('//TEMP') { |mod|
if mod.elements["STATUS"].attributes['VALUE'] != 'Ok' and mod.elements["STATUS"].attributes['VALUE'] != 'n/a'
error_cnt += 1
error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['LOCATION'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']} - #{mod.elements['CURRENTREADING'].attributes['VALUE']} #{mod.elements['CURRENTREADING'].attributes['UNIT']} (Caution/Critical: #{mod.elements['CAUTION'].attributes['VALUE']}/#{mod.elements['CRITICAL'].attributes['VALUE']})\n"
error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}."
end
}
doc.root.elements["VRM"].each_element('//MODULE') { |mod|
if mod.elements["STATUS"].attributes['VALUE'] != 'Ok'
error_cnt += 1
error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']}\n"
error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}."
end
}
doc.root.elements["POWER_SUPPLIES"].each_element('//SUPPLY') { |mod|
if mod.elements["STATUS"].attributes['VALUE'] != 'Ok'
error_cnt += 1
error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']}\n"
error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}."
end
}
if error_cnt == 0
puts "OK: 0 problems"
rc=0
else
puts "Critical: #{error_cnt} problems. #{error_summary}"
puts error_msg
rc=2
end
exit rc
Now having that in place, I simply added all of the iLO2 hosts to a hostgroup, and added a service to check that group using the script, and all my fans, power supplies and such are now monitored without any operating-system level overhead.