Monitoring iLO2 with Nagios

Posted by Kevin Way Wed, 05 Sep 2007 12:21:00 GMT

We have a whole bunch of HP servers, purchased in large part for their excellent Lights-Out management software. We wanted to have Nagios monitor some basic things, like the status of the fans, internal temperatures, power supplies, and VRMs.

Fortunately, that turned out to be pretty easy.

The first step was to download the HP Lights-Out XML Perl Scripting Sample for Linux. I did this even though I don’t use Linux as a platform. The resulting file contains a bunch of sample XML scripts for accomplishing various goals, and a perl script (locfg.pl) that submits them to an iLO2 processor.

I then used locfg.pl to submit the following XML to each of our servers, in order to create a user with no substantial privileges.


<RIBCL VERSION="2.0">
  <LOGIN USER_LOGIN="adminuser" PASSWORD="adminpass">
  <USER_INFO MODE="write">
    <ADD_USER 
      USER_NAME="Nagios Monitor" 
      USER_LOGIN="nagiosuser" 
      PASSWORD="nagiospass">
      <ADMIN_PRIV value ="N"/>
      <REMOTE_CONS_PRIV value ="N"/>
      <RESET_SERVER_PRIV value ="N"/>
      <VIRTUAL_MEDIA_PRIV value ="N"/>
      <CONFIG_ILO_PRIV value="N"/>
    </ADD_USER>
  </USER_INFO>
  </LOGIN>
</RIBCL>

This script allowed me to quickly create unprivileged users on each of the iLO2 consoles.

Armed with an unprivileged user, I set about writing the actual plugin, and ended up with this:


#!/usr/bin/env ruby
require 'optparse'
require 'socket'
require 'openssl'
require 'rexml/document'

# Command Line Options
options = {
  :server => nil
}

opts = OptionParser.new do |opt|
  opt.banner = "Usage: #{$0}  [options]" 
  opt.on('-s', '--server HOSTNAME', String, "Hostname or IP of the server to query") { |i| options[:server] = i }
end
opts.parse!(ARGV)

if not options[:server]
  $stderr.puts "Server must be specified" 
  exit!
end

# iLO XML
xml_start = <<EOF
<RIBCL VERSION="2.22">
  <LOGIN USER_LOGIN="nagiosuser" PASSWORD="nagiospass">
EOF

xml_end = <<EOF
  </LOGIN>
</RIBCL>
EOF

xml_emhealth = <<EOF
<SERVER_INFO MODE="read">
  <GET_EMBEDDED_HEALTH />
</SERVER_INFO>
EOF

error_cnt = 0
error_msg = ''
error_summary = ''

s = TCPsocket.open(options[:server], 443)
ssl = OpenSSL::SSL::SSLSocket.new(s, OpenSSL::SSL::SSLContext.new)
ssl.sync
ssl.connect
ssl.write("<?xml version=\"1.0\"?>\r\n")
ssl.write(xml_start)
ssl.write(xml_emhealth)
ssl.write(xml_end)
ssl.flush
res = ssl.readlines

ssl.close
s.close

doc = REXML::Document.new(res.to_s.match(/<GET_EMBEDDED_HEALTH_DATA>.*<\/GET_EMBEDDED_HEALTH_DATA>/m).to_s)

if ! doc.elements["GET_EMBEDDED_HEALTH_DATA"]
  error_cnt += 1
  error_msg += "Unable to fetch embedded health data\n" 
end

doc.root.elements["FANS"].each_element('//FAN') { |mod|
  if mod.elements["STATUS"].attributes['VALUE'] != 'Ok'
    error_cnt += 1
    error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['ZONE'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']}\n" 
    error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}." 
  end
}

doc.root.elements["TEMPERATURE"].each_element('//TEMP') { |mod|
  if mod.elements["STATUS"].attributes['VALUE'] != 'Ok' and mod.elements["STATUS"].attributes['VALUE'] != 'n/a'
    error_cnt += 1
    error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['LOCATION'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']} - #{mod.elements['CURRENTREADING'].attributes['VALUE']} #{mod.elements['CURRENTREADING'].attributes['UNIT']} (Caution/Critical: #{mod.elements['CAUTION'].attributes['VALUE']}/#{mod.elements['CRITICAL'].attributes['VALUE']})\n" 
    error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}." 
  end
}

doc.root.elements["VRM"].each_element('//MODULE') { |mod|
  if mod.elements["STATUS"].attributes['VALUE'] != 'Ok'
    error_cnt += 1
    error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']}\n" 
    error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}." 
  end
}

doc.root.elements["POWER_SUPPLIES"].each_element('//SUPPLY') { |mod|
  if mod.elements["STATUS"].attributes['VALUE'] != 'Ok'
    error_cnt += 1
    error_msg += "#{mod.elements['LABEL'].attributes['VALUE']} - #{mod.elements['STATUS'].attributes['VALUE']}\n" 
    error_summary += "#{mod.elements['LABEL'].attributes['VALUE']}." 
  end
}

if error_cnt == 0
  puts "OK: 0 problems" 
  rc=0
else
  puts "Critical: #{error_cnt} problems. #{error_summary}" 
  puts error_msg
  rc=2
end

exit rc

Now having that in place, I simply added all of the iLO2 hosts to a hostgroup, and added a service to check that group using the script, and all my fans, power supplies and such are now monitored without any operating-system level overhead.

PostgreSQL scaling on 6.2 and 7.0

Posted by Kevin Way Wed, 11 Apr 2007 22:40:00 GMT

My previous posts have documented the utterly insane lack of scalability of PostgreSQL (at least according to that particular metric) on FreeBSD 6.2.

I ran the same test, on the same machine, using 7.0 with the 4BSD and the ULE schedulers. Postgres was tuned as in the previous posts, and WITNESS, INVARIANTS and all malloc debugging options were off.

Results:

Hopefully nobody will extrapolate this to mean more than it really does (it’s just one test, but it’s one that happens to matter to me.)

More is better?

Posted by Kevin Way Wed, 11 Apr 2007 07:55:00 GMT

Not always. I did some more tests, and confirmed that this particular workload actually gets slower as you add more than 2 cores.

sysbench --num-threads=${i} --test=oltp --pgsql-user=bench --pgsql-db=bench --db-driver=pgsql --max-time=60 --max-requests=0 --oltp-read-only=on run

Here’s the results, with the tests run on absolutely identical hardware. (HP DL360, 2.00 GHz E5335 Xeon, 10K RPM SAS drives, 10 GB of RAM)

I know 6.2 isn’t the greatest for SMP, but I was still surprised to learn that we could get significantly lower overall performance by adding more processors.

I tried to test 7.0, but the 7.0 installer didn’t seem to recognize the iLO virtual keyboard tonight. Maybe tomorrow will go better.

What did I do wrong? 2

Posted by Kevin Way Tue, 10 Apr 2007 01:09:00 GMT

I’m hoping that somebody will look at this and say “hey, dummy, you did this all wrong.” I’m hoping that, because the results are shocking.

There have been a buzz about FreeBSD’s SMP performance compared to Linux’s, particularly a particular 8-core SMP test, using FreeBSD 7.0, sysbench, and mysql. A number of people mentioned that there were known problems with this test and PostgreSQL, so I decided to see how bad the problems were.

The results are so shocking, that my first thought was that I must have done something wrong.

The hardware:

Machine 1:
HP DL360 G5
1 5140 (dual core) 2.33 GHz (1333 FSB) Xeon
4 GB RAM
2 15K RPM SAS hard drives, running as mirrors through an e400i drive array

Machine 2:
HP DL360 G5
2 E5335 (quad core) 2.33 GHz (1333 FSB) Xeons
10 GB RAM
2 10K RPM SAS hard drives, running as mirrors through an e400i drive array

Operating System: Both machines are running FreeBSD 6.2-p3. After all, I wanted to see what performance levels to expect with production-quality software, not the 7.0 sweetness.

Kernel: Both are running identically configured SMP-enabled amd64 kernels.

sysctl.conf: both have the following
kern.ipc.shmmax=2147483647
kern.ipc.shmall=524288
kern.ipc.semmsl=512
kern.ipc.semmap=256
kern.ipc.somaxconn=2048
kern.maxfiles=65536
vfs.read_max=32
loader.conf: both have the following
kern.ipc.semmni=256
kern.ipc.semmns=2048
postgresql.conf: both have the following settings changed
shared_buffers = 1GB                 
work_mem = 64MB
maintenance_work_mem = 32MB
max_fsm_pages = 204800   
random_page_cost = 3.0
effective_cache_size = 512MB
update_process_title = off

Both drive arrays are partitioned identically, and postgres was installed as the only running application in each case.

The test was run as follows:
sysbench --num-threads=${i} --test=oltp --pgsql-user=bench --pgsql-db=bench --db-driver=pgsql --max-time=120 --max-requests=0 --oltp-read-only=on run

And yet here are the results:

Yes, the 8-core, 10GB machine is massively slower than the 2-core 4GB machine.

This can’t be right, can it?

Followup: I ran this single-cpu test:
openssl speed rsa
1×2-Core 2.33GHz 5140 Xeon
                  sign    verify    sign/s verify/s
rsa  512 bits   0.0003s   0.0000s   3905.9  45028.8
rsa 1024 bits   0.0009s   0.0001s   1164.6  18787.2
rsa 2048 bits   0.0049s   0.0002s    205.2   6332.2
rsa 4096 bits   0.0320s   0.0005s     31.2   1917.9

2×4-Core 2.33 GHz E5335 Xeon
rsa  512 bits   0.0003s   0.0000s   3381.1  38972.0
rsa 1024 bits   0.0010s   0.0001s   1003.5  15991.7
rsa 2048 bits   0.0057s   0.0002s    176.7   5481.7
rsa 4096 bits   0.0373s   0.0006s     26.8   1638.6

I wanted to see if this was simply one algorithm, or if other algorithms showed the same sort of slowdown, so I ran the following:

openssl speed md5 sha1 aes-256-cbc
The 2-core machine:
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              14172.75k    45816.42k   114118.73k   182293.03k   220175.36k
sha1             14901.45k    42795.69k    97387.27k   143547.69k   166264.67k
aes-256 cbc      91518.54k    94159.65k    93817.92k    92696.25k    93907.24k
The 8-core machine:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              12305.58k    39149.96k    97745.88k   155771.63k   188252.02k
sha1             12740.06k    36648.90k    83318.76k   122733.05k   142119.67k
aes-256 cbc      78109.56k    80589.95k    81222.08k    81406.47k    81401.94k

The 8-core is a little slower than the 2-core again, but not by the enormous levels we see on the pgsql sysbench.

When running RSA across multiple threads, the results are closer to what one would expect. The command I used to do this:


openssl speed rsa -multi N

And now I’m left with the question: Did I screw something up, or is this what Kris Kennaway was talking about when he wrote this message?

I must admit, that while I expected the performance to fall off, it never occurred to me that an 8-core system could be slower than a 2-core system right out of the gate.

Unscientific 15K v 10K SAS Drive Comparison

Posted by Kevin Way Mon, 09 Apr 2007 20:50:00 GMT

We’ve been buying a lot of new machines today, and almost all of them come with SAS drives. We bought 10K SAS drives for obviously performance-neutral applications, and 15K SAS drives for performance-critical applications, but what exactly did we get for an extra $200/drive?

I decided to do some wildly unscientific tests, using two HP DL360s, one with mirrored 10K RPM drives, the other with mirrored 15K RPM drives, both running FreeBSD 6.2/AMD64. Both machines are equipped with P400i drive controllers.

10K SAS drives
# diskinfo -t /dev/da0
/dev/da0
        512             # sectorsize
        73372631040     # mediasize in bytes (68G)
        143305920       # mediasize in sectors
        17562           # Cylinders according to firmware.
        255             # Heads according to firmware.
        32              # Sectors according to firmware.

Seek times:
        Full stroke:      250 iter in   1.016654 sec =    4.067 msec
        Half stroke:      250 iter in   0.955639 sec =    3.823 msec
        Quarter stroke:   500 iter in   1.960625 sec =    3.921 msec
        Short forward:    400 iter in   1.444117 sec =    3.610 msec
        Short backward:   400 iter in   1.457452 sec =    3.644 msec
        Seq outer:       2048 iter in   2.067837 sec =    1.010 msec
        Seq inner:       2048 iter in   2.069040 sec =    1.010 msec
Transfer rates:
        outside:       102400 kbytes in   2.078970 sec =    49255 kbytes/sec
        middle:        102400 kbytes in   2.598712 sec =    39404 kbytes/sec
        inside:        102400 kbytes in   3.115841 sec =    32864 kbytes/sec
15K SAS Drives
#  diskinfo -t /dev/da0
/dev/da0
        512             # sectorsize
        73372631040     # mediasize in bytes (68G)
        143305920       # mediasize in sectors
        17562           # Cylinders according to firmware.
        255             # Heads according to firmware.
        32              # Sectors according to firmware.

Seek times:
        Full stroke:      250 iter in   0.723955 sec =    2.896 msec
        Half stroke:      250 iter in   0.756001 sec =    3.024 msec
        Quarter stroke:   500 iter in   1.255116 sec =    2.510 msec
        Short forward:    400 iter in   0.680525 sec =    1.701 msec
        Short backward:   400 iter in   0.878597 sec =    2.196 msec
        Seq outer:       2048 iter in   2.055020 sec =    1.003 msec
        Seq inner:       2048 iter in   2.062749 sec =    1.007 msec
Transfer rates:
        outside:       102400 kbytes in   1.331560 sec =    76902 kbytes/sec
        middle:        102400 kbytes in   1.848040 sec =    55410 kbytes/sec
        inside:        102400 kbytes in   1.652754 sec =    61957 kbytes/sec

So far, so good. As expected, the results are faster for all the tasks that require seeking, and sequential transfer rates are considerably higher as well. I think I like these tiny lil drives.

Now let’s try some bonnie, with some big files. Let’s try bonnie++ -d /usr/home -u root -s 16g -n 256:65536:65536:16

10K SAS mirror
Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
anemone.insides 16G   379  99 49303  13 14526   9   621  98 30751   7 423.1  94
Latency             22409us     394ms    1414ms   37000us     142ms     121ms
Version 1.93c       ------Sequential Create------ --------Random Create--------
anemone.insidesyste -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 256:65536:65536/16   540  14   661  47 32500  63   588  15   117   8  8358  64
Latency              3045ms     175ms     506ms    2631ms    1841ms     946ms
1.93c,1.93c,anemone.insidesystems.net,1,1176110507,16G,,379,99,49303,13,14526,9,621,98,30751,7,423.1,94,256,65536,65536,,16,540,14,661,47,32500,63,588,15,117,8,8358,64,22409us,394ms,1414ms,37000us,142ms,121ms,3045ms,175ms,506ms,2631ms,1841ms,946ms
15k SAS mirror
Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
noname.insidesy 16G   444  99 77857  19 22512   6   613  94 85113  15 490.1  15
Latency             20163us    2910ms     296ms     274ms     114ms   51627us
Version 1.93c       ------Sequential Create------ --------Random Create--------
noname.insidesystem -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 256:65536:65536/16  1013  19   845  13 44078  86   959  18   103   3 13215  94
Latency              2763ms     800ms   57543us    2798ms    1805ms     259ms
1.93c,1.93c,noname.insidesystems.net,1,1176153578,16G,,444,99,77857,19,22512,6,613,94,85113,15,490.1,15,256,65536,65536,,16,1013,19,845,13,44078,86,959,18,103,3,13215,94,20163us,2910ms,296ms,274ms,114ms,51627us,2763ms,800ms,57543us,2798ms,1805ms,259ms

And with that, I was reasonably satisfied that we weren’t wasting money when we bought the 15K SAS drives.

DELETE FROM important WHERE overly broad condition;

Posted by Kevin Way Wed, 14 Mar 2007 03:20:00 GMT

A client had an interesting problem. They had a critical PostgreSQL database, and they accidentally ran an overly broad delete on a table.

And their newest backup was two weeks old.

Could we help? Of course we could help.

The procedure used was as follows:

1) Stop the database.

2) Backup the database files to a secure location.

3) Make a second copy of the database files, to a new, empty postgresql instance.

4) Now if you have a dump prior to the error, and xlogs for the entire time period, you can do a Point-In-Time Recovery, with a properly configured recovery.conf. Unfortunately we didn’t have that. We had a two-week old dump and 6 hours of xlogs.

5) Try to determine an approximate transaction ID of the mistake. We did this by looking at timestamps of files on the newest pg_clog file, finding one that happened before the mistake, and then multiplying it’s name by 1048576.

6) Then we used that xid with pg_resetxlog, on the secondary database that we had created. It worked, and we had a view of the data. We then played a bit, to find an xid that was acceptably close to the point of failure.

7) From there, we just did a straight SQL COPY, to get the data from the db server we had just setup onto the original database, and to get it loaded.

It wasn’t the most elegant solution, but it worked beautifully, and the client was pleased.

Here’s to hoping you never need this information!

Older posts: 1 2 3