We've had a bunch of new servers in place for around 3 months now. They seem to be working well and are performing just fine.
Then, out of the blue, our monitoring started throwing alerts on seemingly random servers. Our queues were building up – basically, database performance had dropped dramatically and our processing scripts couldn't stuff data into the DBs fast enough.
What could be causing it?
I took a quick glance at our monitoring and various statistics confirmed the problems. You can clearly see in the following graphs that CPU Wait and Load Average increased at around 14:30, and at the same time MySQL throughput and command counters dropped dramatically:




So, I've found the symptoms; what could be the cause?
Digging around in the system logs, I found the following lines (emphasis is mine):
Apr 26 12:55:31 b008 Server Administrator: Storage Service EventID: 2176 The controller battery Learn cycle has started.: Battery 0 Controller 0 Apr 26 12:56:36 b008 Server Administrator: Storage Service EventID: 2266 Controller log file entry: Battery is discharging: Battery 0 Controller 0 Apr 26 12:56:36 b008 Server Administrator: Storage Service EventID: 2248 The controller battery is executing a Learn cycle.: Battery 0 Controller 0 Apr 26 14:21:52 b008 Server Administrator: Storage Service EventID: 2278 The controller battery charge level is below a normal threshold.: Battery 0 Controller 0 Apr 26 14:21:53 b008 Server Administrator: Storage Service EventID: 2188 The controller write policy has been changed to Write Through.: Battery 0 Controller 0 Apr 26 14:21:53 b008 Server Administrator: Storage Service EventID: 2199 The virtual disk cache policy has changed.: Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 6/i Adapter) Apr 26 14:55:06 b008 Server Administrator: Storage Service EventID: 2177 The controller battery Learn cycle has completed.: Battery 0 Controller 0 Apr 26 14:55:21 b008 Server Administrator: Storage Service EventID: 2247 The controller battery is charging.: Battery 0 Controller 0 Apr 26 14:55:21 b008 Server Administrator: Storage Service EventID: 2278 The controller battery charge level is below a normal threshold.: Battery 0 Controller 0 Apr 26 15:25:41 b008 Server Administrator: Storage Service EventID: 2279 The controller battery charge level is operating within normal limits: Battery 0 Controller 0 Apr 26 15:25:41 b008 Server Administrator: Storage Service EventID: 2189 The controller write policy has been changed to Write Back.: Battery 0 Controller 0 Apr 26 15:25:42 b008 Server Administrator: Storage Service EventID: 2199 The virtual disk cache policy has changed.: Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 6/i Adapter) Apr 26 18:50:26 b008 Server Administrator: Storage Service EventID: 2358 The battery charge cycle is complete.: Battery 0 Controller 0
Bingo!
The RAID controller is running a battery learn cycle. When battery charge drops below a certain level, cache Write Back is disabled, and that kills disk performance. It seems this test is enabled by default and is configured to run every 90 days. Of course, Sod's law dictates that it has to trigger in our busy period!
The Dell tools are not able to turn off this feature, but the LSI MegaCli tool can (Dell PERC 6/i controllers are re-badged LSI cards). I've run the following script on all servers (thanks to burr86 on ##infra-talk @ Freenode):
#!/bin/sh TMPFILE=$(mktemp -p /tmp bbu.relearn.off.XXXXXXXXXX) || exit 1 echo "autoLearnMode=1" > $TMPFILE # or =0 to enable the bbu relearn MegaCli -AdpBbuCmd -SetBbuProperties -f$TMPFILE -a0 rm -f $TMPFILE
I wrote a puppet class to run this script on all nodes:
class megacli::check {
exec{'perc_bbu_autolearn.sh':
require => Class['megacli::install'],
cwd => '/tmp',
path => '/usr/local/bin:/bin',
unless => 'MegaCli -AdpBbuCmd -GetBbuProperties -a0 | grep -q "Auto-Learn Mode: Disabled"'
}
}
omconfig storage battery action=startlearn controller=0 battery=0

Sean Kelly says:
Thanks for this. I had the exact same issue and dell was no help.
Kudos on a very detail oriented diagnosis and solve.
June 21, 2010, 1:51 pmFred says:
Hello,
August 13, 2010, 2:49 amThat's not a problem with Dell but with LSI Card.
See on web and you can see more explanations with Nec Website :
ftp://ftp2.nec-computers.com/TID0612071320/TID0612071320.doc
But Great thanks for the script.
robin says:
Well, it's not actually a problem specifically with the card, whether it's the LSI version or the Dell OEM version. It's more an operational issue, ie. making sure that the battery learn process does not run when the disk subsystem is under heavy load.
August 31, 2010, 1:16 pmThe perils of uniform hardware and RAID auto-learn cycles | PHP SPain says:
[...] text file and found the command we needed to disable auto-learn, (++text files and saving links to reference material), sent that over the Skype chat, and the customer tried that. We weren’t sure the auto-learn [...]
November 16, 2010, 6:02 pmThe perils of uniform hardware and RAID auto-learn cycles | Weez.com says:
[...] text file and found the command we needed to disable auto-learn, (++text files and saving links to reference material), sent that over the Skype chat, and the customer tried that. We weren’t sure the auto-learn [...]
November 17, 2010, 12:09 amThe perils of uniform hardware and RAID auto-learn cycles | Unix Linux Windows says:
[...] text file and found the command we needed to disable auto-learn, (++text files and saving links to reference material), sent that over the Skype chat, and the customer tried that. We weren’t sure the auto-learn [...]
November 19, 2010, 4:20 amKenny says:
About 6 years ago we had this same problem, I called Dell and they even sent replacement batteries instead of explaining what it is and how to manually do it
December 2, 2010, 9:05 amshoki says:
we also faced big problems with perc6i hangs. MySQL was getting too many conns because the RAID was hanging for 2-10s with 100% util in iostat but no write/read going on. The hangs were periodic.
December 2, 2010, 9:47 amThe cause was nagios that was checking the RAID status via megacli every hour. When issueing megacli commands, the controller will hang some random time. This seems to be a perc6i problem, because we have several other LSI controllers in production (even Intel OEMs) that don't have this problem… even when using the same megacli version.
So Dell, please fix your controller firmware… this is a pain in the ass problem! Or at least document this so other admins don't waste time.
The perils of uniform hardware and RAID auto-learn cycles | MySQL Performance Blog says:
[...] text file and found the command we needed to disable auto-learn, (++text files and saving links to reference material), sent that over the Skype chat, and the customer tried that. We weren’t sure the auto-learn [...]
December 13, 2010, 4:34 amBattery Learning still problem many years after | Weez.com says:
[...] performance problems caused by battery auto learning go many years back. We wrote about it, other people from MySQL Community too. The situation did not get better, at least not with Dell RAID [...]
February 11, 2011, 2:50 amBattery Learning still problem many years after | MySQL Performance Blog says:
[...] performance problems caused by battery auto learning go many years back. We wrote about it, other people from MySQL Community too. The situation did not get better, at least not with Dell RAID [...]
June 27, 2011, 1:16 pmBattery Learning still problem many years after - MySQL Performance Blog says:
[...] performance problems caused by battery auto learning go many years back. We wrote about it, other people from MySQL Community too. The situation did not get better, at least not with Dell RAID [...]
July 20, 2011, 5:20 amstrowger says:
Thanks for this post – it's an issue I'm already aware of, but there aren't enough people writing about the pitfalls of Dell kit. We've been climbing a steep learning curve with it, even using Dell's supported Redhat versions.
September 8, 2011, 9:14 amCouple of points:
1 – It's not the DRAC doing these tests as per your article title, it's the RAID controller.
2 – It is, as you say, not possible to prevent them. It is, however, possible to prevent the RAID from slowing down while they're run by using the Dell tools. There's obviously an element of risk in doing so, as for a short while you'll be running with write-back caching enabled and a flat battery. To live dangerously,
omconfig storage vdisk action=changepolicy controller=0 vdisk=0 writepolicy=fwb
Hard lesson from Dell’s PERC H700 Battery Write Cache | j0e.us says:
[...] some more info on this issue, I found a nice blog posting here: http://yo61.com/dell-drac-bbu-auto-learn-tests-kill-disk-performance.html Filed under: Misc. Tech Leave a comment Comments (0) Trackbacks (0) ( subscribe to comments [...]
October 20, 2011, 10:05 pmLooking for RAID Controller without Battery Learning problems ? | Weez.com says:
[...] Battery Learning Cycle problems and its impact to MySQL Performance. Here are couple of links (1,2). It is good to see though there are some controllers coming out which solve this problem, namely [...]
January 5, 2012, 8:49 pmSlow database? Check RAID battery! | The Agile Radar says:
[...] Google searches turned out other articles and blog posts talking about it. It turns out one of the most frequently cited posts belongs to Robin Bowes, my ex-coworker from RIS Technology/Reliam! It also turns out Percona [...]
January 16, 2012, 2:07 pm