Note: This post originally appeared as a post on my former employers site (inside.godaddy.com), and has since been removed. Reposting here to share the information.
We here at GoDaddy deploy our MySQL database servers with RAID 10 for performance and reliability. Supporting that, we utilize hardware RAID option with Dell branded PERC cards. These cards offer a write back cache to boost write performance. Writes are stored in memory on the RAID controller and then flushed to disk in order to improve performance. This provides a noticeable improvement in writes because from the OS perspective, a write is complete when it hits the cache, not the actual disk. Since data in the cache is volatile, that is, susceptible to power loss, there is also a battery that allows the cache to be preserved in the event of a power loss. This eliminates the possibility of data loss while preserving the speed benefits of a write cache.
There is an inherent drawback to using a battery backed write cache. Many RAID controllers, like our Dell PERC cards, go through a battery learning cycle which calibrates the capacity of the battery to ensure that it does not fail unexpectedly. For us, this cycle occurs every 90 days. When a battery learning cycle begins, it will fully charge, discharge and charge again, realigning the true capacity of the battery. While it is performing this process however, it can no longer be relied upon to sustain the cache in the event of a power failure. To prevent any form of data loss, it automatically switches the RAID controller write policy from write-back, to write-through, whereby the writes bypass the cache and hit the disk directly. This will incur a performance hit because the drive is having to service each request one by one, usually with random writes, rather than being able to perform optimized or grouped writes to the disk.
Let’s look at the differences between write policy in terms of IO utilization. In the following example, we look at sda2, which receives the bulk of the traffic from MySQL captured with iostat on one second intervals.
When the disk is in write-back, we see reasonably low IO utilization (%util column):
rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util 0.00 466.34 0.00 66.34 0.00 4261.39 64.24 0.01 0.21 0.21 1.39 0.00 388.00 0.00 129.00 0.00 12144.00 94.14 0.03 0.23 0.16 2.00 0.00 428.00 0.00 120.00 0.00 4384.00 36.53 0.10 0.86 0.10 1.20 0.00 335.00 0.00 255.00 0.00 19240.00 75.45 0.05 0.21 0.15 3.80 0.00 370.00 0.00 49.00 0.00 3944.00 80.49 0.01 0.24 0.24 1.20 0.00 255.00 0.00 91.00 0.00 11280.00 123.96 0.03 0.35 0.21 1.90 0.00 294.00 0.00 63.00 0.00 2856.00 45.33 0.01 0.13 0.13 0.80 0.00 300.00 0.00 180.00 0.00 11688.00 64.93 0.03 0.19 0.13 2.40 0.00 356.00 0.00 58.00 0.00 3312.00 57.10 0.01 0.12 0.12 0.70
When the write policy is set to write-through, avgqu-sz (average number of requests waiting for the IO device), await (average wait time for the IO request to be served), svctim (how long the drive spends servicing the request) and %util all go up significantly because the writes are going directly to the disk instead of using the cache.
rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util 0.00 273.00 0.00 103.00 0.00 8920.00 86.60 0.52 5.04 3.73 38.40 0.00 178.00 0.00 90.00 0.00 4144.00 46.04 0.59 6.58 5.69 51.20 0.00 267.00 0.00 57.00 0.00 2592.00 45.47 0.27 4.72 4.40 25.10 0.00 320.00 0.00 125.00 0.00 9928.00 79.42 0.67 5.29 3.80 47.50 0.00 267.00 0.00 71.00 0.00 4320.00 60.85 0.35 4.93 4.25 30.20 0.00 412.00 0.00 80.00 0.00 8056.00 100.70 0.41 5.09 4.04 32.30 0.00 303.00 0.00 164.00 0.00 7336.00 44.73 0.79 4.84 3.28 53.80 0.00 412.87 0.00 179.21 0.00 12015.84 67.05 0.97 5.40 3.51 62.87 0.00 303.00 0.00 151.00 0.00 11376.00 75.34 0.76 5.05 3.69 55.70
In periods of low IO utilization, the additional hit from a write-through policy is not necessarily a challenge, but during a peak volume, it can lead to massive performance loss on the database. The learning cycle is calculated from the installation and power-on of the RAID card. If the card was installed and powered on at 10AM, in 90 days, the learning cycle will begin and cause degraded IO. From this, we have a few options:
1) Disable battery learning
This is possible with some interfaces, but presents a big risk of not knowing the health of the battery. If your battery were to fail, but its failure was not reported back up, you could inadvertently lose data.
2) Utilize different hardware (dual battery, or NVRAM)
There are RAID cards available with dual battery, and the RAID controller intelligently switches between them as needed. Other cards, like the Dell H710 utilize a non-volitile (NVRAM) to back their cache and do not need a battery. These are an optimal choice when you do not have low periods of IO traffic.
3) Switch from write-through to force write-back
We tossed around the idea about writing a script that runs out of cron every minute, watches for write-through policy and changes the policy to force-write back so that it will use the cache regardless of the battery status. This would require that we accept the risk of data loss in the event of power outage.
4) Control when we learn
This is perhaps the simplest of the options. Dell provides their OMSA toolset which includes commands to initate a learn cycle, or even delay an upcoming learn cycle for up to seven days. By initiating the cycle at a low period, we are able to minimize degraded service.
For Dell OMSA, we query the time until next learning using “omreport storage battery”. If the next learning is scheduled in the next 24 hours, we force the learn cycle to happen immediately using “omconfig storage battery action=startlearn controller=0 battery=0”. This script is then triggered from cron to happen at off hours for the server’s location and workload. Lastly, this is all bundled up into a RPM package that we can deploy to all of our servers using Spacewalk.
In summary, battery learning on RAID controllers can be a tricky problem to solve. It requires that you understand the risks involved in each of the policy settings, the relative workload of your environment and measure the cost/benefit for the hardware you choose.