IO, IO, It’s Off to Testing We Go

In my last post, I learned in disappointing fashion that sometimes you need to start small and work your way up, rather than trying to put together a finished product. This go-round, I’ll talk about my investigation into disk IO.

In an effort to better understand the hardware I have and it’s capacities, I started off by just trying to get some basic info about the RAID controller and the disks. This hardware in particular is a Supermicro, with a yet unknown RAID controller and 16 4TB disks arranged in RAID 6. Finding out more disk and controller information was the first step. “hdparm -i” wasn’t able to give me much, nor was “cat /sys/class/block/sdb/device/{model,vendor}”. “dmesg” got me to a list of hard disks, Hitatchi 7200rpm and a model number that I could Google. It also got me enough controller information to point to megaraid, which is LSI, which got me over to this MegaCli cheat sheet. Using “MegaCli -AdpAllInfo -aALL” actually got me a great deal of information. (In other news, I now think that Dell’s OMSA command line utility is a lot less terrible after trying to figure out MegaCli).

With this information, I can finally start answering some questions. First, the raid controller has a 512M battery backed cache, so I know it should offer decent performance. I also can see that all of the drives are reporting a good SMART health and nothing is in prefail or fail, which might lead to degraded performance.

As a reminder from the last post, there is a single RAID controller, with 2 RAID 6, 60TB virtual drives, running LVM to combine them together to approximately 120TB and using default XFS file system. If I’m going to start making changes to file systems or remove LVM, I really need to get a benchmark of of things before I make changes.

There are a lot of ways to do IO performance testing under Linux. Some common methods use hdparm, sysbench or dd. There are also other packages out there like fio, bonnie++ and iozone. To keep things simple (i.e., I didn’t want to compile or put any unfamiliar packages onto the system), I decided to stick with dd and sysbench.


As my workload seems to be primarily writes, I wanted to start with using dd to do write testing. This is fairly straightforward. Something like “dd if=/dev/zero of=/path/to/mountpoint bs=1G count=1 oflag=direct”. The oflag is important because we want to bypass the Linux file system buffer and go directly to disk, as we would when we set innodb_flush_method=O_DIRECT. For the first test, I wanted to see how sequentially writing 1GB of data would perform.

dd if=/dev/zero of=/mnt/test1/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.05494 s, 1.0 GB/s

Well that wasn’t very helpful.  What if we try 1 8G file?

dd if=/dev/zero of=/mnt/test1/testfile bs=8G count=1 oflag=direct
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 4.05025 s, 530 MB/s

That’s… worse. What about 8 1G files? That should give us some consistent long running writes:

dd if=/dev/zero of=/mnt/test1/testfile bs=1G count=8 oflag=direct
8+0 records in
8+0 records out
8589934592 bytes (8.6 GB) copied, 7.34673 s, 1.2 GB/s

There we go. I ran this a few more times to see if it would be fairly consistent and it was, generally sticking around 1.2 to 1.3 GB/s.

Now I want to see what these numbers look like with what I hope are random writes:

dd if=/dev/zero of=/mnt/test1/testfile bs=512 count=1000 oflag=direct
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 0.162107 s, 3.2 MB/s

Ouch, that’s terrible but also very repeatable, 2.8 to 3.2 MB/s.

One more try with a larger block size:

dd if=/dev/zero of=/mnt/test1/testfile bs=1024 count=10000 oflag=direct
10000+0 records in
10000+0 records out
10240000 bytes (10 MB) copied, 1.94483 s, 5.3 MB/s

Very consistent between 5.2 to 5.4 MB/s.


I have never used sysbench before, but I have heard it used in context with MySQL performance testing. Percona has written several articles about it, including this really old article about IO performance for MySQL. In the end, I settled on this article which seemed to fit what I was looking for, writes against a lot of large files. To test, you prepare files and then run the test. The options give me 150GB of files, testing random reads and writes for 5 minutes. I also want single threaded (because that’s how replication works) and I want to be sure we’re using direct IO.

$ sysbench --test=fileio --file-total-size=150G prepare
ysbench 0.4.12: multi-threaded system evaluation benchmark

128 files, 1228800Kb each, 153600Mb total
Creating files for the test...

$ sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --file-extra-flags=direct --max-time=300 --max-requests=0 run
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 16384
128 files, 1.1719Gb each
150Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed: 21033 Read, 14022 Write, 44800 Other = 79855 Total
Read 328.64Mb Written 219.09Mb Total transferred 547.73Mb (1.8257Mb/sec)
 116.85 Requests/sec executed

Test execution summary:
 total time: 300.0085s
 total number of events: 35055
 total time taken by event execution: 299.2567
 per-request statistics:
 min: 0.09ms
 avg: 8.54ms
 max: 206.39ms
 approx. 95 percentile: 18.74ms

Threads fairness:
 events (avg/stddev): 35055.0000/0.00
 execution time (avg/stddev): 299.2567/0.00

This is a gratuitous amount of information and after trying to digest it all, I seemed to care most about the following data:

  • Total requests per second, in this case, 116.85
  • Total data read and written (1.8257Mb/sec)
  • Average and 95% per-request (8.54ms and 18.75ms)

Now that I have some data, let’s tear down LVM (lvremove, vgremove, pvremove). For my goal of running as many instances concurrently, I don’t actually need 120TB of contiguous space. I can partition this up however I prefer as long as each partition can hold a complete copy of data with room for growth. For this test, I’ll format one of the two virtual drives with XFS and the other with EXT4 and see what that gets us.

mkfs.ext4: Size of device /dev/sdb too big to be expressed in 32 bits
 using a blocksize of 4096.

Perhaps I simply can’t create a volume this large with EXT4. Using fdisk, I create a new partition:

Disk /dev/sdb: 64000.0 GB, 63999995543552 bytes
255 heads, 63 sectors/track, 7780889 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xcc50f3d5

 Device Boot Start End Blocks Id System

Command (m for help): n
Command action
 e extended
 p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-7780889, default 1): 
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-267349, default 267349): 32000G

Command (m for help): p

Disk /dev/sdb: 64000.0 GB, 63999995543552 bytes
255 heads, 63 sectors/track, 7780889 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xcc50f3d5

 Device Boot Start End Blocks Id System
/dev/sdb1 1 32000 257039968+ 83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

IMPORTANT NOTE: at the time, I didn’t realize two things:

  1. fdisk doesn’t support GPT, which is plainly stated at the top of fdisk (“WARNING: GPT (GUID Partition Table) detected on ‘/dev/sdb’! The util fdisk doesn’t support GPT. Use GNU Parted”) but I failed to notice it
  2. Specifying size for partitions needs to be prefixed with +, or it will use cylinders

At this point, I have two mounts, /mnt/test1, XFS, and /mnt/test2, EXT4. Rerunning the same test as before and comparing relevant results (XFS first, then EXT4):

Operations performed: 21401 Read, 14267 Write, 45568 Other = 81236 Total
Read 334.39Mb Written 222.92Mb Total transferred 557.31Mb (1.8577Mb/sec)
 118.89 Requests/sec executed
Test execution summary:
 per-request statistics:
 min: 0.01ms
 avg: 8.32ms
 max: 225.19ms
 approx. 95 percentile: 18.97ms

Operations performed: 54195 Read, 36130 Write, 115584 Other = 205909 Total
Read 846.8Mb Written 564.53Mb Total transferred 1.3783Gb (4.7042Mb/sec)
 301.07 Requests/sec executed
Test execution summary:
 per-request statistics:
 min: 0.00ms
 avg: 3.22ms
 max: 233.93ms
 approx. 95 percentile: 12.45ms

Wow. At first glance, EXT4 is almost 3x faster than XFS and LVM on top of XFS was adding minimal overhead. This seemed to be a fluke, so I ran several more tests and they all showed similar results. Then by chance I ran df:

Filesystem Size Used Avail Use% Mounted on
/dev/sda4 59T 151G 59T 1% /mnt/test1
/dev/sdb1 242G 151G 79G 66% /mnt/test2

This is where I noticed the afore mentioned IMPORTANT NOTE.

Turns out that CentOS 6 has an older version of e2fsutils that won’t let me create an EXT4 partition that has more than 32-bit integer worth of cylinders. Thanks to this article explaining the problem, I opted to grab the source and build it locally (but not install), then run the freshly built command. I also needed to edit the partition table using parted.

$ parted /dev/sdb
(parted) rm 1 
(parted) mkpart primary 0.00TB 64.00TB 
(parted) print 
Model: SMC SMC2108 (scsi)
Disk /dev/sdb: 64.0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
 1 0.00TB 64.0TB 64.0TB primary


$ ./misc/mke2fs -t ext4 /dev/sdb1
mke2fs 1.42.12 (29-Aug-2014)

Warning: the fs_type huge is not defined in mke2fs.conf

Creating filesystem with 15624998400 4k blocks and 3906256896 inodes
Filesystem UUID: 47dc165f-0bfb-45b0-ba36-ba23e2807cc7
Superblock backups stored on blocks: 
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 
 2560000000, 3855122432, 5804752896, 12800000000

Allocating group tables: done 
Writing inode tables: done 
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

(omitted mounting)

$ df -h /mnt/test{1,2}
Filesystem Size Used Avail Use% Mounted on
/dev/sda4 59T 151G 59T 1% /mnt/test1
/dev/sdb1 58T 129M 55T 1% /mnt/test2

Now that I have a properly sized EXT4 mounted on /mnt/test2, let’s redo our tests:

$ sysbench --test=fileio --file-total-size=150G prepare
sysbench 0.4.12: multi-threaded system evaluation benchmark

128 files, 1228800Kb each, 153600Mb total
Creating files for the test...
$ sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 1.1719Gb each
150Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed: 42119 Read, 28079 Write, 89728 Other = 159926 Total
Read 658.11Mb Written 438.73Mb Total transferred 1.0711Gb (3.6559Mb/sec)
 233.98 Requests/sec executed

Test execution summary:
 total time: 300.0171s
 total number of events: 70198
 total time taken by event execution: 293.5396
 per-request statistics:
 min: 0.00ms
 avg: 4.18ms
 max: 763.07ms
 approx. 95 percentile: 14.41ms

Threads fairness:
 events (avg/stddev): 70198.0000/0.00
 execution time (avg/stddev): 293.5396/0.00

Interesting. The size of the volume itself has decent relevance on how fast the test performs. In this case, we’re still 2x faster than XFS (confirmed with repeated tests against /mnt/test1 and /mnt/test2) but dropped off of our earlier test.

What about those earlier dd tests? Comparing XFS to EXT4 in the most recent setup:

$ dd if=/dev/zero of=/mnt/test1/testfile bs=1G count=8 oflag=direct
8589934592 bytes (8.6 GB) copied, 36.7825 s, 234 MB/s

$ dd if=/dev/zero of=/mnt/test1/testfile bs=512 count=10000 oflag=direct
5120000 bytes (5.1 MB) copied, 1.01881 s, 5.0 MB/s

$ dd if=/dev/zero of=/mnt/test1/testfile bs=1024 count=10000 oflag=direct
10240000 bytes (10 MB) copied, 1.16213 s, 8.8 MB/s

vs

$ dd if=/dev/zero of=/mnt/test2/testfile bs=1G count=8 oflag=direct
8589934592 bytes (8.6 GB) copied, 9.43267 s, 911 MB/s

$ dd if=/dev/zero of=/mnt/test2/testfile bs=512 count=10000 oflag=direct
5120000 bytes (5.1 MB) copied, 1.47903 s, 3.5 MB/s

$ dd if=/dev/zero of=/mnt/test2/testfile bs=1024 count=10000 oflag=direct
10240000 bytes (10 MB) copied, 1.5105 s, 6.8 MB/s

Inconclusive here. Some better, some worse. Maybe the only conclusion that can be drawn is that dd might not be as good of a testing method as previously thought.


After all this, and if you’re still reading, you’re a champ, I’ve come to realize a few things:

  • Jumping with both feet into something as seemingly innocent as “IO Performance Testing” can make you drown really fast.
  • Finding a repeatable testing method that reduces overall setup time can be huge. Trying to load multiple datasets and testing replication speeds would have taken multiple days, instead of a single day that diving into performance testing took.
  • Isolate what you’re trying to test, or solve for and focus on that. I’m still a bit unsure whether random read/write testing is still the best method to use to try to test what the best setup is going to be.

Circling back to the open questions at the conclusion of the first post:

  • Are these 7200rpm disks simply too slow?
    • Inconclusive without testing actual MySQL replication further.
  • Am I hitting some sort of RAID controller bottleneck?
    • Again, inconclusive, but it doesn’t seem likely, especially when dropping LVM and using EXT4 produced much higher throughput.
  • Is the RAID controller misconfigured or do I have a bad disk?
    • Unlikely, and no.
  • Was the LVM stitching layer adding unnecessary overhead?
    • It adds a very small bit of latency but not enough to make a difference. For my end goal, it’s entirely unnecessary.
  • Am I badly using or (not) tuning XFS for the IO load I am generating?
    • Quite possibly. I’ve heard really good things about XFS and MySQL, so perhaps with default settings, on a volume this large, it simply is inefficient.
  • Is EXT4 a better option for what I am doing?
    • Again, possibly.
  • Would trying to use MySQL 5.6’s multi-threaded replication feature improve the catch-up time, especially if we can eliminate the single threaded nature of replication?
    • Untested, but my hypothesis is that this is less about the single threaded nature of MySQL and more about needing to be able to do more writes per second as a whole. If I can get over the high await/util problem, then adding more threads could conceivably push me back to those limits, with better results. That is to say, if await and ioutil go down, then I am limited only by the speed of a single thread and not the disk, then more threads might push ioutil back to 100% but with more throughput.

And lastly, new questions that this round of work brought up.

  • Are there better sysbench settings to reproduce the type of load that replication incurs on a system, such that I can get a better test profile before going full scale again? Perhaps testing random writes independently and focusing on that, given that my earlier observations on write vs read volume.
  • Are the other tools (bonnie++, etc.) worth looking at for testing? Bonnie++, for example, is available on EPEL.

To be continued in another post.

Advertisements

6 thoughts on “IO, IO, It’s Off to Testing We Go

  1. Tibi

    Hi,

    First of all, thank you for your post, I liked it.
    I would like to ask what kind of mount options do you use?
    If you are not using you should use or concider to use the next three:

    noatime,nodiratime,nobarrier

    Noatime will tell the filesystem not to record the last accessed date of the file. I think it increases speed because when a file is accessed (read from), it will also record that as being a time the file was accessed, and that writing to the file takes extra time, as does writing that extra data when it is being written to.

    Nodiratime is the same with directories.

    Nobarrier: By default, XFS uses write barriers to ensure file system integrity even when power is lost to a device with write caches enabled. For devices without write caches, or with battery-backed write caches, disable the barriers by using the nobarrier option. If we have battery barrier is useless.

    I would be really curious how big is the difference with these options… 🙂

    Thanks,
    Tibi

    Reply
    1. jeremytinley Post author

      Thanks for the insight. I tested these initially with no additional mount options. I am going to re-test with the ones you suggest, although high level googling says noatime is pretty much negligible on modern kernels and file systems.

      Reply
  2. benbradley

    This is stuff I’ve been meaning to test for a long time and the number of options for i/o benchmarking are pretty bewildering.

    Unfortunately I have no opinions/advice on your results to offer.

    But on the second sysbench runs, with the correctly sized EXT4 partition, it looks like you missed –file-extra-flags=direct. Hopefully that’s just an omission in your post and doesn’t invalidate your results 🙂

    Reply
  3. Pingback: More EXT4 vs XFS IO Testing | InsideMySQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s