In my last post, I learned in disappointing fashion that sometimes you need to start small and work your way up, rather than trying to put together a finished product. This go-round, I’ll talk about my investigation into disk IO.
In an effort to better understand the hardware I have and it’s capacities, I started off by just trying to get some basic info about the RAID controller and the disks. This hardware in particular is a Supermicro, with a yet unknown RAID controller and 16 4TB disks arranged in RAID 6. Finding out more disk and controller information was the first step. “hdparm -i” wasn’t able to give me much, nor was “cat /sys/class/block/sdb/device/{model,vendor}”. “dmesg” got me to a list of hard disks, Hitatchi 7200rpm and a model number that I could Google. It also got me enough controller information to point to megaraid, which is LSI, which got me over to this MegaCli cheat sheet. Using “MegaCli -AdpAllInfo -aALL” actually got me a great deal of information. (In other news, I now think that Dell’s OMSA command line utility is a lot less terrible after trying to figure out MegaCli).
With this information, I can finally start answering some questions. First, the raid controller has a 512M battery backed cache, so I know it should offer decent performance. I also can see that all of the drives are reporting a good SMART health and nothing is in prefail or fail, which might lead to degraded performance.
As a reminder from the last post, there is a single RAID controller, with 2 RAID 6, 60TB virtual drives, running LVM to combine them together to approximately 120TB and using default XFS file system. If I’m going to start making changes to file systems or remove LVM, I really need to get a benchmark of of things before I make changes.
There are a lot of ways to do IO performance testing under Linux. Some common methods use hdparm, sysbench or dd. There are also other packages out there like fio, bonnie++ and iozone. To keep things simple (i.e., I didn’t want to compile or put any unfamiliar packages onto the system), I decided to stick with dd and sysbench.
As my workload seems to be primarily writes, I wanted to start with using dd to do write testing. This is fairly straightforward. Something like “dd if=/dev/zero of=/path/to/mountpoint bs=1G count=1 oflag=direct”. The oflag is important because we want to bypass the Linux file system buffer and go directly to disk, as we would when we set innodb_flush_method=O_DIRECT. For the first test, I wanted to see how sequentially writing 1GB of data would perform.
dd if=/dev/zero of=/mnt/test1/testfile bs=1G count=1 oflag=direct 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 1.05494 s, 1.0 GB/s
Well that wasn’t very helpful. What if we try 1 8G file?
dd if=/dev/zero of=/mnt/test1/testfile bs=8G count=1 oflag=direct 0+1 records in 0+1 records out 2147479552 bytes (2.1 GB) copied, 4.05025 s, 530 MB/s
That’s… worse. What about 8 1G files? That should give us some consistent long running writes:
dd if=/dev/zero of=/mnt/test1/testfile bs=1G count=8 oflag=direct 8+0 records in 8+0 records out 8589934592 bytes (8.6 GB) copied, 7.34673 s, 1.2 GB/s
There we go. I ran this a few more times to see if it would be fairly consistent and it was, generally sticking around 1.2 to 1.3 GB/s.
Now I want to see what these numbers look like with what I hope are random writes:
dd if=/dev/zero of=/mnt/test1/testfile bs=512 count=1000 oflag=direct 1000+0 records in 1000+0 records out 512000 bytes (512 kB) copied, 0.162107 s, 3.2 MB/s
Ouch, that’s terrible but also very repeatable, 2.8 to 3.2 MB/s.
One more try with a larger block size:
dd if=/dev/zero of=/mnt/test1/testfile bs=1024 count=10000 oflag=direct 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 1.94483 s, 5.3 MB/s
Very consistent between 5.2 to 5.4 MB/s.
I have never used sysbench before, but I have heard it used in context with MySQL performance testing. Percona has written several articles about it, including this really old article about IO performance for MySQL. In the end, I settled on this article which seemed to fit what I was looking for, writes against a lot of large files. To test, you prepare files and then run the test. The options give me 150GB of files, testing random reads and writes for 5 minutes. I also want single threaded (because that’s how replication works) and I want to be sure we’re using direct IO.
$ sysbench --test=fileio --file-total-size=150G prepare ysbench 0.4.12: multi-threaded system evaluation benchmark 128 files, 1228800Kb each, 153600Mb total Creating files for the test... $ sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --file-extra-flags=direct --max-time=300 --max-requests=0 run sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Initializing random number generator from timer. Extra file open flags: 16384 128 files, 1.1719Gb each 150Gb total file size Block size 16Kb Number of random requests for random IO: 0 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 100 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random r/w test Threads started! Time limit exceeded, exiting... Done. Operations performed: 21033 Read, 14022 Write, 44800 Other = 79855 Total Read 328.64Mb Written 219.09Mb Total transferred 547.73Mb (1.8257Mb/sec) 116.85 Requests/sec executed Test execution summary: total time: 300.0085s total number of events: 35055 total time taken by event execution: 299.2567 per-request statistics: min: 0.09ms avg: 8.54ms max: 206.39ms approx. 95 percentile: 18.74ms Threads fairness: events (avg/stddev): 35055.0000/0.00 execution time (avg/stddev): 299.2567/0.00
This is a gratuitous amount of information and after trying to digest it all, I seemed to care most about the following data:
- Total requests per second, in this case, 116.85
- Total data read and written (1.8257Mb/sec)
- Average and 95% per-request (8.54ms and 18.75ms)
Now that I have some data, let’s tear down LVM (lvremove, vgremove, pvremove). For my goal of running as many instances concurrently, I don’t actually need 120TB of contiguous space. I can partition this up however I prefer as long as each partition can hold a complete copy of data with room for growth. For this test, I’ll format one of the two virtual drives with XFS and the other with EXT4 and see what that gets us.
mkfs.ext4: Size of device /dev/sdb too big to be expressed in 32 bits using a blocksize of 4096.
Perhaps I simply can’t create a volume this large with EXT4. Using fdisk, I create a new partition:
Disk /dev/sdb: 64000.0 GB, 63999995543552 bytes 255 heads, 63 sectors/track, 7780889 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xcc50f3d5 Device Boot Start End Blocks Id System Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-7780889, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-267349, default 267349): 32000G Command (m for help): p Disk /dev/sdb: 64000.0 GB, 63999995543552 bytes 255 heads, 63 sectors/track, 7780889 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xcc50f3d5 Device Boot Start End Blocks Id System /dev/sdb1 1 32000 257039968+ 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
IMPORTANT NOTE: at the time, I didn’t realize two things:
- fdisk doesn’t support GPT, which is plainly stated at the top of fdisk (“WARNING: GPT (GUID Partition Table) detected on ‘/dev/sdb’! The util fdisk doesn’t support GPT. Use GNU Parted”) but I failed to notice it
- Specifying size for partitions needs to be prefixed with +, or it will use cylinders
At this point, I have two mounts, /mnt/test1, XFS, and /mnt/test2, EXT4. Rerunning the same test as before and comparing relevant results (XFS first, then EXT4):
Operations performed: 21401 Read, 14267 Write, 45568 Other = 81236 Total Read 334.39Mb Written 222.92Mb Total transferred 557.31Mb (1.8577Mb/sec) 118.89 Requests/sec executed Test execution summary: per-request statistics: min: 0.01ms avg: 8.32ms max: 225.19ms approx. 95 percentile: 18.97ms Operations performed: 54195 Read, 36130 Write, 115584 Other = 205909 Total Read 846.8Mb Written 564.53Mb Total transferred 1.3783Gb (4.7042Mb/sec) 301.07 Requests/sec executed Test execution summary: per-request statistics: min: 0.00ms avg: 3.22ms max: 233.93ms approx. 95 percentile: 12.45ms
Wow. At first glance, EXT4 is almost 3x faster than XFS and LVM on top of XFS was adding minimal overhead. This seemed to be a fluke, so I ran several more tests and they all showed similar results. Then by chance I ran df:
Filesystem Size Used Avail Use% Mounted on /dev/sda4 59T 151G 59T 1% /mnt/test1 /dev/sdb1 242G 151G 79G 66% /mnt/test2
This is where I noticed the afore mentioned IMPORTANT NOTE.
Turns out that CentOS 6 has an older version of e2fsutils that won’t let me create an EXT4 partition that has more than 32-bit integer worth of cylinders. Thanks to this article explaining the problem, I opted to grab the source and build it locally (but not install), then run the freshly built command. I also needed to edit the partition table using parted.
$ parted /dev/sdb (parted) rm 1 (parted) mkpart primary 0.00TB 64.00TB (parted) print Model: SMC SMC2108 (scsi) Disk /dev/sdb: 64.0TB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 0.00TB 64.0TB 64.0TB primary $ ./misc/mke2fs -t ext4 /dev/sdb1 mke2fs 1.42.12 (29-Aug-2014) Warning: the fs_type huge is not defined in mke2fs.conf Creating filesystem with 15624998400 4k blocks and 3906256896 inodes Filesystem UUID: 47dc165f-0bfb-45b0-ba36-ba23e2807cc7 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000, 3855122432, 5804752896, 12800000000 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done (omitted mounting) $ df -h /mnt/test{1,2} Filesystem Size Used Avail Use% Mounted on /dev/sda4 59T 151G 59T 1% /mnt/test1 /dev/sdb1 58T 129M 55T 1% /mnt/test2
Now that I have a properly sized EXT4 mounted on /mnt/test2, let’s redo our tests:
$ sysbench --test=fileio --file-total-size=150G prepare sysbench 0.4.12: multi-threaded system evaluation benchmark 128 files, 1228800Kb each, 153600Mb total Creating files for the test... $ sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Initializing random number generator from timer. Extra file open flags: 0 128 files, 1.1719Gb each 150Gb total file size Block size 16Kb Number of random requests for random IO: 0 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 100 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random r/w test Threads started! Time limit exceeded, exiting... Done. Operations performed: 42119 Read, 28079 Write, 89728 Other = 159926 Total Read 658.11Mb Written 438.73Mb Total transferred 1.0711Gb (3.6559Mb/sec) 233.98 Requests/sec executed Test execution summary: total time: 300.0171s total number of events: 70198 total time taken by event execution: 293.5396 per-request statistics: min: 0.00ms avg: 4.18ms max: 763.07ms approx. 95 percentile: 14.41ms Threads fairness: events (avg/stddev): 70198.0000/0.00 execution time (avg/stddev): 293.5396/0.00
Interesting. The size of the volume itself has decent relevance on how fast the test performs. In this case, we’re still 2x faster than XFS (confirmed with repeated tests against /mnt/test1 and /mnt/test2) but dropped off of our earlier test.
What about those earlier dd tests? Comparing XFS to EXT4 in the most recent setup:
$ dd if=/dev/zero of=/mnt/test1/testfile bs=1G count=8 oflag=direct 8589934592 bytes (8.6 GB) copied, 36.7825 s, 234 MB/s $ dd if=/dev/zero of=/mnt/test1/testfile bs=512 count=10000 oflag=direct 5120000 bytes (5.1 MB) copied, 1.01881 s, 5.0 MB/s $ dd if=/dev/zero of=/mnt/test1/testfile bs=1024 count=10000 oflag=direct 10240000 bytes (10 MB) copied, 1.16213 s, 8.8 MB/s vs $ dd if=/dev/zero of=/mnt/test2/testfile bs=1G count=8 oflag=direct 8589934592 bytes (8.6 GB) copied, 9.43267 s, 911 MB/s $ dd if=/dev/zero of=/mnt/test2/testfile bs=512 count=10000 oflag=direct 5120000 bytes (5.1 MB) copied, 1.47903 s, 3.5 MB/s $ dd if=/dev/zero of=/mnt/test2/testfile bs=1024 count=10000 oflag=direct 10240000 bytes (10 MB) copied, 1.5105 s, 6.8 MB/s
Inconclusive here. Some better, some worse. Maybe the only conclusion that can be drawn is that dd might not be as good of a testing method as previously thought.
After all this, and if you’re still reading, you’re a champ, I’ve come to realize a few things:
- Jumping with both feet into something as seemingly innocent as “IO Performance Testing” can make you drown really fast.
- Finding a repeatable testing method that reduces overall setup time can be huge. Trying to load multiple datasets and testing replication speeds would have taken multiple days, instead of a single day that diving into performance testing took.
- Isolate what you’re trying to test, or solve for and focus on that. I’m still a bit unsure whether random read/write testing is still the best method to use to try to test what the best setup is going to be.
Circling back to the open questions at the conclusion of the first post:
- Are these 7200rpm disks simply too slow?
- Inconclusive without testing actual MySQL replication further.
- Am I hitting some sort of RAID controller bottleneck?
- Again, inconclusive, but it doesn’t seem likely, especially when dropping LVM and using EXT4 produced much higher throughput.
- Is the RAID controller misconfigured or do I have a bad disk?
- Unlikely, and no.
- Was the LVM stitching layer adding unnecessary overhead?
- It adds a very small bit of latency but not enough to make a difference. For my end goal, it’s entirely unnecessary.
- Am I badly using or (not) tuning XFS for the IO load I am generating?
- Quite possibly. I’ve heard really good things about XFS and MySQL, so perhaps with default settings, on a volume this large, it simply is inefficient.
- Is EXT4 a better option for what I am doing?
- Again, possibly.
- Would trying to use MySQL 5.6’s multi-threaded replication feature improve the catch-up time, especially if we can eliminate the single threaded nature of replication?
- Untested, but my hypothesis is that this is less about the single threaded nature of MySQL and more about needing to be able to do more writes per second as a whole. If I can get over the high await/util problem, then adding more threads could conceivably push me back to those limits, with better results. That is to say, if await and ioutil go down, then I am limited only by the speed of a single thread and not the disk, then more threads might push ioutil back to 100% but with more throughput.
And lastly, new questions that this round of work brought up.
- Are there better sysbench settings to reproduce the type of load that replication incurs on a system, such that I can get a better test profile before going full scale again? Perhaps testing random writes independently and focusing on that, given that my earlier observations on write vs read volume.
- Are the other tools (bonnie++, etc.) worth looking at for testing? Bonnie++, for example, is available on EPEL.
To be continued in another post.
Hi,
First of all, thank you for your post, I liked it.
I would like to ask what kind of mount options do you use?
If you are not using you should use or concider to use the next three:
noatime,nodiratime,nobarrier
Noatime will tell the filesystem not to record the last accessed date of the file. I think it increases speed because when a file is accessed (read from), it will also record that as being a time the file was accessed, and that writing to the file takes extra time, as does writing that extra data when it is being written to.
Nodiratime is the same with directories.
Nobarrier: By default, XFS uses write barriers to ensure file system integrity even when power is lost to a device with write caches enabled. For devices without write caches, or with battery-backed write caches, disable the barriers by using the nobarrier option. If we have battery barrier is useless.
I would be really curious how big is the difference with these options… 🙂
Thanks,
Tibi
Thanks for the insight. I tested these initially with no additional mount options. I am going to re-test with the ones you suggest, although high level googling says noatime is pretty much negligible on modern kernels and file systems.
This is stuff I’ve been meaning to test for a long time and the number of options for i/o benchmarking are pretty bewildering.
Unfortunately I have no opinions/advice on your results to offer.
But on the second sysbench runs, with the correctly sized EXT4 partition, it looks like you missed –file-extra-flags=direct. Hopefully that’s just an omission in your post and doesn’t invalidate your results 🙂
Great catch! I’m going to re-test regardless, so I’ll be extra sure I’m consistent.
Pingback: More EXT4 vs XFS IO Testing | InsideMySQL
Ran into the EXT 4 issue last week. Google search pulled up your blog. Like it or not, you’re still helping your old team. 🙂