Even If You Fail, You Can Still Learn

As many learning experiences do, this one also starts out “So I was working on a project at work and…”.  In this case, the end result is to try to run as many concurrent copies of MySQL on a single server as possible, maintaining real time replication each running differing data sets. To help with this, I sent out to do this on a server with 36 7200rpm 4GB SATA disks, giving me roughly 120TB of available space to work with.

This isn’t an abnormal type of machine for us. Sometimes you simply need a ton of disk space. There is a quirk with this particular machine that I’ve been told: the RAID controller has some issues with addressing very large virtual disks and I should create 2 60TB volumes and stitch them together with LVM. Easy enough: pvcreate both volumes, create a volume group and a logical volume out of it and viola: ~116TB of storage on a single mount point, with xfs as the file system (default options).

Overzealously, it began by copying 20 different database backups over and preparing them to be replicas. I say overzealous, because that was an immediate bust. Limited write IO on the destination server because it’s not a production class machine prevented me from even getting the data copied over in any timely fashion.

Not to be swayed, I reduced my test to 5 concurrent and that was definitely more successful, at least in terms of getting the data over. For the copy, the old netcat trick on one end with pv piped through xbstream worked really well. The backups are using Percona Xtrabackup, so a quick bit of bash to set up the listener (netcat), && innobackupex –decompress && innobackupex –apply-log would let me spend my weekend doing family things and not work things (Bonus: use mail to page yourself when a long running task is done).

Once done, I hacked up the mysql init script from Percona to look at its own filename to determine what data directory to use. I also moved the my.cnf file into the data directory so that per-instance settings would be easy to maintain, and ensured we only look at that one by passing –defaults-file to mysqld_safe. The trick here is to make symbolic links from the mysql init script to the name you want to use (ex. ln -s mysql mysql-shard1). Now, I run “mysql-shard1 start” as an init script (anywhere really, its just bash), and it will look at /var/lib/mysql/shard1/ for datadir, which has a copy of shard1’s data from my backup. Lastly, each my.cnf file had to be crafted a a bit to ensure that all the paths were unique, and that they weren’t running on overlapping ports.

With each mysqld successfully running, I was able to configure replication and each copy was happily pulling down data and sticking it in local relay logs. I added all of these to my local install of innotop and brought up dashboard view, which lets me see the critical bits, especially replication, in basically real time. After a little bit though, it was apparent that replication wasn’t catching up — it was falling behind. Still too overzealous, it seemed.

Given the earlier misstep with shipping 20 copies of data over at once, I suspected it was an IO capacity problem, after all, this class machine isn’t designed for the heavily lifting of production. My go-to tools for looking at this are sar and iostat. The former would show that I was running between 7-10% iowait, that is, the time the kernel is waiting on the IO subsystem to respond, while the latter would show that the disk was very busy with writes and always 100% util and await of 30-50ms (await is the average time requests take to be served).

Knowing that when you restore from a backup, catchup time involves a heavy amount of writes to backfill all the relay logs while the SQL thread is busy running the statements as fast as it can.

Looking over the my.cnf configuration file, there were a few options I could play with. First, increase the InnoDB buffer pool from 2G to 40G because databases love memory. Next, turn off log-slave-updates and log-bin, since we really don’t need to keep any local binary logs nor a copy of everything this replica applies to its own copy. Insert buffer (innodb_ibuf_max_size) was capped at 2G, so I moved that up to 10G (by default it’s half of your buffer pool).

I shut down all the existing MySQL instances and started just a single one up. After about an hour, it was still slipping behind, but not nearly as fast. This really surprised me. I was really sure that this box could sustain a single instance based on our write volume. This got me wondering about how well the IO subsystem was performing which spawned a series of questions:

  • Are these 7200rpm disks simply too slow?
  • Am I hitting some sort of RAID controller bottleneck?
  • Is the RAID controller misconfigured or do I have a bad disk?
  • Was the LVM stitching layer adding unnecessary overhead?
  • Am I badly using or (not) tuning XFS for the IO load I am generating?
  • Is EXT4 a better option for what I am doing?
  • Would trying to use MySQL 5.6’s multi-threaded replication feature improve the catch-up time, especially if we can eliminate the single threaded nature of replication?

I’ll continue this in a second post about investigating and benchmarking IO.


2 thoughts on “Even If You Fail, You Can Still Learn

  1. Eric Ayers

    Typo? These numbers don’t add up. “16 7200rpm 4GB SATA disks, giving me roughly 120TB of available space to work with.”


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s