System management can be a big deal. At Etsy, we DBAs have been feeling the pain of getting spread too thin. You get a nice glibc vulnerability and have to patch and reboot hundreds of servers. There goes your plans for the week.
We decided last year to embark on a 2016 mission to get better performance, easier management and reduced power utilization through a farm reduction in server count for our user generated, sharded data.
Scoping the Problems
Taking on a massive project to replace your entire fleet of hardware is not an easy sell, especially when on the surface it seems like everything is fine. As DBAs, we experienced challenges that were not readily visible to the rest of the organization or even to end users. The first challenge was to quantify what are the reasons for doing such a large uplift.
In the beginning, Etsy had one instance of mysqld per server, and one database inside that. A few years ago, we switched from having one database per server, to many databases. We called this our “logical shard migration”. Splitting our data up gave us the ability to split entire databases off when we reached disk capacity where previously individual rows were lifted out of tables. Splitting was easy – restore data, replicate, switch reads and writes to new server, drop old tables. Server capacity was defined by disk space available.
Through that migration, our sharded user data had grown to 120 physical servers – it’s not Facebook or Twitter level, but started to become significant for two DBAs to manage. The hardware itself was decent for a few years ago but starting to show age.
Most of the fleet were older Dell R720s with platter disks and the rest were even older HPs with the same. We recently had been having some trouble with the Dell servers as their drives were getting kicked out of the RAID-10 array. There was a firmware fix available but having to do a farm-wide PERC firmware upgrade to fix it is time consuming.
Our schema change process that we run weekly involves taking side A out, applying local schema changes, replacing it and then repeating on B side. Because of this, when A comes out, B takes 100% of our traffic and all that fresh traffic causes the buffer pool to go nuts evicting and warming up a fresh set of queries. Quickly trying to load lots of fresh data into the buffer pool spikes iowait and this translates directly to poor page performance.
Schema changes happen once a week and this blip happens twice, when you swing A to B and B to A. Fortunately, it doesn’t last for very long. Within 20 minutes, things are warmed up and everything continues about happily. 40 minutes a week of slightly degraded performance feels pretty small but it’s also done during the middle of the day.
Planning the Move
With a reasonable case for making a hardware change, our next step was to plan out what hardware, how would we architect and how do we get from here to there.
For other systems, we had recently introduced a new Dell spec into our database world – the R630 with Intel SSDs. Our early impressions had been great. We theorized that we could use this new spec to replace a great deal of these older platter systems by stacking more data on each one.
We set a goal on building a new shard farm that would reduce our count from 120 to 30. For us, 30 seems like a manageable number and although it was picked somewhat at random, it actually played out very well.
As I mentioned before, the “logical shard migration” had split our data up across many databases inside a single mysqld instance. The quantity of databases was directly related to storage available on the hardware. Newer servers had 2.2TB of space and they received 22 logical shards and 1.1TB space servers got 10 logical shards.
Compacting all this data onto fewer servers could only really go one of two ways. We could build a system to merge logical shards together or we could run multiple instances of mysqld. For the record, using any sort of virtualization (lxc/kvm) was not considered mostly due to not having enough tooling around setting up virtual hosts.
The sharded cluster runs Percona Server 5.5, so a feature like multi-source replication is not an option for us. Were it, we could setup a fresh server, replicate all databases from many sources into a single and then do a cutover to the new server.
We could use something like MariaDB (this was pre-GA for MySQL 5.7) in the middle to act as an aggregator of multiple sources and then attach the final downstream server with the merged data.
This still left us with the problem of getting all the data combined together. Transportable table spaces in MySQL 5.6 or a straight mysqldump were the two obvious answers here, both had either a long ramp up for time to upgrade or time to dump all the rows.
A much easier to implement architecture would be to just run multiple mysqld instances on the same hardware. Supporting multiple ports for a single host was already feasible in how our ORM accesses the database. Our DSN strings explicitly specify port so no defaults are assumed.
Our Chef cookbooks already supported multiple instances on a single host as well, due to work in another area. We could deploy data to
/var/lib/mysql/$NAME, where $NAME was the old name of one of the 120 servers. With this, we could even keep similar naming conventions that everyone was used to for identifying what data lived where.
In the end, we selected multiple instances purely due to time and simplicity to implement.
The simple math says with 120 old servers and 30 new servers, we could put 4 instances per host except for the logical shard count difference. Taking the difference into account, we decided that 10 of the 15 would run the larger, 22 logical shard servers and the remaining 5 new servers would take 6 instances. This means that per server, we’re running 60 or 66 logical shards total. Disk space, CPU, and network traffic is all very consistent per logical shard.
For a project of this size, I chose to write up a document outlining much of what is written above: the hardware, the plan, scaling, capacity planning and even risks. This document was provided to much of the technology organization so other engineers have an opportunity to review and evaluate the plan. No one of us is as smart as all of us and through this, I’m looking for perspectives that I can’t see – either by lack of technical expertise or bias on my own plan.
The document was published internally on Google Docs which allowed everyone to comment on specific sections and carry on sideline discussions and eventually help me formulate a more complete document. For example, concerns about the CPU usage from multiple mysqlds led met to document out scaling expectations by number of instances per host:
All of the work put into discussion and documentation builds confidence that this project will be a success. I had zero concern at the end of flipping the new servers live and the results show a magnificent improvement.
From the scaling document above, I took a linear approach and I was pleased to find that with the combination of more cores and newer generation processor, we actually did much, much better than our linear predictions.
A before and after look of CPU usage for a 24 hour period (sorry for the mixed graphing formats but we also rolled out VividCortex to the new shard cluster for higher resolution).
These are from servers running 6 instances / 10 logical shards each. At 6x the traffic, we see a jump from 10% use to 15% for user CPU.
This was the least of my concerns. We allocated 87GB for a single instance running 22 logical shards or 4GB per. With 22 logical shards across 3 instances on the new boxes, we’re allocating 130GB per instance or 6GB per logical shard. The net effect is that we get more memory per instance.
This is my favorite area to talk (brag?) about. Platters to SSDs is like going from walking to taking a Star Trek style transporter. First off, we have over 20x the IOPS to work with and SSDs are giving us multiple channels instead of a single actuator arm.
Deploying the first of the new servers, I was expecting to see a slight bit of slowdown as the servers warmed up. Watching CPU numbers, putting in an entirely cold server, the iowait went from 0.5%, to 1%, and back to 0.5% in a few seconds. Normally, I would have expected 20 minutes of warm up and iowait to sit in the 20% range.
Our weekly schema change process has improved execution time by 30% because we’re able to write data faster.
The SSDs end up allowing us to go more dense on storage too, boosting total available space from 215TB to 420TB of space. After having these in production for a month, the disk usage rate suggests we have thousands of days of space left.
We reduced our server count, put more data on each server and still ended up with more disk capacity overall.
One of the less obvious benefits of this change was in the reduction of power usage. The newer Dell servers are more efficient and not having to keep racks of 15k rpm disks spinning all the time leads to more power savings.
In the end, we went from about 24,000 watts of power down to 8,000 to run this farm.
One of the concerns I had going into this project was the net impact of putting 3x or 6x traffic onto a single server. Specifically, I was worried that the amount of concurrent requests would actually start fighting for time to run on the CPU. This was completely not true at our traffic levels. To verify this, I captured 60 seconds of SQL traffic to do a comparison using pt-query-digest. The captures were for the same time period, and captured traffic is showing platter first, then SSD. This was captured when we had A side on SSD and B side on platter still.
Average query latency – that is how long it takes to run the average query – went from 707usec to 359usec. 95th percentile also improved by ~300%. Even with 6x more traffic, we’re responding to each query twice as fast and our outlier tail is much shorter.
With a reduced server count, it’s much easier to put all of them onto 10gbit switches. We did have to do some relocation of destination backup servers, but we were able to significantly relax our throttle on our backup speed. Backups went from 9 hours to 1 hour with this change.
So far, I’ve yet to see a single bad thing that’s come out of this migration. There can be the perception that distributing your data over fewer servers means the surface area of impact to a single server outage is larger – 1/120th to 1/30th in our case. Conversely it can mean there’s less hardware that can fail. I would rather spend my time investing in making data available in the event of a failure than managing a farm designed to spread it so wide that it doesn’t matter.