Downsizing to SSDs

System management can be a big deal. At Etsy, we DBAs have been feeling the pain of getting spread too thin. You get a nice glibc vulnerability and have to patch and reboot hundreds of servers. There goes your plans for the week.

We decided last year to embark on a 2016 mission to get better performance, easier management and reduced power utilization through a farm reduction in server count for our user generated, sharded data.

Scoping the Problems

Taking on a massive project to replace your entire fleet of hardware is not an easy sell, especially when on the surface it seems like everything is fine. As DBAs, we experienced challenges that were not readily visible to the rest of the organization or even to end users. The first challenge was to quantify what are the reasons for doing such a large uplift.

Server Quantity

In the beginning, Etsy had one instance of mysqld per server, and one database inside that. A few years ago, we switched from having one database per server, to many databases. We called this our “logical shard migration”. Splitting our data up gave us the ability to split entire databases off when we reached disk capacity where previously individual rows were lifted out of tables. Splitting was easy – restore data, replicate, switch reads and writes to new server, drop old tables. Server capacity was defined by disk space available.

Through that migration, our sharded user data had grown to 120 physical servers – it’s not Facebook or Twitter level, but started to become significant for two DBAs to manage. The hardware itself was decent for a few years ago but starting to show age.

Spinning Rust

Most of the fleet were older Dell R720s with platter disks and the rest were even older HPs with the same. We recently had been having some trouble with the Dell servers as their drives were getting kicked out of the RAID-10 array. There was a firmware fix available but having to do a farm-wide PERC firmware upgrade to fix it is time consuming.

Our schema change process that we run weekly involves taking side A out, applying local schema changes, replacing it and then repeating on B side. Because of this, when A comes out, B takes 100% of our traffic and all that fresh traffic causes the buffer pool to go nuts evicting and warming up a fresh set of queries. Quickly trying to load lots of fresh data into the buffer pool spikes iowait and this translates directly to poor page performance.

unnamed   platter_iowait


Schema changes happen once a week and this blip happens twice, when you swing A to B and B to A. Fortunately, it doesn’t last for very long. Within 20 minutes, things are warmed up and everything continues about happily. 40 minutes a week of slightly degraded performance feels pretty small but it’s also done during the middle of the day.


Planning the Move

With a reasonable case for making a hardware change, our next step was to plan out what hardware, how would we architect and how do we get from here to there.

Hardware Selection

For other systems, we had recently introduced a new Dell spec into our database world – the R630 with Intel SSDs. Our early impressions had been great. We theorized that we could use this new spec to replace a great deal of these older platter systems by stacking more data on each one.

Screen Shot 2016-05-20 at 10.46.29 AM.png


We set a goal on building a new shard farm that would reduce our count from 120 to 30. For us, 30 seems like a manageable number and although it was picked somewhat at random, it actually played out very well.

As I mentioned before, the “logical shard migration” had split our data up across many databases inside a single mysqld instance. The quantity of databases was directly related to storage available on the hardware. Newer servers had 2.2TB of space and they received 22 logical shards and 1.1TB space servers got 10 logical shards.

Compacting all this data onto fewer servers could only really go one of two ways. We could build a system to merge logical shards together or we could run multiple instances of mysqld. For the record, using any sort of virtualization (lxc/kvm) was not considered mostly due to not having enough tooling around setting up virtual hosts.

Merging Data

The sharded cluster runs Percona Server 5.5, so a feature like multi-source replication is not an option for us. Were it, we could setup a fresh server, replicate all databases from many sources into a single and then do a cutover to the new server.

We could use something like MariaDB (this was pre-GA for MySQL 5.7) in the middle to act as an aggregator of multiple sources and then attach the final downstream server with the merged data.

This still left us with the problem of getting all the data combined together. Transportable table spaces in MySQL 5.6 or a straight mysqldump were the two obvious answers here, both had either a long ramp up for time to upgrade or time to dump all the rows.

Multiple Instances

A much easier to implement architecture would be to just run multiple mysqld instances on the same hardware. Supporting multiple ports for a single host was already feasible in how our ORM accesses the database. Our DSN strings explicitly specify port so no defaults are assumed.

Our Chef cookbooks already supported multiple instances on a single host as well, due to work in another area. We could deploy data to /var/lib/mysql/$NAME, where $NAME was the old name of one of the 120 servers. With this, we could even keep similar naming conventions that everyone was used to for identifying what data lived where.

In the end, we selected multiple instances purely due to time and simplicity to implement.

The simple math says with 120 old servers and 30 new servers, we could put 4 instances per host except for the logical shard count difference. Taking the difference into account, we decided that 10 of the 15 would run the larger, 22 logical shard servers and the remaining 5 new servers would take 6 instances. This means that per server, we’re running 60 or 66 logical shards total. Disk space, CPU, and network traffic is all very consistent per logical shard.

The Plan

For a project of this size, I chose to write up a document outlining much of what is written above: the hardware, the plan, scaling, capacity planning and even risks. This document was provided to much of the technology organization so other engineers have an opportunity to review and evaluate the plan. No one of us is as smart as all of us and through this, I’m looking for perspectives that I can’t see – either by lack of technical expertise or bias on my own plan.

The document was published internally on Google Docs which allowed everyone to comment on specific sections and carry on sideline discussions and eventually help me formulate a more complete document. For example, concerns about the CPU usage from multiple mysqlds led met to document out scaling expectations by number of instances per host:

Screen Shot 2016-05-20 at 12.02.33 PM

Going Live

All of the work put into discussion and documentation builds confidence that this project will be a success. I had zero concern at the end of flipping the new servers live and the results show a magnificent improvement.

CPU Usage

From the scaling document above, I took a linear approach and I was pleased to find that with the combination of more cores and newer generation processor, we actually did much, much better than our linear predictions.

A before and after look of CPU usage for a 24 hour period (sorry for the mixed graphing formats but we also rolled out VividCortex to the new shard cluster for higher resolution).


These are from servers running 6 instances / 10 logical shards each. At 6x the traffic, we see a jump from 10% use to 15% for user CPU.


This was the least of my concerns. We allocated 87GB for a single instance running 22 logical shards or 4GB per. With 22 logical shards across 3 instances on the new boxes, we’re allocating 130GB per instance or 6GB per logical shard. The net effect is that we get more memory per instance.


This is my favorite area to talk (brag?) about. Platters to SSDs is like going from walking to taking a Star Trek style transporter. First off, we have over 20x the IOPS to work with and SSDs are giving us multiple channels instead of a single actuator arm.

Deploying the first of the new servers, I was expecting to see a slight bit of slowdown as the servers warmed up. Watching CPU numbers, putting in an entirely cold server, the iowait went from 0.5%, to 1%, and back to 0.5% in a few seconds. Normally, I would have expected 20 minutes of warm up and iowait to sit in the 20% range.

Our weekly schema change process has improved execution time by 30% because we’re able to write data faster.

The SSDs end up allowing us to go more dense on storage too, boosting total available space from 215TB to 420TB of space. After having these in production for a month, the disk usage rate suggests we have thousands of days of space left.

We reduced our server count, put more data on each server and still ended up with more disk capacity overall.

Power Usage

One of the less obvious benefits of this change was in the reduction of power usage. The newer Dell servers are more efficient and not having to keep racks of 15k rpm disks spinning all the time leads to more power savings.

In the end, we went from about 24,000 watts of power down to 8,000 to run this farm.

Query Performance

One of the concerns I had going into this project was the net impact of putting 3x or 6x traffic onto a single server. Specifically, I was worried that the amount of concurrent requests would actually start fighting for time to run on the CPU. This was completely not true at our traffic levels. To verify this, I captured 60 seconds of SQL traffic to do a comparison using pt-query-digest. The captures were for the same time period, and captured traffic is showing platter first, then SSD. This was captured when we had A side on SSD and B side on platter still.


Average query latency – that is how long it takes to run the average query – went from 707usec to 359usec. 95th percentile also improved by ~300%. Even with 6x more traffic, we’re responding to each query twice as fast and our outlier tail is much shorter.

Speedy Backups

With a reduced server count, it’s much easier to put all of them onto 10gbit switches. We did have to do some relocation of destination backup servers, but we were able to significantly relax our throttle on our backup speed. Backups went from 9 hours to 1 hour with this change.


Wrapping Up

So far, I’ve yet to see a single bad thing that’s come out of this migration. There can be the perception that distributing your data over fewer servers means the surface area of impact to a single server outage is larger – 1/120th to 1/30th in our case. Conversely it can mean there’s less hardware that can fail. I would rather spend my time investing in making data available in the event of a failure than managing a farm designed to spread it so wide that it doesn’t matter.


4 thoughts on “Downsizing to SSDs

  1. Andy

    70,000 IOPS seems really low for a 24 disk SSD array. You can get that much from just a single SSD. Why is it that low?

    1. jeremytinley Post author

      Probably badly tested. I ran bonnie++ on them and used iostat to compare throughput. I’m not sure if there’s a better way to push the disks any faster. I could have been bottlenecked on the CPU for the process.

      The array is across 2 controllers, each RAID-6, and using LVM to stripe them together.

  2. aw

    No surprise that SSDs kick ass. Unless your business is archiving images or video there’s really no reason to have spinning platters anymore.

    1. jeremytinley Post author

      Depending upon the array size, it can be pretty expensive. I think the bulk of the cost of our servers is in the storage array. We still run things on platters simply due to the cost of replacing them now. They’ll eventually get moved over.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s