Source of Truth or Source of Madness?

This year at Etsy, we spun up a “Database Working Group” that talks about all things data. It’s made up of members from many teams: DBA, core development, development tools and data engineering (Hadoop/Vertica). At our last two meetings, we started talking about how many “sources of information” we have in our environment. I hesitate to call them “sources of truth” because in many cases, we just report information to them, not action data based on them. We spent a session whiteboarding all of of these sources and drawing the relationships between them. It was a bit overwhelming to actually visualize the madness.

A few examples:

  • We use Chef for configuration management and Chef knows about all database server. It made sense for us to build out our monitoring to generate Nagios configuration based on that data from Chef. When we deploy our configuration, we iterate over all of our hosts to define host definitions and then based on hostname regex pattern or Chef role, we can determine what service definitions to attach. If a database server is built with Chef, we know it will get monitoring this way.
  • When a database backup or a restore test is completed, we log them to a utility database and can access the information with HTTP. This gives us a simple way to monitor for backup freshness, size and trend backup size over time.
  • Our web servers know what DSNs are available through a PHP array. When we add new servers, we update the array manually, commit to git and push it out using a special deployinator.

Starting Small

At our second meeting, we started to look at one specific and seemingly simple example:

What would it take to ensure that we can generate a DDL schema (mysqldump -d) of all datasets in our environment in a consistent, complete manner?

I’ll cheat a bit and say, we already do this mostly. A simple bash script has a list of servers and we do a (more complicated version of) mysqldump -d -h $FQDN1 || mysqldump -d -h $FQDN2. This dumps to a directory, and at the completion of the dumps, we upload these to a local git repository.

This leads to two infrequent but annoying problems:

  1. If we add or remove a dataset, the script needs to be manually updated to
  2. If we rename or replace a server, $FQDN1 or $FQDN2 become invalid. If both go away, we lose the schema dump, write and git push a 0 byte file.

We need magic. This magic needs to be self-updating, easily queried and distributed. If I want to build this magic, I need something like:

  1. Fetch a list of all datasets in the environment
  2. For each dataset, fetch a list of healthy servers to do the mysqldump against.
  3. Run the mysqldump command.
  4. Upload to git.

If the list of datasets is dynamic and always up to date, and if the list of healthy servers is the same way, the simple process can be magic.

Enter Service Discovery

It’s been a few weeks since the Percona Live 2016 conference wrapped up and the bar track is always my favorite. Getting an opportunity to meet new people and learn about their infrastructure can be really exciting. One of the many topics I chatted with many people about was service discovery and Consul came up as being evaluated by several companies.

Consul aims to be a service discovery platform and I think it’s intent is more in the microservices world. Etsy is very monolithic, but I don’t think that it’s all that different for my use-case. It provides a HTTP interface for getting service information, health and a key-value store.

Is Consul the Right Choice?

As I began to do a proof of concept, some coworkers saw commits to a new Consul cookbook and asked me “Is Consul the right choice here?” and “Why not Zookeeper?” These come from good intentions for not wanting to have too many tools doing the same thing. Etsy already uses Zookeeper for Kafka. Kafka did their own service discovery bits on their end, and use ZK as a data store.

That being said, Zookeeper is not service discovery – it is a key value store. Helix, however, is a service discovery framework that uses the ZK data store, so we might get some wins in re-using our existing ZK infrastructure. In either case – Consul, Helix or stitching something together ourselves – we’re going to have to add a tool.

For my purposes, Consul was selected to prove the value of having service discovery. If, in the end, Consul is not the right choice for us, the process here at least helps identify the right questions to ask about what technology we need.

Getting It Running

How can Consul would solve my above question? Well, first I have to get it running. For brevity, I’ll gloss over some heavy detail here and just touch on the basics. Consul ships as a Go binary. I really prefer to deploy a RPM, so I re-used an excellent repo for getting spec files and it was trivial to build. Once cheffed and installed, I bootstrapped the cluster in “server” mode and then setup a second Chef recipe to install the agent in “client” mode on some development databases.

Publishing Services

Now, I need all my databases to be self registering. It seems sane to make the “service” that is being discovered to be the database name itself. Consul has the ability to accept registration of services. It’s pretty simple, I start off with a mysql.json file that has the following information:

  "ID": "$FQDN:3306:$DBNAME",
  "Name": "$DBNAME",
  "Address": "10.xx.xx.xx",
  "Port": 3306,
  "Check": {
    "ID": "mysql:$DBNAME",
    "Name": "MySQL - $DBNAME",
    "Script": "/usr/lib64/nagios/plugins/check_mysql -d $DBNAME",
    "Interval": "10s"

ID must be unique, and the most unique combination I can come up with is FQDN+PORT+DBNAME. Using colons has a caveat that I can’t discover this by DNS, but I don’t plan to right now anyway. Name is the name of the service, so making that just the DBNAME also makes sense. Funny enough, since we use Nagios, I can just reuse one of our plugins to verify that the DBNAME exists. If for some reason we remove that database, the check will fail and Consul will know that server will no longer be offering that database.

Then I send that in using cat mysql.json | curl -H "Content-Type: application/json" -X POST -d @- The client agent on the database knows who the leader of the servers is, and this greatly simplifies where I post my information to. I just post locally and the agent takes care of the rest. From here, it’s trivial to make a loop that iterates over all instances of MySQL, and then all databases inside of each to register all of them in this fashion.

Now when I visit the Consul UI, I can see my new DBNAME and the two nodes it’s running on. The first node is a fresh build and running a dataset properly. The second has MySQL online, but the database has not been restored yet – it’s running with no data. Because of this, I can see only one of the two nodes is “online”,  because the second node is failing authentication.

Getting Service Information

Consul gives a lot of HTTP endpoints to query. Looking back at my simple-but-flawed bash script, I can replace my hard-coded list of datasets with a call to the /v1/catalog/services endpoint to get a list of all datasets known to Consul. Upon reflection, maybe I should tag these services with a keyword like mysql so that I could filter to just MySQL datasets and skip getting ‘consul’ or any future things Consul might store for us.

curl --silent | jq 'keys[]' will give me the list of services, including the consul service itself (or just | grep -v consul), but good enough to get the list:


With a list of databases now, I can find a healthy server to fetch the schema dump from. There is a /v1/health/service endpoint I can use with a ?passing flag to only retrieve a list of healthy servers. To make it simple, I’ll throw in jq to get something easier to work with in bash.

curl --silent$DBNAME?passing | jq '.[].Node.Node' gives me a list of the services passing health check – currently one node.


Making It Better

I have magic! I then got to wondering – how can I make this even better?

  1. If I could scan the frm files, I could find the latest modification timestamp and add that as an entry in the Consul key-value store. This would let me know which node in the service has the most recent DDL statement applied as that is the best candidate for taking a schema dump from.
  2. Likewise, if I tracked the same timestamp per dataset, I could even trigger a schema dump if that value changes. No more needing to run a cron on a specific interval and blindly dump schemas.

This is all the tip of the iceberg. I can foresee storing much more information about our environment and start to pull together all of our sources of information into one. Adding a Chef helper to fetch data from Consul seems straightforward. With that, we could fetch attributes about servers – is it time-delayeds, is it a master? This all would help us build monitoring service checks. If the information changes, redeploying Nagios configuration will bring things back in line without ever having to change a service definition by hand.

Likewise, the PHP array for our web servers can be generated by information in Consul. Replication relationships, roles (master vs replica) and databases available would all be available. For added safety, the deployinator for this can generate a new array and diff against current allowing a human the opportunity to sanity check before it goes live.

It’s clear that software like Consul can simplify our life at Etsy. The next challenge is looking at other options and ensuring that we pick the right way to simplify.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s