domingo, 26 de octubre de 2014

Service discovery and getting started with etcd

After playing with some Twitter opensource components recently (mostly finagle) I became very interested on the concept of service discovery as a way to implement load balancing and failure recovery in the interconnection between internal services of your infrastructure.   This is specially critical if you are have a microservices architecture.

Basically the idea of Service Discovery solutions is having a shared repository with an updated list of existing instances of type A and having mechanisms to retrieve, update and subscribe to that list allowing other components to distribute the requests to service A in an automated and reliable way.

The traditional solution is Zookeeper (based on Google Plaxos algorithm with code opensourced by Yahoo and maintained as part of the Hadoop project) but apparently other alternatives have appeared and are very promising in the near future.  This post summarized very well the alternatives available.

One of the most interesting solutions is etcd (simpler than Zookeeper, implemented in Go and supported by the CoreOS project).  In this post I explain how to do some basic testing with it.

etcd is a simple key/value store with support for expiration and watching keys that makes it ideal for service discovery.   You can think of it like a redis server but distributed (with consistency and partition tolerance) and with a simple HTTP interface supporting GET, SET, DEL and LIST.


First step is to install etcd and the command line tool etcdctl.
You can easily download it and install it from here or if you are using a Mac you can just "brew install etcd etcdctl"

Registering a service instance

When a new service instance in your infrastructure starts it should register himself in etcd by sending a SET request with all the information that you want to store for that instance.

In this example we store the hostname and port of the service instance and we use a url schema like /services/SERVICE/DATACENTER/INSTANCE_ID.   In addition we set a ttl of 10 seconds to make sure the information expires if it is not refreshed properly because this instance is not available.

var path = require('path'),
    uuid = require('node-uuid'),
    Etcd = require('node-etcd');

var etcd = new Etcd(),

    p = path.join('/', 'services', 'service_a', 'datacenter_x', uuid.v4());

function register() {
      hostname: '',
      port: '3000'
    }), {
        ttl: 60

  console.log('Registered with etcd as ' + p);

setInterval(register, 10000);

Discovering service instances

When a service in your infrastructure requires using other service it has to send a GET request to retrieve all the available instances and subscribe (WATCH) to receive notifications of nodes down or new nodes up.

var path = require('path'),
    uuid = require('node-uuid'),
    Etcd = require('node-etcd');

var etcd = new Etcd();
var p = path.join('/', 'services', 'service_a', 'datacenter_x');

var instances = {};
function processData(data) {
  if (data.action == 'set') {
    instances[data.node.key] = data.node.value;
  } else if (data.action == 'expire') {
    delete instances[data.node.key];

var watcher = etcd.watcher(p, null, {recursive: true});
watcher.on("change", processData);

etcd.get(p, {recursive: true}, function(res, data) {
  data.node.nodes.forEach(function(node) {
    instances[node.key] = node.value;


Service discovery solutions are becoming a central place of lot of server infrastructures because of the increasing complexity in those infrastructures specially because of the raise of microservices like architectures.  etcd is a ver simple approach that you can understand, deploy and start using in a less than an hour and looks more actively maintained and future proof than zookeeper. 

I tend to think that if Redis is able to have a good clustering solution soon it could replace specialized service discovery/configuration solutions in some cases (but I'm far from an expert in this domain).

The other thing that I found missing are good frameworks making use of these technologies integrated with connection pool management, load balancing strategies, failure detection, retries...    Kind of what finagle does for twitter, maybe that can be my next project :)