lunes, 8 de diciembre de 2014

Static type checking for Javascript (TypeScript vs Flow)

I've never been a big fan of Javascript for large applications (nothing beyond proxies and simple services) and that is partially because in my experience the lack of static typing ends up making very easy to make mistakes and very difficult to refactor code

Because of that I was very excited when I discovered TypeScript some months ago (Disclaimer: I'm not a JS expert) and I was very curious about the differences between TypeScript and Flow when some colleage pointed me to it today.   So I tried to play find the seven differences, but I'm lazy and I stopped after finding one.

Apart from cosmetic differences and tools availability both TypeScript and Flow support type definition based on annotations, type inference and class/modules support based on EcmaScript 6 syntax.    The relevant difference I found after reading/playing with them (for half an hour) is that because of the way they implement type inference Flow can detect type changes of the variables after the initial declaration making it more appropriate for legacy Javascript code where adding annotations can be not possible.

This is some code I used to play with it with some inline comments:

var s = "hello";
s.length  // Both TS and Flow know that s is a string and they check they have a length method


var s: string = null;
s = "hello";
s.length  // Both TS and Flow know that s is a string and they check they have a length method

var s = null;
s = "hello";

s.length // TS doesn't know that this is a string but Flow knows and can check it has a length method

domingo, 26 de octubre de 2014

Service discovery and getting started with etcd

After playing with some Twitter opensource components recently (mostly finagle) I became very interested on the concept of service discovery as a way to implement load balancing and failure recovery in the interconnection between internal services of your infrastructure.   This is specially critical if you are have a microservices architecture.

Basically the idea of Service Discovery solutions is having a shared repository with an updated list of existing instances of type A and having mechanisms to retrieve, update and subscribe to that list allowing other components to distribute the requests to service A in an automated and reliable way.


The traditional solution is Zookeeper (based on Google Plaxos algorithm with code opensourced by Yahoo and maintained as part of the Hadoop project) but apparently other alternatives have appeared and are very promising in the near future.  This post summarized very well the alternatives available.

One of the most interesting solutions is etcd (simpler than Zookeeper, implemented in Go and supported by the CoreOS project).  In this post I explain how to do some basic testing with it.

etcd is a simple key/value store with support for expiration and watching keys that makes it ideal for service discovery.   You can think of it like a redis server but distributed (with consistency and partition tolerance) and with a simple HTTP interface supporting GET, SET, DEL and LIST.

Installation

First step is to install etcd and the command line tool etcdctl.
You can easily download it and install it from here or if you are using a Mac you can just "brew install etcd etcdctl"

Registering a service instance

When a new service instance in your infrastructure starts it should register himself in etcd by sending a SET request with all the information that you want to store for that instance.

In this example we store the hostname and port of the service instance and we use a url schema like /services/SERVICE/DATACENTER/INSTANCE_ID.   In addition we set a ttl of 10 seconds to make sure the information expires if it is not refreshed properly because this instance is not available.

var path = require('path'),
    uuid = require('node-uuid'),
    Etcd = require('node-etcd');

var etcd = new Etcd(),

    p = path.join('/', 'services', 'service_a', 'datacenter_x', uuid.v4());


function register() {
  etcd.set(p,
    JSON.stringify({
      hostname: '127.0.0.1',
      port: '3000'
    }), {
        ttl: 60
    });


  console.log('Registered with etcd as ' + p);

}
setInterval(register, 10000);
register();

Discovering service instances

When a service in your infrastructure requires using other service it has to send a GET request to retrieve all the available instances and subscribe (WATCH) to receive notifications of nodes down or new nodes up.

var path = require('path'),
    uuid = require('node-uuid'),
    Etcd = require('node-etcd');

var etcd = new Etcd();
var p = path.join('/', 'services', 'service_a', 'datacenter_x');

var instances = {};
function processData(data) {
  if (data.action == 'set') {
    instances[data.node.key] = data.node.value;
  } else if (data.action == 'expire') {
    delete instances[data.node.key];
  }
  console.log(instances);
}


var watcher = etcd.watcher(p, null, {recursive: true});
watcher.on("change", processData);

etcd.get(p, {recursive: true}, function(res, data) {
  data.node.nodes.forEach(function(node) {
    instances[node.key] = node.value;
  });
  console.log(instances);
});


Conclusions

Service discovery solutions are becoming a central place of lot of server infrastructures because of the increasing complexity in those infrastructures specially because of the raise of microservices like architectures.  etcd is a ver simple approach that you can understand, deploy and start using in a less than an hour and looks more actively maintained and future proof than zookeeper. 

I tend to think that if Redis is able to have a good clustering solution soon it could replace specialized service discovery/configuration solutions in some cases (but I'm far from an expert in this domain).

The other thing that I found missing are good frameworks making use of these technologies integrated with connection pool management, load balancing strategies, failure detection, retries...    Kind of what finagle does for twitter, maybe that can be my next project :)

jueves, 20 de marzo de 2014

Actor Model

The more I write concurrent applications the more I hate it.    Typically you end up having a code full of locks, queues, threads and threadpools where it is from difficult to impossible to know if it is correct or it only apparently works.

Because of that I decided to do a little research on the Actor Pattern that apparently is powering frameworks like Erlang making it a very good solution for highly concurrent communication platfomrs (like Facebook Chat or WhatsApp).

These are the slides I prepared, there is no much explanation on them, so feel free to ask me any question and try it!   The Actor Model is fun and will simplify your life no matter if you use a framework for it or just keep in mind the concept in your future designs.




miércoles, 12 de febrero de 2014

Scientific way of estimating the cost of a feature in your project



I'm a fan of estimations as long as they are not used to try to figure out when a feature will be done.    I like estimations and I think they are critical when they are used to decide which features should be done and which ones shouldn't.

So, if they are so important, what is the best way to make estimations.   I'm going to share my secret formula based on the things that I have read and my personal experience in my professional career (where I have to admit that my estimations are now completely different than 15 years ago).

There are two key concepts that we need to understand before digging into the actual formula:
  • One feature working doesn't mean the feature is complete or ready.   Instrumentation, thread safety, unit tests, error handling, documentation, automation, unexpected problems, bug fixing...  most of the times takes much more time that the implementation of the basic functionality.
  • Once you write something you usually have to maintain and not break it forever.   Making sure that new features, refactors or any minor change doesn't break any existing code is a really big deal in any project with enough complexity.

Based on those key concepts we can split the cost of a feature in 3 buckets:
  • Cost to have something working (the usual engineers initial estimation): X
  • Cost to have something ready to be shipped: Y
  • Cost to keep it working for the life of the product: Z
For a total cost for adding a feature to a product of X + Y + Z 

And now is when the scientific part is applied.   Based on my experience and thousands (well, maybe 3 or 4) articles I have read I think the Pareto Principle has a perfect application in this case.

In any project the cost of implementing the basic functionality (X) is 20% vs the 80% of implementing the rest of functionality needed to ship the product (Y).   So Y = 4 * X

The Circular Estimation Conjecture: You should always multiply your estimates by pi.
I've seen a similar estimation of X + Y = PI * X that is a bit optimistic in my opinion.  I recommend you to read the visual demonstration of what is called the circular estimation conjeture







For the second part (the maintainability cost Z) we can apply the same Pareto Principle to get Z = 4 (X + Y)

With all those numbers in place the conclusion is easy.   The total cost of having a feature in a product is X + 4 * X  + 4 * (X + 4 * X) = 25 * X

Take your initial guess (or ask any engineer) to get X, then the cost of the feature that you need to use to decide if it is worth to waste your time implementing it or not is exactly 25 * X

As corollary and final demonstration of the theorem, I though this post was going to take me 5 mins to write it and it took me 25 mins and I suspect I will have to spend more than one hour discussing about it with other people.






martes, 21 de enero de 2014

Writing sequential test scripts with node

Today I was trying to create a node.js script to test a HTTP service but the test required multiple steps.   I gave it a try by using async module to "symplify" that code and that's the ugly code I came up with.

I'm not an expert in js/node, feel free to comment if I'm doing something wrong, I'm more than happy to learn.

(inflightSession and create are two helper functions that I have)

Test using node + jasmine: 

it("should accept valid sessionId", function(done) {
      async.waterfall([
          inflightSession,

          function(sessionId, callback) {
            create({ 'sessionId':sessionId }, callback);
          }
      ], function(error, response) {
          expect(response.statusCode).to.equal(200);
          done()
    });
  });


Same test using python + unittest:

def create_ok_test():
    session_id = inflight_session()
    response = create({ 'sessionId': session_id })

    assert_equals(200, response.status_code)


Same test using node ES6 generators (yield keyword):

it("should accept valid sessionId", function*() {
      var sessionId = yield inflightSession();

      var response = yield create({ 'sessionId': sessionId });
      expect(response.statusCode).to.equal(200);
  });


Honestly the code in python is way more readable than the existing node code, and still better even when comparing it with the new node generators.    Anway definitely looks like a promising way to move forward in the node community.   Some comments:

ES6 generators are available under a flag in node 0.11 and are supposed to be included in 0.12.

yield is a common keyword in other languages (i.e. python, C#) to exit from a function but keeping the state of that function so that you can resume the execution later.

function* is the syntax to define a generator function (a function using yield inside).

You need a runner supporting those generator functions (in this example jasmine needs to add support for it), basically calling the generator.next and waiting for the result (the result should be a promise or similar object) before calling generator.next again.

UPDATE: As I´m somehow forced to use node, I ended up creating a helper function and my tests are now like this

itx("should accept valid sessionId", inflightSession, function(sessionId, done) {
      create({ 'sessionId':sessionId }, function(error, response) {
          expect(response.statusCode).to.equal(200);
          done()
      });

});


viernes, 17 de enero de 2014

Distributed Load Testing: Conclussions (5/5)

Let's recap what we have done in these series and try to get some conclusions.   The steps or achievements are these ones:
  1. Find and test a distributed load testing tool in python: locust.
  2. Extend locust for custom (non-HTTP) protocol testing.
  3. Use Instant Servers to run the locust master and slaves.
  4. Implement a simple way to autostop the machines when they are not being used based on the locust logs and instant servers stop API.
  5. Create a template for the slaves to be easily cloned.   Use instant servers tags to define groups.
  6. Fix the python Instant Server SDK and extend it with new authentication and clone features.
  7. Extend locust interface adding a button to spawn machines in instant servers directly from the locust web interface.
Today I like even more python and the testing tools based in scripting instead of complex UIs.  This project gave me the oportunity to discover locust and Instant Servers and highly recommend people to use them for this kind of use case, it was very easy and a lot of fun using and combining those technologies.  Hopefully I can get more time for deeper integration of virtual machines in locust (with a good control UI and perhaps support for other providers).


miércoles, 15 de enero de 2014

Distributed Load Testing: py-smartdc and spawing slaves from locust (4/5)

After all the previous work I was able to spawn slaves easily from the Instant Servers interface (just clicking Clone) or with the command line tools but I wanted to go further and explore the extension capabilities of locust to add some very simple support to the locust web page to create the slaves.

How to clone and tag an Instant Servers machine with python

I have to recognize that my first instinct was to try to create a python Instant Servers SDK,  I even created the github repo and built the skeleton of the SDK, but 10 mins later somebody told me how stupid I was because there was already an official python SDK :(

Ok, that's great I though, but when I tried to use it I realize that it was not compatible with Instant Servers.   The "problem" is that the SDKs are maintained by joyent and even if Instant Servers is the same infrastructure, the API version is not exactly the same and the existing python SDK doesn't work with Telefonica infrastructure.

The solution was easy and I created my fork and pull requested a patch [1] that still hasn't been merged.   Feel free to use my fork! [2]  In addition I added another patch to support username&password authentication [3], and another one to add support for cloning machines

Once that's solved creating a machine by cloning the template instance we built in the previous post is very easy:

from smartdc import DataCenter, TELEFONICA_LOCATIONS

mad = DataCenter(location='eu-mad-1',
              known_locations=TELEFONICA_LOCATIONS,
              login='is0012', password='HNSnFAkc', api_version='6.5')

template_found = False
for machine in mad.machines():
        tags = machine.get_tags()

        if tags.get('locust') == 'slave-template':
                new_machine = machine.clone()
                new_machine.add_tags(locust='slave')
                print new_machine.name + ' ' + str(new_machine.get_tags())
                template_found = True

if not template_found:
        print 'slave-template instance not found'


How to integrate spawning machines in locust

Locust is easily extensible in the testing scripts as we saw in the previous post but also in the UI.   It is easy to add functionality to the website but modifying the template and adding more HTTP routes to process new API requests.  In my case I added a new button to spawn an Instant Servers machine and a route to process that request:

In locust/templates/index.html: 

                <div class="top_box box_stop box_running" id="box_reset">
                    <a href="/stats/reset">Reset Stats</a></br>
                    {% if is_distributed %}
                        <a href="/cloud/create">Spawn Slave</a>
                    {% endif %}

                </div>



In a new file locust/cloud.py:

from smartdc import DataCenter, TELEFONICA_LOCATIONS
from locust import web

mad = DataCenter(location='eu-mad-1',
              known_locations=TELEFONICA_LOCATIONS,
              login='', password='', api_version='6.5')

@web.app.route("/cloud/create")
def cloud_create():

    template_found = False
    for machine in mad.machines():
        tags = machine.get_tags()

        if tags.get('locust') == 'slave-template':
                new_machine = machine.clone()
                new_machine.add_tags(locust='slave')
                return new_machine

    return None


You can find my locust fork in [5], don't forget to change your login and password in the cloud.py file.

The result is this simple new button with the functionality required to spawn new machines automatically configured as locust slaves connected to the master for distributed testing.



[1] https://github.com/atl/py-smartdc/pull/9
[2] https://github.com/ggarber/py-smartdc/
[3] https://github.com/atl/py-smartdc/pull/10
[4] https://github.com/atl/py-smartdc/pull/11
[5] https://github.com/ggarber/locust

viernes, 3 de enero de 2014

Distributed Load Testing: Using Instant Servers for semi-automated slaves spawning (3/5)

As I mention in the first post of this series virtual machines provides a very dynamic and cheap way to create slave nodes for load testing.  There are different providers out there but I was interested on using Instant Servers (based on Joyent technology) because it is really simple to start with and it is fun to explore new solutions instead of using always the boring Amazon infrastructure.

The three small features I was interested on implementing with Instant Servers were:
  • Simplify the creation of machines preconfigured to be used of locust slaves for my load testing
  • Starting the slaves automatically on the machine startup so that the master detects them and is able to schedule jobs
  • Stopping the machine automatically when it is not used for any test for some time to make sure we don't waste our money when forgetting to stop those unused machines

Creation of machines

To simplify the creation of machines I decided to build an instance with all the required packages and use it as template to clone the actual slave nodes.    To be able to find this instance later automatically I used Instant Server tags.  As far as I know unfortunately there is no UI for tagging in Instant Servers but you could use a script similar to mine [1].

This machine needs to have python, locust, the test file (locustfile.py), all the basic packages (make, gcc) and the packages required to start locust automatically (see next point).

Auto starting slaves

To make sure that the slaves are started during machine startup and that we start multiple instances in every box (because locust is single threaded and I want to use multicore machines) I used supervisord with the following (hopefully autoexplicative) configuration.   Tune the numprocs parameter depending on the machine you are using.

/etc/supervisor/conf.d/locust.conf

[program:locust]
command=locust -f /root/locustfile.py --master-host=81.45.23.221 --slave
stderr_logfile = /var/log/supervisord/locust-stderr.log
stdout_logfile = /var/log/supervisord/locust-stdout.log
process_name=%(program_name)s_%(process_num)02d
numprocs=4

Auto stopping machines


To monitor the usage of a machine I could be using the extensive Analytics API in instant servers but I decided to use the locust log file for simplicity.   I created a python script [2] monitoring log files for activity and if there is no activity for some minutes it invokes the stop Instant Servers API to shut down that instance.

Note: It was not possible to use the existing python SDK with Instant Servers and I had to fork it and made a small modification.  More details in next post.


PD. If I ever mention the word "cloud" in any of this posts, please feel free to insult me, I will deserve it.

[1] https://gist.github.com/ggarber/8381238
[2] https://gist.github.com/ggarber/8381263