Saturday, July 26, 2014

How to deal with blocking operations

In distributed environment even simple things become complicated. One of the most trivial examples is a counter. If you have only one instance of an application running, then you can simply store the counter in a database table with the counter's id and value, for example:
| ID | VAL |
------------
| 0  |  1  |
and increase its value with a simple query:
UPDATE counters SET val=val+1 WHERE id=0
However, when there are many clients trying to increase the counter simultaneously (for example multiple frontends displaying the number of visitors to the website), you may experience some serious slowdowns. It's because updating a row in a database is a blocking operation, which means that during update the row is locked to prevent the situation when multiple process try to modify it at the same time. There are two basic types of locks: write locks and read locks. A write lock is a basic and most often used type of lock, which does not allow modifying a value while another change is in progress. You can still read the row value, but in this case two subsequent reads can return two different values, because the change could have taken place just between them. When this becomes an issue, you can use a read lock, which prevents other processes from reading the row until the change is done. Depending on how your database is configured, locking the counter row will become more or less painful, but it still needs to be done sequentially to ensure its value is correct. This way, simple operation of increasing the visitors counter can become a serious bottleneck to the whole website.

I have prepared an example application which simulates such case. You can download a zipped archive from here. It was prepared in Linux, but in case you work in Windows you should also be able to compile it with Cygwin - just make sure to install mysql development libraries (you need mysql_config.exe binary in your PATH).
First, prepare the database and set up the tables. All example assume that you have a MySQL database running on localhost, with test database called "test" and password "root" for the root user. If your environment differs, modifiy respective variables:
mysql -u root -proot test < counter.sql
Now, compile the example code which works on the counter:
make sync
This example simulates ten simultaneous blocking counter updates. Each transaction starts one second after the previous one, but it takes two seconds to complete, so the transactions overlap and block one another.
When you run the resulting binary you should see the output similar to the following:
Begin #1
Update #1
Begin #2
Update #2
Begin #3
Update #3
Commit #1
Begin #4
Update #4
Begin #5
Update #5
Commit #2
Begin #6
Update #6
Begin #7
Update #7
Commit #3
Begin #8
Update #8
Begin #9
Update #9
Commit #4
Begin #10
Update #10
Commit #5
Commit #6
Commit #7
Commit #8
Commit #9
Commit #10
Counter value: 10, time elapsed: 20.015499s
You can notice that commits do not take place immediately after updates. This is because each transaction must wait for all previous ones to complete. As a result the test takes around 20 seconds, which is the sum of time needed to finish all transactions. In this scenario, the more clients try to modifiy the counter, the longer it takes to complete. Finally, the system becomes overloaded, some of the operations time out, client processes fail to set the correct counter value, and the user experience is a disaster.

There are couple of ways to solve the problem. Some of them involve spending time and money to build a NoSQL cluster based on some fashionable technology, like Hadoop or Cassandra. A cheaper and smarter way involves implementing so called "optimistic locking". The main idea behind this technique is to decrease to the minimum the time you spend in lock. Operation based on optimistic locking takes three phases: The last operation is the only one that needs to be atomic. In short, it reads the value you want to modify and if didn't change from the first read, it replaces the old value with the new one. If comparison fails, it means that another process has already modified the value, and we need to roll back.
Optimistic locking is a good solution, but it still can lead to failures while updating counter value, and it also requires clients to implement some strategy on how to deal with update failure: "Should I repeat the operation?", "At what intervals?", "How many times before giving up?", etc.

Another way is to get rid of blocking opeartions at all, and replace them with non-blocking ones. It can be done easily by introducing queues. Putting a new value into a queue does not affect any existing values, so it doesn't need any read or write locks. With every new request you just insert a new row into the queue table, and there is a single, separate process which updates the counter and removes the rows it has read form the queue. Because there are no concurrent processes working on the counter, there is no problem of blocking. Also, the worker can read new inserts in batches instead of single records, which can give a huge performance boost with fast growing queue (if there are a hundred new tasks waiting in the queue, you can increase the counter by a hundred in one step). The queue is safe, because each insert has its own unique id, so deleting records from the queue does not interfere with inserting new ones. Also, when updating fails, the inserts are not lost, but they remain in the queue for later processing. The only drawback is that when you read the counter value it is still a little bit behind, but for purposes such as the number of visitors to the website, it can be easily accepted.

Going back to the example. Compile the code with:
make async
This will rebuild the code to use queue instead on changin the counter directly. When you now run it, you should see the output similar to the following:
Begin #1
Insert #1
Begin #2
Insert #2
Commit #1
Begin #3
Insert #3
Commit #2
Begin #4
Insert #4
Begin #5
Insert #5
Commit #3
Begin #6
Insert #6
Commit #4
Commit #5
Begin #7
Insert #7
Commit #6
Begin #8
Insert #8
Commit #7
Begin #9
Insert #9
Commit #8
Begin #10
Insert #10
Commit #9
Commit #10
Counter value: 20, time elapsed: 11.023697s
As you can see, the whole operation now took much faster to complete, because the new transactions didn't have to wait for the previous ones to finish.

Saturday, March 15, 2014

Hacking Linux executables with LD_PRELOAD code injection

Executable code can be stored on disk in many different formats. In MS-DOS, programs which didn't use more than one 64kB memory segment, could be simply saved in a COM file, which contained only machine code, with no additional metadata. When instructed to run such file, operating system loaded it into a first free memory segment at address 0x100, initialized code segment register (CS), data segment registers (DS, ES) and stack segment register (SS) to point to that segment and then passed control to the first instruction located at the beginning of the file. More complicated executables contained an EXE header with some additional informaton required by the operating system, like the number of memory blocks the program needs or initial values of the segment registers.

Nowadays, applications compiled for multitasking operating systems often use dynamically shared libraries, so the executable file headers also contain information about required libraries and functions included from those libraries. In Linux, you can get that data with readelf. For example, using command:
readelf -s /usr/bin/users
gives the following information (I stripped some of the output to increase readability):
Symbol table '.dynsym' contains 58 entries:
  Num: Type Bind   Vis     Name
    1: FUNC GLOBAL DEFAULT utmpxname@GLIBC_2.2.5 (2)
    2: FUNC GLOBAL DEFAULT free@GLIBC_2.2.5 (2)
    3: FUNC GLOBAL DEFAULT abort@GLIBC_2.2.5 (2)
    4: FUNC GLOBAL DEFAULT __errno_location@GLIBC_2.2.5 (2)
    5: FUNC GLOBAL DEFAULT strncpy@GLIBC_2.2.5 (2)
    6: FUNC GLOBAL DEFAULT strncmp@GLIBC_2.2.5 (2)
    7: FUNC GLOBAL DEFAULT _exit@GLIBC_2.2.5 (2)
    8: FUNC GLOBAL DEFAULT __fpending@GLIBC_2.2.5 (2)
    9: FUNC GLOBAL DEFAULT qsort@GLIBC_2.2.5 (2)
   10: FUNC GLOBAL DEFAULT textdomain@GLIBC_2.2.5 (2)
   11: FUNC GLOBAL DEFAULT endutxent@GLIBC_2.2.5 (2)
   12: FUNC GLOBAL DEFAULT fclose@GLIBC_2.2.5 (2)
   13: FUNC GLOBAL DEFAULT bindtextdomain@GLIBC_2.2.5 (2)
   14: FUNC GLOBAL DEFAULT dcgettext@GLIBC_2.2.5 (2)
   15: FUNC GLOBAL DEFAULT __ctype_get_mb_cur_max@GLIBC_2.2.5 (2)
   16: FUNC GLOBAL DEFAULT strlen@GLIBC_2.2.5 (2)
   17: FUNC GLOBAL DEFAULT __stack_chk_fail@GLIBC_2.4 (3)
   18: FUNC GLOBAL DEFAULT getopt_long@GLIBC_2.2.5 (2)
   19: FUNC GLOBAL DEFAULT mbrtowc@GLIBC_2.2.5 (2)
   20: FUNC GLOBAL DEFAULT __overflow@GLIBC_2.2.5 (2)
   21: FUNC GLOBAL DEFAULT strrchr@GLIBC_2.2.5 (2)
   22: FUNC GLOBAL DEFAULT lseek@GLIBC_2.2.5 (2)
   23: FUNC GLOBAL DEFAULT memset@GLIBC_2.2.5 (2)
   24: FUNC GLOBAL DEFAULT __libc_start_main@GLIBC_2.2.5 (2)
   25: FUNC GLOBAL DEFAULT memcmp@GLIBC_2.2.5 (2)
   26: FUNC GLOBAL DEFAULT fputs_unlocked@GLIBC_2.2.5 (2)
   27: FUNC GLOBAL DEFAULT calloc@GLIBC_2.2.5 (2)
   28: FUNC GLOBAL DEFAULT strcmp@GLIBC_2.2.5 (2)
As you can see in the last line, users executable calls a strcmp function from the standard Linux GNU C Library library. This function compares two strings given as its arguments and returns 0 if the strings are equal, positive value if the first string is greater than the second one, and negative value otherwise.

You can take advantage of the fact that the program uses a function from an external library to modify its behaviour without touching its code. All you have to do is to substitute strcmp with your own function. You can do it by writing your own dynamic library with custom strcmp and load it before the code of the users executable is initialized.

First, let's create a very simple library mylib.c with this code:
int strcmp(const char* s1, const char* s2) {
  for (; *s1 == *s2; s1++, s2++) {
    if (*s1 == '\0') {
      return 0;
    }
  }
  return ((*(unsigned char*)s1 > *(unsigned char*)s2) ? -1 : 1);
}
and compile it with the following command:
gcc mylib.c -shared -Wl,-soname,mylib -o mylib.so -fPIC
As you probably noticed, it behaves opposite to the original strcmp function: it returns negative value when the second string is greater than the first one, and positive value otherwise.

Now all you have to do is to inject the custom library code into the running program. Fortunately, Linux allows you to load your own dynamic libraries before starting any executable with LD_PRELOAD environment variable. We can take advantage of this feature - just compare the results of the two following commands:
[adam@ubuntu:~]$ users
adam eve

[adam@ubuntu:~]$ LD_PRELOAD="./mylib.so" users
eve adam
Because we reversed the behaviour of the strcmp function, and users executable relies on it, we managed to change the result of the command (user logins are reversed in the second example) without even touching it.

Google used the same trick to create TCMalloc, which is a faster modification of the standard Linux malloc.

Monday, January 13, 2014

OSV - operating system for the cloud

Cloud development requires design oriented towards environment with reduced performance. With many virtual servers occupying one physical machine a lot of processing power is consumed by context switching, reducing overall performance of all virtual instances.

OSV is a new operating system designed specifically for instances running in the cloud, written in C++ (which, as a side note, proves that Linus Torvalds was wrong when he claimed that C++ is not suitable for writing a system kernel). OSV reduces a lot of overhead found in the traditional operating systems, like kernel memory protection. If you ever worked with operating system which does not use such protection (like AROS or MorphOS), especially on hardware with x86 architecture, you must have observed a huge performance gain on those systems. Of course the biggest drawback of this approach is that a badly written application can crash the whole machine. However, since OSV does not run directly on hardware, but on top of a hypervisor (such as Xen or KVM) such crash affects only a virtual cloud instance, not the whole server. Moreover, OSV runs a standard Java Virtual Machine, which provides automatic memory management and necessary level of protection by itself, so no extra effort from the operating system is needed to ensure software stability.

Will OSV become sucessful? It's hard to say at the moment, but it surely shows a new trend, where not only hardware and applications, but also operating systems evolve to fit the new reality which cloud computing creates. If you are interested in trying out OSV yourself, here are some instructions how to run an Amazon EC2 instance with OSV on-board.

Saturday, January 11, 2014

Waiting for the seamless cloud

You may not have heard about memristor, the long sought fundamental circuit element (next to resistor, capacitor and inductor), but this invention can be the greatest revolution in computer industry since introducing the transistor. Memristors have simpler construction than transistors and don't require external power source to retain information. Last year Crossbar Inc. unveiled their RRAM non-volatile memory technology with shockingly impressive characteristics: 1 terabyte of data on a single chip with 20 times faster access than offered by traditional NANDs.

But this is just the beginning. When technology evolves, it may be able to replace not only flash drives, but also DRAMs, which will have a huge impact on the whole computer industry. Imagine a device with only one memory space, used both as data storage and for running applications. Moreover, applications are loaded into memory only once, and they stay there even when you turn the power off. You can take off the battery from the smartphone and replace it with a new one, and after turning the power on you instantly get your device just as you left it.

Sounds familiar? If you ever worked with Smalltalk you already know the whole idea. If you didn't, I strongly encourage you to try out Squeak or Pharo. They are both open source Smalltalk implementations, and they provide complete, object-oriented working environment. What is unique about Smalltalk is that you can start applications, create new workspaces, build classes and objects, and when you quit, the whole environment is saved in an image, and restored on the next session. At first sight it looks like simple hibernation, but it isn't. In Smalltalk you can build applications without writing the source code in a traditional way. You just create an object, dynamically add properties and methods to it, and it becomes a part of the current environment. When you want to distribute the application you just distribute the image, which has all the application code, libraries, and necessary tools.

RRAM is a technology which will eventually allow to implement the old Smalltalk idea on the hardware level and finally separate virtual machines from underlying hardware. Currently, every time a virtual machine crashes, a new instance needs to be started from scratch and configured. Although this process can be automated with tools like Puppet, it still takes time and makes applications run from scratch. With RRAM-based cloud the problem will not exist any more: you will be able to save the state of a virtual machine at any moment and than spawn it or move it to another physical location (server, rack unit, or even datacenter) within seconds. More important, the state of all the applications will be preserved, which means that, for example, you will not loose user sessions even if you move a virtual machine to another location.

In my opinion it will be a next big step in cloud computing evolution. Now the cloud instances are mostly stateless, and require external storage or cache to keep user data. With the new memory chips they will be able to preserve their state even while moving across different hardware, and this process will become absolutely seamless for the end user - just the same way the GSM networks work today, keeping your call running even when you switch between base transceiver stations while talking.

Moving to the cloud

The cloud computing has developed rapidly during last few years. As shown by Gartner study, the cloud market has grown by 18% in 2013 and is now worth 131 billion (or thousand million) US dollars. By the year 2015 it is expected to hit the value of 180 billion. Despite some well known security and privacy concerns regarding public cloud storage, and even despite the PRISM global scandal, the future in which most (if not all) of our data is stored in the cloud seems inevitable. This requires a substantial shift in approach to software architecture and creates new challenge for software developers.

First, you need to abandon huge, complicated, monolytic applications in favour of platforms built from many small, interconnected services. Such services require much less resources, and can be easily replicated througout the cloud, increasing the overall stability and safety of the applications. One of the most famous examples of a successful SOA revolution is Amazon, which radically changed its software platform.

Second, you need to change the underlying technologies. For example, Java with its extremely resource hungry virtual machine, bloated frameworks and heavy threads seems a rather bad choice. On the other hand, Node.js with its small memory footprint and single-threaded, event-driven, non-blocking I/O model fits the cloud perfectly.
If you have ever had doubts about Javascript being the right choice for an enterprise application, than you should read about Paypal moving their backend from Java to Javascript. Except from huge cost savings (Paypal has been extensively using very expensive Java soultions like Terracotta BigMemory to scale its backend vertically) and increased productivity (engineers claim to be able to develop software twice as fast as in Java), Paypal also benefits from new skills of its software developers, who can now work both on frontend and backend using the same programming language.

Third, you may need to redesign the application flow and algorithms used. You may often get better results processing huge amount of data in small chunks on many small cloud instances, than all at once on few huge mainframes. It means not only using MapReduce to speed up your tasks, but also avoiding of using complex sequential algorithms. For example, in typical situations mergesort may perform worse than heapsort, but is much simpler to parallelize and with enough cloud instances will sort your data much faster.
But there is more to it than that. AMD has just announced a new line of 64-bit ARM server CPUs, which is widely regarded as the end of its race against Intel in high performance computing. ARM processors are less complicated and more power efficient than their Intel counterparts, but are also slower - which means that they will be more suited for clouds with many small instances.

Finally, there is a whole new class of problems in cloud computing, which are either not present, or long solved in traditional applications, like data consistence or dealing with transactions. Some of them cannot be solved by software, and they require change in your business model: for example you may have to trade real time consistency for the amount of traffic your application can handle. Also, most cloud installations suffer from poor I/O performance, because there are usually many virtual instances simultaneously trying to access one physical device like hard drive or network interface.

Monday, December 23, 2013

Sorting contact groups in Pidgin

Pidgin is a great open source messenger application, and the only one I know which can handle thousands of contacts without choking itself to death. The only feature I miss about it is that you cannot easily sort contact groups by name. The company I work for has their employees arranged in over two hundred groups, and without some kind of automatic ordering it is really hard to find the right one. There is a 7 years old ticket with a request for adding alphabetic group sorting to Pidgin, and it still remains unresolved.

Fortunately, Pidgin stores the contact list (sometimes called roster) in an XML file blist.xml, and there are lots of libraries for different programming languages which help to manipulate XML structures. One of the most interesting scripts I have found was written in Python: it sorts the contact list alphabetically by name, and it's only 12 lines of code. I prepared an enhanced version, which allows you to create a list with predefined priorities, for example:
roster = { 'Family': 1, 'Friends': 2 }
In this case groups Family and Friends will be moved to the top of the list, with group Family as first, Friends as second, and the remaining groups sorted alphabetically in ascending order.

You can download the script here. Some important notes:

  • You have to run it in the same directory Pidgin keep the blist.xml in.
  • Also, turn off the messenger before you run the script, because it keeps the whole contact list in memory and will overwrite version stored on disk once you shut it down.
  • Don't forget to make a backup of the original blist.xml file in case something goes wrong.

Friday, September 13, 2013

Design by contract

Design by contract is a software design approach (the term is actually a trademark), first introduced by Bertrand Meyer in his programming language Eiffel (a free GNU implementation of the language is called SmartEiffel). The concept is built around the idea of a contract between software components, which must fulfilled. Each module defines a set of valid input conditions and guarantees a certain output when the conditions are met. Simply put, design by contract requires you to define a set of preconditions (input constraints) and postconditions (output constraints) for each function or method. Eiffel also supports invariants, which are constraints put on the object fields.

If you think about it for a while, it turns out that you don't need a dedicated programming language to implement most of the design by contract principles in your software. In fact, many programmers use at least some kind of preconditions in their code to ensure input data correctness, but without specifing it formally. Consider this simple C++ method to calculate the greatest common divisor of two integers:
int gcd(int a, int b)
{
    if (b == 0) {
        return a;
    } else {
        return gcd(b, a % b);
    }
}
According to its definition "the greatest common divisor (gcd) of two or more integers (at least one of which is not zero), is the largest positive integer that divides the numbers without a remainder". To be sure that the method parameters meet the criteria stated in the definition, you will need to validate both variabes:
int gcd(int a, int b)
{
    // at least one of the integers is not zero
    if (a != 0 || b != 0) {
        if (b == 0) {
            return a;
        } else {
            return gcd(b, a % b);
        }
    }
    // report error
}
As you can see, the code gets unnecessarily complicated, because it mixes computations with validation logic. And it's still only a half of the contract. To ensure that the return value is also correct, you need to implement additional conditions:
int gcd(int a, int b)
{
    int result;

    // at least one of the integers is not zero
    if (a != 0 || b != 0) {
        if (b == 0) {
            result = a;
        } else {
            result = gcd(b, a % b);
        }
        // resulting number divides the input numbers without a remainder
        if ((a % result == 0) && (b % result == 0)) {
            return result;
        }
    }
    // report error
}
These conditions complicate the code even more. To solve this problem, design by contract approach introduces two dedicated clauses: requires for preconditions and ensures for postconditions. There are libraries and compilers for many various programming languages which support these directives. For example, if you write code in Java, you can use cofoja, which supports contracts through annotations. In C and C++ you can use standard assert macro to achieve the similar effect:
#define REQUIRES(condition) assert(condition)
#define ENSURES(condition) assert(condition)

int gcd(int a, int b)
{
    // at least one of the integers is not zero
    REQUIRES(a != 0 || b != 0)

    int result;

    if (b == 0) {
        result = a;
    } else {
        result = gcd(b, a % b);
    }

    // resulting number divides the input numbers without a remainder
    ENSURES((a % result == 0) && (b % result == 0))

    return result;
}
Now the code looks much cleaner and it is easy to distinguish the computation logic from the preconditions and postcoditions. Also, you can now quickly spot the method's input requirements and the conditions met by the result.

Of course you can replace assert with your own macros, in order to not stop the program execution when the contract rules are broken (which assert does by default). If you are interested, I prepared a set of macros which throw a dedicated exception instead. You can find them, together with usage examples, at the project page on Github.

Design by contract is a very useful technique if you want to ensure a high level of software correctness, and it should be considered as complementary, not competitive, to unit testing. Unit tests are usually run only once before software deployment and may not be able to find all errors, because in most cases it is impossible to test code against all possible inputs. The contract rules, however, are verified every time code is executed. On the other hand, using preconditions and postconditions may in some cases considerably slow the program down. Although this should rarely be the problem, since CPU usage very seldom is a bottleneck, many systems allow you to turn off constraint checking in release mode, in order to improve performance.