Friday, September 16, 2011

Human factor

Having recently moved to a netbook, I realized again how important performance is for software development. I realized it even more when my Android phone became almost unusable after another software upgrade. As a kid I was a big fan of a computer demoscene where gifted coders showed to the world how to make the hardware do things unimaginable even by its creators (like a famous Commodore 64 FLI graphic mode). Now when the number of small devices connected to the Internet exceeds the number of computers and laptops wired to the global network, writing efficient software is a must: Facebook develops HipHop for PHP (written in C++), Google releases Native Development Kit for Android and makes Go the first compiled programming language available for the Google App Engine platform.

It is commonly argued that ANSI C is the "fastest programming language" that exists. No it's not. The language itself is neither slow nor fast, it's the implementations that are (compare Ruby and JRuby for example). However, it's true that software written in C can be usually compiled to the most efficient executable code (excluding pure assembly). No wonder that the most efficient pieces of software, like operating system kernels (Linux, BSD, Solaris, Windows, etc) or programming language VMs (Common Lisp, Python, PHP, Ruby, and even Java) are all written almost exclusively in C (not even C++).

Out of curiosity I started to look around for web software written in C and found a wonderful web server called G-Wan. It's very small and its performance looks really amazing as compared to other web servers - some enthusiastic reviews even call it The Future of the Internet and wonder why such a brilliant software is not popular. Well, G-Wan is not popular, because there is a problem. A big problem. And the name of the problem of G-Wan is Pierre Gauthier, its author. If you browse through the G-Wan's website and forum you can learn that he is a man with a really huge ego (he calls himself one of the best engineers available), who loves to criticize other people's work (like Poul-Henning Kamp's, the author of Varnish). But at the same time he does not want to publish his own source code nor make it open source, because he perceives other developers as inferior idiots, who would surely break it or at least bring nothing interesting to the existing code base. Moreover, evil government agents will try to introduce backdoors into his software and you are more secure when Pierre hides his code deep under his bed and tells you that you should trust him. And of course you do, don't you? And if you invest your time and money in G-Wan you don't have to worry at all about future development and maintenance, because even though the source code is closed, Pierre also gives you his word that he will not drop the software, and that he will never get hit by a bus (I'm not exaggerating, just read this post). I will not go deeper into his paranoia of his website being constantly attacked by Microsoft and NSA servers, because you already should have an idea about the way the guy thinks.

Edit: On the contrary to the information provided by its author, some security problems have actually been found in G-Wan. The affected version 2.10.6 is no longer available, and the issue was addressed by Pierre in his usual manner (one of the most funny claims is that nobody ever tried to confirm the bugs, while there is no archive of older G-Wan versions to verify it). Also, any links leading to the report are being actively removed from Wiki pages (see the comments to this post for details).

On the other pole there is a true pearl called Tiny C Compiler created by Fabrice Bellard. It's so insanely fast (it can compile the typical Linux kernel in less than 15 seconds!) that it can be used to write scripts or servlets in ANSI C (guess what is used internally by G-Wan to compile the servlets). With another open source project, libmicrohttpd, it can become a good alternative if you want to build a small, fast web server that can use ANSI C servlets (for example as an embedded router software). Libmicrohttpd is fully HTTP 1.0 and 1.1 compliant, and it offers several threading models, so you can tune your software and choose which one suites you best in your environment. If I was supposed to build a tiny embeddable web server, I would definitely choose open source libmicrohttpd + tcc + some personal coding over G-Wan.

P.S. G-Wan's author deleted forum from his website, also ensuring (with carefully crafted robots.txt file) that none of its contents is archived by search engines. The forum contained many important information for G-Wan users, but user support is obviously less important than invalidating links in this and other unfavourable posts. Of course you are still welcome to believe in "source code insurance", and that the same will never happen to G-Wan.

Friday, September 2, 2011

Another one bites the dust

A few days ago I found an interesting post on Ken Shirriff's blog, entitled Why your favourite language is unpopular. It summarizes a talk by Erik Meijer who provided a simple, yet very accurate, explanation why some programming languages gain popularity, while some other (often better in many aspects) don't. I thought about this formula in context of Reia, a very promising language for Erlang platform I mentioned about back in 2008 in my post Scripting Erlang. The author of Reia has just announced that he drops development of Reia in favour of Elixir. It seems that with the amount of software existing today there is very little space for another programming language, even if it addresses problems hard to solve with mainstream languages. With existing army of programmers at its disposal, the IT industry can deal with most of its problems using existing tools, without the need of inventing another language to rule them all. A good example of such philosophy is Cilk Plus, which originates from Cilk and allows to scale C/C++ code easily on multicore CPUs. With Cilk you can improve your existing software without rewriting it from scratch with a new computer language, which in the long run can create more problems than it actually solves.

Tuesday, August 30, 2011

Running Scala via Nailgun

Recently a friend of mine got me interested in Scala, a modern programming language for JVM. Scala reached a wide audience in 2009 when Twitter announced that they use Scala in the most critical elements of their backend. It became popular for its efficiency and scalability, and for joining in a very pragmatic manner the concepts of functional and object oriented programming. It also implements some very powerful constructs, like foldLeft, which allows you to compose your own reduction methods operating on lists (see this page for some very interesting examples).

I decided to give it a try on my Acer Aspire One netbook, which I use often to do some daily tasks. Unfortunately, with 1.6 GHz Intel Atom and 1GB RAM running Fedora 14 it took Scala REPL almost 20 seconds to start. I was wondering if I could make it faster and found out that some people recommended Nailgun for this task. In short, Nailgun preloads the whole JVM into the memory and than uses its own bootstrap loader to connect your programs to it, so that they load much faster, without JVM startup overhead. However, besides some good advice I couldn't find any working example how to do it, so I decided to do it myself. Fortunately it turned out to be a quite easy task. First, I copied nailgun-0.7.1.jar to Scala's lib directory, then I took the original bin/scala script and created a modified version. I changed line 161 from
scala.tools.nsc.MainGenericRunner  "$@"
to
com.martiansoftware.nailgun.NGServer  "$@"
and line 113 from
CPSELECT="-Xbootclasspath/a:"
to
CPSELECT="-cp "
The last change was necessary for Nailgun to load properly. Then I started regular Scala REPL through
ng scala.tools.nsc.MainGenericRunner
just to learn that it still takes 20 seconds to start. Conclusion? Modern JVM loading overhead is so small that reducing it further makes little sense, even on slow machines.
Lessons learned:
1) Using Nailgun to speed up Scala won't get you anywhere.
2) Don't listen to forum "experts" who recommend solutions which they have obviously never tried in practice.

Saturday, October 30, 2010

Scale up your skills, not the hardware

It is not a big secret that good programmers keep their saws sharp all the time. If you want to improve your mental skills a good place to start is Project Euler. I found it while browsing other programmers' blogs and immediately got hooked up. Project Euler is a place where you can challenge yourself (and others) in solving various mathematical and computer programming problems. The collection of problems is constantly growing and you can find them ranging from very easy to really difficult. When you provide the correct solution to a problem, a forum dedicated to it becomes revealed and you can see how other people managed to solve it. Sometimes you can also get a bonus in a form of paper explaining the problem and the best known solutions in details. You can learn a lot from these forums since people share there both their ideas and source code written in different programming languages; some of them don't even use a computer, all they need to find a solution is paper and pencil.

What I particularly like about Project Euler is that most of the problems cannot be solved by brute force. They can make you check hundreds or thousands of possible combinations, but still require a more clever approach. A classic example are problems 18 and 67, where you need to find a path from the top of the number pyramid to the bottom, so that the total sum of numbers on the path is the largest. While the pyramid in problem 18 is rather small and you can find the solution by trying every of 16384 possible paths, it is impossible to solve problem 67 the same way. To solve it you need to turn the problem upside down - literally, since you can easily get the correct solution if you start searching for the best path from the bottom of the pyramid to the top, reducing the number of combinations with every step.

You may wonder how mathematics can help you in facing every day challenges as a programmer. Imagine you work for an on-line store with millions of CD records on stock, and you provide a browser with a dynamic list of products based on different criteria, such as genre, year and price. Now you want to show to your customers the most popular product within the range they choose. Getting all items satisfying the selected criteria from a database and finding the most popular one every time someone browses the list will quickly kill your back-end. On the other hand, it is impossible to generate such data in advance for every possible combination of search parameters. You can either expand your server farm (and possibly move to the cloud to reduce costs) or you can use some statistical tools to help you do the job. The trick is to find a representative sample size for a given population of products. Assuming that N is a total population, z is a statistical factor for desired confidence level, p is an estimated proportion of a given variable within the population, and d is admissible error level, we can get sample size n with the following formula:
x = z^2 * (p * (1 - p))
n = N * x / ((N - 1) * d^2 + x)
The z factor is constant and equals 1.96 for the 95% confidence level and 2.58 for the 99% confidence level. Assuming that a selected range of CD records contains one million uniformly distributed products and you can tolerate a 5% error, it turns out that in 95% cases you get the most popular one correctly by analysing only 385 randomly chosen items from the list, since:
x = 1.96^2 * (0.5 * (1 - 0.5)) = 0.9604
n = 1000000 * 0.9604 / ((1000000 - 1) * 0.05^2 + 0.9604) =~ 384.013
To be 99% sure that no more than 1% results are incorrect, you'd need to analyse 16369 records.

You can learn more about calculating sample sizes from this article at Quirks - I took the above formula from it. And remember: while it's true that machine cycles are cheap and people cycles are not, never forget that people cycles are worth billions of machine ones.

Saturday, July 31, 2010

MapReduce with Gambit Scheme & Termite

Termite is a library that runs on top of Gambit Scheme and implements concurrency model inspired by Erlang (there is also a Chicken Scheme port, but it's so far incomplete). Since I have already posted about Termite some time ago, I will not go into details how to write distributed applications with it. Instead, I want to show you how to use Termite to implement a simple MapReduce algorithm.

A trivial implementation uses a simple worker process that executes some code, and a pmap function which spawns multiple workers and collects the results:
(define (worker)
  (let* ((msg (?)) (pid (car msg)) (fun (cadr msg)) (arg (caddr msg)))
    (! pid (fun arg))))

(define (pmap fun lst)
  ; spawn workers
  (for-each 
    (lambda (x) (let ((p (spawn worker))) (! p (list (self) fun x))))
    lst)
  ; collect results
  (let loop ((i (length lst)) (result '()))
    (if (= 0 i) (reverse result) (loop (- i 1) (cons (?) result)))))
So far it works similar to its Erlang equivalent. However, this version does not guarantee (just like the Erlang one) that you get the results in the same order as the arguments passed to pmap. Let's consider function f, which holds execution of the current thread for a given number of seconds:
(define (f x) (thread-sleep! x) x)
If you run the following code:
(pmap f (list 1 2 3 2 1))
The result will be:
(1 1 2 2 3)
Since the first results on the list will come from the processes that finished first.

A solution to this problem is to make a single worker process using a sequencer to mark spawned threads. Numbers generated by the sequencer are then used to sort the results to achieve the correct order of the result list:
(define (worker)
  ; init sequencer
  (let loop ((n 0))
    (let ((msg (?)))
      ; terminate worker on request
      (if (not (eqv? 'exit (car msg)))
        (let ((pid (car msg)) (fun (cadr msg)) (arg (caddr msg)))
          ; start a new thread
          (spawn (lambda () (! pid (cons n (fun arg)))))
          ; increase sequencer
          (loop (+ n 1)))))))

(define (pmap fun args)
  ; start worker
  (let ((p (spawn worker)) (result '()))
    ; send data to the worker
    (for-each (lambda (x) (! p (list (self) fun x))) args)
    ; collect results
    (let loop ((i (length args)) (lst '()))
      (if (= 0 i)
        (set! result lst)
        (loop (- i 1) (cons (?) lst))))
    ; terminate worker
    (! p (list 'exit))
    ; sort results
    (let loop ((in (qsort result <)) (out '()))
      (if (null? in)
        out
        (loop (cdr in) (cons (cdar in) out))))))
You can use any function you wish to sort the results. I used a modified version of quicksort found at Rosetta Code:
(define (qsort l gt?)
  (let q ((l l))
    (if (null? l)
      l
      (let ((s (split-by (cdr l) (lambda (x) (gt? (car x) (caar l))))))
        (append (q (car s)) (list (car l)) (q (cdr s)))))))

(define (split-by l p)
  (let loop ((low (list)) (high (list)) (l l))
    (if (null? l)
     (cons low high)
     (if (p (car l))
       (loop low (cons (car l) high) (cdr l))
       (loop (cons (car l) low) high (cdr l))))))
Now when you run the test again, you get the result list in the same order as the argument list, no matter what the execution time of single processes was:
(pmap f (list 1 2 3 2 1))

(1 2 3 2 1)
You can download the pmap library here.

Monday, July 19, 2010

Using trampolines in C

In my previous post I presented my implementation of trampoline code in newLISP. Since this technique is also used in non-functional programming languages, I prepared an example in C.

Let's assume we have two mutually recursive functions f1 and f2:
#include <stdio.h>

void f1(int n);
void f2(int n);

void f1(int n) {
  printf("%d\n", n);
  if (n == 0)
    printf("Blastoff!\n");
  else
    f2(n);
}

void f2(int n) {
  f1(n-1);
}

int main() {
  f1(1000000);
}
When you compile this code without any optimizations provided by a compiler and try to run it you will most probably quickly get a segmentation fault (or similar error, depending on your platform). What happens here is the application running out of stack for storing function callbacks.

In Lisp you can solve this problem by rewriting functions to return continuations instead of values. In C you can simulate continuations by the following means:
1. Making functions to operate on pointers to variables instead of actual variables, which allows the application to preserve their values between function calls.
2. Returning a pointer to the next function to be called instead of calling it directly, which makes the use of application stack unnecessary.

So rewritten f1 and f2 can be presented in the following form:
#include <stdio.h>

typedef void *(*Fun)(int *n);

void *f1(int *n);
void *f2(int *n);

void *f1(int *n) {
  printf("%d\n", *n);
  if (*n == 0) {
    printf("Blastoff!\n");
    return NULL;
  } else {
    return f2;
  }
}

void *f2(int *n) {
  *n = *n - 1;
  return f1;
}

int main() {
  int n = 1000000;
  Fun f = f1;
  while (f != NULL) {
    f = (Fun)f(&n);
  }
  return n;
}
Now the main function iteratively goes through subsequent function calls until the halting condition (NULL) is met. At the end of iteration variable n holds 0.

As a side note, gcc is able to optimize the code of mutually recursive functions quite effectively. If you compile the first example with gcc -O2 the resulting code will not crash, because the compiler replaces all call instructions with jmp.

P.S. I would like to thank Johnny, my friend and a true programming expert, who inspired me to write this post. I was also inspired by Carlos Oliveira's post on trampolines.

Thursday, July 15, 2010

Advanced recursion in newLISP

In my recent post about newLISP I mentioned that it does not support tail call optimization. In fact, many Lisps don't. As Bill Six pointed out even the ANSI standard of Common Lisp standard does not mandate (unlike Scheme) tail-call elimination provided by the language implementation, although it seems that all well-known ANSI Common Lisp compilers do it anyway.

I was wondering if there is a way to circumvent this problem and the first solution I found was to use memoize macro described in an excellent online documentation for newLISP, Code Patterns in newLISP:
(define-macro (memoize mem-func func)
  (set (sym mem-func mem-func)
    (letex (f func  c mem-func)
      (lambda ()
        (or (context c (string (args)))
        (context c (string (args)) (apply f (args))))))))
You can apply this macro to any function with any number of arguments. The trick here is that each time a function is called its result is cached in memory for another call. It can speed up your application tremendously, which can be observed by comparing the execution time of these example Fibonacci functions:
(define (fibo n)
  (if (< n 2) 1
    (+ (fibo (- n 1)) (fibo (- n 2)))))

(memoize fibo-m
  (lambda (n)
    (if (< n 2) 1
      (+ (fibo-m (- n 1)) (fibo-m (- n 2))))))

(time (fibo 35))
(time (fibo-m 35))
On my laptop (fibo 35) took 12,98 seconds to execute, while (fibo-m) executed in 0,016 miliseconds.

Unfortunately the memoize macro cannot help you to handle mutual recursion. A classic example of such recursion looks as follows:
(define (f1 n)
  (println n)
    (if (= n 0)
      (println "Blastoff!")
      (f2 n)))

(define (f2 n)
  (f1 (- n 1)))
You can easily make newLISP run out of stack by running (f1 1000), not to mention bigger numbers. What happens if we define a "memoized" version of f1 and f2? Let's see:
(memoize f1
  (lambda (n)
    (println n)
    (if (= n 0)
      (println "Blastoff!")
      (f2 n))))

(memoize f2
  (lambda (n)
    (f1 (- n 1))))
Again, running (f1 1000) immediately exhausts newLISP's stack.

A solution to this problem is using a technique called trampolining. Bill Clementson on his blog not only explained in an excellent way the concept of using trampolines, but also provided a Common Lisp implementation, which became my inspiration to write a newLISP version:
(define (trampoline fun arg)
  (catch
    (while true
      (let ((run (apply fun arg)))
        (setf fun (first run) arg (rest run)))) 'result)
  result)
A trampoline iteratively executes code thunks returned by a function, and this way avoids blowing out application stack. However, in order to use trampoline, the function must return continuation (a pointer to the next step) instead of value. Below is a version of beforementioned functions modified to use the trampoline:
(define (f1 n)
  (println n)
  (if (= n 0)
    (throw "Blastoff!")
    (list f2 n)))

(define (f2 n)
  (list f1 (- n 1)))
Now you can test it with:
(trampoline f1 '(1000))
(trampoline f1 '(10000))
(trampoline f1 '(100000))
...
Have fun!