Saturday, September 27, 2008

Cilk: C that scales

Do you like C? I do. I used to write software for embedded devices, where resources are priceless and each CPU cycle and every byte of RAM counts. I enjoyed it a lot, mostly because it was a real challenge and it reminded me of good old 8-bit computer times, when if your program was not fast enough you had to optimize its code or find a smarter solution instead of buying a better CPU, expanding memory, or building a home data center. Even today few people question using C to speed up the most critical parts of software, not to mention some obvious examples like Linux kernel, written almost entirely in ANSI C. Two years ago NVidia announced its CUDA platform, which allows developers to write software in C that can run concurrently on many GPUs, outperforming applications using standard CPUs.

For those who don't own NVidia hardware, there is an alternative called Cilk. Cilk is a language for multithreaded parallel programming based on ANSI C. It extends standard C with new keywords like cilk (identifying parallel function), spawn (which spawns a new computation thread) and sync (which gathers job results from all threads started with spawn).

I ran an example Cilk application that finds Fibonacci numbers of a given value:
#include <cilk-lib.cilkh>
#include <stdlib.h>
#include <stdio.h>

cilk int fib(int n)
{
     if (n < 2)
          return (n);
     else {
          int x, y;
          x = spawn fib(n - 1);
          y = spawn fib(n - 2);
          sync;
          return (x + y);
     }
}

cilk int main(int argc, char *argv[])
{
     int n, result;

     if (argc != 2) {
          fprintf(stderr, "Usage: fib [] <n>\n");
          Cilk_exit(1);
     }
     n = atoi(argv[1]);
     result = spawn fib(n);
     sync;

     printf("Result: %d\n", result);
     return 0;
}
As you can see it looks almost like an ordinary C program. You can run it with --nproc argument to determine the number of CPUs (or CPU cores) that are supposed to be used by the application.

As I expected, the Cilk program turned out to be faster than its Erlang equivalent. The speedup was from 5.110972 to 0.689607 seconds on dual Xeon E5430 2.66 Ghz (8 cores total), comparing to 25.635218 and 10.017358 seconds of Erlang version respectively. But what struck me the most about the results was that the Cilk code scaled in almost linear manner:
Wall-clock running time on 1 processor: 5.110972 s
Wall-clock running time on 2 processors: 2.696646 s
Wall-clock running time on 4 processors: 1.355353 s
Wall-clock running time on 8 processors: 689.607000 ms
Cilk code performance boost was approximately 740%, comparing to Erlang 250%. On 8 cores Cilk code outperformed Erlang code by more than an order of magnitude. This is something that should make Erlang developers feel respect for.

1 comment:

ilya said...

In fact, MIT's Cilk is being commercialized, due to ship in several months. Expanded capabilities include support for loops, C++, and a solution for the non-local variable problem.

www.cilk.com