Friday, August 22, 2014

Pushing the limits

Today I'm not going to write about ultra fast Hadoop clusters or insanely scalable, globally-distributed databases. This time I want to take you on the journey in the totally opposite direction, straight into the rabbit hole.

Let's say you want to display a nice looking rainbow on the screen of you TV, like this one:
Because a static picture is a little boring, you also want to animate it, so that the rainbow moves up (or down) the screen. The question is: what hardware do you need for this and what will be the software requirements?

What if I tell you that all you need is a computer with 1.8 MHz CPU, 48 kilobytes of RAM and the program that takes 29 (twenty nine) bytes? Yes, bytes. Less than the number of letters in this sentence. If you don't believe me, you can download a zip archive containing the software from this location. Inside you will find a file rainbow.xex which you can run on any Atari 800/800XL emulator, or even on the real hardware if you still have one (you can use a device like SIO2SD to transfer files to the Atari). I have run it on a real Atari 800XL connected to a plasma TV through a composite video lead, and it works perfectly.

How is that possible? The secret lies in talking directly to hardware. 8-bit computers have very limited resources, so any software layer which is not absolutely necessary is just a waste of memory and CPU cycles. A simple example is a boolean value, which can have one of two states: true or false. Any computer in the world, no matter how simple or complicated, needs only one bit to understand boolean values: 1 for true and 0 for false. Since a byte consists of 8 bits, you can store 8 different boolean values in one byte. Yet, in today's software world, the Java boolean primitive takes entire byte, and the simplest Boolean class:
public final class Boolean
{
    public static final Boolean TRUE = new Boolean(true);
    public static final Boolean FALSE = new Boolean(false);
    private final boolean value;

    public Boolean(boolean value) {
        this.value = value;
    }

    public boolean booleanValue() {
        return value;
    }
}
compiles in Java 7 to 451 bytes of bytecode, not to mention it requires a few hundred megabytes runtime environment to load. This is the cost of having programmer friendly, enterprise grade, programming tools and languages.

Going back to the example, the algorithm to make Atari display the rainbow is quite simple:

1. Initialize graphic mode.
2. Read current scanline from Atari display chip - Antic. Antic creates image on a TV screen drawing it line by line, from left to right, 50 times (PAL) or 60 times (NTSC) per second. The horizontal resolution of an Atari screen is 240 pixels (including borders), which means that the scanline number can have values from 0 (first line) to 239 (last line).
3. If the line number is 0, increase the counter, which is used to calculate the colour palette shift. Increasing the counter on every new frame adds the animation effect to the rainbow picture.
3. Wait for Antic to finish drawing the current line on the screen (we don't want to change color in the middle of the line).
4. Change background colour to the value obtained from Antic (Atari has a total palette of 256 colors) summed up with the current shift value.
5. Repeat from point 2.

This algorithm can be easily translated into a C program:
#include <atari.h>

#define SDMCTL (*((unsigned char*)0x22F))

int main() {
  unsigned char line, counter = 0;

  SDMCTL = 0; // disable display in BASIC screen area
  while (1) {
    line = ANTIC.vcount;  // read current screen line
    if (!line) counter++; // increase color shift on new frame
    ANTIC.wsync = line;   // block CPU until vertical sync
    GTIA_WRITE.colbk = line+counter; //change background color
  }
  return 0;
}
To compile this program on a PC, you can use a cross-platform cc65 compiler:
cc65 -t atari -o rainbow.xex rainbow.c
However, the resulting executable is 1104 bytes long. Let's make the compiler to output the assembly code to see the CPU instruction list it produced:
cc65 -t atari rainbow.c
The assembly source code in the resulting rainbow.s file looks like this:
    jsr     decsp1
    lda     #$00
    jsr     pusha
    ldx     #$00
    lda     #$00
    sta     $022F
L0008:  ldx     #$00
    lda     $D40B
    ldy     #$01
    sta     (sp),y
    ldy     #$01
    ldx     #$00
    lda     (sp),y
    jsr     bnega
    jeq     L000E
    ldy     #$00
    ldx     #$00
    lda     (sp),y
    pha
    clc
    adc     #$01
    ldy     #$00
    sta     (sp),y
    pla
L000E:  ldy     #$01
    ldx     #$00
    lda     (sp),y
    sta     $D40A
    ldy     #$01
    ldx     #$00
    lda     (sp),y
    jsr     pushax
    ldy     #$02
    ldx     #$00
    lda     (sp),y
    jsr     tosaddax
    sta     $D01A
    jmp     L0008
    ldx     #$00
    lda     #$00
    jmp     L0002
L0002:  jsr     incsp2
    rts
Quite long, isn't it? If you understand the assembly language, you will probably notice a few problems here. First, the program jumps a few times to some strange subroutines (like pusha, bnega, tosaddax, etc.), which does not seem necessary. Second, it uses a lot of reads and writes to the stack (although the algorithm uses only two local variables (line and counter). Third, some instructions are completely unnecessary (like LDX, which loads data to never used CPU index register X).

Fortunately, the cc65 compiler provides some optimizations: you can use "‑Oi" and "‑Os" command line switches to inline all subroutines and get rid of jsr instructions, you can also make local variables to be stored under fixed memory address instead of the stack through "‑Cl". Also, using "‑Or" switch enables register variables, which is particularly interesting, because it does not make CPU use physical registers (since MOS 6502 processor used in the 8-bit Atari has only one general purpose register, called accumulator), but it makes variables to be stored in first 256 bytes of physical memory, called zero page. The advantage of using zero page is using less CPU cycles to access it: 8-bit memory addresses (from 0 to 255) are decoded faster by 8-bit processing unit than 16-bit addresses (from 0 to 65535 - the memory limit in Atari 800XL is 64KB).

Let's see what happens if we turn the optimizations on:
cc65 -Cl -Osir -t atari rainbow.c
The resulting assembly code is now much smaller:
    lda     #$00
    sta     L0004
    sta     $022F
L000A:  lda     $D40B
    sta     L0003
    lda     L0003
    bne     L0010
    lda     L0004
    clc
    adc     #$01
    sta     L0004
L0010:  lda     L0003
    sta     $D40A
    lda     L0003
    clc
    adc     L0004
    sta     $D01A
    jmp     L000A
To produce a working executable, we have to assemble and link it:
ca65 rainbow.s && cl65 -t atari -o rainbow.xex rainbow.o
Still, the resulting code is 630 bytes long. This is because the compiler still includes some Atari specific code from the standard library. We can work around this using an assembler (like xasm or MADS). But first, we need to clean up the assembly code a bit:
    org $2000  ; start address

    lda #$00  ; load accumulator with 0
    sta $cb   ; store it in memory
    sta $022f ; disable display in BASIC screen area
loop
    lda $d40b ; read current screen line into accumulator
    sta $cc   ; store it in zero page
    lda $cc   ; load current line number from memory
    bne skip  ; if it's not zero jump to code at 'skip' label
    lda $cb   ; load counter from memory into accumulator
    clc       ; clear carry flag
    adc #$01  ; add 1 to accumulator
    sta $cb   ; store counter in memory
skip
    lda $cc   ; load current line number from memory
    sta $d40a ; block CPU until vertical sync
    lda $cc   ; load current line number from memory
    clc       ; clear carry flag
    adc $cb   ; add counter to the line number
    sta $d01a ; change background colour
    jmp loop  ; repeat
I added some comments to make clear what's going on in the code. So, let's assemble it:
xasm rainbow.s /o:rainbow.xex
Now rainbow.xex is only 45 bytes long! Two first bytes are the executable file header ($FFFF) and the next two are memory address the program should be loaded into ($2000). The rest is pure CPU instruction list.

But we can still make it smaller, using some knowledge the compiler does not have. First, we don't need to init the counter stored at $cb memory address, because we don't care about it's initial value (it gets increased infinitely anyway), so we can get rid of "sta $cb" instruction. Second, we don't need "lda $cc" right after "sta $cc" and "sta $d40a", because the "sta" instruction doesn't change the accumulator state, so there's no need to reload it. We can also replace the whole procedure of loading counter from memory into accumulator, adding 1 and storing it back in memory, with only one instruction - "inc $cb" - which increases the memory value by 1. Finally, we don't need "clc" to clear the carry flag (the flag indicating that a mathematical operation resulted in value which does not fit in 8 bits) before adding line number and counter, because Atari colour palette can have only 256 values, so it doesn't matter if the addition result fits in 8 bits or not.

After making the changes, the code now looks as follows:
    org $2000 ; start address

    lda #$00  ; load accumulator with 0
    sta $022f ; disable display in BASIC screen area
loop
    lda $d40b ; read current screen line into accumulator
    sta $cc   ; store it in zero page
    bne skip  ; if it's not zero jump to code at 'skip' label
    inc $cb   ; increase counter in memory by 1
skip
    lda $cc   ; load current line number from memory
    sta $d40a ; block CPU until vertical sync
    adc $cb   ; add counter to the line number
    sta $d01a ; change background colour
    jmp loop  ; repeat
After assembling it with xasm, the resulting executable is now only 33 bytes long. However, we can still cut the instruction list down. You may notice that there is no need to load and store accumulator at $cc memory address, because its state doesn't change between "sta $cc" and "lda $cc" - so we can safely remove those instructions:
    org $2000 ; start address

    lda #$00  ; load accumulator with 0
    sta $022f ; disable display in BASIC screen area
loop
    lda $d40b ; read current screen line into accumulator
    bne skip  ; if it's not zero jump to code at 'skip' label
    inc $cc   ; increase counter in memory by 1
skip
    sta $d40a ; block CPU until vertical sync
    adc $cc   ; add counter to the line number
    sta $d01a ; change background colour
    jmp loop  ; repeat
As we got rid of another couple of bytes, rainbow.xex is now exactly 29 bytes after assembling.

Lessons learned:
1. Compilation with default compiler options sucks.
2. A smart compiler can do a really great job optimizing code.
3. There is always a way for a programmer to optimize it even more.
4. Programming old computers is fun.

Saturday, July 26, 2014

How to deal with blocking operations

In distributed environment even simple things become complicated. One of the most trivial examples is a counter. If you have only one instance of an application running, then you can simply store the counter in a database table with the counter's id and value, for example:
| ID | VAL |
------------
| 0  |  1  |
and increase its value with a simple query:
UPDATE counters SET val=val+1 WHERE id=0
However, when there are many clients trying to increase the counter simultaneously (for example multiple frontends displaying the number of visitors to the website), you may experience some serious slowdowns. It's because updating a row in a database is a blocking operation, which means that during update the row is locked to prevent the situation when multiple process try to modify it at the same time. There are two basic types of locks: write locks and read locks. A write lock is a basic and most often used type of lock, which does not allow modifying a value while another change is in progress. You can still read the row value, but in this case two subsequent reads can return two different values, because the change could have taken place just between them. When this becomes an issue, you can use a read lock, which prevents other processes from reading the row until the change is done. Depending on how your database is configured, locking the counter row will become more or less painful, but it still needs to be done sequentially to ensure its value is correct. This way, simple operation of increasing the visitors counter can become a serious bottleneck to the whole website.

I have prepared an example application which simulates such case. You can download a zipped archive from here. It was prepared in Linux, but in case you work in Windows you should also be able to compile it with Cygwin - just make sure to install mysql development libraries (you need mysql_config.exe binary in your PATH).
First, prepare the database and set up the tables. All example assume that you have a MySQL database running on localhost, with test database called "test" and password "root" for the root user. If your environment differs, modifiy respective variables:
mysql -u root -proot test < counter.sql
Now, compile the example code which works on the counter:
make sync
This example simulates ten simultaneous blocking counter updates. Each transaction starts one second after the previous one, but it takes two seconds to complete, so the transactions overlap and block one another.
When you run the resulting binary you should see the output similar to the following:
Begin #1
Update #1
Begin #2
Update #2
Begin #3
Update #3
Commit #1
Begin #4
Update #4
Begin #5
Update #5
Commit #2
Begin #6
Update #6
Begin #7
Update #7
Commit #3
Begin #8
Update #8
Begin #9
Update #9
Commit #4
Begin #10
Update #10
Commit #5
Commit #6
Commit #7
Commit #8
Commit #9
Commit #10
Counter value: 10, time elapsed: 20.015499s
You can notice that commits do not take place immediately after updates. This is because each transaction must wait for all previous ones to complete. As a result the test takes around 20 seconds, which is the sum of time needed to finish all transactions. In this scenario, the more clients try to modifiy the counter, the longer it takes to complete. Finally, the system becomes overloaded, some of the operations time out, client processes fail to set the correct counter value, and the user experience is a disaster.

There are couple of ways to solve the problem. Some of them involve spending time and money to build a NoSQL cluster based on some fashionable technology, like Hadoop or Cassandra. A cheaper and smarter way involves implementing so called "optimistic locking". The main idea behind this technique is to decrease to the minimum the time you spend in lock. Operation based on optimistic locking takes three phases: The last operation is the only one that needs to be atomic. In short, it reads the value you want to modify and if didn't change from the first read, it replaces the old value with the new one. If comparison fails, it means that another process has already modified the value, and we need to roll back.
Optimistic locking is a good solution, but it still can lead to failures while updating counter value, and it also requires clients to implement some strategy on how to deal with update failure: "Should I repeat the operation?", "At what intervals?", "How many times before giving up?", etc.

Another way is to get rid of blocking opeartions at all, and replace them with non-blocking ones. It can be done easily by introducing queues. Putting a new value into a queue does not affect any existing values, so it doesn't need any read or write locks. With every new request you just insert a new row into the queue table, and there is a single, separate process which updates the counter and removes the rows it has read form the queue. Because there are no concurrent processes working on the counter, there is no problem of blocking. Also, the worker can read new inserts in batches instead of single records, which can give a huge performance boost with fast growing queue (if there are a hundred new tasks waiting in the queue, you can increase the counter by a hundred in one step). The queue is safe, because each insert has its own unique id, so deleting records from the queue does not interfere with inserting new ones. Also, when updating fails, the inserts are not lost, but they remain in the queue for later processing. The only drawback is that when you read the counter value it is still a little bit behind, but for purposes such as the number of visitors to the website, it can be easily accepted.

Going back to the example. Compile the code with:
make async
This will rebuild the code to use queue instead on changin the counter directly. When you now run it, you should see the output similar to the following:
Begin #1
Insert #1
Begin #2
Insert #2
Commit #1
Begin #3
Insert #3
Commit #2
Begin #4
Insert #4
Begin #5
Insert #5
Commit #3
Begin #6
Insert #6
Commit #4
Commit #5
Begin #7
Insert #7
Commit #6
Begin #8
Insert #8
Commit #7
Begin #9
Insert #9
Commit #8
Begin #10
Insert #10
Commit #9
Commit #10
Counter value: 20, time elapsed: 11.023697s
As you can see, the whole operation now took much faster to complete, because the new transactions didn't have to wait for the previous ones to finish.

Saturday, March 15, 2014

Hacking Linux executables with LD_PRELOAD code injection

Executable code can be stored on disk in many different formats. In MS-DOS, programs which didn't use more than one 64kB memory segment, could be simply saved in a COM file, which contained only machine code, with no additional metadata. When instructed to run such file, operating system loaded it into a first free memory segment at address 0x100, initialized code segment register (CS), data segment registers (DS, ES) and stack segment register (SS) to point to that segment and then passed control to the first instruction located at the beginning of the file. More complicated executables contained an EXE header with some additional informaton required by the operating system, like the number of memory blocks the program needs or initial values of the segment registers.

Nowadays, applications compiled for multitasking operating systems often use dynamically shared libraries, so the executable file headers also contain information about required libraries and functions included from those libraries. In Linux, you can get that data with readelf. For example, using command:
readelf -s /usr/bin/users
gives the following information (I stripped some of the output to increase readability):
Symbol table '.dynsym' contains 58 entries:
  Num: Type Bind   Vis     Name
    1: FUNC GLOBAL DEFAULT utmpxname@GLIBC_2.2.5 (2)
    2: FUNC GLOBAL DEFAULT free@GLIBC_2.2.5 (2)
    3: FUNC GLOBAL DEFAULT abort@GLIBC_2.2.5 (2)
    4: FUNC GLOBAL DEFAULT __errno_location@GLIBC_2.2.5 (2)
    5: FUNC GLOBAL DEFAULT strncpy@GLIBC_2.2.5 (2)
    6: FUNC GLOBAL DEFAULT strncmp@GLIBC_2.2.5 (2)
    7: FUNC GLOBAL DEFAULT _exit@GLIBC_2.2.5 (2)
    8: FUNC GLOBAL DEFAULT __fpending@GLIBC_2.2.5 (2)
    9: FUNC GLOBAL DEFAULT qsort@GLIBC_2.2.5 (2)
   10: FUNC GLOBAL DEFAULT textdomain@GLIBC_2.2.5 (2)
   11: FUNC GLOBAL DEFAULT endutxent@GLIBC_2.2.5 (2)
   12: FUNC GLOBAL DEFAULT fclose@GLIBC_2.2.5 (2)
   13: FUNC GLOBAL DEFAULT bindtextdomain@GLIBC_2.2.5 (2)
   14: FUNC GLOBAL DEFAULT dcgettext@GLIBC_2.2.5 (2)
   15: FUNC GLOBAL DEFAULT __ctype_get_mb_cur_max@GLIBC_2.2.5 (2)
   16: FUNC GLOBAL DEFAULT strlen@GLIBC_2.2.5 (2)
   17: FUNC GLOBAL DEFAULT __stack_chk_fail@GLIBC_2.4 (3)
   18: FUNC GLOBAL DEFAULT getopt_long@GLIBC_2.2.5 (2)
   19: FUNC GLOBAL DEFAULT mbrtowc@GLIBC_2.2.5 (2)
   20: FUNC GLOBAL DEFAULT __overflow@GLIBC_2.2.5 (2)
   21: FUNC GLOBAL DEFAULT strrchr@GLIBC_2.2.5 (2)
   22: FUNC GLOBAL DEFAULT lseek@GLIBC_2.2.5 (2)
   23: FUNC GLOBAL DEFAULT memset@GLIBC_2.2.5 (2)
   24: FUNC GLOBAL DEFAULT __libc_start_main@GLIBC_2.2.5 (2)
   25: FUNC GLOBAL DEFAULT memcmp@GLIBC_2.2.5 (2)
   26: FUNC GLOBAL DEFAULT fputs_unlocked@GLIBC_2.2.5 (2)
   27: FUNC GLOBAL DEFAULT calloc@GLIBC_2.2.5 (2)
   28: FUNC GLOBAL DEFAULT strcmp@GLIBC_2.2.5 (2)
As you can see in the last line, users executable calls a strcmp function from the standard Linux GNU C Library library. This function compares two strings given as its arguments and returns 0 if the strings are equal, positive value if the first string is greater than the second one, and negative value otherwise.

You can take advantage of the fact that the program uses a function from an external library to modify its behaviour without touching its code. All you have to do is to substitute strcmp with your own function. You can do it by writing your own dynamic library with custom strcmp and load it before the code of the users executable is initialized.

First, let's create a very simple library mylib.c with this code:
int strcmp(const char* s1, const char* s2) {
  for (; *s1 == *s2; s1++, s2++) {
    if (*s1 == '\0') {
      return 0;
    }
  }
  return ((*(unsigned char*)s1 > *(unsigned char*)s2) ? -1 : 1);
}
and compile it with the following command:
gcc mylib.c -shared -Wl,-soname,mylib -o mylib.so -fPIC
As you probably noticed, it behaves opposite to the original strcmp function: it returns negative value when the second string is greater than the first one, and positive value otherwise.

Now all you have to do is to inject the custom library code into the running program. Fortunately, Linux allows you to load your own dynamic libraries before starting any executable with LD_PRELOAD environment variable. We can take advantage of this feature - just compare the results of the two following commands:
[adam@ubuntu:~]$ users
adam eve

[adam@ubuntu:~]$ LD_PRELOAD="./mylib.so" users
eve adam
Because we reversed the behaviour of the strcmp function, and users executable relies on it, we managed to change the result of the command (user logins are reversed in the second example) without even touching it.

Google used the same trick to create TCMalloc, which is a faster modification of the standard Linux malloc.

Monday, January 13, 2014

OSV - operating system for the cloud

Cloud development requires design oriented towards environment with reduced performance. With many virtual servers occupying one physical machine a lot of processing power is consumed by context switching, reducing overall performance of all virtual instances.

OSV is a new operating system designed specifically for instances running in the cloud, written in C++ (which, as a side note, proves that Linus Torvalds was wrong when he claimed that C++ is not suitable for writing a system kernel). OSV reduces a lot of overhead found in the traditional operating systems, like kernel memory protection. If you ever worked with operating system which does not use such protection (like AROS or MorphOS), especially on hardware with x86 architecture, you must have observed a huge performance gain on those systems. Of course the biggest drawback of this approach is that a badly written application can crash the whole machine. However, since OSV does not run directly on hardware, but on top of a hypervisor (such as Xen or KVM) such crash affects only a virtual cloud instance, not the whole server. Moreover, OSV runs a standard Java Virtual Machine, which provides automatic memory management and necessary level of protection by itself, so no extra effort from the operating system is needed to ensure software stability.

Will OSV become sucessful? It's hard to say at the moment, but it surely shows a new trend, where not only hardware and applications, but also operating systems evolve to fit the new reality which cloud computing creates. If you are interested in trying out OSV yourself, here are some instructions how to run an Amazon EC2 instance with OSV on-board.

Saturday, January 11, 2014

Waiting for the seamless cloud

You may not have heard about memristor, the long sought fundamental circuit element (next to resistor, capacitor and inductor), but this invention can be the greatest revolution in computer industry since introducing the transistor. Memristors have simpler construction than transistors and don't require external power source to retain information. Last year Crossbar Inc. unveiled their RRAM non-volatile memory technology with shockingly impressive characteristics: 1 terabyte of data on a single chip with 20 times faster access than offered by traditional NANDs.

But this is just the beginning. When technology evolves, it may be able to replace not only flash drives, but also DRAMs, which will have a huge impact on the whole computer industry. Imagine a device with only one memory space, used both as data storage and for running applications. Moreover, applications are loaded into memory only once, and they stay there even when you turn the power off. You can take off the battery from the smartphone and replace it with a new one, and after turning the power on you instantly get your device just as you left it.

Sounds familiar? If you ever worked with Smalltalk you already know the whole idea. If you didn't, I strongly encourage you to try out Squeak or Pharo. They are both open source Smalltalk implementations, and they provide complete, object-oriented working environment. What is unique about Smalltalk is that you can start applications, create new workspaces, build classes and objects, and when you quit, the whole environment is saved in an image, and restored on the next session. At first sight it looks like simple hibernation, but it isn't. In Smalltalk you can build applications without writing the source code in a traditional way. You just create an object, dynamically add properties and methods to it, and it becomes a part of the current environment. When you want to distribute the application you just distribute the image, which has all the application code, libraries, and necessary tools.

RRAM is a technology which will eventually allow to implement the old Smalltalk idea on the hardware level and finally separate virtual machines from underlying hardware. Currently, every time a virtual machine crashes, a new instance needs to be started from scratch and configured. Although this process can be automated with tools like Puppet, it still takes time and makes applications run from scratch. With RRAM-based cloud the problem will not exist any more: you will be able to save the state of a virtual machine at any moment and than spawn it or move it to another physical location (server, rack unit, or even datacenter) within seconds. More important, the state of all the applications will be preserved, which means that, for example, you will not loose user sessions even if you move a virtual machine to another location.

In my opinion it will be a next big step in cloud computing evolution. Now the cloud instances are mostly stateless, and require external storage or cache to keep user data. With the new memory chips they will be able to preserve their state even while moving across different hardware, and this process will become absolutely seamless for the end user - just the same way the GSM networks work today, keeping your call running even when you switch between base transceiver stations while talking.

Moving to the cloud

The cloud computing has developed rapidly during last few years. As shown by Gartner study, the cloud market has grown by 18% in 2013 and is now worth 131 billion (or thousand million) US dollars. By the year 2015 it is expected to hit the value of 180 billion. Despite some well known security and privacy concerns regarding public cloud storage, and even despite the PRISM global scandal, the future in which most (if not all) of our data is stored in the cloud seems inevitable. This requires a substantial shift in approach to software architecture and creates new challenge for software developers.

First, you need to abandon huge, complicated, monolytic applications in favour of platforms built from many small, interconnected services. Such services require much less resources, and can be easily replicated througout the cloud, increasing the overall stability and safety of the applications. One of the most famous examples of a successful SOA revolution is Amazon, which radically changed its software platform.

Second, you need to change the underlying technologies. For example, Java with its extremely resource hungry virtual machine, bloated frameworks and heavy threads seems a rather bad choice. On the other hand, Node.js with its small memory footprint and single-threaded, event-driven, non-blocking I/O model fits the cloud perfectly.
If you have ever had doubts about Javascript being the right choice for an enterprise application, than you should read about Paypal moving their backend from Java to Javascript. Except from huge cost savings (Paypal has been extensively using very expensive Java soultions like Terracotta BigMemory to scale its backend vertically) and increased productivity (engineers claim to be able to develop software twice as fast as in Java), Paypal also benefits from new skills of its software developers, who can now work both on frontend and backend using the same programming language.

Third, you may need to redesign the application flow and algorithms used. You may often get better results processing huge amount of data in small chunks on many small cloud instances, than all at once on few huge mainframes. It means not only using MapReduce to speed up your tasks, but also avoiding of using complex sequential algorithms. For example, in typical situations mergesort may perform worse than heapsort, but is much simpler to parallelize and with enough cloud instances will sort your data much faster.
But there is more to it than that. AMD has just announced a new line of 64-bit ARM server CPUs, which is widely regarded as the end of its race against Intel in high performance computing. ARM processors are less complicated and more power efficient than their Intel counterparts, but are also slower - which means that they will be more suited for clouds with many small instances.

Finally, there is a whole new class of problems in cloud computing, which are either not present, or long solved in traditional applications, like data consistence or dealing with transactions. Some of them cannot be solved by software, and they require change in your business model: for example you may have to trade real time consistency for the amount of traffic your application can handle. Also, most cloud installations suffer from poor I/O performance, because there are usually many virtual instances simultaneously trying to access one physical device like hard drive or network interface.