Saturday, October 30, 2010

Scale up your skills, not the hardware

It is not a big secret that good programmers keep their saws sharp all the time. If you want to improve your mental skills a good place to start is Project Euler. I found it while browsing other programmers' blogs and immediately got hooked up. Project Euler is a place where you can challenge yourself (and others) in solving various mathematical and computer programming problems. The collection of problems is constantly growing and you can find them ranging from very easy to really difficult. When you provide the correct solution to a problem, a forum dedicated to it becomes revealed and you can see how other people managed to solve it. Sometimes you can also get a bonus in a form of paper explaining the problem and the best known solutions in details. You can learn a lot from these forums since people share there both their ideas and source code written in different programming languages; some of them don't even use a computer, all they need to find a solution is paper and pencil.

What I particularly like about Project Euler is that most of the problems cannot be solved by brute force. They can make you check hundreds or thousands of possible combinations, but still require a more clever approach. A classic example are problems 18 and 67, where you need to find a path from the top of the number pyramid to the bottom, so that the total sum of numbers on the path is the largest. While the pyramid in problem 18 is rather small and you can find the solution by trying every of 16384 possible paths, it is impossible to solve problem 67 the same way. To solve it you need to turn the problem upside down - literally, since you can easily get the correct solution if you start searching for the best path from the bottom of the pyramid to the top, reducing the number of combinations with every step.

You may wonder how mathematics can help you in facing every day challenges as a programmer. Imagine you work for an on-line store with millions of CD records on stock, and you provide a browser with a dynamic list of products based on different criteria, such as genre, year and price. Now you want to show to your customers the most popular product within the range they choose. Getting all items satisfying the selected criteria from a database and finding the most popular one every time someone browses the list will quickly kill your back-end. On the other hand, it is impossible to generate such data in advance for every possible combination of search parameters. You can either expand your server farm (and possibly move to the cloud to reduce costs) or you can use some statistical tools to help you do the job. The trick is to find a representative sample size for a given population of products. Assuming that N is a total population, z is a statistical factor for desired confidence level, p is an estimated proportion of a given variable within the population, and d is admissible error level, we can get sample size n with the following formula:
x = z^2 * (p * (1 - p))
n = N * x / ((N - 1) * d^2 + x)
The z factor is constant and equals 1.96 for the 95% confidence level and 2.58 for the 99% confidence level. Assuming that a selected range of CD records contains one million uniformly distributed products and you can tolerate a 5% error, it turns out that in 95% cases you get the most popular one correctly by analysing only 385 randomly chosen items from the list, since:
x = 1.96^2 * (0.5 * (1 - 0.5)) = 0.9604
n = 1000000 * 0.9604 / ((1000000 - 1) * 0.05^2 + 0.9604) =~ 384.013
To be 99% sure that no more than 1% results are incorrect, you'd need to analyse 16369 records.

You can learn more about calculating sample sizes from this article at Quirks - I took the above formula from it. And remember: while it's true that machine cycles are cheap and people cycles are not, never forget that people cycles are worth billions of machine ones.