What I particularly like about Project Euler is that most of the problems cannot be solved by brute force. They can make you check hundreds or thousands of possible combinations, but still require a more clever approach. A classic example are problems 18 and 67, where you need to find a path from the top of the number pyramid to the bottom, so that the total sum of numbers on the path is the largest. While the pyramid in problem 18 is rather small and you can find the solution by trying every of 16384 possible paths, it is impossible to solve problem 67 the same way. To solve it you need to turn the problem upside down - literally, since you can easily get the correct solution if you start searching for the best path from the bottom of the pyramid to the top, reducing the number of combinations with every step.

You may wonder how mathematics can help you in facing every day challenges as a programmer. Imagine you work for an on-line store with millions of CD records on stock, and you provide a browser with a dynamic list of products based on different criteria, such as genre, year and price. Now you want to show to your customers the most popular product within the range they choose. Getting all items satisfying the selected criteria from a database and finding the most popular one every time someone browses the list will quickly kill your back-end. On the other hand, it is impossible to generate such data in advance for every possible combination of search parameters. You can either expand your server farm (and possibly move to the cloud to reduce costs) or you can use some statistical tools to help you do the job. The trick is to find a representative sample size for a given population of products. Assuming that N is a total population, z is a statistical factor for desired confidence level, p is an estimated proportion of a given variable within the population, and d is admissible error level, we can get sample size n with the following formula:

x = z^2 * (p * (1 - p)) n = N * x / ((N - 1) * d^2 + x)The z factor is constant and equals 1.96 for the 95% confidence level and 2.58 for the 99% confidence level. Assuming that a selected range of CD records contains one million uniformly distributed products and you can tolerate a 5% error, it turns out that in 95% cases you get the most popular one correctly by analysing only 385 randomly chosen items from the list, since:

x = 1.96^2 * (0.5 * (1 - 0.5)) = 0.9604 n = 1000000 * 0.9604 / ((1000000 - 1) * 0.05^2 + 0.9604) =~ 384.013To be 99% sure that no more than 1% results are incorrect, you'd need to analyse 16369 records.

You can learn more about calculating sample sizes from this article at Quirks - I took the above formula from it. And remember: while it's true that machine cycles are cheap and people cycles are not, never forget that people cycles are worth billions of machine ones.

## No comments:

Post a Comment