Sunday, September 21, 2008

Sky is the limit for Mnesia

Until recently the biggest Mnesia flaw was its storage limit. It is not the fault of the database system, since Mnesia itself can handle data of virtually infinite size. The main problem lays in an outdated Erlang term storage engine called DETS, which is slow and uses 32-bit offsets (it limits single file size to 2GB). DETS could be a perfect fit for the famous AXD301 ATM switch, but is certainly not for modern web development. A few times I had to drop Mnesia in favour of Hbase to serve data and use Jinterface to make it communicate with Erlang code, which served application logic.

But those times are finally over. Thanks to Joel Reymont and the Dukes of Erl you can now use Tokyocabinet engine as a storage for Mnesia. What is the most exciting about this solution is that it is completely transparent to your application - only creating a table looks a bit different:
Table = testtab,
    [{type, {external, ordered_set, tcbdbtab}},
     {external_copies, [node()]},
     {user_properties, [{deflate, true}, {bucket_array_size, 10000}]}]).
You also need to start tcerl before running Mnesia:
and synchronize table data with disk before closing Erlang, if you don't want to loose some of your data:
Port = mnesia_lib:val({Table, tcbdb_port}),
To sync all existing Mnesia tables you can use the following function:
F = fun(T) ->
            case catch mnesia_lib:val({T, tcbdb_port}) of
                {'EXIT', _} ->
                Port ->
Tabs = mnesia_lib:val({schema, tables}),
lists:foreach(F, Tabs).
All other common operations, like reading and writing data, transactions, etc., don't need to be changed. You can learn more by looking at example provided with the library.

I managed to make ejabberd (one of the best Jabber/XMPP servers around) to run on Tokyocabinet, only with changing a few lines of its code. Now I'm on my way to make Mnapi work with it. Thank you guys!!!


Jay said...

How reliable do you think this is, though? Does it meet the Erlang community's expectations of total rock-solidness?

Krzysztof (Christopher) Kliś said...

Erlang reliability arises mostly from the fact that it depends only on itself and nothing more. Introducing external mechanisms always creates some risk of side effects.
As far as the Erlang part is concerned - yes, I believe it is reliable, since all database logic depends on Mnesia and it has proven reliability. So you can be sure that at least all database operations are correct.
On the other hand the storage itself is provided by Tokyocabinet and you have to trust it about data consistency and correctness. However, I made several stress tests pumping concurrently millions of records into Mnesia running on tcerl and didn't run into any problems with data consistency. I also managed to run a patched ejabberd on production system where it handles an inter-process communication and didn't run into any problems either.

Ulf Wiger said...

Dets is certainly good enough even for the modern successors of the AXD 301, the main reason being that these are signaling nodes with very modest requirements on persistent storage - mnesia disc copies (i.e. ram+disk and not dets at all) are used only for configuration data. But sure, there are lots of other application domains where an upper limit of 2 GB per table fragment is nothing to brag about. (:

The most interesting tests of tcerl would be restart and crash scenarios under high load. Turning the power off and on again is an old favourite. Tcerl hooks into mnesia's commit logic, and that means that there is no means of rollback if something goes wrong (I believe mnesia will simply "dump core" instead) - the Tokyocabinet stuff simply has to work. Also, mnesiaext modifies mnesia's table loading logic, so trouble during distributed table load would be something to watch out for.

I'm not saying that I expect it not to work - just pointing out which types of robustness tests would be most convincing.

Krzysztof (Christopher) Kliś said...

Erlang & Mnesia are usually used in distributed enviornments anyway, so a failure or crash of a single instance should be no problem.
But I would be curious to see some independent and reliable tests myself. Tsung ( looks like a good candidate for the job.