Enable MySQL Streaming in Cascading / Scalding

Last week I ran into a an ugly problem of Scalding:
I needed to read a really large table from MySQL to process it in a certain job. In generall this is trivial: just use a JDBC Source, select your columns and that’s it.

Usually we do this by using 1-3 parallel connections to the SQL-server. This time I started running out of memory because scalding didn’t (more precicely: couldn’t) swap/spill to disk. The problem here is the default behaviour of the mysql-connector. The api docs says:

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and can not allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

So, what does this mean: If you query a 10 GB table, you get all the data and the connector tries to buffer it in memory – which is a bad idea if you just want to process tuple by tuple. You can then split this large query into 10 smaller ones: SELECT ... FROM ... LIMIT 0, x, SELECT ... FROM ... LIMIT x+1, y, … etc. This works – but partitioning a large result this way is not very efficient because starting from the second query, MySQL has to iterate over x rows until it can start gathering and returning results. So you partition the big query into 10 smaller results but you put quite a lot of load to the server. And over all you still have to keep a lot of results in RAM.

Continue reading Enable MySQL Streaming in Cascading / Scalding

Compiling Cascading: FAILURE: Build failed with an exception.

Today I ran into a really stupid error message when I tried to recompile cascading-jdbc:

Evaluating root project ‘cascading-jdbc’ using build file ‘/home/…/cascading-jdbc/build.gradle’.

FAILURE: Build failed with an exception.

* Where:
Build file ‘/home/…/cascading-jdbc/build.gradle’ line: 68

* What went wrong:
A problem occurred evaluating root project ‘cascading-jdbc’.
> Could not find method create() for arguments [fatJarPrepareFiles, class eu.appsatori.gradle.fatjar.tasks.PrepareFiles] on task set.

* Try:
Run with –stacktrace option to get the stack trace. Run with –debug option to get more log output.

BUILD FAILED

Total time: 5.355 secs

Solution

Check your gradle version … I ran a brand new Ubuntu with the shipped gradle version 1.4. Well the cascading readme states that gradle 1.8 is required … and it really is.