Enable output compression in Scalding

I just wanted to enable final output compression in one of my Scalding jobs (because I needed to reorganize a some-TB-data set).

Unfortunately scalding always produced uncompressed files. After some googling, I came across a github issue that adressed exactly this problem. Via some links I got the sample code from this repo which can be used to write compressed TSVs.


Scalding Exception: diverging implicit expansion for type com.twitter.algebird.Semigroup[T]

I was just doing a again some scalding jobs and again got an .. interesting exception:

In a groupBy operation, I wanted to sum something up using:

.groupBy('a) {
  _.sum('a -> 'c)

And was rewarded with this one:

[error] example.scala:20: diverging implicit expansion for type com.twitter.algebird.Semigroup[T]
[error] starting with method eitherSemigroup in object Semigroup
[error]       _.sum('a -> 'c)
[error]            ^
[error] one error found
[error] (compile:compile) Compilation failed



Spot the mistake? It’s the missing type hint at sum:

.groupBy('a) {
  _.sum<strong>[Int]</strong>('a -> 'c)  //  <-- [Int]

Scalding: unable to compare stream elements in position: 0

I’m currently working quite a bit with Twitter’s Scalding.
Recently I split up a job into sub-jobs and suddenly got an Exception in my join:

Caused by: cascading.CascadingException: unable to compare stream elements in position: 0

If I had remembered the Fields API in detail, I would have thought about this paragraph (it’s about sorting, but the consequence is the same):

Note: When reading from a CSV, the data types are set to String,hence the sorting will be alphabetically, therefore to sort by age, an int, you need to convert it to an integer. For example …


Ensure you are joining the correct data types and possibly convert them before. For example:

.map ('myField-> 'myField) {x:Int => x}

Enable MySQL Streaming in Cascading / Scalding

Last week I ran into a an ugly problem of Scalding:
I needed to read a really large table from MySQL to process it in a certain job. In generall this is trivial: just use a JDBC Source, select your columns and that’s it.

Usually we do this by using 1-3 parallel connections to the SQL-server. This time I started running out of memory because scalding didn’t (more precicely: couldn’t) swap/spill to disk. The problem here is the default behaviour of the mysql-connector. The api docs says:

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and can not allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

So, what does this mean: If you query a 10 GB table, you get all the data and the connector tries to buffer it in memory – which is a bad idea if you just want to process tuple by tuple. You can then split this large query into 10 smaller ones: SELECT ... FROM ... LIMIT 0, x, SELECT ... FROM ... LIMIT x+1, y, … etc. This works – but partitioning a large result this way is not very efficient because starting from the second query, MySQL has to iterate over x rows until it can start gathering and returning results. So you partition the big query into 10 smaller results but you put quite a lot of load to the server. And over all you still have to keep a lot of results in RAM.

The better solution would be to use just one or two connections that stream the rows directly into the cascading/scalding job. The framework can then decide whether it can process the data or if it needs to spill to disk.

The solution seems dead easy: simply turn on streaming! The Api docs even shows how to do it:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, 

Looking at the cascading source, the statement is already created in the right way. But it’s not that easy to get the statement before the query is being submitted (at least if you want to avoid dirty hacks!).

So I invested a couple of hours fiddling around with gradle, cascading, scalding and sbt to get it done the right way. As I saw this other issue describing exactly the same problem, I made a pull request which is currently under review accepted and merged into the 2.5.5 branch – the tests in my local jobs already work like a charm!

If you’re using Scalding, you can easily use the modified MySqlScheme by doing the following steps:

  1. Clone my fork repo the official 2.5.5 branch.
  2. ensure you have gradle 1.x (>= 8) installed
  3. build cascading and install it into your local repo:
    gradle install-Dcascading.jdbc.url.mysql="jdbc:mysql://a-host/a-db?user=&password=" -i -x test
    -x test disables the tests (which saves ~10min).
  4. add your local repo to the source repositories in your project
  5. add the dependency to your project: cascading cascading-jdbc-mysql 2.5.5. If you want a different version number, just change version.properties accordingly.

To really use the streaming, you have to initialize the MySqlScheme on your own. In Scalding, this can be done for example using this code (Scalding 0.9):

abstract class StreamingJDBCSource extends JDBCSource {
  override val maxConcurrentReads = 1

  override protected def getJDBCScheme = new MySqlScheme(
    classOf[MySqlDBInputFormat[DBWritable]],  // inputFormatClass
    null,  // orderBy
    false // replace on Insert

That’s it. Your usual JdbcSources were probably extending JDBCSource. Just change them to extend StreamingJDBCSource and you are done. Now your mappers require less RAM and you can cut down the amount of parallel connections for each single source.

Compiling Cascading: FAILURE: Build failed with an exception.

Today I ran into a really stupid error message when I tried to recompile cascading-jdbc:

Evaluating root project ‘cascading-jdbc’ using build file ‘/home/…/cascading-jdbc/build.gradle’.

FAILURE: Build failed with an exception.

* Where:
Build file ‘/home/…/cascading-jdbc/build.gradle’ line: 68

* What went wrong:
A problem occurred evaluating root project ‘cascading-jdbc’.
> Could not find method create() for arguments [fatJarPrepareFiles, class eu.appsatori.gradle.fatjar.tasks.PrepareFiles] on task set.

* Try:
Run with –stacktrace option to get the stack trace. Run with –debug option to get more log output.


Total time: 5.355 secs


Check your gradle version … I ran a brand new Ubuntu with the shipped gradle version 1.4. Well the cascading readme states that gradle 1.8 is required … and it really is.

RaspberryPi Desktop Sharing via VNC

The problem: Raspberry connected to a TV / share the desktop

When connecting a RapsberryPi to a screen or TV to show something you surely would like to remote control the RasPi not only via shell (SSH) but also via VNC to see exactly what is displayed on the remote screen. So what we are looking for is not “a new remote desktop” (as provided by a lot of tools) but “desktop sharing”. And of course you want the desktop shared automatically when the Raspberry Pi boots.

Yet thanks to the great and large Rapsberry Pi community, this is a pretty easy task – if you know how to do it.

Solution: x11vnc

Use raspi-config to configure the raspi to “Enable Boot to Desktop”.

Install x11vnc:

sudo apt-get install x11vnc

Configure x11vnc to set a password automatically when it is started:

sudo x11nvc -storepasswd

Create a file /etc/init.d/x11vnc which is used on startup (mind the line breaks! Or just get it from GitHub). And set the permissions to 755 (sudo chmod 755 /etc/init.d/x11vnc).

# /etc/init.d/x11vnc
# Provides:          x11vnc
# Required-Start:    lightdm
# Required-Stop:
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: x11vnc
# Description:

case "$1" in
echo "Starting x11vnc ... "
x11vnc -auth /var/run/lightdm/root/:0 -rfbauth /root/.vnc/passwd -no6 -noipv6 -reopen -forever -shared

echo "Killing x11vnc ..."
killall x11vnc
echo "Usage: /etc/init.d/x11vnc {start|stop}"
exit 1
exit 0

You could now test the script already via:

sudo service x11vnc start

Tell the RasPi to start x11vnc automatically after the graphical environment has started:

sudo update-rc.d x11vnc defaults

Done. Reboot your RasPi to test if everything did work. When it has come up again, you should be able to connect to the desktop from your Linux/Mac/Windows machine with a VNC client of your choice, for example TightVNC. Simply enter the host name of the Raspi in the connect dialogue and hit ok. That’s it.

Feedly being DDoS’ed – Open communication leads to a boost in Pro Accounts

You might have heard that the company providing Feedly (a news aggregator application) is currently being extorted money by attacking the website. The attackers are DDoS’ing the website to make website and the complete webservice unavailable until Feedly pays money.

While Feedly still refuses to pay ransom they’re battling the third day against the attacks. While DDoS attacks are not a novel crime in the battlefield called internet, Feedly decided to communicate openly about the situation.

The Feedly community obviously rewards the open communication: In a tweet, Feedly announced that “the rate of Pro sign ups has doubled over the last 3 days”. Feedly: Keep up and stand your ground!

IntelliJ IDEA and Scala being awfully slow on Windows 8.1

At work we are working mostly in Scala and most of us are using IntelliJ IDEA for coding. The choice of the operating system is up to the developer. As I am quite convenient with Windows (and use MS Office quite often), I am a happy Windows 8.1 user (btw: who the hell needs a start button when you have the windows key!? Anyways … different story).

The Problem

After a while when I started a Scalding project, IntelliJ became very slow and often turned to be non responsive for some seconds about once per minute. So over all it was a very inconvenient and unproductive situation.

Of course I followed all kind of IntelliJ tuning tips, tuned vmoptions, etc. Yet the cumpulsory breaks remained. In the end I turned off the Scala compile server which eliminated most of the breaks – yet I also lost a lot of convenience of the IDE like marked compile errors immediately after typing etc. My machine was already equipped with an SSD, so upgrading to SSD wouldn’t lead to any improvement.

Solution 1: Tortoise git status cache

So I decided to track the problems one step further. Windows’ Resource Manager became a permanent companion. First I only monitored CPU. This quickly showed me that TGitCache repeatedly consumed 100% CPU. A quick search led to Tortoisegit Issue 980. Disabling the icon overlays quickly eliminated this problem Settings > Icon Overlays > Status Cache=None!

Solution 2: File History / Dateiversionsverlauf

I got less breaks now – but still got them. Monitoring RAM didn’t add any insights – I always had more RAM free than was required by any program. So I moved on to monitoring IO. IO Activity sorted by bytes total then showed me that when I experienced an IntelliJ hang, a certain Windows process had quite some IO activity. A quick search revealed that it was the drive file history process (German: Dateiversionsverlauf). I then remembered that I turned this feature on some time ago as I usually like the on the fly backups and shadow copies. Unfortunately a lot of files seem to get touched during a Scala/SBT/Play/Scalding project. So this might lead windows to permanantly shadow copying a large number of files.

Turning off file history suddenly made IntelliJ blazingly fast again. Hooray! >The quickest fix is to diable the feature completely. Yet I don’t want to miss the feature completely. The only way to configure file history on a per directory level seems to be editing the registry directly as described in an article in MSDN.
But at least my machine is now really nice and usable again!


If you’re geeting into the same trouble:
Disable Windows’ file history first and see if this solves your issue. Maybe also disable TortoiseGit’s status cache.

But the most important lesson: Learn how to debug your problems not only in the programming language you are using but also the tools your operating system provide. They can be live-savers.

RaspberryPi Weatherstation – The TV Station (Part III)

The third and last post of this series describes the RasperryPi that is connected to the TV and displays the sensor data in a visually appealing way.

Part 1Part 2Part 3

In the past blog post, the first Raspberry Pi was connected to several sensors. Tiny Python scripts poll the data from the sensors regularly and save them to simple text files that can be copied via SSH/SCP. Current data can be obtained from the sensors by directly connecting to the Brick Daemon which runs on this RasPi.

The main focus of this post is the visualization of the data via a JavaFX application and how to control the FX application by using the regular remote control of the TV. But – before we’re divig into the details, I want to teas you with a screen shot of the final result ;-)

Wetterstation blue edition

Nice isn’t it? So let’s get started

Things to discuss

The most important question first: What do I want to see and do?
I had a pretty clear intention already: I want to see the current values of all sensors in a small overview and I want to be able to toggle through time series of the past days.

Toggling should be done by the TV’s remote control as I would have the remote at hand already when I switch to the Weather data display. I’ve seen several guides where an IR-receiver and an additional remote were used to control the RasPi. Yet I simply do not like the idea of having another remote control on the table – right next to the other ones. Also controlling the app via mobile phone wasn’t what I wanted as we usually still simply use the TV’s own remote to control the TV. – Also I know from Raspbmc that it is possible to accomplish this by using libCEC – somehow.

Next question to clarify: How should the data be visualized on the TV. This was easy: JavaFX.
Colleagues asked me why I chose FX instead of HTML5. Well mainly there were three reasons:

  • I simply wanted to do something in JavaFX
  • I’m not a fan of fizzling around with JavaScript and CSS. And the standard way to displaying webapps on the RasPi seems to be using Midori. A browser I never worked with before – and even thinking about possible CSS/JS incompatibilities totally turned me off.
  • I wanted to control the UI with the TV’s remote control. That has to be done somehow using the CEC-commands that are sent through HDMI. I’ve never done that before and anticipated my chances to accomplish this in pure Java ways higher than somehow creating input events that I had to redirect to the browser.

Checking the Hardware

Prior to coding, I did some research if and how the CEC commands are sent through HDMI. And – unfortunately – besides all software problems, a lot of people reported problems to even get the signal to the RasPi! Possible problems mentioned throughout several posts included the TV software having disabled the CEC functionality or HDMI cables that were blocking / not forwarding the CEC signals.

What I found the easiest way to test the CEC compatibility of the setup was Raspbmc. Raspbmc is a mediacenter for Raspberry Pis, which is easy to install and supports CEC. So: download the image, flash it to an SD card, connect the Pi to your TV with the HDMI cable that you want to use later as well and start the Pi. If you see your TV’s remote working: Great! Otherwise: you have my sympathy. Good luck in finding out what is wrong with your technical setup.

Install libCEC

So the CEC signals are technically consumable by the Pi. Let’s istall libCEC.
DO NOT install the packages from the pulse-eight website. And also DO NOT simply recompile and install libCEC following the guide on the GitHub page. It’s important to compile the lib for the Pi! Just follow one of the step by step guides to compile and install:

$ cd /tmp
$ sudo apt-get install build-essential autoconf liblockdev1-dev libudev-dev git  libtool pkg-config
$ git clone git://github.com/Pulse-Eight/libcec.git
$ cd libcec
$ sudo ./bootstrap
$ sudo ./configure --with-rpi-include-path=/opt/vc/include --with-rpi-lib-path=/opt/vc/lib --enable-rpi
$ sudo make
$ sudo make install
$ sudo ldconfig
$ cec-client -l

LibCEC is now installed including cec-client (seen in the last call). Play around a bit with cec-client. Up to now everything was fine on our Samsung TV. But beware, I’ve seen posts where people with TVs from other vendors faced serious problems … If you want to play around with cec-client, CEC-O-MATIC might also be a reference you should have a look at.

Install Java 8

At the time of writing, Java 8 is already released. At the time I was writing the code, it was still an EA release (Early Access). I read several posts that recommended installing the ARM version of Java 8. So I simply followed the OpenJDK guide to install OpenJDK 8:

  1. download Java 8 ARM from Oracle
  2. unpack the file: sudo tar zxvf jdk-8-linux-arm-vfp-hflt.gz -C /opt
  3. and check if Java 8 got installed: /opt/jdk1.8.0/bin/java -version
  4. Set default java and javac to JDK 8:
    $ sudo update-alternatives --install /usr/bin/javac javac /opt/jdk1.8.0/bin/javac 1
    $ sudo update-alternatives --install /usr/bin/java java /opt/jdk1.8.0/bin/java 1
    $ sudo update-alternatives --config javac
    $ sudo update-alternatives --config java
    $ java -version
    $ javac -version

    java and javac should link to 1.8.0 now.

Next adjust the memory split option as also mentioned on the page to 256mb (gpu_mem=256 in /boot/config.txt). It’s useful to read the page. The page also gives a note which is pretty important to avoid frustration:

Note that the default configuration of JavaFX on the Raspberry Pi does not use X11. Instead JavaFX works directly with the display framebuffer and input devices. So you should not have the X11 desktop running when starting JavaFX.

JDK 8 EA builds for the Raspberry Pi include full support for hardware accelerated graphics, with everything from the base, graphics, controls and FXML modules. Media and Web modules are not included.

So configure the Pi to boot just to the shell (via sudo raspi-config) and do not try to see the JavaFX output via VNC!

Building the GUI

Designing the UI (or: how I started to love JavaFX)

Before I even started to code, I wanted to set up the layout. Coming from a Java-Swing background I was really excited to try this JavaFX thingy that was said to be so much cooler than Swing.

After setting up the project in NetBeans and installing JavaFX Scene Builder I was very positively surprised! By the way, I was only working with Scene Builder 1.1, the current Version 2 is said to be even better. After initializing the project, my focus quickly went into the src/main/resources folder. At this location you can find fxml/FXMLDocument.fxml and styles/base.css (later renamed to metro.fxml/.css). Designing the Gui was about 90% done just in Scene Builder (for the .fxml) and Netbeans (for the.css). The really really nice thing is that it is a true WYSIWYG editor. Also whenever the css file is being edited, the effect is immediately visible in Scene Builder. Simply great compared to what I was used to from my Swing experience.

After the main layouting was done, I started to write some Java code in FXMLDocumentController.java. Accessing UI components is simply done by annotating the according fields. The fields themselves are injected automatically. A nice decoupling of view and controller. Also I was able to compare two very different layouts by just switching one single line (the one referencing the fxml file) without further refactoring.

Time for some user tests! Honestly, I was a bit afraid what others would say. But I wanted to make a NICE UI, so I asked my wife and friends on Facebook and Google+ for feedback. And this decision turned out to be damn right! I got very valuable feedback about colors and layout. I definately recommend asking users – and value their feedback!

Afterwards I implemented the logic for reading the CSV files including filling the model classes. These model classes were then rendered into the graphs. So I also had some test data and continued styling the graph lines via CSS. At this stage I remembered the talks of Gerrit Grunwald. Gerrit is pretty experienced in JavaFX on the Pi and mentioned in some talks that animations on the Pi can be really really slow in the beginning until everything gets hot spot compiled. – I can definitely confirm this! The trick was to simply set animated="false" on the line chart that would only be animated at the beginning (which I didn’t really need or want anyways).

Show the UI on the TV

Time for a first test on the TV! At this point, remember the last part from “Install Java8″ above: We do not need an X-server as JavaFX on the Pi paints directly to the framebuffer! So just compile and build the Jar, copy it over to the Pi and start it with java -jar ./ Weatherstation-1.0-SNAPSHOT.jar.

I made the experience that my regular monitor is not the same as the display of the TV. So I spent some time with adjusting css font sizes until I was content with the way everything was displayed on the TV. It feels a bit like webdesign and testing with different browsers – just a bit less painful.

Connecting the sensors

With the gui being done so far, the next step had to be done. When the program is started, archived data is read from files. Current data should be fetched directly from the remote sensors. This was actually plain easy. With the API provided by tinkerforge, everything I had to do was starting a thread off the Event dispatching thread that polled new sensor values periodically and pushed the data to the models. the models then simply updated their UI elements.

Remote control Java

The key strokes should be used to toggle through the values of humidity, temperature, ambient light and air pressure.

Everything that was left now was the remote control part. Unfortunately, libcec is a C-binary only. So I had the choices to try some JNI and talk to the lib directly or to start a Java Thread wrapping a process that just called the cec-client program:
/usr/local/bin/cec-client -d 8 -t prta -o Wetter. The d parameter defines the verbosity of the output, t defines the type of device the cec-client simulates (recorder, …), and o finally defines the string that is shown in the On Screen Display of the TV when brwosing through the HDMI input devices. When a cec-client listens with this command, just switch the TV to the according “device” and press some buttons on the remote control. The key presses are shown on the console.

The console output again is parsed directly from the Java process and calls the appropriate methods in the controller. This works brilliant in my case. Yet PulseEight do not recommend using cec-client in a production environment as it is intended for tests only. But well – I didn’t want to invest even more time into coding a perfect cec-java-bridge. Especially as my C-knowledge is very basic. So chances are high that my implementation would be far from “production ready”, too.


That’s it! And of course, you can fork the complete project on GitHub!

Part 1Part 2Part 3

Take care when logging Exceptions!

Today I was facing some weird nullpointer exceptions (NPEs) in my Android App (Beta phase, luckily). Usually I catch exceptions like

try{ .. }catch(SomeException e){ 
   logger.info("A SomeException occured, but i got it.", e) 

Well in this part of the code I broke with my habit and wrote:

try{ .. }catch(SomeException e){ 
   logger.info("A SomeException occured, but i got it: "+e.getMessage(), e) 

And guess what. I experienced the expected exception, caught it (great) – and got a NPE somewhere in the Loggin framework. WTF?
After having a quick look at the Throwable API, I realized that getMessage() can indeed return null. And String+null produces null. So I nulled my logging message and passed this null reference right into the logging call – which produced the NPE in there. This was very annoying as I successfully caught the first exception – just to produce a susbsequent error during handling the first one.

Well, I immediately grepped my whole project for any strings like .getMessage() and checked for any other NPE traps.
Lesson learned today: carefully check the Api docs and be even more paranoid for NPEs.
Yet one question remains unanswered: Why on earth would one like to return null references in exceptions instead of empty strings?!