What to do in case of org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes

I’m currently gathering my first experiences with Apache Spark and in particular Spark SQL.

While I was playing a bit with Spark SQL Joins I suddenly faced an exception like Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: foo.
Followed by the parsed SQL statement etc …

Well, in MySQL the error message would have been
"Unknown column 'foo' in field list"
Aka: You are accessing a column/field foo where this field does not exist.
I was already a bit too close to the problem in order to see it at once – and I only found descriptions dealing with nested structures etc (which wasn’t the case in my situation). So it took me a couple of minutes to realize what Spark want to tell me.

Maybe this helps someone else, too.

Share This:

How to ignore Maven build erros due to JavaDoc with Java 8

Java 8 is a bit more strict in JavaDoc parsing. This can lead to build failures in Maven when building the repo with warnings like:

Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.7:jar (attach-javadocs) on project [projectname]: MavenReportException: Error while creating archive:
Exit code: 1 - [path-to-file]:[linenumber]: warning: no description for @param

Sure, the good solution would be to fix the JavaDocs. But in cases where you just clone a foreign repo, you probably just want to get it run and not start fixing it.

To ignore the erros, just turn off doclint by adding the following <configuration> tag to your pom.xml:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-javadoc-plugin</artifactId>
    <version>2.10.2</version>
    <executions>
        <execution>
            <id>attach-javadocs</id>
            <goals>
                <goal>jar</goal>
            </goals>
            <configuration> <!-- add this to disable checking -->
                <additionalparam>-Xdoclint:none</additionalparam>
            </configuration>
        </execution>
    </executions>
</plugin>

Some more solutions can be found in this StackOverflow thread.

Share This:

How to get List of Objects from deeper level in Json via GSON

Sometimes you get a quite nested Json response but the only thing you need is a list of classes in a certain branch of the Json document (like a response of Yahoo’s YQL query).

Assume just the following json document:

{
"fieldA": {
    "fieldB": {
        "fields": [
            { "foo": "test1", "bar": "test2"},
            { "foo": "test11", "bar": "test22"}
         ]
      }
   }
}

And the only thing you need is the fields array.
A Java8 way to get the fields as a list would be:

List<FooBar> quotes2 = Stream.of(gson.fromJson(json, JsonObject.class)
	.getAsJsonObject("foo")
	.getAsJsonObject("bar")
	.getAsJsonArray("foobar"))
	.flatMap(e -> Stream.of(gson.fromJson(e, FooBar[].class)))
	.collect(Collectors.toList());

But that’s quite some code. Okay if you only need it once, but as soon as you need this several times it clearly violates the DRY principle. Gson (which I am using a lot) doesn’t seem to provide a simple way for doing this. Except creating the whole hierarchy as Java Classes, which might just be overkill.

Solving the problem in a more generic way is the way to go – but it als requires creating generic arrays:

class Gsons{
    public static <T> List<T> asList(String json, String path, Class<T> clazz) {
        Gson gson = new Gson();
        String[] paths = path.split("\\.");
        JsonObject o = gson.fromJson(json, JsonObject.class);
        for (int i = 0; i < paths.length - 1; i++) {
            o = o.getAsJsonObject(paths[i]);
        }
        JsonArray jsonArray = o.getAsJsonArray(paths[paths.length - 1]);
        Class<T[]> clazzArray = (Class<T[]>) ((T[]) Array.newInstance(clazz, 0)).getClass();
        T[] objectArray = gson.fromJson(jsonArray, clazzArray);
        return Arrays.asList(objectArray);
    }
}

The only things to do are creating a class for the entities and calling the method:

List<FooBar> fooBars = Gsons.asList(json, "fieldA.fieldB.fields", FooBar.class);

Share This:

How to (re)schedule an alarm after an App upgrade in Android

In one of my Apps I am using alarms to schedule notifications.
Of course I also want to (re)schedule the alarm when the device is rebooted. Easy: Just set a BOOT_COMPLETED action in the intent-filter of the according schedule reciever:

<receiver android:name=".AlarmScheduleReceiver" android:enabled="true">
<intent-filter>
<action android:name="android.intent.action.BOOT_COMPLETED" />
<category android:name="android.intent.category.DEFAULT" />
</intent-filter>
</receiver>

The problem now just is that when the app is upgraded, your alarm will not be rescheduled! Not too much of a problem – if you know it! Just add another action into the intent-filter:

<action android:name="android.intent.action.PACKAGE_REPLACED" />

I was really lucky that a friend pointed that out when I added that feature to my app! Figuring this out just by getting user complaints that “the alarm sometimes doesn’t work” would not have been very funny!

I would have been pretty glad if the API docs would mention something like “hey, when you listen for BOOT_COMPLETE, you might consider listening for PACKAGE_REPLACED, too”. Well, that’s life.

Share This:

How to do automatic tagging of articles using Feedly

In this post I will describe a first proof of concept approach about how to implement a supervised learning system to automatically tag RSS posts in Feedly.

Motivation

Everyone using an RSS reader to read daily news surely knows the situation that certain topics keep (re)occuring in the news. Yet most people have topics that they are simply not interested in. Just think about certain sports, political topics or world events. But of course they keep showing up in the daily news stream.

Therefore a system is needed that automatically assigns predefined tags to the corresponding news entities and (maybe) also marks them as read.

A critical point is that the system must integrate into an RSS reader application. A system not being able to attach to an existing system would not be applicable as one still wants to use a mobile / desktop app to read the news and also to (un)tag articles. Implementing the complete value chain comprising fetching RSS, parsing, classifying, providing an aggregated stream AND an application for reading the news is definitely out of scope for a proof of concept.

I wanted to write such a classifier for quite some time but didn’t find a system that provided a convenient API to plug in  a tool for reading, classifying and pushing back the results. Unless I discovered the Feedly API. Unfortunately the Feedly API is not (yet) fully open, so that one has to obtain a time limited API token by hand. Yet, for a proof-of-concept, this is totally acceptable.

The Learning System

So much for the introduction. Let us go in medias res:

The first thought was to start with some clustering using Elasticsearch (for similarity search). But let’s recall the base facts and requirements:

  • Only a hand full of tags are needed
  • start with the simplest approach first
  • it should be able to run either on OpenShift or on my Raspberry Pi

So the choice was to start with a simple Naive Bayes Classifier. Instead of doing an in depth explanation of the Bayes classifier (I recommend Paul Graham’s A Plan for Spam and the page about combined probability), just recall: a Bayes Classifier is just a 0-1 classifier. So a single classifier is required for each tag. This makes it of course unusable for a very large amount of tags! But the big advantage is that the Bayes classifier is just dead easy. Just count how often a word occurs in the desired in class A (the Tag) and class B.

How to train / apply the classifier(s)

The classifier should be trained perdiodically and the user must have the opportunity to correct classification errors. Before dealing with synchronizing & updating entries, the classification workflow for each tag is:

  1. get all entities for the tag and use them as positive samples
  2. get all read and untagged entities and use them as negative samples
  3. get all unread and untagged entries and compute P(tag)
  4. if P(tag) > 0.95, mark the entity with the tag and probably also mark it as read

As input, the all kinds of properties are used that could distinguish between tags. Including the source URL, site keywords, categories etc. Then the content is tokenized / split by all non word characters. Graham writes about some optimizations for spam detection – yet results were pretty convincing without further optimization.

in order to have some positive samples, this of course requires the presence of some entities being tagged already. In this case I started tagging already quite some time ago as I already assumed that I needed some ground truth.

Raspberry PI: Boon and Bane

Raspberry PIs are great as little home servers. The drawback is that the RaspPi has just a single core, 700 MHz ARM CPU and 512 Mb ram which is shared between GPU and system. So, it is a bit slow and is a bit low on resources. Especially if the RasPi is also used for other purposes at the same time that also consume some RAM. In case of very large RSS streams, this could indeed raise a  problem: Running low on CPU is unconvenient (=slow), but running low on RAM is deadly (OOME). Therefore it might be required to replace the HashMap in the Bayes class with a DB layer like MapDB.

Status Quo

The quick test with the Bayes classifier showed already some really fine results! On the RasPi, each Tag is classified within 200 – 230s (14 – 18s on my notebook). The mission “Reduce the amount of information that I am not interested in” can thus be regarded as “successfully tested“!

Also there have hardly been any misclassifications. And the ones I experienced were quite understandable. In contrast to scientific publications I didn’t do extensive accuracy tests – the first attempts were so promising that I simply saved the time and thought about what to try out next that could make my life easier.

If this approach should be followed any further there are of course (as always) some open issues: Besides code cleaning, one could try to filter by TF-IDF, filter certain tokens, adjusting thresholds, etc. But I doupt that the results would get much better.

And of course, the complete code is available at GitHub. Feel free to fork it and play around with it! Beware: The code can change dramatically from one commit to another. For example if I just want to test a new idea.

Share This:

Java 8 Streams: Collecting items into a Map of (Key, Item)

Once in a while I come across the task where I have a list of Items that I want to filter and afterwards store in a map. Usually the key is a property of the Item: anItem.name -> anItem

In the usual Java way this looked like the following:

Map<String, Stuff> map = new HashMap<>();
for (Stuff s : list) {
    if (!s.name.equals("a")){
        map.put(s.name, s);
    }
}

Nothing really special, but it somehow doesn’t look too nice. Yesterday I thought: in Scala I would emit tuples and call .toMap. Isn’t that also possible with Java 8 Streams? And indeed it is:

Map<String, Item> map = l.stream()
    .filter(s -> !s.name.equals("a"))
    .collect(toMap(s -> s.name, s -> s)); // toMap is a static import of Collectors.toMap(...)

This looks compact and readable!

If you don’t like s -> s, just use identity() function of the Function class. Actually I do not like static imports very much as as they make the code less readable, but in this case I would decide for static imports.

Map<String, Item> map = l.stream()
    .filter(s -> !s.name.equals("a"))
    .collect(toMap(s -> s.name, identity())); // toMap and identity are imported statically

Share This:

How to rename a GIT tag

Once in a while (yet often enough) it happens that I have to change an already pushed git tag. Usually because I violated my own naming scheme.
Yet I also somehow can’t keep the necesserry commands in mind:

git tag newTag oldTag
git tag -d oldTag
git push origin :refs/tags/oldTag

Basically it is just: copy/link newTag to oldTag, remove oldTag, delete remote oldTag.
Also see the git man page for further parameters.

Share This:

Windows Tomcat start failed command 127.0.0.1 could not be found

I just installed Tomcat 7 on my Windows machine and tried to fire it up through Netbeans. But instead of a running server, I just got an error message that command 127.0.0.1 could not be found (Localized error message: Der Befehl “127.0.0.1” ist entweder falsch geschrieben oder konnte nicht gefunden werden.).

I remember that I read about it in a Tomcat bugtracker (but can’t find it any more). Well the solution is pretty simple:
Just open [tomcat home]\bin\catalina.bat and remove the “-characters from lines 196 and 201 (in the code snippet below it’s line 1 and 6):

set JAVA_OPTS=%JAVA_OPTS% %LOGGING_CONFIG%

if not "%LOGGING_MANAGER%" == "" goto noJuliManager
set LOGGING_MANAGER=-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
:noJuliManager
set JAVA_OPTS=%JAVA_OPTS% %LOGGING_MANAGER%

Share This:

Firefox Sync not syncing Tabs?

Just recently I tried Firefox including the Sync-Feature. And was pretty disappointed as Tabs didn’t seem to be synced. But – they were synced! It’s just that the tabs aren’t opened automatically.

Just open about:sync-tabs and there they are.
Listed / Grouped by device. Just a bit … hidden.

Firefox-sync

Share This:

Epson XP-205 Scanner in Windows 8.1 – How to get it back to work

Now it’s time for a Epson XP-205 Scanner to work with Windows 8.1

It took me a while now to get the XP-205 scanner component to run on my new Windows 8.1 via WiFi:

The problem is, that when you install the scan software, you might get a communication error telling you that the scanner is probably not connected or turned off. Both of course is not the problem.

So how you get it online in Windows 8.1:

  1. First install the Epson Scan software that you can obtain directly from epson (german version link).
  2. Open the scan software and verify that you (still) get the error.
  3. epson-scan-settings-1Then change to the Windows-Tile screen (just hit the Windows-Key on your keyboard) and type “Epson Scan”. A list should appear that shows the Epson Scan entries. One of them should be “Epson Scan Settings” (german: Epson Scan-Einstellungen).
  4. Now comes the tricky part:If you work as an administrator (which you shouldn’t do usually), just open the Epson Scan Settings tool. Otherwise right-click the Epson Scan Settings tool and start it with administrator privileges. This is very important! Otherwise you won’t be able to do the next step as an important part is greyed and thus unavailable!
  5. You should see the settings window with “Connection local” selected.
  6. epson-scan-settings-2Hit the “Network” button and then hit the “add” button below the Networkscanner address field.
    If this button is disabled, you probably do not run the program with admin privileges.
  7. Add the Scanner’s IP and give it a name.
  8. Hit the Test button – it should work now!

epson-scan-settings-4The scan software should work now. I first had some issues that I also had to run it with admin privileges. Sometimes I could start the scanning software, showing the correct user interfacec but it only scanned a very tiny part of the physical scan area (VERY strange).

So if you still experience some issues: Try restarting the computer first! If this doesn’t work, try starting as administrator.

Update: The Scan software surprised me today with a message similar to  “ah no – I won’t scan today”. So I

  • uninstalled the scan software. Well I tried, the files were still in the windows directory.
  • let the Software updater update the firmware of the device
  • reinstalled Epson Scan

And – suddenly everything is just fine.

Share This: