How to do automatic tagging of articles using Feedly

In this post I will describe a first proof of concept approach about how to implement a supervised learning system to automatically tag RSS posts in Feedly.

Motivation

Everyone using an RSS reader to read daily news surely knows the situation that certain topics keep (re)occuring in the news. Yet most people have topics that they are simply not interested in. Just think about certain sports, political topics or world events. But of course they keep showing up in the daily news stream.

Therefore a system is needed that automatically assigns predefined tags to the corresponding news entities and (maybe) also marks them as read.

A critical point is that the system must integrate into an RSS reader application. A system not being able to attach to an existing system would not be applicable as one still wants to use a mobile / desktop app to read the news and also to (un)tag articles. Implementing the complete value chain comprising fetching RSS, parsing, classifying, providing an aggregated stream AND an application for reading the news is definitely out of scope for a proof of concept.

I wanted to write such a classifier for quite some time but didn’t find a system that provided a convenient API to plug in  a tool for reading, classifying and pushing back the results. Unless I discovered the Feedly API. Unfortunately the Feedly API is not (yet) fully open, so that one has to obtain a time limited API token by hand. Yet, for a proof-of-concept, this is totally acceptable.

The Learning System

So much for the introduction. Let us go in medias res:

The first thought was to start with some clustering using Elasticsearch (for similarity search). But let’s recall the base facts and requirements:

  • Only a hand full of tags are needed
  • start with the simplest approach first
  • it should be able to run either on OpenShift or on my Raspberry Pi

So the choice was to start with a simple Naive Bayes Classifier. Instead of doing an in depth explanation of the Bayes classifier (I recommend Paul Graham’s A Plan for Spam and the page about combined probability), just recall: a Bayes Classifier is just a 0-1 classifier. So a single classifier is required for each tag. This makes it of course unusable for a very large amount of tags! But the big advantage is that the Bayes classifier is just dead easy. Just count how often a word occurs in the desired in class A (the Tag) and class B.

How to train / apply the classifier(s)

The classifier should be trained perdiodically and the user must have the opportunity to correct classification errors. Before dealing with synchronizing & updating entries, the classification workflow for each tag is:

  1. get all entities for the tag and use them as positive samples
  2. get all read and untagged entities and use them as negative samples
  3. get all unread and untagged entries and compute P(tag)
  4. if P(tag) > 0.95, mark the entity with the tag and probably also mark it as read

As input, the all kinds of properties are used that could distinguish between tags. Including the source URL, site keywords, categories etc. Then the content is tokenized / split by all non word characters. Graham writes about some optimizations for spam detection – yet results were pretty convincing without further optimization.

in order to have some positive samples, this of course requires the presence of some entities being tagged already. In this case I started tagging already quite some time ago as I already assumed that I needed some ground truth.

Raspberry PI: Boon and Bane

Raspberry PIs are great as little home servers. The drawback is that the RaspPi has just a single core, 700 MHz ARM CPU and 512 Mb ram which is shared between GPU and system. So, it is a bit slow and is a bit low on resources. Especially if the RasPi is also used for other purposes at the same time that also consume some RAM. In case of very large RSS streams, this could indeed raise a  problem: Running low on CPU is unconvenient (=slow), but running low on RAM is deadly (OOME). Therefore it might be required to replace the HashMap in the Bayes class with a DB layer like MapDB.

Status Quo

The quick test with the Bayes classifier showed already some really fine results! On the RasPi, each Tag is classified within 200 – 230s (14 – 18s on my notebook). The mission “Reduce the amount of information that I am not interested in” can thus be regarded as “successfully tested“!

Also there have hardly been any misclassifications. And the ones I experienced were quite understandable. In contrast to scientific publications I didn’t do extensive accuracy tests – the first attempts were so promising that I simply saved the time and thought about what to try out next that could make my life easier.

If this approach should be followed any further there are of course (as always) some open issues: Besides code cleaning, one could try to filter by TF-IDF, filter certain tokens, adjusting thresholds, etc. But I doupt that the results would get much better.

And of course, the complete code is available at GitHub. Feel free to fork it and play around with it! Beware: The code can change dramatically from one commit to another. For example if I just want to test a new idea.

Research Idea: Evaluation of Traffic Lane Detection with OpenStreetMap GPS Data

I am soon leaving University and thus the time for pure research will soon be over. Unfortunately I still have some ideas for possible research. I’ve tried getting them out of my head as this has not yet worked out, I’ll try to write them down – maybe somewone finds them interesting enough for a Bachelor-/Masterthesis or something like that …

Introduction

OpenStreetMap creates and provides free geographic data such as street maps to anyone who wants them. The project was started because most maps you think of as free actually have legal or technical restrictions on their use, holding back people from using them in creative, productive, or unexpected ways. The OpenStreetMap approach is comparable to Wikipedia where everyone can contribute content. In openStreetMap, registered users can edit the map directly by using different editors or indirectly by providing ground truth data in terms of GPS tracks following pathes or roads. A recent study shows, that the difference between OpenStreetMap’s street network coverage for car navigation in Germany and a comparable proprietary dataset was only 9% in June 2011.

In 2010, Yihua Chen and John Krumm have published a paper at ACM GIS about “Probabilistic Modeling of Traffic Lanes from GPS Traces“. Chen and Krum apply Gaussian micture Models (GMM) on a data set of 55 shuttle vehicles driving between the Microsoft corporate buildings in the Seattle area. The vehicles were tracked for an average of 12.7 days resulting in about 20 million GPS points. By applying their algorithm to this data, they were able to infer lane structures from the given GPS tracks.

Adding and validating lane attributes completely manually is a rather tedious task for humans – especially in cases of data sets like OpenStreetMap. Therefore it should be evaluated if the proposed algorithm could be applied to OpenStreetMap data in order to infer and/or validate lane attributes on existing data in an automatic or semiautomatic way.

Continue reading Research Idea: Evaluation of Traffic Lane Detection with OpenStreetMap GPS Data