Enable output compression in Scalding

I just wanted to enable final output compression in one of my Scalding jobs (because I needed to reorganize a some-TB-data set).

Unfortunately scalding always produced uncompressed files. After some googling, I came across a github issue that adressed exactly this problem. Via some links I got the sample code from this repo which can be used to write compressed TSVs.

Solution:

  1. Set the parameters correctly as stated in the docs. Beware of your Hadoop version (Yarn vs MR1):
    // http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.5.0/CDH4-Installation-Guide/cdh4ig_topic_23_3.html
    // MR1
    // Compress Map output
    set("mapred.compress.map.output", "true")
    set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec")
    // compress final output
    set("mapred.output.compress", "true")
    set("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec")
    
  2. Get the CompressedDelimitedScheme and CompressedTsv from https://github.com/morazow/WordCount-Compressed
  3. Pipe your output to a compressed TSV:
    myPipe.write(CompressedTsv("/tmp/foo"))
    
  4. Check your output and the content:
     hadoop fs -ls /tmp/foo

    it should list a /tmp/foo/part-00000.snappy

    hadoop fs -text -cat /tmp/foo/part-00000.snappy