Enable output compression in Scalding

I just wanted to enable final output compression in one of my Scalding jobs (because I needed to reorganize a some-TB-data set).

Unfortunately scalding always produced uncompressed files. After some googling, I came across a github issue that adressed exactly this problem. Via some links I got the sample code from this repo which can be used to write compressed TSVs.


  1. Set the parameters correctly as stated in the docs. Beware of your Hadoop version (Yarn vs MR1):
    // http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.5.0/CDH4-Installation-Guide/cdh4ig_topic_23_3.html
    // MR1
    // Compress Map output
    set("mapred.compress.map.output", "true")
    set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec")
    // compress final output
    set("mapred.output.compress", "true")
    set("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec")
  2. Get the CompressedDelimitedScheme and CompressedTsv from https://github.com/morazow/WordCount-Compressed
  3. Pipe your output to a compressed TSV:
  4. Check your output and the content:
     hadoop fs -ls /tmp/foo

    it should list a /tmp/foo/part-00000.snappy

    hadoop fs -text -cat /tmp/foo/part-00000.snappy

Leave a Reply

Your email address will not be published. Required fields are marked *