Tag: Scala

  • Scalding hiding NPEs in “operator Each failed executing operation”

    Yesterday I was surprised by a failing Scalding task. Everything worked fine locally and all I git was like “job failed, see cluster log”. In the cluster log I saw the following:

    2014-10-24 14:38:41,222 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201410101555_2230_m_000005_3: cascading.pipe.OperatorException: [com.twitter.scalding.T…][com.twitter.scalding.RichPipe.each(RichPipe.scala:471)] operator Each failed executing operation
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:107)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
    at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80)
    at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
    at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133)
    at cascading.operation.Identity$2.operate(Identity.java:137)
    at cascading.operation.Identity.operate(Identity.java:150)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
    at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
    at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)

    (more…)

  • Enable output compression in Scalding

    I just wanted to enable final output compression in one of my Scalding jobs (because I needed to reorganize a some-TB-data set).

    Unfortunately scalding always produced uncompressed files. After some googling, I came across a github issue that adressed exactly this problem. Via some links I got the sample code from this repo which can be used to write compressed TSVs.

    (more…)

  • Scalding Exception: diverging implicit expansion for type com.twitter.algebird.Semigroup[T]

    I was just doing a again some scalding jobs and again got an .. interesting exception:

    In a groupBy operation, I wanted to sum something up using:

    .groupBy('a) {
      _.sum('a -> 'c)
    }

    And was rewarded with this one:

    [error] example.scala:20: diverging implicit expansion for type com.twitter.algebird.Semigroup[T]
    [error] starting with method eitherSemigroup in object Semigroup
    [error]       _.sum('a -> 'c)
    [error]            ^
    [error] one error found
    [error] (compile:compile) Compilation failed

    WTF??

    Solution:

    Spot the mistake? It’s the missing type hint at sum:

    .groupBy('a) {
      _.sum<strong>[Int]</strong>('a -> 'c)  //  <-- [Int]
    }
  • Scalding: unable to compare stream elements in position: 0

    I’m currently working quite a bit with Twitter’s Scalding.
    Recently I split up a job into sub-jobs and suddenly got an Exception in my join:

    Caused by: cascading.CascadingException: unable to compare stream elements in position: 0

    If I had remembered the Fields API in detail, I would have thought about this paragraph (it’s about sorting, but the consequence is the same):

    Note: When reading from a CSV, the data types are set to String,hence the sorting will be alphabetically, therefore to sort by age, an int, you need to convert it to an integer. For example …

    Solution:

    Ensure you are joining the correct data types and possibly convert them before. For example:

    .map ('myField-> 'myField) {x:Int => x}