Friday, March 08, 2013

zipstream - Zip File InputFormat for Hadoop Streaming

At work, we store logs as a single CSV inside a zip file in HDFS (history, that's why :).

Looking around, I couldn't find a FileInput library that works with Hadoop streaming on CDH4 (the version we're using).

So I wrote one, hope you'll find it useful (you can download the jar directly from here.)

Here's an example how to use it:

