One of the way to use Python with Hadoop is via Hadoop Streaming. However it's geared mostly toward text based format and at work we use mostly Avro.
Took me a while to figure the magic, but here it is. Note that the input to the mapper is one JSON object per line.
Note it's a bit old (Avro is now at 1.7.4), originally from here.
If it won't be simple, it simply won't be. [Hire me, source code] by Miki Tebeka, CEO, 353Solutions
Friday, September 14, 2012
Subscribe to:
Post Comments (Atom)
2 comments:
I've been trying to do something like this. But the JSON that comes out is 10% of the time mangled. Have you had this experience?
No I haven't. Probably in the near future I'll have more experience with this method and we'll see.
Did you try a newer avro release? (it's currently at 1.7.4)
Post a Comment