If it won't be simple, it simply won't be. [Hire me, source code] by Miki Tebeka, CEO, 353Solutions

Wednesday, March 14, 2012

Reading Avro Files Faster than Java

At work, we use a lot of Avro. One of the problems we faced was that the Python Avro package is very slow comparing to the Java one. The goal then was to write fastavro which is a subset of the avro package and should be at least as fast as Java. In this post I'll show how fastavro became faster than Java and also Python 3 compatible.

Going Fast
The Python avro package uses classes and properties heavily. This might allow for nice design but since function calls in Python are expensive it has a cost. The approach was to strip down most of the code in the avro package to one simple module, eliminating as many function calls as possible along the way and using only built in types. After some tweaking, fastavro was churning through the 10K records benchmark in about 2.6seconds (comparing to 13.9 seconds of the avro package). It was a nice speedup but the goal was to be as fast as Java (which was doing about 1.8sec).


Going Faster
Enter Cython. fastavro compiles the Python code without any specific Cython code. This way on machines that do not have a compiler users can still use fastavro. This complicated the build process a bit since now the C extension is generated using Cython in external Makefile. The code in fastavro first tries to import the C extension and if it fails imports the pure Python one.

This approach gave a 2x speedup (benchmark of 10K records done in 1.5seconds). Again, this is without any Cython specific code.

Python 3 Support
The initial Python 3 support was written on the first day of PyCon. However after hearing Robert Brewer's excellent talk I decided to take his advice and write a small compatibility layer (six was not used for various reasons).

As Robert said, this approach made fastavro better with strings, unicode and other things which were glossed over the 2.X only code. The build system was simplified a lot comparing to the one with the initial Python 3 support.

End Result
The end result is a package that reads Avro faster than Java and supports both Python 2 and Python 3. Using Cython and a little bit of work the was achieved without too much effort.

As usual, the code can be found on bitbucket.

No comments:

Blog Archive