If it won't be simple, it simply won't be. [Hire me, source code] by Miki Tebeka, CEO, 353Solutions

Saturday, April 20, 2013

Serving Dynamic Images with matplotlib

Here's an example of generating dynamic images using matplotlib in a web server (flask this time).

Thursday, April 11, 2013

Show Location Of Hive Table in HDFS

Here's a script I use to find out where Hive is storing it's data on HDFS (I call it hiveloc).




Tuesday, April 09, 2013

Quickly Plotting Labeled Data

Here's a quick way to view some labeled data you have (taken from An Introduction to scikit-learn). It will reduce the data to two dimensions using PCA and then scatter plot it with different colors for each label.

Script to close a branch in mercurial (hg)

Mercurial (hg) does not let you delete branches (or alter history in any way). But you can close branches so they won't show in hg branches command.

Here's a script I use to close branches (we work with feature branches at work, and close them when work on the feature is done).

Sunday, March 31, 2013

gittip on bitbucket/github

gittip is a cool idea, however currently there's no built-in way to add it to bitbucket/github projects.

One option I found that works it to add a clickable image to your README.md or README.rst.

See example here.

Markdown:

[![gittip](http://i.imgur.com/lg9rx9w.png)](https://www.gittip.com/Miki%20Tebeka/)


ReStructuedText:

.. image:: http://i.imgur.com/lg9rx9w.png
   :alt: gittip
   :target: https://www.gittip.com/Miki%20Tebeka/



Notes:

  1. You'll probably want to change gittip user id :)
  2. There's a discussion on gittip bug tracker on the right way to do this.
  3. Unofficial gittip image generated using cooltext.

Thursday, March 28, 2013

import "C" slides

Last night I gave a talk about using C from Go at the L.A. Gophers meetup.

You can view the slides here. (Note that "run" won't work due to security restrictions, you can download the slides here and run it locally using the present tool).

Wednesday, March 13, 2013

Investigating Hash Distribution

A college asked me for a hash function on strings that return an integer between 0 to N. Before diving in, I decided to take the lazy path and check if Python's hash function is good enough.

Luckily, ipython notebook --pylab=inline makes that a breeze.
Check out the notebook here.

And yes, we decided to stick with this solution. I guess we're at least 1/3 programmers.

Friday, March 08, 2013

zipstream - Zip File InputFormat for Hadoop Streaming

At work, we store logs as a single CSV inside a zip file in HDFS (history, that's why :).

Looking around, I couldn't find a FileInput library that works with Hadoop streaming on CDH4 (the version we're using).

So I wrote one, hope you'll find it useful (you can download the jar directly from here.)

Here's an example how to use it:

Thursday, February 21, 2013

Abusing namedtuple - Yet Another Enum

There's a discussion over at python-ideas about enum. This prompted me to write yet another implementation of enum, this time abusing namedtuple.

Friday, February 15, 2013

try lock

At work, we have several functions that can run only one at a time. We call this "try lock" (or trylock), and had it forever in the Java code.

When we started a Python project, we wanted this functionality. A decorator seems like the right solution. The below try_lock decorator has an optional function that lets you get a finer grained solution on what to lock. It gets the function arguments and returns a key to lock on. If you don't specify keyfn, then there will be just one lock for the function.

Thursday, January 24, 2013

whoops - A WebHDFS Library and Client

Just released whoops 0.1.0 which is a WebHDFS library and a command line client for Python.

Wednesday, December 19, 2012

Timing Your Code

It's a good idea to time portions of your code and have some metric you monitor. This way you can see trends and solve bottlenecks before someone notices (hopefully). Timing functions is easy with decorators, but sometimes you want to time a portion of a function. For this you can use a context manager.

Tuesday, December 11, 2012

Tuesday, November 20, 2012

Last Letter Frequency

I was playing a game with my child where you say a word, then the other person need to say a word which starts with the last letter of the word you said, then you need to say a word with their last letter ...

We noticed that many words end with S and E, which made me curious about the frequency of the last letter in English words. matplotlib makes it super easy to visualize the results.

Friday, November 16, 2012

Python For Data Analysis

Just finished reading Python For Data Analysis, it's a great book with lots of practical examples. Highly recommended.

Thursday, October 25, 2012

Mocking HTTP Servers

Sometimes, httpbin is not enough, and you need your own custom HTTP server for testing.
Here's a small example on how to do that using the built in SimpleHTTPServer (thanks @noahsussman for reminding me).

Monday, October 15, 2012

http://httpbin.org

Sometimes you need to write an HTTP server to debug the client you are writing.

One quick way to avoid this is to use http://httpbin.org/. It supports most of the common HTTP verbs and mostly return the variables you send in.

For example (note the args field in the reply):

$ curl -i 'http://httpbin.org/get?x=1&y=2'
HTTP/1.1 200 OK
Content-Type: application/json
Date: Mon, 15 Oct 2012 21:50:27 GMT
Server: gunicorn/0.13.4
Content-Length: 386
Connection: keep-alive

{
  "url": "http://httpbin.org/get?x=1&y=2",
  "headers": {
    "Content-Length": "",
    "Connection": "keep-alive",
    "Accept": "*/*",
    "User-Agent": "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3",
    "Host": "httpbin.org",
    "Content-Type": ""
  },
  "args": {
    "y": "2",
    "x": "1"
  },
  "origin": "75.82.8.111"
}

Friday, October 05, 2012

Cleanup After Your Tests - But Be Lazy

It's a nice practice to clean after your tests. It's good for various reasons like disk space, "pure" execution environment and others.

However if you clean up to eagerly it'll make your debugging much harder. The data just won't be there to see what went wrong.

The solution we found is pretty simple:
  • Try to place all your test output in one location
  • Nuke this location when starting the tests
This way all the information is available after an error, and you don't accumulate too much junk (just one test run junk at a time).

Thursday, September 20, 2012

Data Wrangling With Python

I just gave a talk at work called "Data Wrangling With Python" which gives an overview on the scientific Python ecosystem. You can view it here.

Friday, September 14, 2012

Using Hadoop Streaming With Avro

One of the way to use Python with Hadoop is via Hadoop Streaming. However it's geared mostly toward text based format and at work we use mostly Avro.

Took me a while to figure the magic, but here it is. Note that the input to the mapper is one JSON object per line.

Note it's a bit old (Avro is now at 1.7.4), originally from here.

Blog Archive