If it won't be simple, it simply won't be. [source code] by Miki Tebeka, CEO, 353Solutions

Thursday, September 20, 2012

Data Wrangling With Python

I just gave a talk at work called "Data Wrangling With Python" which gives an overview on the scientific Python ecosystem. You can view it here.

Friday, September 14, 2012

Using Hadoop Streaming With Avro

One of the way to use Python with Hadoop is via Hadoop Streaming. However it's geared mostly toward text based format and at work we use mostly Avro.

Took me a while to figure the magic, but here it is. Note that the input to the mapper is one JSON object per line.

Note it's a bit old (Avro is now at 1.7.4), originally from here.

Friday, September 07, 2012

Setting Maching Learning on OSX

Setting up machine learning tools (numpy, scipy, matplotlib, scikit-learn, ...) can be a pain (why can they just use a decent OS? :).

We are lucky to have Ben Kim now with us at Adconion, and he posted the following:


Mac OS X Lion Software Installs
  1. Install compilers
    1. Install XCode 4.x from the App Store
      1. Install Command Line Tools in Preferences/Download
    2. Install gcc, g++, and gfortran compilers
      1. Download tar file
      2. Extract to /
        1. tar -xvf abc.tar -C /
    3. Reference http://sites.google.com/site/dwhipp/tutorials/mac_compilers
  2. Install Homebrew
    1. Run the install command using ruby
      1. ruby <(curl -fsSkL raw.github.com/mxcl/homebrew/go)
    2. brew doctor
      1. chown /usr/local folders listed
      2. Place /usr/local/bin before /usr/bin in path
    3. Reference https://github.com/mxcl/homebrew/wiki/installation
  3. Install python using brew
    1. brew install readline sqlite gdbm pkg-config
    2. brew install python
    3. Note: Mac OS X Lion comes with old version 2.7.1 of python (python --version) 
  4. Set PATH in .bash_profile
    1. vim ~/.bash_profile
      1. export PATH=/usr/local/share/python:/usr/local/bin:$PATH
  5. Create symlinks
    1. Within /(System/)?Library/Frameworks/Python.framework/Versions, sudo rm Current
    2. Within the above directories, ln -s /usr/local/Cellar/python/2.7.3 Current
  6. Install pip, if necessary, using easy_install
    1. sudo easy_install pip
  7. Using pip (sudo pip install [--upgrade] abc)
    1. Install nose
    2. Install numpy
    3. Install scipy with environmental variables settings
      1. sudo CC=clang CXX=clang FFLAGS=-ff2c pip install [--upgrade] scipy
    4. Install scikit-learn
    5. Install pandas
  8. Install matplotlib
    1. Download source from repo: https://github.com/matplotlib/matplotlib
    2. cd ~Downloads/matplotlib-*
    3. python setup.py build
    4. python setup.py install
  9. Install VW (Vowpal Wabbit)
    1. Install boost
      1. Download tar file
      2. mv boost extracted folder to /usr/local
      3. export BOOST_ROOT environmental variable
      4. cd to boost directory
      5. make and install
        1. sudo ./bootstrap
        2. sudo ./bjam install
      6. Download bjam
      7. mv to directory in PATH
        1. mv bjam /usr/local/bin
      8. Set bjam toolset to darwin
        1. bjam toolset=darwin stage
      9. Reference http://www.boost.org/doc/libs/1_41_0/more/getting_started/unix-variants.html#expected-build-output
    2. cd to VW directory
      1. make and test
        1. make
        2. make test

Blog Archive