LOCF and Linear Imputation with PostgreSQL

This tutorial will introduce various tools offered by PostgreSQL, and SQL in general – like custom functions, window functions, aggregate functions, WITH clause (or CTE for Common Table Expression) – for the purpose of implementing a program which imputes numeric observations within a column applying linear interpolation where possible and locf-and-linearforward and backward padding where necessary. I’m going to progressively add and explain those constructs, step by step, so no problem if you are new to the scene. I am very much interested in input regarding potential downsides of the implementation and possible improvements.

Continue reading

Illustrated Guide to ROC and AUC

roc(In a past job interview I failed at explaining how to calculate and interprete ROC curves – so here goes my attempt to fill this knowledge gap.) Think of a regression model mapping a number of features onto a real number (potentially a probability). The resulting real number can then be mapped on one of two classes, depending on whether this predicted number is greater or lower than some choosable threshold. Let’s take for example a logistic regression and data on the survivorship of the Titanic accident to introduce the relevant concepts which will lead naturally to the ROC (Receiver Operating Characteristic) and its AUC or AUROC (Area Under ROC Curve).

Continue reading

Neural Nets with Caffe Utilizing the GPU

network-graphCaffe is an open-source deep learning framework originally created by Yangqing Jia which allows you to leverage your GPU for training neural networks. As opposed to other deep learning frameworks like Theano or Torch you don’t have to program the algorithms yourself; instead you specify your network by means of configuration files. Obviously this approach is less time consuming than programming everything on your own, but it also forces you to stay within the boundaries of the framework, of course. Practically though this won’t matter most of the time as the framework Caffe provides is quite powerful and continuously advanced.

Continue reading

GPU Powered DeepLearning with NVIDIA DIGITS on EC2

activationsIn this tutorial I am going to show you how to set up CUDA 7, cuDNN, caffe and DIGITS on a g2.2xlarge EC2 instance (running Ubuntu 14.04 64 bit) and how to get started with DIGITS. For illustrating DIGITS’ application I use a current Kaggle competition about detecting diabetic retinopathy and its state from fluorescein angiography.

Convolutional Deep Neural Networks for Image Classification

For classification or regression on images you have two choices:

  • Feature engineering and upon that translating an image into a vector
  • Relying on a convolutional DNN to figure out the features

Continue reading

Guide to EC2 from the Command Line

AWSThis tutorial aims at guiding your first steps at controlling your EC2 instances from the command line. It is by no means even remotely complete but it will give you an impression of the basic structure and concepts, so you can quickly fill in the gaps for your personal use case. The tutorial starts with setting up your account and forges a bridge from requesting a Spot instance, over exchanging files with it, hooking up additional storage, to finally terminating it. I am not though explaining interaction with the AWS web console – we’ll only resort it for some initial configuration. As usual the target audience are Linux users but the AWS CLI tools are pretty much identical for Windows.

Continue reading

A Guide on OCR with tesseract 3.03

Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to OCR-convert PDFs to text and how to train tesseract for application to new fonts. Let me dampen your expectations – you *will* have to read further texts (esp. the official documentation) to actually perform successful training! This text is describing usage of tesseract 3.03 RC on Ubuntu 14.04. Tesseract is also available for other Linuxes and Windows – the work flow will be mostly the same across OSes – of course some commands I use are though specific to Ubuntu. Also mind that tesseract 3.03 is considerably different to 3.02, which again differs from  3.01 as well – the changes are partially more fundamental than what you might expect from the version numbers.

Continue reading

Introduction to OpenCPU for R on EC2 with Python

OpenCPUopencpu is (simply put) a server implementing a RESTful web API for remotely executing R functions and retrieving results. In this tutorial I am going to showcase how OpenCPU can be installed on an EC2 instance running Ubuntu 14.04. Python and its requests package come into play for the purpose of conveniently handling HTTP communication. First and foremost thanks to the effort Jeroen Ooms put into developing OpenCPU and composing its documentation the whole process is comparatively easy and painfree.

Continue reading

OAuth 2.0 for Google (Analytics) API with Python Explained

oauth2In this tutorial I am going to explain how OAuth 2.0 works and how to apply it for interacting with Google Analytics API using Python. Google provides for that purpose a Python package – which so far only supports Python 2 though … well.

OAuth2 seems to be quite a mess at first and Google’s documentation on this subject is not that well organized in my opinion. So with this article I do my best to save you the sweat I had to invest. After all it’s not that complicated anyway, as you will probably agree.

Continue reading