Impact of Dimensionality on Data in Pictures

I am excited to announce that this is supposed to be my first article published also on :)

The processing of data needs to take dimensionality into account as usual metrics change their behaviour in subtle ways, which impacts the efficiency of algorithms and methods that are based on distances / similarities of data points. This has been tagged the “curse of dimensionality“. Just as well in some cases high dimensionality can make help us when investigating data – “blessing of dimensionality”. In general it is good to know what’s going on and so let’s have a look at what dimensionality does to data.

Continue reading

Titanic challenge on Kaggle with decision trees (party) and SVMs (kernlab)

titanic-iconThe Titanic challenge on Kaggle is about inferring from a number of personal details whether a passenger survived the disaster or did not. I gave two algorithms a try, which are decision trees using R package party and SVMs using R package kernlab. I chose to use party for the decision trees over the more prominent rpart because the authors of party make a very good point why their approach is likely to outperform it and other approaches in terms of generalization.

Continue reading

Using the Linux Shell for Web Scraping

bbcLet’s assume we want to scrape the “Most Popular in News” box from What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content:

Continue reading

The tf-idf-Statistic For Keyword Extraction

tf-idfThe tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not  bad at all.

Continue reading

“Digit Recognizer” Challenge on Kaggle using SVM Classification

MNIST-digitsThis article is about the “Digit Recognizer” challenge on Kaggle. You are provided with two data sets. One for training: consisting of 42’000 labeled pixel vectors and one for the final benchmark: consisting of 28’000 vectors while labels are not known. The vectors are of length 784 (28×28 matrix) with values from 0 to 255 (to be interpreted as gray values) and are supposed to be classified as to what number (from 0 to 9) it represents. The classification is realized using SVMs which I implement with kernlab package in R.

Three representation of the data set

The pre- and post-processing in all cases consists of removing the unused (white – pixel value = 0) frame (rows and columns) of every matrix and finally to scale the computed feature vectors (feature-wise) to mean 0 and standard deviation 1. I gave three representations a try:

Continue reading

Segmenting a Text Document using the Idea of a Cellular Automata

The German parliament publishes protocols for each of their sessions. A lot of data waiting to be processed. The protocols are published in the form of text files and PDFs. The published text files are not of my liking but xpdf manages to produce decent text versions from the PDFs. The layout is preserved quite well, which is good because it makes the whole journey from there more deterministic. Processing the layout though is not trivial because the text flow is not trivial.

  • Most of the text – the actual parts holding the transcript – is split into two columns.
  • Lists with names are usually separated into four columns.
  • Headlines and titles occupy mostly a full line.
  • Tables can look programmatically similar to all of those three styles.

Continue reading

Tool for visualization of connections between agents and entities in context of redtubegate

redtubegate-inter-connEarly in December 2013 a lawfirm began to send out approximately 10 to 40 thousand cease-and-desist letters on behalf of the rightholder of a bunch of porn flicks for streaming those films on redtube. So far, so good. Now a lot of people didn’t like to receive bills ranging from 250 to more than a thousand Euro for streaming erotica just before christmas especially when being pretty sure that they didn’t even do so. Now given the magnitude of this case a lot of these people turned sour and started to dig a bit deeper. And what was brought to light is a shady network of companies with links where there should be none and a bunch of business partners who as well turned out to have more in common than what was to be seen at first glance.

Continue reading

“Learning From Data” by Yaser Abu-Mostafa (Caltech) on

edx-logo-headerTo deepen my knowledge about Machine Learning I decided last year to attend “Learning From Data” on edX. This online course was designed by Yaser Abu-Mostafa – a renowned expert on the subject and professor of Electrical Engineering and Computer Science at California Institute of Technology (Caltech). I can say without the slightest hesitation that this course was a wonderful intellectual experience. Prof. Abu-Mostafa conceived the course so skilfully that it was as much a joy to attend, as it was challenging. And this finding couldn’t be further from a naturalness, especially given that the syllabus took a path through quite theoretical terrain.

Continue reading

Pivoting Data in R Excel-style

(This article is referring to an initial proof-of-concept version of r-big-pivot)

I have to admit that I very much enjoy pivoting through data using Excel. Its pivoting tool is great for getting a quick insight into a data set’s structure and for discovering interesting anomalies (the sudden rise of deaths due to viral hepatitis serves as a nice example). Unfortunately Excel itself is handicapped by several restrictions:

  • a maximum number of one point something million rows per data set (which is crucial because the data needs to be formatted long)
  • Excel is slow and often instable when you go beyond 100’000 rows
  • cumbersome integration into available data infrastructures
  • restricted charting and analysis capabilities

Continue reading

Social Network Analysis by Lada Adamic on coursera

courseraDid you know that top researchers and universties from all over the world offer their knowledge in structured and partly certified online courses? Well now you know! Those coures are refered to as MOOC which stands for “Massive Open Online Courses” and is for me one of THE digital discoveries of the year 2013. The three biggest platforms are currently edX, coursera and Udacity. I am following courses on all of those three and I really sometimes can’t believe how awesome this opportunity is.

Continue reading

Pentaho Kettle 5 comes with improved database repository performance

Two weeks ago I wrote about Kettle 4.4 and its database repository and how working with it is truly no fun due to excessive latency connected with loading and saving of jobs and transformations. Logging the queries sent to a MySQL database made it obvious that the reason is a tornado of thousands of database commands and replies sent back and forth.

Did Kettle 5 learn something new?

Continue reading

An intuitive interpretation of the beta distribution

First of all this text is not just about an intuitive perspective on the beta distribution but at least as much about the idea of looking behind a measured empirical probability and thinking of it as a product of chance itself. Credits go to David Robinson for approaching the subject from a baseball angle and to John D. Cook for establishing the connection to Bayesian statistics. In this article I want to add a simulation which stochastically prooves the interpretation.

Ratings of sellers on an e-commerce platform

Let’s assume you would like to buy a product and you can choose from two sellers A and B. The only relevant difference is the ratings – which a customer can give to a seller after a purchase choosing between ‘good’ and ‘bad’. Seller A received 10 ratings of which 9 are good and seller B received so far 500 ratings of which 400 (80%) happen to be ‘good’.

Continue reading

Pentaho Kettle 4.4 database repository under the hood

We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.

Continue reading

Parallel computing in R on Windows using doSNOW and foreach

"Sarah and Patti Chasing Me 2" by Martin Male [CC BY 2.0]

by Martin Male [CC BY 2.0]

Who would’ve thought that parallel processing in R is THAT simple? Thanks to a collection of packages having a task use all available cores is a cinch. I call an Intel i7-3632QM my own – which means 4 physical cores each providing 2 virtual cores running at something around 3 GHz. So let’s have a look at how this feat is accomplished.

The method of parallelization applied here works by starting a specified number of regular R processes and have those run on the available cores. The launch, stop of the those processes and the assignment of tasks is controlled from the original R process. The assignment of those processes to the cores is controlled by the OS. Because this method can be extended to cores beyond one CPU or computer the packages use the acronym SNOW for “Simple Network Of Workstations”.

Continue reading

Comparison of String Distance Algorithms


For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.

Continue reading