Let’s assume we want to scrape the “Most Popular in News” box from bbc.co.uk. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content:
The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not bad at all.
This article is about the “Digit Recognizer” challenge on Kaggle. You are provided with two data sets. One for training: consisting of 42’000 labeled pixel vectors and one for the final benchmark: consisting of 28’000 vectors while labels are not known. The vectors are of length 784 (28×28 matrix) with values from 0 to 255 (to be interpreted as gray values) and are supposed to be classified as to what number (from 0 to 9) it represents. The classification is realized using SVMs which I implement with kernlab package in R.
Three representation of the data set
The pre- and post-processing in all cases consists of removing the unused (white – pixel value = 0) frame (rows and columns) of every matrix and finally to scale the computed feature vectors (feature-wise) to mean 0 and standard deviation 1. I gave three representations a try:
The German parliament publishes protocols for each of their sessions. A lot of data waiting to be processed. The protocols are published in the form of text files and PDFs. The published text files are not of my liking but xpdf manages to produce decent text versions from the PDFs. The layout is preserved quite well, which is good because it makes the whole journey from there more deterministic. Processing the layout though is not trivial because the text flow is not trivial.
- Most of the text – the actual parts holding the transcript – is split into two columns.
- Lists with names are usually separated into four columns.
- Headlines and titles occupy mostly a full line.
- Tables can look programmatically similar to all of those three styles.
Early in December 2013 a lawfirm began to send out approximately 10 to 40 thousand cease-and-desist letters on behalf of the rightholder of a bunch of porn flicks for streaming those films on redtube. So far, so good. Now a lot of people didn’t like to receive bills ranging from 250 to more than a thousand Euro for streaming erotica just before christmas especially when being pretty sure that they didn’t even do so. Now given the magnitude of this case a lot of these people turned sour and started to dig a bit deeper. And what was brought to light is a shady network of companies with links where there should be none and a bunch of business partners who as well turned out to have more in common than what was to be seen at first glance.
To deepen my knowledge about Machine Learning I decided last year to attend “Learning From Data” on edX. This online course was designed by Yaser Abu-Mostafa – a renowned expert on the subject and professor of Electrical Engineering and Computer Science at California Institute of Technology (Caltech). I can say without the slightest hesitation that this course was a wonderful intellectual experience. Prof. Abu-Mostafa conceived the course so skilfully that it was as much a joy to attend, as it was challenging. And this finding couldn’t be further from a naturalness, especially given that the syllabus took a path through quite theoretical terrain.
(This article is referring to an initial proof-of-concept version of r-big-pivot)
I have to admit that I very much enjoy pivoting through data using Excel. Its pivoting tool is great for getting a quick insight into a data set’s structure and for discovering interesting anomalies (the sudden rise of deaths due to viral hepatitis serves as a nice example). Unfortunately Excel itself is handicapped by several restrictions:
- a maximum number of one point something million rows per data set (which is crucial because the data needs to be formatted long)
- Excel is slow and often instable when you go beyond 100’000 rows
- cumbersome integration into available data infrastructures
- restricted charting and analysis capabilities
Did you know that top researchers and universties from all over the world offer their knowledge in structured and partly certified online courses? Well now you know! Those coures are refered to as MOOC which stands for “Massive Open Online Courses” and is for me one of THE digital discoveries of the year 2013. The three biggest platforms are currently edX, coursera and Udacity. I am following courses on all of those three and I really sometimes can’t believe how awesome this opportunity is.
Two weeks ago I wrote about Kettle 4.4 and its database repository and how working with it is truly no fun due to excessive latency connected with loading and saving of jobs and transformations. Logging the queries sent to a MySQL database made it obvious that the reason is a tornado of thousands of database commands and replies sent back and forth.
Did Kettle 5 learn something new?
First of all this text is not just about an intuitive perspective on the beta distribution but at least as much about the idea of looking behind a measured empirical probability and thinking of it as a product of chance itself. Credits go to David Robinson for approaching the subject from a baseball angle and to John D. Cook for establishing the connection to Bayesian statistics. In this article I want to add a simulation which stochastically prooves the interpretation.
Ratings of sellers on an e-commerce platform
Let’s assume you would like to buy a product and you can choose from two sellers A and B. The only relevant difference is the ratings – which a customer can give to a seller after a purchase choosing between ‘good’ and ‘bad’. Seller A received 10 ratings of which 9 are good and seller B received so far 500 ratings of which 400 (80%) happen to be ‘good’.
We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.
Who would’ve thought that parallel processing in R is THAT simple? Thanks to a collection of packages having a task use all available cores is a cinch. I call an Intel i7-3632QM my own – which means 4 physical cores each providing 2 virtual cores running at something around 3 GHz. So let’s have a look at how this feat is accomplished.
The method of parallelization applied here works by starting a specified number of regular R processes and have those run on the available cores. The launch, stop of the those processes and the assignment of tasks is controlled from the original R process. The assignment of those processes to the cores is controlled by the OS. Because this method can be extended to cores beyond one CPU or computer the packages use the acronym SNOW for “Simple Network Of Workstations”.
For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.
Why (a) PCA?
A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. Have a look at this use case. I’ll try to explain the motivation using a simple example.
Think of a very flat square (e.g. 1x1x.05) in a three dimensional space. What you see of this cuboid when looking at it from a specific angle while assuming perspectivic visual indicators away is basically a 2 dimensional projection of it. From one angle it looks like a rhombus, looking straight from the edge side all you see is a line-like shape. So obviously the representation offering the most information about this cuboid is the one you see when looking at it perpendicular to the main surface. Then you see a square. Given that the square is very flat you have lost information (two separate points might lie exactly on top of each other now) – but as little as possible for a projection in that case. The same idea works analogously for higher dimensions. Just that then you wont’t be able to look at the real spacial arrangement and make a conscious decision about how to project it into a lower-dimensional space while keeping the relevant information. This is where PCA comes into play because as you will see this projection is often calculateable. I will exemplify this concept using a flat ellipsoid. The first part assumes basic knowledgle in linear alebra and statistics.
(Attention: The calculations and analysis are not biased by my political views – but the interpretation of the results might be and their verbal formulation certainly is … ;)
About a week ago I came across an article titled “How divided is the Senate?” by Vik Paruchuri where he uses a method called principal component analysis (PCA) to visualize the closeness of votings given by senators of the 113th Congress of the USA. I immediately fell in love with the idea behind this article as well as the method applied – which was a great opportunity to revise some statistics and alebra basics. And because (pretending) transparency is a major foundation of a modern democracy, full detailed word by word protocols of every meeting of the Bundestag are published as PDFs and text files on their website. So I downloaded all those protocols for the 17th Bundestag, extracted the votings and loaded the votes into a data frame. That was quite a drag because judging from typos (Sevim Dadelen, Sevim Dagelen, Sevim Dagdelen, …), different name versions (Erwin Josef Rüddel, Erwin Rüddel) and line breaks within the longer names like Dr. Karl-Theodor Freiherr von und zu Guttenberg (his title is gone, so the name became a tad handier by now) those text files where manually sanitized PDF convertions of live transcripts. I’ll spare you the details – but getting the data finally right took quite some effort.