For the previous article I needed a quick way to figure out if two sets of points are linearly separable. But for crying out loud I could not find a simple and efficient implementation for this task. Except for the perceptron and SVM – both are sub-optimal when you just want to test for linear separability. The perceptron is guaranteed to finish off with a happy ending – if feasible – but it can take quite a while. And SVMs are designed for soft-margin classification which means that they might settle for a not separating plane and also they maximize the margin around the separating plane – which is wasted computational effort.
The efficient way to get the job done is by applying linear programming (LP). That means representing the question “Is it possible to fit a hyper-plane between two sets of points?” with a number of inequalities (that make up a convex area). I’m going to give a quick walk through for the math to make the idea plausible – but this text is more describing an introductory example and not an introduction to LP itself. For solving the linear program I will use Rglpk which provides a high level interface to the GNU Linear Programming Kit (GLPK) – and of course has been co-crafted by the man himself – Kurt Hornik – who is also involved with kernlab and party – thank you, Prof. Hornik and keep up the good work!
I am excited to announce that this is supposed to be my first article published also on r-bloggers.com :)
The processing of data needs to take dimensionality into account as usual metrics change their behaviour in subtle ways, which impacts the efficiency of algorithms and methods that are based on distances / similarities of data points. This has been tagged the “curse of dimensionality“. Just as well in some cases high dimensionality can make help us when investigating data – “blessing of dimensionality”. In general it is good to know what’s going on and so let’s have a look at what dimensionality does to data.
The Titanic challenge on Kaggle is about inferring from a number of personal details whether a passenger survived the disaster or did not. I gave two algorithms a try, which are decision trees using R package party and SVMs using R package kernlab. I chose to use party for the decision trees over the more prominent rpart because the authors of party make a very good point why their approach is likely to outperform it and other approaches in terms of generalization.
Let’s assume we want to scrape the “Most Popular in News” box from bbc.co.uk. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content:
The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not bad at all.
This article is about the “Digit Recognizer” challenge on Kaggle. You are provided with two data sets. One for training: consisting of 42’000 labeled pixel vectors and one for the final benchmark: consisting of 28’000 vectors while labels are not known. The vectors are of length 784 (28×28 matrix) with values from 0 to 255 (to be interpreted as gray values) and are supposed to be classified as to what number (from 0 to 9) it represents. The classification is realized using SVMs which I implement with kernlab package in R.
Three representation of the data set
The pre- and post-processing in all cases consists of removing the unused (white – pixel value = 0) frame (rows and columns) of every matrix and finally to scale the computed feature vectors (feature-wise) to mean 0 and standard deviation 1. I gave three representations a try:
The German parliament publishes protocols for each of their sessions. A lot of data waiting to be processed. The protocols are published in the form of text files and PDFs. The published text files are not of my liking but xpdf manages to produce decent text versions from the PDFs. The layout is preserved quite well, which is good because it makes the whole journey from there more deterministic. Processing the layout though is not trivial because the text flow is not trivial.
- Most of the text – the actual parts holding the transcript – is split into two columns.
- Lists with names are usually separated into four columns.
- Headlines and titles occupy mostly a full line.
- Tables can look programmatically similar to all of those three styles.
Early in December 2013 a lawfirm began to send out approximately 10 to 40 thousand cease-and-desist letters on behalf of the rightholder of a bunch of porn flicks for streaming those films on redtube. So far, so good. Now a lot of people didn’t like to receive bills ranging from 250 to more than a thousand Euro for streaming erotica just before christmas especially when being pretty sure that they didn’t even do so. Now given the magnitude of this case a lot of these people turned sour and started to dig a bit deeper. And what was brought to light is a shady network of companies with links where there should be none and a bunch of business partners who as well turned out to have more in common than what was to be seen at first glance.
To deepen my knowledge about Machine Learning I decided last year to attend “Learning From Data” on edX. This online course was designed by Yaser Abu-Mostafa – a renowned expert on the subject and professor of Electrical Engineering and Computer Science at California Institute of Technology (Caltech). I can say without the slightest hesitation that this course was a wonderful intellectual experience. Prof. Abu-Mostafa conceived the course so skilfully that it was as much a joy to attend, as it was challenging. And this finding couldn’t be further from a naturalness, especially given that the syllabus took a path through quite theoretical terrain.
(This article is referring to an initial proof-of-concept version of r-big-pivot)
I have to admit that I very much enjoy pivoting through data using Excel. Its pivoting tool is great for getting a quick insight into a data set’s structure and for discovering interesting anomalies (the sudden rise of deaths due to viral hepatitis serves as a nice example). Unfortunately Excel itself is handicapped by several restrictions:
- a maximum number of one point something million rows per data set (which is crucial because the data needs to be formatted long)
- Excel is slow and often instable when you go beyond 100’000 rows
- cumbersome integration into available data infrastructures
- restricted charting and analysis capabilities
Did you know that top researchers and universties from all over the world offer their knowledge in structured and partly certified online courses? Well now you know! Those coures are refered to as MOOC which stands for “Massive Open Online Courses” and is for me one of THE digital discoveries of the year 2013. The three biggest platforms are currently edX, coursera and Udacity. I am following courses on all of those three and I really sometimes can’t believe how awesome this opportunity is.
Two weeks ago I wrote about Kettle 4.4 and its database repository and how working with it is truly no fun due to excessive latency connected with loading and saving of jobs and transformations. Logging the queries sent to a MySQL database made it obvious that the reason is a tornado of thousands of database commands and replies sent back and forth.
Did Kettle 5 learn something new?
First of all this text is not just about an intuitive perspective on the beta distribution but at least as much about the idea of looking behind a measured empirical probability and thinking of it as a product of chance itself. Credits go to David Robinson for approaching the subject from a baseball angle and to John D. Cook for establishing the connection to Bayesian statistics. In this article I want to add a simulation which stochastically prooves the interpretation.
Ratings of sellers on an e-commerce platform
Let’s assume you would like to buy a product and you can choose from two sellers A and B. The only relevant difference is the ratings – which a customer can give to a seller after a purchase choosing between ‘good’ and ‘bad’. Seller A received 10 ratings of which 9 are good and seller B received so far 500 ratings of which 400 (80%) happen to be ‘good’.
We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.
Who would’ve thought that parallel processing in R is THAT simple? Thanks to a collection of packages having a task use all available cores is a cinch. I call an Intel i7-3632QM my own – which means 4 physical cores each providing 2 virtual cores running at something around 3 GHz. So let’s have a look at how this feat is accomplished.
The method of parallelization applied here works by starting a specified number of regular R processes and have those run on the available cores. The launch, stop of the those processes and the assignment of tasks is controlled from the original R process. The assignment of those processes to the cores is controlled by the OS. Because this method can be extended to cores beyond one CPU or computer the packages use the acronym SNOW for “Simple Network Of Workstations”.