In this little tutorial I am going to describe a handy tool for transforming an XML document into a more easily processable CSV format. There are many ways of getting this job done - but most are more tedious than necessary (like writing a custom made RegEx parser – yuck!). Using XMLStarlet and XPath expressions this is going to be cinch. Let’s evaluate a number of typical XML data configurations and turn them into a flat CSV structure.
In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.
In this text I am going to describe a very straightforward way of how to make use of Twitter’s REST API v1.1. I put some code together for that purpose, so that requesting data just needs the API URL, the API parameters and a vector containing the OAuth parameters.
Before you can get started you have to login to your Twitter account on dev.twitter.com, create an application and generate an “Access Token” for it. So let’s jump right in and fetch IDs of 10 followers of @hrw (Human Rights Watch). The necessary code is located on GitHub – download all three files and then you can just edit the example below as suggested.
What I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.
In case you are new to this I recommend reading my articles on the following topics involved here:
- Getting Started With Pentaho BI Server 5, Mondrian and Saiku
- Setting Up a Time Dimension Table in MySQL
- Using the Dimension Lookup/Update Step in Pentaho Kettle
- Google Analytics API with Pentaho Kettle
In a traditional star schema the dimensions are located within specialized tables which are referred to by numeric keys from the fact table. A dimension can represent anything from the gender (“male”, “female”, “intersex”) over a hierarchy representing a location (“Germany”, “RLP“, “Mainz“) to an individual user’s profile (name, address, date of birth, …). Now thanks to Mr. Kimball we know there are different types of what he refers to as Slow Changing Dimensions (SCD – “slow” because they are expected to change only infrequently):
A gamified, high-speed, high-quality Q&A-site for topics revolving around making professionally sense of a company’s data - a.k.a. “Business Intelligence” – wouldn’t that be awesome? And let’s face it – asking a question on how to configure a step in Pentaho Kettle does not fit any StackExchange site’s realm yet. Usually this type of question is asked on StackOverflow but the feedback-latency is quite high to say the least. Or let’s take a question on how to design a KPI – this one usually ends up on CrossValidated but will often be greeted with disdain given the statistical triviality - plus most people in statistics are not working with BI and won’t be open for the subject’s specific intricacies. And finally you are wondering about how to configure a MySQL RDMS for a data warehouse – where to ask that? On dba.SE … I guess. And suddenly you get weird issues with TomCat which you need for Pentaho BI Server – hmmm, SuperUser? Or ServerFault?
It’s just too distributed!
You need a table keeping the time dimension for your MySQL-based data warehouse?
Here’s how you do it:
This article serves the purpose of illustrating that signal processing with R is possible – thanks to the signal package – and to keep a reference of some of the stuff that I learned at my last edX course. Anyway, I am by no means an expert on signal processing so I’d prefer to let the pictures and the code speak for themselves. But to give you the idea – I show case the creation and application of an FIR band pass filter (Chebyshev Type 1 in this case) and of an FIR filter created using the Parks-McClellan method with the Remez exchange algorithm. The code snippets are taken from a larger R script which you can find on GitHub. I aim to focus on the essential parts. You’re welcome to share your knowledge and corrections by leaving a comment.
Attending “Discrete Time Signals and Systems” by Richard Baraniuk from Rice University was an awesome experience on many levels. Right after “Learning from Data” my second favorite MOOC so far. First of all the subject of extracting a signal from a discrete time series in terms of frequency composition is interesting by itself and provided a smooth opportunity for me to revise some of the math I studied many years ago. But this by itself wouldn’t make a learning experience that superb – what it takes for that is a teacher who knows how to get the knowledge across to the student. And in that regard – apart from the science itself – Richard is a master! It was obvious how much effort Baraniuk and his team put into designing the course. Every detail about the lectures and the exercises seemed superbially well crafted. And this is apparently not by chance – as after googling the professor’s name I found that he is actually something like a MOOC evangelist and very passionate about offering such an opportunity to the learners around the world.
Pentaho’s BI Server or BA platform allows you to access business data in the form of dashboards, reports or OLAP cubes via a convient web interface. Additionally it provides an interface to administer your BI setup and schedule processes. The aim of this tutorial is to illustrate how to get started with the BI Server and for that purpose I am going to use a small and artifical data set – as otherwise I would have to get deeper into further technologies – so I can keep this text lean. I am going to elaborate on Mondrian schemas, data warehouse design, MDX and further related concepts in separate articles.