(This article expects from the reader basic experience with PowerPivot. You should know how to access a PowerPivot table from Excel and how to turn it into a pivot table or a chart.)
KPIs for a grocery store
I think this topic is introduced best using an example. Let’s assume an offline store and we are responsible for the data analysis. The boss wants us to count the people visiting the shop considering their age group and also how many of those visits lead to a purchase.
Usual administrative units are too heterogenous for regional statistics. To make regions comparable, territorial units of similar population size are required. For the European Union and further states being associated in some way or another the NUTS (Nomenclature des unités territoriales statistiques) classification has been developed in 1980 and is being updated triennially.
There are four NUTS levels 0,1,2 and 3. Every region is designated a code consisting of two to five characters. The first two characters denote the state (the usual ISO-3166 two letter code – Greece being an exception as it is referred to with EL instead of GR). The characters following it in case of NUTS 1,2 and 3 form a hierachical system. So for example DE21H (Munich) belongs to DE21 (Oberbayern) belongs DE2 (Bayern / Bavaria) belongs to DE (Germany).
Eurostat is the institution within the European Union that organizes statistics from the 27 EU member states (f.x. from the German Federal Office of Statistics who also maintain a web-access to their data). Their web-site offers a wealth of statistics, reports, documents and visulization tools. It is pretty huge and I still get lost easily on it or discover new things. So this article doesn’t even try to show you around. I’ll just exemplify here one aspect of their site – the statistics database in context of a concrete question:
The question we’ll investigate
How regularly did people – differentiated in younger than or at least 65 years of age - in recent past die from a cause categorized as “Ill-defined and unknown causes of mortality”? We will be looking at national level (NUTS 0).
As you might know, I am working as the Data Analyst for carpooling.com in Munich. carpooling.com is the company maintaining the leading web platform for organizing carpoolings (in the world, actually). Many people don’t know what “carpooling” means, so let me explain it you:
Tanja lives in Stuttgart and wants to visit her family in Hamburg next weekend. This is quite a long distance and hence pretty expensive – and also kind of dull sitting in a car alone for several hours. Having three free seats in her car left, she thinks to herself … “Why not offer those seats to other people and share the expenses with them?!”. So she advertises her planned lift on www.mitfahrgelegenheit.de – the biggest German web-site for carpoolings. Peter who also wants to travel to Hamburg next weekend finds her ad and gives her a call to seal the deal. Okay, so far so good, but few passengers aren’t as reliable as Peter and might just forget about the ride and Tanja would then be left with a free seat and no money. So carpooling.com came up with the idea of a “booking system” to make carpooling agreements more binding. Next weekend Tanja, Peter and further passengers meet and drive to Hamburg together.
I was curious how gender-ratios of young women and men are distribute geographically in Europe. Eurostat offers absolute numbers for all NUTS2 regions in Europe. The most recent available figures were referring to January 2012 – in few cases like Turkey I was falling back to January 2011 due to missing values.
The figures are drawn from table “demo_r_d2jan” on Eurostat.
PowerPivot itself cannot be automated with VBA. But with a few tricks we can control how the data drawn from PowerPivot is accessed within Excel. This article assumes that you know how to import data with PowerPivot and how to integrate it into an Excel file as a pivot table. Also you should have activated the developer tools in Excel. There you find all the VBA relevant stuff like the macro recorder and the dialogue controls you want to add.
Let’s get it started with an example
We start with a PowerPivot table (named “data”) keeping the values for two variables A and B for every day in 2012 imported from a MySQL-DB in this case. So in the PowerPivot window this would look something like that:
So let’s assume some manager is asking for a dashboard displaying those two values in a chart but only one at a time and s/he is not inclined to deal with pivot table field lists but wants to switch between A and B using buttons.
“The operation has been cancelled because there is not enough memory available for the application.”
Recently I experienced occasional errors regarding lack of memory when trying to import big data sets with PowerPivot. I am using Excel 32bit so the maxium theoretical in-memory size of an Excel file is 4GB. But judging from the memory usage displayed in the task manager it quickly became clear that I wasn’t even getting close to this boundary.
Pivot tables are in contrast to popular belief not just an Excel feature but present in a lot of applications dealing with tabularly structured numerical data – pivot tables are the visual and interactive result of a general concept of data aggregation depending on categories. LibreOffice (check out the article on it), OpenOffice and R by libraries like reshape offer this technology targeting at a different audience. If you understand pivoting in one of these tools, you pretty much got it for all of them.
At my work I am offering a workshop on pivoting with Excel 2010 and so I came up with a chart boiling down the concept of using pivot tables so it fits on a sheet of PDF.
The scatterplot shows the frequency of occuring words for two sets of texts. You click on one circle and you see the words for it on the left hand side. The app is built on d3.js (my second small project using it) and I am planning to write an introductory article on it soon. Apart from a few issues it is fun to work with d3.
When you are responsible for setting up or extending an automized reporting of data stored in a warehouse with Excel, I would always vote for using PowerPivot for fetching the data and displaying it in pivot charts and tables. But there are use cases and situations where PowerPivot is not an option. Maybe you just cannot use PowerPivot for whatever reason. But first and foremost I met / suffered from the case where you already have an Excel reporting using cell references to other files and macros to fill the cells. If this sort of reporting reaches a certain extent and importance you cannot just set it up new again. Now if you need to extend such a report f.x. with a new column and don’t want to use cell references to other files – or worse VBA – because this is very fragile and error prone then I have a solution for you.
And this is how it works
In this article I am going to show you how to convert a sequence of images (PNGs in this case) into a clip that can be uploaded to and watched on Youtube or Vimeo in HD. ffmpeg - the tool we are going to use – has to be dealt with through the command line. The large number of different parameters and settings which often depend on each other or are mutually exclusive are quite intimidating at first sight (at second as well). Actually it took me quite a while to produce a clip that did not weigh more than 100 MB while keeping a good quality and didn’t produce obscure issues. But in the course of this odyssey I learned the basics of digital videos and also, at the end of the day, the settings which led to my desired outcome even started to make (some) sense!
First of all I created a thousand PNGs (named image000.png, image001.png, …, image999.png) using ggplot2 in R. If you don’t know ggplot2 yet, check out my introductory article on it and the code I used for the article (about stock quote scatter plots changing in time).
Okay, here we go (the new lines are added just for clarity):
-framerate 10 # input frame rate
-i image%03d.png # image names (image000.png, ..., image999.png)
-s:v 1280x720 # video size
-c:v libx264 # encoder
-profile:v high # H.264 profile for video
-crf 20 # constant rate factor
-pix_fmt yuv420p # pixel format
-r 30 # output frame rate
daimler_man.mp4 # clip file name
ggplot2 is for plotting in R, very flexible and ably designed by Hadley Wickham following a concept called “grammar of graphics” and anyway pretty awesome – so let’s jump right in with some simple examples that should help you get it going.
This is the R code I used to create the PNGs which are afterwards put together with ffmpeg into a clip (check ‘em out). I am just commenting on the programmatic aspects. In a separate article I will write about how ggplot2 is used and how ffmpeg turns the PNGs into a clip.
Originally I had the idea for this little project (still can’t find a name or description for it) when dealing with the stock quotes correlations. The tool I came up with shows the scatterplot for two stock quotes charts and the respective Pearson correlation coefficient. I wanted to see if one can tell from the scatterplot and the coefficient how two stocks relate to each other. I didn’t take this investigation much further than the visualization and some pondering about patterns shown in the scatterplots.