April 27, 2011
Data Wrangler is a web-based service from Stanford University’s Visualization Group. Similar to Google Refine it allows interactive cleansing and transformation of messy, real-world data for further analysis in e.g. Excel, R, Protovis etc.
To learn how to use Wrangler read this research paper or this blog post.
Wrangler is still in its alpha version so if you want to give feedback to the developers, just go for it.
Wrangler Demo Video from Stanford Visualization Group on Vimeo.
April 27, 2011
Welcome to the best network – Vodafone has a new commercial and I kind of like it
The song “Way back home” is scored by the australian duo Bag Raiders
April 5, 2011
I will attend the Spring Meeting of Young Economists 2011 which takes place in Groningen between 14th and 16th April. If you are there as well you are welcome to contact me.
March 16, 2011
How often do you have to deal with non-normal data? Do you know what to do with it? In his article “Dealing with Non-normal Data: Strategies and Tools” Arne Buthmann explains the common reasons for non-normal data and how to handle it.
Addressing Reasons for Non-normality
Reason 1: Extreme Values
Reason 2: Overlap of Two or More Processes
Reason 3: Insufficient Data Discrimination
Reason 4: Sorted Data
Reason 5: Values Close to Zero or a Natural Limit
Reason 6: Data Follows a Different Distribution
No Normality Required
He states that “Some statistical tools do not require normally distributed data. To help practitioners understand when and how these tools can be used, the table below shows a comparison of tools that do not require normal distribution with their normal-distribution equivalents.”
|Comparison of Statistical Analysis Tools for Normally and Non-Normally Distributed Data
|Tools for Normally Distributed Data
||Equivalent Tools for Non-Normally Distributed Data
||Mann-Whitney test; Mood’s median test; Kruskal-Wallis test
||Mood’s median test; Kruskal-Wallis test
||One-sample sign test
|F-test; Bartlett’s test
|Individuals control chart
||Weibull; log-normal; largest extreme value; Poisson; exponential; binomial
March 16, 2011
In the real world we will almost always come across missing values in data due to many reasons. This problem must be adressed to produce reliable statistical results. First of all we need to identify what is missing. Then ask yourself why the data is missing and what it means. After you have answered those questions you need to deal with the missing values. Those are the steps to take:
1. When missing values are few and lay far apart then do nothing.
2. When a column has a significant number of missing values then create a variable for missing, present values (0/1).
3. When a column has a significant number of missing values then replace the missing value with a constant value e.g. mean, median or mode.
4. When a column and its values are essential to producing accurate predictions then estimate the missing value based on other, non-missing data elements.
January 16, 2011
From time to time I find software that amazes me. Google Refine falls into this category.
When we deal with real data, we find missing values, inconsistencies etc. that need to be cleaned before conducting any analysis. Fixing the problem manually is time-consuming and annoying, instead use Google Refine. It not only cleans data in a powerful way but also transforms data sets from one format into another or extends them with other data sets.
It is a desktop application which runs on your computer. Thus not a web service and there is no need to upload you super sensitive data to some other server. Of course Refine is for free.