Data Cleaning with Data Wrangler

April 27, 2011

Data Wrangler is a web-based service from Stanford University’s Visualization Group.  Similar to Google Refine it allows interactive cleansing and transformation of messy, real-world data for further analysis in e.g. Excel, R, Protovis etc.

To learn how to use Wrangler read this research paper or this blog post.

Wrangler is still in its alpha version so if you want to give feedback to the developers, just go for it.

Wrangler Demo Video from Stanford Visualization Group on Vimeo.

Willkommen im besten Netz

April 27, 2011

Welcome to the best network – Vodafone has a new commercial and I kind of like it 🙂

The song “Way back home” is scored by the australian duo Bag Raiders

SMYE 2011

April 5, 2011

I will attend the Spring Meeting of Young Economists 2011 which takes place in Groningen between 14th and 16th April. If you are there as well you are welcome to contact me.

Dealing with Non-Normal Data

March 16, 2011

How often do you have to deal with non-normal data? Do you know what to do with it? In his article “Dealing with Non-normal Data: Strategies and ToolsArne Buthmann explains the common reasons for non-normal data and how to handle it.

Addressing Reasons for Non-normality

Reason 1: Extreme Values

Reason 2: Overlap of Two or More Processes

Reason 3: Insufficient Data Discrimination

Reason 4: Sorted Data

Reason 5: Values Close to Zero or a Natural Limit

Reason 6: Data Follows a Different Distribution

No Normality Required

He states that “Some statistical tools do not require normally distributed data. To help practitioners understand when and how these tools can be used, the table below shows a comparison of tools that do not require normal distribution with their normal-distribution equivalents.”

Comparison of Statistical Analysis Tools for Normally and Non-Normally Distributed Data
Tools for Normally Distributed Data Equivalent Tools for Non-Normally Distributed Data Distribution Required
T-test Mann-Whitney test; Mood’s median test; Kruskal-Wallis test Any
ANOVA Mood’s median test; Kruskal-Wallis test Any
Paired t-test One-sample sign test Any
F-test; Bartlett’s test Levene’s test Any
Individuals control chart Run Chart Any
Cp/Cpk analysis Cp/Cpk analysis Weibull; log-normal; largest extreme value; Poisson; exponential; binomial

Application: Missing Data

March 16, 2011

In the real world we will almost always come across missing values in data due to many reasons. This problem must be adressed to produce reliable statistical results.  First of all we need to identify what is missing. Then ask yourself why the data is missing and what it means. After you have answered those questions you need to deal with the missing values. Those are the steps to take:

1. When missing values are few and lay far apart then do nothing.

2. When a column has a significant number of missing values then create a variable for missing, present values (0/1).

3. When a column has a significant number of missing values then replace the missing value with a constant value e.g. mean, median or mode.

4. When a column and its values are essential to producing accurate predictions then estimate the missing value based on other, non-missing data elements.

Any question- Quora gives the answer

March 16, 2011

Need some hints about useful statistical books, free public data sets or have any question about statistic got to:

Google Refine 2.0

January 16, 2011

From time to time I find software that amazes me. Google Refine falls into this category.
When we deal with real data, we find missing values, inconsistencies etc. that need to be cleaned before conducting any analysis. Fixing the problem manually is time-consuming and annoying, instead use Google Refine. It not only cleans data in a powerful way but also transforms data sets from one format into another or extends them with other data sets.

It  is a desktop application which runs on your computer. Thus not a web service and there is no need to upload you super sensitive data to some other server. Of course Refine is for free.

Hans Rosling -The Joy of Stats

December 26, 2010

“I kid you not, statistics are now the sexiest subject on the planet.”

Hans Rosling

Hans Rosling is my idol, his online lectures are of this eye-opening, mind-expanding and funny kind. He is not only the developer of gapminder, a really cool information visualization software for animation of statistics but also an international known medical doctor, academic, statistician and public speaker.
In his videos ‘The Joy of Stats’ he shows exactly why some people (including me) have a passion for statistics, it is exciting and fun to find the story behind the data, tell it in a visual appealing way and make sense of the world.

You can find many more great lectures on TED or of The Joy of Stats series.

UPDATE: The Joy of Stats is now available in its entirety on Gapminder


Online book: Introduction to data mining

December 26, 2010

What a great online book about Data Mining!!! Thanks to the authors for providing this book for free but honestly I would buy the book if it was available in print (now I need to do a lot of printing work).

What is it that makes this book so great? The structure and visualization are what caught my attention. Have a look at the table of content it is actually a Data Mining Map; what a great idea to use structured map.

The authors´ approach is practical and does not go too deep in explanations which is good if you are not interested in the theoretical equations behind the stats. You will find a lot of pictures, tables, definitions and exercises but less formulas. This online book was created by Dr. Saed Sayad in a collaboration with Professor Stephen T. Balke in the Department of Chemical Engineering and Applied Chemistry at the University of Toronto.

The Beginner’s Guide For Web Data Analysis

November 19, 2010

On his blog Occam’s razor about web analytics Avinash Kaushik wrote a post with the topic “Beginner´s guide to web analytics: Ten steps to love & success”. Being an expert in web analytics he gives a practical introduction into this field. This is an overview about the ten steps that you should follow according to him. The whole post gives a useful outline how to get started with web analytics, so check out here.

Step 1: Visit the website. Note objectives, customer experience, suckiness.

Step 2: How good is the acquisition strategy? Traffic Sources Report.

Step 3: How strongly do Visitors orbit the website? Visitor Loyalty & Recency.

Step 4: What can I find that is broken and quickly fixable? Top Landing Pages.

Step 5: What content makes us most money? $Index Value Metric.

Step 6: How Sophisticated Is Their Search Strategy? Keyword Tag Clouds.

Step 7: Are they making money or making noise? Goals & Goal Values.

Step 8: Can the Marketing Budget be optimized? Campaign Conversions/Outcomes.

Step 9: Are we helping the already convinced buyers? Funnel Visualization.

Step 10: What are the unknown unknowns I am blind to? Analytics Intelligence.

Click here for full article.