Better web statistics analysis using Statistical Process Control (SPC) – Part 1 An Overview

March 7, 2010

I’ll bet that the mention of statistics in the title has already got some of you reaching for the ‘back’ button but please stick with it. Although the detailed theory behind how Statistical Process Control (SPC) works might be somewhat complex using it as an analytical  tool is really easy and, in over thirty years in the automotive industry, it has proved itself over and over again as one of the most effective tools I have ever used. These posts are about how you can apply this approach to improve the analysis of your web statistics.

OK you’ve improved your website or intranet and let it loose on your users. How can you tell how successful the improvements have been? Will it be more like the iPhone or more like New Coke? Talking directly to your users is always going to be the best way to get robust data but there are drawbacks as it can often be too time consuming and expensive to interview a large enough sample or just plain impossible if your users are spread all over the world. Anyway even if you could interview a large enough sample it is always good practice to get confirmation by using more than one methodology.

An obvious other source of data is web statistics. Web data by itself can only tell us so much and even then we have to be very careful that we interpret it in the correct way in order to get the maximum benefit.  Karl Grove’s article on The Limitations of Server Log Files for Usability Analysis gives us some very big caveats when it comes to web statistics. Yet we need metrics in order to identify the improvements that really make a difference to our users. One way of improving our analysis is to use Statistical Process Control (SPC), a technique which has been successfully used in manufacturing for many decades. For the purposes of this post I am only going to talk about a single metric, the number of visits to a web site or intranet. However SPC can be applied to any web statistic that produces a regular stream of numbers.

So how does SPC work?

SPC is based on a phenomena that is all around us. It can describe everything from the heights of blades of grass in a field to the distribution of galaxies across the universe. Just about everything in nature conforms to this distribution model which, in itself, is quite amazing. That is why it is called the ‘normal distribution‘ (it is also referred to as the ‘bell curve’ for obvious reasons as you can see below).  As you can see the normal distribution is symmetrical and it can be fully described using just two numbers. The average (the zero point) and the standard deviation or SD also known as the Greek letter sigma (the plus and minus numbers either side of the average). I won’t say too much here about the SD except to say mathematically we know what percentage is in ‘each slice’ of the SD (see below).

A good example of what the curve represents is the heights of human beings. I saw a photo recently of the smallest man and tallest man in the world taken together and it was really mind boggling. One was just over 2 feet while the other was just under eight feet tall. Yet they were both male human beings. SPC predicts this. They belong in what has become a fashionable (and often misrepresented) paradigm – the long tail. At the extreme ends of the curve the line gets closer and closer to zero but never touches it. This allows for extreme phenomena to happen but only very rarely. However the extent of this tail can be accurately estimated so we know the possible limits of human height. Therefore if we came across an adult male human being that was say two inches tall or twenty feet tall we would not be surprised if they were green and said ‘Take me to your leader’.

What really interests us is the huge bulk that is contained in the first 3 slices either side of the average (referred to as plus and minus 3 SD or 3 sigma) which comprises 99.7% of the total possible results with the majority being closer to the average value. Look around you on a busy street and nearly everyone will be near enough the same height give or take a few inches. If we were a large clothing company we would be interested in the sizes of the majority of people, not the rare outliers, so that we could make clothes in sizes and quantities that reflect reality. SPC can tell us this and the amazing thing is we don’t need to measure everyone in the world as SPC can give us amazingly close answers based solely on small samples. These sample sizes have been mathematically validated to ensure that the approximation of the average value and SD gained will be very close to the true values. From this we will be able to accurately forecast, for example, what the minimum and maximum heights that individual blades of grass in a field may reach without having to measure every one.

All we need to establish at the outset is the average and the SD of any process, whether it be growing grass or web statistics, and this will allow us to estimate with very high accuracy what the average and SD of the total ‘population’ will be (‘population’ is a term used in statistics that refers to the totality of all the things you are measuring).

SPC as a indicator of real change

How does SPC work in real life? Imagine a simple process such as turning the diameter of a metal bar down to a required value on a lathe. There are multiple factors that can affect the final outcome – how consistent the speed of the lathe is, how centrally the spindle runs, coolant flow rate etc. In this instance it is possible to combine all of these factors using SPC to produce a picture of what the machine is capable of achieving. Once this baseline is set it then becomes possible over time to build up a picture of the variance in the operation that is caused by other factors – operator, tool wear, materials, coolant failure etc. It becomes possible to assign causes and, over time, really get to know something about the process.

A word of warning though – don’t mix apples and oranges. To get really accurate results you need to define what it is you are measuring. Bamboo is a grass but including the height of bamboo in a study of the heights of lawn grass would clearly be a nonsense.

Also if you are a manufacturer of women’s clothes selling only in the USA then a study of the sizes of human beings from all over the world will be misleading as it may also include men and, on average, men are taller than women. Also nutrition can affect size so the heights of people in poorer parts of the world may also be less than their counterparts in more affluent countries.

This caveat applies to web statistics too. If, for instance, your product is seasonal or site traffic increases at weekends you may get more accurate results by treating some data separately. Another good example is intranets. If the huge majority of your staff only work Monday-Friday then only use that data and don’t throw in weekend data which may only muddy the waters. You really need to think about what it is you are trying to measure.

The benefits

The methodology is very simple to use and the benefits are –

– The ‘noise’ caused by many different factors can be taken into account where it contributes to the normal process variation

– Real, significant changes can be identified out of the ‘noise’ of the normal process variation

– The data is displayed graphically so patterns can be easily identified

– The data is usually recorded on a single chart that can cover up to 6 months of data so variations over time can be easily picked up

– The chart can be annotated so that ‘events’ such as improvement initiatives, technical problems etc that can affect the data can be noted against time

-Once the SPC process is up and running it takes virtually no time to do and the practicalities of how to do it can be explained to someone in less than 15 minutes

In conclusion I can only say that this technique can be incredibly powerful and accurate in its predictions and it can take a lot of the arguments out of web statistics because when things change significantly you’re not just guessing any more – you can really prove it.

In Part 2 I’ll be explaining the SPC methodology and how you can start applying it straight away to your web statistics by using control charts.

(Thanks to grapho for the stock.xchng bar chart and to Paul Vlaar for his CC photo of bamboo)

3 Responses to “Better web statistics analysis using Statistical Process Control (SPC) – Part 1 An Overview”

  1. I wish i had seen this blog before I wrote my own on a similar subject!

  2. […] Comments Andy Parkinson on Better web statistics analysis…IntranetLounge on Designing intranet structures …Danegeld links to 23… on Content […]

  3. Very informative article! Thanks Patrick for sharing it! SPC has been a big help for my team, and I have you to thanks for that, since I discovered this article and began to further research on SPC.

    Thank again! 🙂

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: