Better web statistics analysis using Statistical Process Control (SPC) – Part 2 a methodology

May 16, 2010

In Part 1 I give an overview of SPC. Statistical Process Control (SPC) is a way of accurately predicting what an entire data group will look like based on small samples.  This is important as everything varies over time and, if the data group happens to be web statistics, it may be vital to know whether a rise or fall in, for instance the number of vistors your web site or intranet has had in a week, may just be a part of the normal variation or is due to some significant change.

In this post I’ll show you how you can define the ‘normal’ variation for any web statistic. I’ll also give examples and show you how easy SPC is to use  and how powerful a tool it can be in aiding analysis of your web data.

Recording data using control charts

First of all I’ll take you through the nuts and bolts of recording your data. In SPC we use control charts to record our data. In the automotive sector we would normally use a  pro-forma Excel based form which has been specified by the customer but in order to more clearly show the steps I have created a simple control chart pro-forma in Word (Control Chart Blank at foot of page). I also give some examples  –  a completed chart (Control Chart Example 1) which shows what a chart looks like when all of the data is entered and a partly completed chart (Control Chart Example 2) which follows on from Example 1 and shows how significant changes are identified.

1) Create your scales – You will need to look back over some historical data to see roughly how much the data varies and come up with a suitable scale for the average value and another scale for the range. Don’t worry if you get this a little wrong at this stage you can always do another one. Scales are entered on the left hand side of the chart for both the averages and range.

2) Start adding your data – If you want to get a head start and you have the data available you can create your first control chart from historic data. This means that the control chart for the current data will already have the control lines inserted which will greatly help your analysis of current data.

The average is calulated by adding the current value to the two preceding values and then divide by 3. The range is simply the difference between the highest and lowest of the three values. As you need three data points you can see that whether you start with historic or current data you must wait until your third entry before you can calculate an average or range. Enter the average and range as a cross. The intersection of the cross pinpoints the value. Draw a line to the cross from the previous data entry. Drawing lines like this make it much easier to detect trends. Then just keep doing this until all 26 data points have been entered and the chart is complete.

3) Analysis is something you do all the time – As you add data keep thinking about what you are looking at. Make a note on the chart of any event that might be relevant as it happens. This is really important for spotting patterns and causes and effects in the future. We always used to pin our charts to a wall in the office so everyone could see them and make comments.

4) Calculating the control lines – The control lines are the ‘red lines’ beyond which a change can be seen as significant, in other words a real change and not just the normal variation.

When you have completed the chart it is now time to do some simple calculations. These are done as described in the text boxes at the top of the control chart but I’ll go over them again here in a little more detail -

Control lines for the average value – First you have to add up all of the averages you have worked out for each data entry and find the average of the averages. Next do the same for the range values so that you end up with the average range. To calculate the Upper Control Limit (UCL) you must multiply the average range by a constant called ‘A2′. This constant represents the proportionality of the range to the Standard deviation (SD) which I discussed in Part 1. This constant changes depending on the sample size and as our sample size is three then this value is always 1.023. So all we need to do is multiply the average range by 1.023, add this value to the average of averages and we have the UCL. To get the Lower Control Limit (LCL) you again multiply the average range by 1.023 but this time you subtract it from the average of averages. The UCL and LCL are normally represented on your next chart as solid red lines while the average of averages is a dotted red line.

Control line for the range value – It is not enough to control the average value, we must also control the range. Look at the two following sets of values -

1, 20, 39

19, 20, 21

They both have the same average value, 20, but you can see that the range for first set of values (38) is way higher than that the range for the second set (2). As control lines are always symetrical about the average value the range can theortically be a minus number but the lowest real number a range can be is zero. For that reason we generally calculate the Upper Control Limit Range (UCLR) only. You do this by multiplying the average range by another constant ‘D4′ which for a sample size of three is always 2.574. The UCLR is represented in the range section of the control chart as a solid red line with the average range again as a dotted red line.

You can see from Example 1 that I have put the control limits on the completed chart. This can be a useful exercise to see how the data fits and where it has gone beyond the control limits. However the control lines are really for the next chart. Get a blank chart and draw in the control lines as calculated.

5) You now have your warning device – Now you will have a brand new control chart with sets of red lines on it. These control limits represent an early warning system that will alert you when things have changed, really changed. Any value that is outside of these control limits means that something new has happened and that you will need to try and assign it a cause. This can be helped by keeping up to date with everything that might affect the particular metric you are controlling and note it on a chart. If your visitor numbers fall below the LCL and it is also a national holiday that might not be a coincidence. If your visitor numbers rise above the UCL after a PR campaign you might be able to tell the PR people they’ve done a good job.

6) Keep and review all your charts – If you store all charts it is a useful exercise at times to look over the results for a long time period in order to discover patterns and parallels. These charts will constitute an accurate picture of a metric over time. If you are maintaining charts for more than one metric see if any patterns occur affecting more than one metric.

Notes on analysing the chart

I have completed Control Chart Example 1 and have calculated and inserted the control lines. You can see that all the values sit inside the control lines. In Control Chart Example 2 I have inserted the control lines from Chart 1 and have carried on inserting data. You can see that there are two occasions when the average values lie outside the control lines, in other words the values were ‘out of control’. I have annotated the chart with events that happened at the time the data was inserted. The first occasion was when a blog post was advertised on a peer discussion list drawing more views than normal. The second occasion was when views fell dramatically and it probably wasn’t a co-incidence that it was Christmas.

In conclusion

I have attached both a blank control chart you can start using straight away and two example charts so you can better see how it all works. I have completed the entries on my computer (even I can’t read my writing!) but I feel it is generally better to print the chart off as A3 and enter data by pen. You can do this while it is on display on your wall.

At first glance this all might seem a tad complicated but believe me after you’ve done it a couple of times you won’t even have to think about it. What you’ll end up with is not just an accurate record of a metric but also of the events that have affected it over time as well as an early warning system for when things may be going wrong and, just as importantly, when you are getting them right.

You can find a good general introduction to SPC here thanks to the NHS in Scotland. Also Andy Parkinson has posted on the subject here.

If you want any advice on SPC or if you can think of any improvements that might apply to this technique please leave a comment.

Control_Chart_Blank

Control_Chart_Example1

Control_Chart_Example2

(Thanks to grapho for the stock.xchng bar chart)

 

About these ads

4 Responses to “Better web statistics analysis using Statistical Process Control (SPC) – Part 2 a methodology”

  1. bruno amaral Says:

    It’s been quite an interesting read, thank you for taking the time to write this series :)


  2. […] Better web statistics analysis using Statistical Process Control (SPC) – Part 2 a methodology … […]


  3. […] (More on how to use statistical process control for web site analysis here) […]


  4. Another great article on SPC! Great insight! Thank you for sharing Part 2: “A Methodology.” I hope that there will be a Part 3 in the near future.

    Thanks again :-)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: