Variability in Excel

A primer on measuring data variance in Excel.

by David Goodmanson

Let’s continue our look at the basic statistics functions in Excel.

In our first article we looked at statistical functions that measured central tendency or averages. Most people understand Average or Mean though Median is often misused.

Now lets look at a more difficult but important area called by the experts measures of variability, spread or dispersion.


Variability or Range

Averages tell us what is at the center of a set of data but it doesn’t tell you the whole story.

For example – you’re told that two restaurants have main courses with an average price of $10 each, therefore they sound like they’re of similar value. But one restaurant has main courses of $9, $10 and $11 while the other has a kids meal for only $1 plus main for $10 and $19. The second eatery has a much wider variation in prices which the average alone doesn’t disclose.

Variability tells us how widely or narrowly values are spread. In one sense they tell us how reliable our average may be (ie judging by how widely the data is distributed). Would you like to eat at a restaurant where the variation of customer experience is widely spread? Or one that is narrowly focused around a high average? Which one would you be confident about eating at?

The great thing about modern software, like Excel, is that it has a fantastic range of tools to analyse data. Tasks that took people days of painstaking tabulation now take seconds.

What Excel can’t do is make you understand what the functions do and what the results mean.

In this article we’ll focus on the Excel functions with a short attempt to explain Standard Deviation. But teaching “Stats 101” isn’t our job, especially when plenty of others have done it better so we’ve put a link to a few web sites that you can check out.


Excel’s Variation Functions

The functions we will be covering are Maximum (Max, MaxA), Minimum (Min, MinA), Standard Deviation (StDev, StDevA) and Variance (Var and VarA) plus one ‘function’ not supplied by Microsoft but easily done.

The basic use of these functions is as follows:



  • Maximum uses the Max(a range of numbers) function which calculates the maximum value contained in the specified range of numbers.
    Note: a range of numbers is a flexible concept. It can refer to a set of individual cell addresses (B3, C6, Z48), a continuous range of cells (A2:A52) a array or any mixed grouping of both individual cells or ranges (A2:A52, C6, B3, Z48). Excel will handle all of these. Each statistical function used in this article uses a range of numbers as the input to the function.
  • Minimum uses the Min(a range of numbers) function which calculates the minimum value found in that range of numbers.
  • Remember, if you are going to copy the formula to other cells, and you want to reference the same cells, in the example above (B3, C6, Z48) you’ll need to make sure these cells addresses are addressed using absolute addressing. As above, the cell references become ($B$3, $C$6, $Z$48) with absolute addressing. Which means that Excel will only use the values contained at those addresses and will not adjust the addresses as it would without the dollar signs. This applies to all Excel functionality.
  • ***Tip For absolute (or mixed) addressing, try using the “F4″ key (in the top row of keys). Put your cursor in the formula and with each press of the “F4″ key it will change the addressing type of each cell reference.
  • In the case where an “A’ when an “A” is added to the end of a statistical function it indicates that any cells with non-numeric values will be treated as if they were the number zero. Where there is no “A” suffix the function only calculates cells with numeric values, (both function types ignore blank or empty cells).

Observe in the Excel screen that follows how using the “A” suffix on a statistical function can create different results. The Excel screen uses the same data that was used in the first statistics article. In this example, the differences arise from the way in which these functions handle Wednesday’s data which has text or blanks. If we assume the business wishes to calculate days when orders were “0”, “None” or “Empty” each calculation behaves differently for the respective day’s orders. Let’s look at the “Orders” columns.

We are testing to see the difference in how each function (in Column “B”) handles normal values and unexpected values, e.g. Wednesday.

· Orders 1 shows that both function types treat the value “0” in the same way, hence the same results for both functions. In this case correct results.

· Orders 2 shows that functions without the “A” suffix ignore the value “None” and therefore incorrectly calculate their values.

· Orders 3 shows that both “A” and non-“A” values ignore the value for Wednesday (Empty Cell) and thus gives the wrong result.

The exception to these examples is the MAX() and MAXA() functions, because it is not concerned with Wednesday’s values, (but MIN() is!).

Excel - Variance table and sample results image from Variability in Excel at Office-Watch.com

Similarly, for each of the other functions, the syntax remains the same. The function names in Column “A” are spelt exactly as the function is used. Simply type in an equals sign, followed immediately by the function name, open a bracket, type in the address of the cells of interest and then close the bracket, and hit enter. For example =MINA(C2:C6)


RANGE – not in Excel but useful

The only calculations shown that are not a native Excel functions is Range and RangeA, which I created by using the MIN and MAX function. Basically Range = MAX() – MIN().

Range is a good statistic to derive because if gives an initial picture of the spread of the data that’s easy to understand.

An interesting and fun web page link that gives a nice and easy explanation of Range is at MathsIsFun. Check it out, good for kids also.


Standard Deviation and Variance

The most common measurements of spread and variation are Standard Deviation and Variance. These functions have been included in this article using the Sample form of the functions. Why so? As it turns out, that is what we are most likely to be measuring. There are also Population forms of these functions and they are denoted by a “P” on the end of the function name, so they would appear as STDEVP(), STDEVPA(), VARP() and VARPA().


Sample and Population

What’s the difference between the sample functions (eg STDEV) and Population version (eg STDEVP)?

The more common ‘Sample’ function applies when you’re analysing a sample or poll of a larger group. The Population version only applies when you have all the data.

OK, let’s imagine we are interested in knowing what is the most popular brand of motor vehicle in the U.S.. To keep things simple let’s say there are Ford, GM and Chrysler. How will I measure this? Well I could stand by a street corner in Los Angeles CA and record how many of each type I observe. Am I measuring the population or a sample of the population? If you said sample you’d be right. It would be very difficult to take a census of everyone (i.e. a population) who owned a brand of vehicle at a particular time. You could go to the motor registry and ask for access to their computer systems and derive a measure using them, but then this would be for one State. So we could then go to all 50 registries and so on, but I think you might see my point.

Another example: before an election we have surveys of a small number of people, those surveys are used to make an educated guess about who will win the vote. Analysing the survey results might use the ‘STDEV’ or Sample function. After the election you could get all the actual votes and analyse them using the STDEVP or ‘Population’ version because you’re counting all the votes, not just a small sample.

It is not always possible to gain access to a whole Population. Does this make samples unreliable and second rate? Definitely not! Oh no, Statistics and Statistical Methods are very clever at making interesting inferences from well crafted samples, the study of which goes beyond the scope of this article. Take it as a given that most commonly you’ll be dealing with samples.


Bell Shaped Curve

So, the combination of Variation measures (Standard Deviation), and Central measures (Average) brings us to a point where we can begin to understand the famous bell-shaped curve. See below courtesy of Wikipedia.

Example of Bell Shaped Curve - normal distribution image from Variability in Excel at Office-Watch.com

**I would also recommend the first paragraph of the article that accompanies this graph.

This graph is commonly referred to as the normal distribution. “Normal” because the statistics that govern this curve (or distribution of values), occur widely in our daily life.

Why is it important to understand the bell curve? Uses and misuses of the bell curve concept have been used to fool people into all manner of silly and dangerous ideas. Even a basic understanding of the bell curve will help you avoid the tricks of shysters.

Taking a brief look at the curve, it is color-coded in a particular way. On the horizontal axis, at the centre we can see a letter that looks like a “u” with a tail on it. It is called “mew” and is the average value of that distribution. We can see other values on the horizontal axis looking like an “o”, that is “sigma” and represents the standard deviation of the distribution.

Now, looking at the dark blue sections we can see that 34.1% of values occur on the left of “mew” up to one standard deviation, and to the right we see the same symmetrical property – 34.1% of values occur to the right of “mew”. Also up to one standard deviation. Putting this together, we can infer that 68.2% of values in a normal distribution occur within plus or minus one standard deviation. Heights are generally a normally distributed variable. For a given population we could calculate an average height, and a standard deviation. If the population is normally distributed we can infer that individual heights are distributed around the average such that 68.2% of heights measured will fall within plus or minus 1 standard deviation either side of the mean.

Applying Excel Statistical Functions to the above example the probability of a value falling within one standard deviation to the right of “mew” can be derived using NORMSDIST().

Entering the formula =NORMSDIST(1) – 0.5 into any cell gives the probability .3413 or 34.1%. The reason we subtract 0.5 is because we are only interested in the values to the right of “mew”. Effectively half (0.5) of the distribution is not of interest, we are measuring one of the two halves of the symmetrical distribution.

Don’t worry if the last part of this article is a bit too “mathsy”, it is just a quick look at the bell curve and what it means and how it may be used. The bell curve and the normal distribution are big subjects that go out of the scope of this series of articles. It is just a look and an introduction.

Going back to our first restaurant meal price example for a moment – one restaurant’s meals STDEV result is 1 and the second’s is 9. At the glance you can tell that the second restaurant has a greater variance in meal prices than the first. Of course the Range results (2 and 18 respectively) will tell much the same story in this simple example.

This is also an example of a good general rule – the higher the Standard Deviation, the greater the spread of results around the average. A low STDEV value indicates that most of the results are near the average.

Finally, here’s the table of data as it appears in the screen shot above. Why not print off the article, copy and paste the table data into Excel and practice using the functions as shown, step-by-step?
































Day

Orders 1

Orders 2

Orders 3

Monday

4

4

4

Tuesday

2

2

2

Wednesday

0

None

Thursday

6

6

6

Friday

3

3

3

 


More on Standard Deviation

If you’d like to know more about Standard Deviation or remind yourself of that first year college course, here’s some web sites to check out.

Wikipedia gets lots of bad press but for general science and maths it’s a good place to start. Thanks to Excel you can skip over the formulas and focus on understanding what the results mean.

TechBookReport has an explanation without maths.

Robert Niles has a good summary for journalists, now if only we could get more journalists to read it before parroting the latest bogus ‘survey’ that crosses their desk.

In the third Stats article we will look at quartiles and measurement in percentiles, and how standardizing works. Until then good luck.