Descriptive Statistics - Parte Deux
16 Jul 2008 Rob Slazas 7 comments 370 views
This is the second of three posts in Descriptive Statistics. Click here to see the full list of statistics posts.
We left off last time with central tendency, spread, and trying to beat Quan in a breath-holding contest. In this post we will continue the topic of descriptive statistics by covering shape.
Contents
Platy-kur-what?
When it comes to describing the shape of a dataset, we are almost always visualizing it in the form of a histogram. Fortunately, the hist command is quite capable and flexible, which makes this work much easier. If we recall my breathholding data from last time, which we examined with a scatter / dot-plot and a boxplot, we can look at its shape this time in histogram form (You can download the .mat file here).
load RobPracticeHolds.mat; h1 = figure('Position',[100 100 600 400],'Color','w'); subplot(4,1,1:2); hist(breathholds); title('Practice Holds'); showx = get(gca,'Xlim'); subplot(4,1,3); boxplot(breathholds,'orientation','horizontal','widths',.5); set(gca,'Yticklabel',[],'Xlim',showx); xlabel(''); ylabel(''); subplot(4,1,4); scatter(breathholds,ones(size(breathholds))); set(gca,'Yticklabel',[],'Xlim',showx); xlabel('seconds');

The hist function (top graph) does a few things: first it fits 10 “bins” equispaced within the range of the data (default is 10 bins, but using the nbins input arg changes that); then it counts the number of points that land inside of each of those bins; and finally it draws a vertical bar graph of that count, using bar widths to fit those bins.
The data takes on a “toothier” look using hist, where the main features are the 2 peak bins with 8 points each, and an empty bin to the right, sitting between the outlier and the rest of the data. In general, I would “eyeball” this shape to be slightly asymmetric, with the longer tail going out to the right (called skewed to the right). It also looks like a somewhat flat / wide shape since the data is not tightly grouped (called platykurtosis).
These visual observations of shape are nice, but they sound too much like art critique to be useful. The next questions you might have are: what do they mean, and is there are way to quantify them so we don’t have to rely on the subjectiveness of the “eyeball” method?
Quantifying Shape
When looking at histograms, the 2 basic ways to quantify their shape goes to answering 2 questions about them:
- Is the data asymmetric, and if so - which way does it tilt?
- How is the data spread out - wide and flat or skinny and tall?
Skewness answers the first question about symmetry. For a given dataset, it compares the tails on the left and right sides of the mean against each other. Perfectly symmetrical tails (such as the Normal Distribution has) return a skewness value of zero. Datasets with a heavier left tail return negative numbers (skewed to the left), while heavier right tails return positive numbers (skewed to the right). The greater the magnitude of skewness, the less balanced the tails.
myskew = skewness(breathholds); disp (['My practice holds have a skewness of ',num2str(myskew)]); if myskew+.1<0 disp('So they are skewed to the left.'); elseif myskew-.1>0 disp('So they are skewed to the right.'); else disp('So they are approximately symmetrically distributed.'); end
My practice holds have a skewness of 0.72107 So they are skewed to the right.
Kurtosis answers the second question about how widely or narrowly the data is spread, compared to a Normal Distribution. This function sets the kurtosis of a perfect Normal Distribution equal to 3 (some other softwares subtract 3 so that perfect Normal equals zero). If the given distribution is flatter / wider than Normal, then kurtosis returns values greater than 3 (indicating that more outliers could be present). For a narrower / taller than Normal distribution, the kurtosis is less than 3 (indicating outliers are less likely).
mykurt = kurtosis(breathholds); disp (['My practice holds have a kurtosis of ',num2str(mykurt)]); if mykurt+.1<3 disp('So they are narrower than Normal.'); elseif mykurt-.1>3 disp('So they are flatter than Normal.'); else disp('So they are spread about the same as Normal.'); end
My practice holds have a kurtosis of 3.1865 So they are flatter than Normal.
A Real-Life Use of Shape
Like a lot of things in math, you might be thinking “OK, that’s nice. But when would I use this?” The uses of shape parameters appear most often when checking to see if certain inferential statistics are appropriate for your data. Particularly with statistical tests that depend on the shape being approximately Normal, or those that focus on what the tails are doing.
For example, an industry standard test to determine if a group of values meets a specification based on a small sample (i.e. only testing a few out of the group) is found in ANSI Z1.9. This test procedure makes assumptions about the skewness and kurtosis (symmetry and flatness) of the dataset being close to that of the Normal Distribution. If you have a one-sided specification, let’s say it is higher-is-better, then data that is skewed to the right would be more likely to fail this test unnecessarily. In fact, you would be penalized for better performance (higher values)!
h2 = figure('Position',[100 100 600 400],'Color','w'); histfit(breathholds); title('Practice Holds Compared to ANSI Z1.9'); legend('My data','ANSI Z1.9 thinks'); annotation('textarrow',[.25 .32],[.6 .13],... 'string',[{'Overestimate of'};{'the left tail'}]); annotation('line',[.35 .35],[.11 .92],'color','r','linewidth',2);
So, let’s assume that I cannot accept a breath hold time of less than 40 seconds. If we were to run the ANSI Z1.9 test on my practice breathholding data, the higher times cause the test to assume the variation is symmetric (like the Normal distribution). It therefore concludes that the left tail (low values) extends below our lower limit of 40 by an unacceptable amount (we fail the test).
So, when testing to see if I can hold my breath for an acceptable amount of time, I should choose something other than ANSI Z1.9 since my data is skewed. Continuing to use this type of test would be overly conservative and reject some good results.
Wrapping up
That’s it for our introduction to descriptive statistics. In the next post we’ll tidy up with a few of my favorite visualizations for descriptive stats. Then, from there on, we’re going to be talking about inferential statistics, how they’re used, and what visualizations go along with them. As usual, questions and comments are welcome below.
7 Responses to “Descriptive Statistics - Parte Deux”
Leave a Reply
Include MATLAB code in your comment by doing the following:
<pre lang="MATLAB">
%insert code here
</pre>


That was a good read. keep up the good work Rob!
“* How is the data spread out - wide and flat or skinny and tall?”
As written, that’s a question about variance, not kurtosis. Be careful to distinguish between them.
Skewness answers the first question about symmetry. For a given dataset, it compares the tails on the left and right sides of the mean against each other.
…
The greater the magnitude of skewness, the less balanced the tails.
Well, actually, the skewness measure you’re using is a standardized third moment, which does not literally “compare the tails”, except in a very particular (and quite restricted) sense.
In particular, note that while symmetry can imply zero skewness (not always though; consider a t-distribution with three degrees of freedom, t(3) - which is symmetric but its 3rd-moment-based “skewness” isn’t zero, since the integral doesn’t exist), the implication does not go the other way. The distribution may be asymmetric, yet the distribution could still have a standardized third moment of zero. (there’s a simple example given here - a fair six-sided die labelled 0, 0, 5, 5, 5, 9 has a distribution that is not symmetric, but has zero third moment, and so zero skewness).
Consequently this:
disp(’So they are approximately symmetrically distributed.’);
is not really a correct conclusion on finding a near-zero standardized third moment.
If the given distribution is flatter / wider than Normal, then kurtosis returns values greater than 3 (indicating that more outliers could be present). For a narrower / taller than Normal distribution, the kurtosis is less than 3 (indicating outliers are less likely).
I am sorry, but this is an incorrect characterization of the situation.
If you were to compare a t(5) distribution with a normal (kurtosis 9 vs kurtosis 3) your characterization would say that a t(5) is flatter, and if you draw the raw densities it looks like you’d be right… but the t(5) has a larger variance (5/3 vs 1 for the standard versions of both). Once you adjust the t(5)’s variance to match, it is higher in the middle than the normal, while the normal is higher in the area around a standard deviation either side of the mean, and then the t gets higher again in the tails. (This “more peaked in the middle” with higher kurtosis is common but not universal).
Kurtosis is related to the variance about mu+/- sigma (sometimes called “the shoulders”). As the data becomes less concentrated about those two points (putting more data both toward the middle and out in the tails), the value becomes larger. So there’s (roughly speaking) a tendency to become both heavier tailed and more concentrated in the centre as kurtosis goes up (at a given variance), but it’s always possible to find counterexamples.
Start with looking at wikipedia on kurtosis, which has the right general idea, but note that it is also not quite right, so I’d highly recommend that you read Kendall and Stuart (”Advanced theory of statistics”, Vol I, 3rd or 4th ed should have it) on how kurtosis is often misinterpreted (especially see their exercises), which explains why the wikipedia discussion (e.g. where it says “A high kurtosis distribution has a sharper “peak” and fatter “tails”, while a low kurtosis distribution has a more rounded peak with wider “shoulders”“) is often but not always true.
Also, Kendall and Stuart have numerous examples of why zero third moment doesn’t imply symmetry.
This test procedure makes assumptions about the skewness and kurtosis (symmetry and flatness) of the dataset being close to that of the Normal Distribution.
Actually, if I understand correctly, ANSI Z1.9 assumes normality, not just the same skewness and kurtosis as a normal distribution. You can be far from normal and still have those moment-based quantities being exactly right, so they may not be ideal ways to pick up problems with applying ANSI Z1.9. Applying separate tests of skewness and kurtosis (the so-called ‘rectangle test’) is not a bad test for normality, though it’s possible to do much better. On the other hand, I don’t think tests are what is required here - it’s generally more important to try to quantify the costs of assuming normality in a given situation (in this case, whether the standard is is relevant in spite of the fact that you know before you even see any data that the distribution isn’t going to be exactly normal).
Sorry this reply sounds so negative. This stuff is very hard to write about at an elementary level (I am quite familiar with the difficulties) - you’re pretty much stuck with either making incorrect generalizations or you’re spending more text dealing with the exceptions than the original point, and losing the audience you were hoping to reach.
I have recently written a cautionary post on histograms that might be of interest to people who read this post. It’s got some information that’s useful to keep in mind when looking at histograms to assess distributional shape.
@ Quan, thanks.
@ efrique,
Thanks for such a detailed reading and response to the post. The discussions that spawn from great exchanges like these are usually quite valuable.
I emphatically agree with your last comment, that a survey article about the statistics toolbox is an exercise in compromise. For most readers who occasionally use the toolbox, I run the risk of over-generalizing to get the basic point across (as it appears I have done here). For frequent toolbox users and experts such as yourself, this is clearly not news. My plan is to briefly mention the various functions of these tools, giving examples and some definition. Then, based upon what readers respond to, go into more detailed posts. Circling back to shape in depth might be worthwhile. It sounds as if you may have previously written on the topic.
You bring up some interesting points with respect to the t-family of distributions and skewness / kurtosis. I thought it would be neat to do some quick Monte Carlo sims with the t(3) and t(5) pdf’s to investigate this. I’ll save the output graphics for a later post, but here’s some code you can run and look at with me:
On the t(3) skewness plot - look at how small the iqr is and that it is centered on zero. But wow, even with a pretty big sample size per sim there are some wildly behaving points, sometimes out to +/- 30 or so depending how many times you run the MC. This looks like it would be a really cool area of study.
On the t(5) kurtosis plot - a couple of things. As you might expect it piles heavily into the 4 to 8 range of values (the iqr is tiny). But again there are occasional wild values even in the face of high sample size. Also, I get your point about comparing it to a mismatched variance normal pdf. I’ll be more specific in the future about pointing out the “peakiness” shape difference instead of just the wide/flat versus narrow/tall. Good catch.
Yes, Z1.9 assumes normality. In application of it, I have come across cases where the thing being measured by it appeared to deviate from normality sometimes (sample sizes were too low to be conclusive). We found through practice that a skewness / kurtosis test of data at the entrance was very good at predicting if we would overestimate the non-conformances, which were later confirmed with greater sampling. So these comments draw upon the adherence of normally distributed data to some bounds of s and k over time. But ultimately you have said it right that Z1.9 doesn’t mention them directly.
Lastly (this is the longest comment I’ve ever written!), I like your cautionary post about histograms at the link above. Look for a post here in the future that investigates MATLAB’s robustness (or vulnerability, we’ll see) to those types of perils. Once again, thank you for a great commentary.
Best,
Rob
Hi Rob.
While it would often be the case that having the right skewness and kurtosis would give you reasonably close to the right tail probabilities (which would correspond to the conformances you mention), it’s worth pointing out that there are circumstances where this won’t happen so well.
As an example, consider a symmetric continuous distribution - the distribution of Y = SX, where X has a gamma distribution with shape parameter a* and unit scale parameter, and S is a random sign (i.e. S = +1 or -1 with equal probability) independent of X. By analogy with the double exponential we might refer to the distribution of Y as a double gamma (it’s sometimes called a reflected gamma).
*(I’d say alpha but it’s easier to type ‘a’)
If I have done the integrations correctly (which I may well not have, I did them in a hurry), the kurtosis of this distribution will be (a+3)(a+2)/[(a+1)a].
That is equal to 3 when the shape parameter, a, is about 2.3027.
The lower tail probabilities of this distribution is half the upper tail probability of a gamma with the same shape parameter (the probability in both tails is equal to the upper tail probability of that corresponding gamma distribution). The distribution of Y is bimodal with substantially heavier tails than the normal, particularly in the extreme tail.
This has skewness zero and kurtosis very close to 3.
(The reflected gamma or double gamma distribution actually is used in a variety of application areas)
It’s easy to construct distributions whose third and fourth moments correspond to the normal but which can be quite far from normal; in particular some degree of caution is required if there’s a possibility that the distribution is not unimodal.
(Once the distribution is symmetric, continuous and unimodal, having the third and fourth moments equal to that of the normal does start to pin down the tail probabilities somewhat better, though it might still surprise you how far they can be pushed.)
I should add that last I worked with this (20 years ago I think), for small samples I found a slightly larger value of a (2.5-3 depending on sample size) typically gave sample kurtosis closer to 3.
This sort of thing is typical - even though the population value is right at smaller a, the sample estimate is not unbiased for this distribution, and the distribution of the sample kurtosis is pretty skewed in small samples. Either effect might lead to slightly larger values producing sample kurtosis that’s typically nearer to right.
[...] we finished up descriptive statistics last time, it might be useful to briefly mention a few ways of visualizing datasets. If you’re like me, [...]