This is the first of three posts in Descriptive Statistics. Click here to see the full list of statistics posts.

Matlab Logo

Our first dive into the Statistics Toolbox will cover descriptive statistic. These are the functions that help us understand what data we have on hand. As we said in the introductory post, descriptive statistics stop there, and do not make assumptions or reach conclusions about where your data came from, or how it compares to other datasets.

Contents

Holding Our Breath

So, it turns out that one of Quan’s many talents is the ability to hold his breath for long periods of time. I’m wondering if I should challenge him to a breath-holding contest –> whoever holds it longer wins. Since I’m not sure if I have a chance at beating him, I took some practice breath holds and timed myself. Here are my results after 40 holds:

load RobPracticeHolds.mat; y = ones(size(breathholds));
h1 = figure('Position',[100 100 400 100],'Color','w');
scatter(breathholds,y);
% using a scatter plot like a dot plot

xlabel('Seconds');
set(gca,'Yticklabel',[]);
title('Practice Holds');

stats

You can see that my times go from 40-something to over 120 seconds. I’m not very consistent. The middle of the dense cluster of times seems to be around 70 seconds, so let’s call that my “expected” time for now. As you can see, viewing these results as a bunch of points has its limitations, so what tools can we employ here to get more precise?

When it comes to describing a dataset, there are some standardized things to look at before getting “application specific.” Descriptors of central tendency seek to locate the data group, typically by finding an estimate of where the middle lies. Next, descriptors of spread explain how widely dispersed or tightly grouped the data is. And lastly, the shape measures how the data looks as a body, usually in terms of a histogram.

Central Tendency

For the central tendency of my data, we will look to the mean, median, and mode. All three may not necessarily apply to our data, so let’s examine what each of them will return.

Mean returns the arithmetic mean, also called the average, of the data. This is probably the most commonly used statistic. It simply sums up all the values and divides by the number of values (giving equal weight to each value in the dataset). Calling mean(x) is the same as calling sum(x)/numel(x).

Median returns the central value in the dataset. This is simpler than mean in that it finds the middle of a sorted list of values. If there are an odd number of values, then the center value is the median value. If there are an even number of values, then a value halfway between the 2 center values is the median value. This is usually different than the mean since asymmetry in the data tends to move the median differently than the mean.

Mode returns the most frequently occurring value. If your dataset has more than one most frequently occurring value, then the lowest of them is returned. For small datasets such as mine, it will be unlikely that any values occur more than once. For very large datasets, or data that has been rounded, filtered, or transformed in some way, mode can be more meaningful (such as the peak of a histogram).

So for my practice breath hold data, I select the mean and median to describe the central tendency of the dataset.

disp(['The mean is ',num2str(mean(breathholds)),' seconds (green line).']);
disp(['The median is ',num2str(median(breathholds)),' seconds (red line).']);
hold all;
line([mean(breathholds) mean(breathholds)],[0.5 1.5],'color','g');
line([median(breathholds) median(breathholds)],[0.5 1.5],'color','r');
The mean is 74.1834 seconds (green line).
The median is 73.0013 seconds (red line).

stats 2

These are better estimates than my previous “eyeball” guess of 70 seconds. So, when I tell Quan how long I tend to hold my breath, I can say something like “I expect to hold it around 74 seconds” and see if he laughs. If he doesn’t laugh, then maybe I have a shot at beating him.

Spread

So let’s suppose that Quan doesn’t laugh at me, and maybe I have a chance at holding my breath longer than him. My next concern is whether or not I am consistent enough to have a winning time in our contest. I will only get one try, so how much higher or lower than my center of about 74 seconds should I expect? From my first “eyeball” estimate, my times went from 40-something to over 120 seconds. How can we measure the spread more precisely? There are several methods available, so let’s see what each will return when called:

Std returns the standard deviation of the dataset. This is probably the second-most used statistic, behind the mean. An overly simplified explanation of std is that it takes the average difference between each point and the mean. Higher values of std indicate that the data is (on average) farther away from the mean, and thus more spread out. Low values of std indicate tightly grouped data.

The std equation is much more impressive to look at though, since it squares each point’s difference from the mean (deviation), sums them up, divides by (n-1), and then takes the square root to “standardize” the result.

h2 = figure('Position',[200 200 400 100],'Color','w'); axis off;
text(.5,.45,'$$std(x) = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}$$',...
'Interpreter','latex','HorizontalAlignment','Center','FontSize',16);

stats3

disp(['The standard deviation of my practice holds is ',...
num2str(std(breathholds)),' seconds.']);
The standard deviation of my practice holds is 19.0874 seconds.

Min, Max, and Range tell you the extents of your data. If your data is representative of the phenomenon being measured, then looking at min and max give you outer boundaries around what to expect, while range describes the width of your data. One weakness of these 3 functions - they are vulnerable to outliers. If you have a wild value in the dataset that is not really representative of what you are studying, they will not catch it. By the way, range(x) is just the same as calling max(x)-min(x).

figure(h1);
disp(['My times go from ',num2str(min(breathholds)),' to ',...

    num2str(max(breathholds)),' seconds (black lines),']);
disp(['with a total range of ',num2str(range(breathholds)),' seconds.']);
line([min(breathholds) min(breathholds)],[0.5 1.5],'color','k');
line([max(breathholds) max(breathholds)],[0.5 1.5],'color','k');
My times go from 44.8692 to 128.2181 seconds (black lines),
with a total range of 83.3489 seconds.

stats4

IQR, inter-quartile range, is a better estimate of the spread than the range. It measures the width of the 2 inner quarters of the points in the dataset, telling you the bounds which hold half of your data. By ignoring the outer quarters of the dataset, iqr is not sensitive to outliers. To find the iqr of a dataset, sort it and then select the middle 50% of the points (same as finding the 25% and 75% quantiles). The difference between the values at either end of the middle 50% of points is the inter-quartile range.

quantiles = quantile(breathholds,[.25 .75]);
disp(['The middle 50% of my times are between ' ,num2str(quantiles(1)),...
    ' and ',num2str(quantiles(2))]);
disp(['seconds (blue lines), with an IQR of ',num2str(iqr(breathholds))]);
line([quantiles(1) quantiles(1)],[0.5 1.5],'color','b');
line([quantiles(2) quantiles(2)],[0.5 1.5],'color','b');
The middle 50% of my times are between 60.6648 and 84.6078
seconds (blue lines), with an IQR of 23.9431

stats5

So now I’m getting a better understanding of what to expect during a breath-holding contest. It seems most likely that I will hold it for a minute to almost a minute and a half. I wonder if this is competitive?

The 5-in-1 tool

Before we move on, let’s look at the “Swiss Army knife” of descriptive statistics - boxplot. Most of the things we just reviewed above at are contained in a boxplot, plus a bonus feature of outlier protection. That’s right, boxplot does include a simple guarding scheme that identifies any points as outliers that are more than 1.5 times the iqr away from the box. Pretty cool. Let’s take a look at what we have constructed on our scatter plot compared to a real boxplot.

title('Scatter with Min, 25%iqr, Median, Mean, 75%iqr, & Max lines');
xlabel('');
h3 = figure('Position',[100 100 400 100],'Color','w');
boxplot(breathholds,'orientation','horizontal','widths',.5);
set(gca,'XLim',[40 140]); title('A Boxplot of the same data');
xlabel(''); set(gca,'Yticklabel',[]); ylabel('');

stats6
stats7

When we line-up the two plots, see how the boxplot automatically contains visual representations of the min (left whisker), iqr (blue box), median (red line), and max (right whisker)? But wait! Something is different with the right whisker - our max from the scatter plot was identified by the boxplot as an outlier, since it sits more than 1.5 x iqr away from the box. So boxplot marked it with a red ‘+’ and then reset the ‘whisker’ to the apparent max of the next lower point.

When you know how to read a boxplot, it is a very efficient way to summarize the central tendency and spread of your dataset. It only lacks the mean and std, which are substitutes for the median and iqr in this case.

Wrapping Up

So after all the analysis of my practice data, I’m not very confident about beating Quan in a breath-holding contest. Maybe after more practice and the proper coaching, I’ll have a chance. ;-)

In our next segment, we will finish up descriptive stats with a look at shape and some more visualization techniques. If there are specific topics you would like to see covered in future posts, please leave a comment below. And, as always, questions are quite welcome.