Open Mind Tree: Statistics chapter 2

Chapter 2:

Describing and Exploring Data

Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative. Several options are available including plotting the data or calculating descriptive statistics

Plotting Data

Often, the first thing one does with a set of raw data is to plot frequency distributions. Usually this is done by first creating a table of the frequencies broken down by values of the relevant variable, then the frequencies in the table are plotted in a histogram

Example: Your age as estimated by the questionnaire from the first class

TABLE 2.1
Age	Frequency

18	3
19	10
20	14
21	10
22	5
23	2
24	1
25	1
26	2

Note: The frequencies in the above table were calculated by simply counting the number of subjects having the specified value for the age variable

Histogram

Grouping Data:

Plotting is easy when the variable of interest has a relatively small number of values (like our age variable did). However, the values of a variable are sometimes more continuous, resulting in uninformative frequency plots if done in the above manner. For example, our weight variable ranges from 100 lb. to 200 lb. If we used the previously described technique, we would end up with 100 bars, most of which with a frequency less than 2 or 3 (and many with a frequency of zero). We can get around this problem by grouping our values into bins. Try for around 10 bins with natural splits

Example: Binning our weight variable

Table 2.2
Weight Bin	Midpoint	Frequency

100-109	104.5	6
110-119	114.5	10
120-129	124.5	6
130-139	134.5	10
140-149	144.5	5
150-159	154.5	3
160-169	164.5	4
170-179	174.5	1
180-189	184.5	0
190-199	194.5	2
200-209	204.5	1

Histogram

Here's a live demonstration of binning. (Courtesy of R. Webster West.)

To change the size of the bin use your mouse to move the little triangle.

see section in text on cumulative frequency distributions Stem & Leaf Plots

If values of a variable must be grouped prior to creating a frequency plot, then the information related to the specific values becomes lost in the process (i.e., the resulting graph depicts only the frequency values associated with the grouped values). However, it is possible to obtain the graphical advantage of grouping and still keep all of the information if stem & leaf plots are used....

These plots are created by splitting a data point into that part associated with the `group' and that associated with the individual point. For example, the numbers 180, 180, 181, 182, 185, 186, 187, 187, 189 could be represented as:

18 001256779

Thus, we could represent our weight data in the following stem & leaf plot:

Stem & Leaf

      10 057788
      11 0001235558
      12 001555
      13 0002244555
      14 00005
      15 005
      16 2255
      17 0
      18
      19 05
      20 0

Stem & leaf plots are especially nice for comparing distributions:

Males Stem Females

    8 10 05778
      11 0001235558
      12 001555
 5440 13 002255
   00 14 005
   00 15 5
  522 16 5
    0 17
      18
   50 19
    0 20

Terminology Related to Distributions:

Often, frequency histograms tend to have a roughly symetrical bell-shape and such distributions are called normal or gaussion

Example: Our height distribution

Sometimes, the bell shape is not semetrical

The term positive skew refers to the situation where the "tail" of the distribition is to the right, negative skew is when the "tail" is to the left

Example: Pizza Data

See the text for other terminology

Notation

Variables

When we describe a set of data corresponding to the values of some variable, we will refer to that set using an uppercase letter such as X or Y

When we want to talk about specific data points within that set, we specify those points by adding a subscript to the uppercase letter like X₁

For example:

5, 8, 12, 3, 6, 8, 7

X₁, X₂, X₃, X₄, X₅, X₆, X₇

Summation

The greek letter sigma, which looks like

, means "add up" or "sum" whatever follows it. Thus,

, means "add up all the X_i's". If we use the X_is from the previous example, X_i= 49 (or just X)

Note, that sometimes the has number above and below it. These numbers specify the range over which to sum. For example, if we again use the the X_is from the previous example, but now limit the summation: X_i= 34

Nasty Example:

Antic Real

Table 2.3
Student	Mark #1	Mark #2
-	X	Y

1	82	84
2	66	51
3	70	72
4	81	56
5	61	73

Double Subscripts

Sometimes things are made more complicated because capital letters (e.g., X) are sometimes used to refer to entire datasets (as opposed to single variables) and multiple subscripts are used to specify specific data points

Table 2.4
Student	Week 1	Week 2	Week 3	Week 4	Week 5

1	7	6	4	2	2
2	3	4	4	3	4
3	3	4	5	4	6

X₂₄ = 3

X or X_ij = 61

Measures of Central Tendency

While distributions provide an overall picture of some dataset, it is sometimes desirable to represent the entire dataset using descriptive statistics

The first descriptive statistics we will discuss, are those used to indicate where the centre of the distribution lies

There are, in fact, three different measures of central tendency

The first of these is called the mode

The mode is simply the value of the relevant variable that occurs most often (i.e., has the highest frequency) in the sample

Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar

However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value).

Finding the mode:

Create a nongrouped frequency table as described previously, then identify the value with the greatest frequency

For Example: Class Height

Table 2.5
Value	Frequency	Value	Frequency

61	3	69	3
62	4	70	2
63	4	71	4
64	4	72	4
65	3	73	0
66	7	74	0
67	5	75	0
68	4	76	1

A second measure of central tendency is called the median

The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median)

To find the median, the data points must first be sorted into either ascending or descending numerical order

The position of the median value can then be calculated using the following Formula:

For Examples:

If there are an odd number of data points:

(1, 3, 3, 4, 4, 5, 6, 7, 12)

If there are an even number of data points:

The median is the item in the fifth position of the ordered dataset, therefore the median is 4

Finally, the most commonly used measure of central tendency is called the mean (denoted

for a sample, and

for a population)

The mean is the same of what most of us call the average, and it is calculated in the following manner:

For example, given the dataset that we used to calculate the median (odd number example), the corresponding mean would be:

Similarly, the mean height of our class, as indicated by our sample, is:

Mode vs. Median vs. Mean In our height example, the mode and median were the same, and the mean was fairly close to the mode and median. This was the case because the height distribution was fairly symetrical However, when the underlying distribution is not symetrical, the three measures of central tendency can be quite different

This raises the issue of which measure is best?

Example: Pizza Eating

Table 2.5
Value	Frequency	Value	Frequency

0	4	8	5
1	2	10	2
2	8	15	1
3	6	16	1
4	6	20	1
5	6	40	1
6	5	-	-

Mode = 2 slices per week

Median = 4 slices per week

Mean = 5.7 slices per week

Note that if you were calculating these values, you would show all your steps (it's good to be prof!)

Measures of Variability

In addition to knowing where the centre of the distribution is, it is often helpful to know the degree to which individual values cluster around the centre. This is known as variability. There are various measures of variability, the most straightforward beingthe range of the sample:

Highest value minus lowest value

While range provides a good first pass at variance, it is not the best measure because of its sensitivity to extreme scores (see section in text)

The Average Deviation

Another approach to estimating variance is to directly measure the degree to which individual datapoints differ from the mean and then average those deviations

That is:

However, if we try to do this with real data, the result will always be zero:

Example: (2,3,4,4,4,5,6,12)

The Mean Absolute Deviation (MAD)

One way to get around the problem with the average deviation is to use the absolute value of the differences, instead of the differences themselves

The absolute value of some number is just the number without any sign:

For Example,

Thus, we could re-write and solve our average deviation question as follows:

The dataset in question has a mean of 5 and a mean absolute deviation of 2

The Variance

Although the MAD is an acceptable measure of variability, the most commonly used measure is variance (denoted s² for a sample and

for a population) and its square root termed the standard deviation (denoted s for a sample and

for a population)

The computation of variance is also based on the basic notion of the average deviation however, instead of getting around the "zero problem" by using absolute deviations (as in MAD), the "zero problem" is eliminating by squaring the differences from the mean

Specifically:

Example: Same old numbers

(2,3,4,4,4,5,6,12)

Alternate formula for s² and s

The definitional formula of variance just presented was:

An equivalent formula that is easier to work with when calculating variances by hand is:

Although this second formula may look more intimidating, a few examples will show you that it is actually easier to work with (as you'll See in assignment 2)

Estimating Population Parameters

So, the mean

and variance (s²) are the descriptive statistics that are most commonly used to represent the datapoints of some sample

The real reason that they are the preferred measures of central tendency and variance is because of certain properties they have as estimators of their corresponding population parameters;

and

Four properties are considered desirable in a population estimator; sufficiency, unbiasedness, efficiency, & resistance

Both the mean and the variance are the best estimators in their class in terms of the first three of these four properties

Sufficiency

A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter

Unbiasedness

A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter

- Explanation of N-1 in s² formula

Efficiency

The efficiency of a statistic is reflected in the variance that is observed when one examines the means of a bunch of independently chosen samples

Assessing the Bias of an Estimator

The bias of a statistic as an estimator of some population parameter can be assessed by

defining some population with measurable parameters
taking a number of independent samples from that population
calculating the relevant statistic for each sample
averaging that statistic across the samples
comparing the "average of the sample statistics" with the population parameter

Using the procedure, the mean can be shown to be an unbiased estimator (see pp 47)

However, if the more intuitive formula for s² is used:

it turns out to underestimate

This bias to underestimate is caused by the act of sampling and it can be shown that this bias can be eliminated if N-1 is used in the denominator instead of N

Note that this is only true when calculating s², if you have a measurable population and you want to calculate

, you use N in the denominator, not N-1

Degrees of Freedom

The mean of 6, 8, & 10 is 8

If I allow you to change as many of these numbers as you want BUT the mean must stay 8, how many of the numbers are you free to vary?

The point of this exercise is that when the mean is fixed, it removes a degree of freedom from your sample -- this is like actually subtracting 1 from the number of observations in your sample

It is for exactly this reason that we use N-1 in the denominator when we calculate s2 (i.e., the calculation requires that the mean be fixed first which effectively removes -- fixes -- one of the data points)

Resistance

The resistance of an estimator refers to the degree to which that estimate is effected by extreme values

As mentioned previously, both

and s² are highly sensitive to extreme values

Despite this, they are still the most commonly used estimates of the corresponding population parameters, mostly because of their superiority over other measures in terms sufficiency, unbiasedness, & efficiency

Open Mind Tree

Statistics chapter 2

Chapter 2:

Describing and Exploring Data

Plotting Data

Histogram

Grouping Data:

Histogram

Stem & Leaf

Terminology Related to Distributions:

Notation

Variables

Summation

Nasty Example:

Double Subscripts

Measures of Central Tendency

Finding the mode:

Measures of Variability

The Average Deviation

The Mean Absolute Deviation (MAD)

The Variance

Alternate formula for s² and s

Estimating Population Parameters

Sufficiency

Unbiasedness

Efficiency

Assessing the Bias of an Estimator

Degrees of Freedom

Resistance

0 comments:

Popular Posts

Visitors

Archives

Infolinks In Text Ads

Featured Posts

Blogger Tips

Open Mind Tree

Statistics chapter 2

Chapter 2:

Describing and Exploring Data

Plotting Data

Histogram

Grouping Data:

Histogram

Stem & Leaf

Terminology Related to Distributions:

Notation

Variables

Summation

Nasty Example:

Double Subscripts

Measures of Central Tendency

Finding the mode:

Measures of Variability

The Average Deviation

The Mean Absolute Deviation (MAD)

The Variance

Alternate formula for s2 and s

Estimating Population Parameters

Sufficiency

Unbiasedness

Efficiency

Assessing the Bias of an Estimator

Degrees of Freedom

Resistance

0 comments:

Popular Posts

Visitors

Archives

Infolinks In Text Ads

Featured Posts

Blogger Tips

Alternate formula for s² and s