Chapter 2:

Describing and Exploring Data


Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative. Several options are available including plotting the data or calculating descriptive statistics

Plotting Data


Example: Your age as estimated by the questionnaire from the first class

TABLE 2.1

Age
Frequency
18
3
19
10
20
14
21
10
22
5
23
2
24
1
25
1
26
2


Histogram



Grouping Data:

Example: Binning our weight variable

TABLE 2.2

Weight Bin
Midpoint
Frequency
100-109
104.5
6
110-119
114.5
10
120-129
124.5
6
130-139
134.5
10
140-149
144.5
5
150-159
154.5
3
160-169
164.5
4
170-179
174.5
1
180-189
184.5
0
190-199
194.5
2
200-209
204.5
1


Histogram



Stem & Leaf

      10 057788
      11 0001235558
      12 001555
      13 0002244555
      14 00005
      15 005
      16 2255
      17 0
      18
      19 05
      20 0

Stem & leaf plots are especially nice for comparing distributions:


Males Stem Females

    8 10 05778
      11 0001235558
      12 001555
 5440 13 002255
   00 14 005
   00 15 5
  522 16 5
    0 17
      18
   50 19
    0 20


Terminology Related to Distributions:

Notation

Variables

5,  8,  12, 3,  6,  8,  7 
X1, X2,  X3, X4,  X5, X6, X7 

Summation

Nasty Example:

TABLE 2.3

Student
Mark #1
Mark #2
-
X
Y
1
82
84
2
66
51
3
70
72
4
81
56
5
61
73



Double Subscripts

TABLE 2.4

Student
Week 1
Week 2
Week 3
Week 4
Week 5
1
7
6
4
2
2
2
3
4
4
3
4
3
3
4
5
4
6

Measures of Central Tendency

Finding the mode:

TABLE 2.5

Value
Frequency
Value
Frequency
61
3
69
3
62
4
70
2
63
4
71
4
64
4
72
4
65
3
73
0
66
7
74
0
67
5
75
0
68
4
76
1

For Examples:

(1, 3, 3, 4, 4, 5, 6, 7, 12)



Example: Pizza Eating

TABLE 2.5

Value
Frequency
Value
Frequency
0
4
8
5
1
2
10
2
2
8
15
1
3
6
16
1
4
6
20
1
5
6
40
1
6
5
-
-

Measures of Variability

The Average Deviation

That is:

Example: (2,3,4,4,4,5,6,12)



The Mean Absolute Deviation (MAD)


For Example,


The Variance

Specifically:



Example: Same old numbers

(2,3,4,4,4,5,6,12)







Alternate formula for s2 and s



Estimating Population Parameters

Sufficiency

Unbiasedness

Efficiency

Assessing the Bias of an Estimator



it turns out to underestimate

This bias to underestimate is caused by the act of sampling and it can be shown that this bias can be eliminated if N-1 is used in the denominator instead of N

Note that this is only true when calculating s2, if you have a measurable population and you want to calculate , you use N in the denominator, not N-1



Degrees of Freedom

Resistance