Analysis of body heights
The problem in a nutshell:
given two data sets of German female body heights containing N=300 and N=3000 entries
histogram and Q-Q plot are consistent with normal distribution
for N=300
similar for N=3000
test for normality with standard χ
^{2}
goodness of fit test
for concreteness: with Matlabs
chi2gof
results for N = 300
p-value = 0.5884
→ consistent with normal distribution
results for N = 3000
p-value = 6.0958e-10
→ normal distribution is highly unlikely!
the usual rules of thumb are fulfilled in both cases
all O
_{i}
≥ 1
at least 80% of the O
_{i}
> 5
The explanation:
numerical experiment
create your own data sets with N data points
generate values with standard random number generator using normal distribution (μ = 165, σ = 6.9)
round the values to centimeters
results of the standard χ
^{2}
analysis
p-value (300) = 0.470
p-value (3000) = 0.266e-3
obviously the rounding is the problem
measured (or rounded) values are from a discrete distribution N
_{g}
N
_{g}
is the "rounded version" of the normal distribution N
using suitable units → measured values are integer
Studying behaviour with varying N:
create rounded values as before
use
chi2gof
to compute p-values
semi-logarithmic plot, using mean values of three runs each
findings
p-value falls drastically (exponentially) with rising N
but: large fluctuations. e.g.
N
p-values
1000
1.67e-01/6.48e-01/4.88e-03
2.73e-01
3000
4.30e-05/7.40e-03/5.36e-04
2.66e-03
strange behaviour at N=30000
needs further analysis (cf. below)
Use Kolmogorow-Smirnow test for comparison:
reminder
directly compares empirical and given distribution functions
strange, but well known test statistics
has no additional parameters
result of tests
much less fluctuations of individual p-values
p-value decreases exponentially very fast
KS test discriminates better between continuous and rounded values
What exactly is our question?
every measurement has a certain precision → the rounding effect is inevitable!
but
we still assume that real body heights are normally distributed
we are not interested in the rounding problem
we want to test for the real (exact) distribution
how?