previous up next
Previous: 5 ESTIMATING THE VARIATIONS Up: SAMPLING DISTRIBUTION OF THE Next: 7 CONCLUSION

Subsections


6 USEFUL AND USELESS STATISTICS


1 Uniform Distribution as an Example

When $ \xi $ is uniform over $ \left[-10,\,+10\right]$, then $ n\times var_{\Phi }\left(m_{2}\right)=8000/9+20000/9/\left(n-1\right)$. An unbiased statistic for this quantity is $ n\,V$ where $ V$ is given in xtwnr 5.6. In order to estimate the quality of this statistic, we have simulated four sets of $ N=200000$ samples, using respectively $ n=5,\,8,\,12,\,50$, and plotted the results in Figure 7. In all cases the average of $ n\,V$ is as expected (the dashed line). But only the greatest value of $ n$ gives a nice shaped curve. For smaller values of $ n$ the distribution is really skew and for really small $ n$, a noticeable part of the experimental values of $ V$ are negative ($ 20\%$ in Figure 7(a) where $ n=5$).

Figure: Experimental distribution of $ n\,V$ (when $ \xi $ is uniformly distributed)
[$ n=5$]% latex2html id marker 5700
\includegraphics[width=0.5\textwidth,height=50mm]{figures/xfig_unf_5_esti}[$ n=8$]% latex2html id marker 5703
\includegraphics[width=0.5\textwidth,height=50mm]{figures/xfig_unf_8_esti}

[$ n=12$]% latex2html id marker 5706
\includegraphics[width=0.5\textwidth,height=50mm]{figures/xfig_unf_12_esti}[$ n=50$]% latex2html id marker 5709
\includegraphics[width=0.5\textwidth,height=50mm]{figures/xfig_unf_50_esti}

2 Usefulness of a Statistic

Situation described in Subsection 6.1 shows that an "unbiased statistic" can be absolutely useless when dealing with small samples. In order to explore this question, we have to specify a border value beyond which noise will be considered as louder than signal.

Definition 6.1   A (positive) statistic $ \alpha$ is useless (resp. useful) when its coefficient of variation is known to be greater (resp. lower) than $ 1/3$.

The idea beyond this definition has something to do with the notion of probable error. The $ PE$ is a deviation from the mean such that 50% of the population may be expected to lie between $ \mu -PE$ and $ \mu +PE$. This $ PE$ provides a rough perception of what happens, providing the following rule of thumb : below $ PE$, don't discuss ; above $ PE$ begin to discuss.

In order to provide a similar criterion when the $ pdf$ is not easy to obtain, we have to select a threshold value for the coefficient of variation. Our choice of $ 1/3$ is based on the following reason. Probability distributions can be built such that quite all of the population lies inside of the "one sigma" range. But, outside the class room, these distributions are describing situation where rare events are a dominant feature, so that mean values have no more a clear factual meaning.

For the other situations, a great part of the population lies outside of the "one sigma" range. With our choice of factor $ 1/3$, this means that there is an important part of the population outside of $ \left[2\bar{\alpha}/3,\,4\bar{\alpha}/3\right]$ when the nominal value is $ \bar{\alpha}$. Our feeling is that a not better known statistic should be discarded in any situation. Obviously, another choice of the factor, or a non symmetric interval (to take into account the unavoidable skewness of a positive variable), would be possible. But this would not change the mainlines of the argument.

Theorem 6.2   The following statistics are, depending upon the distribution of $ \xi\in\Omega $, useless when the sample size $ n$ is below the following values :

pdf $ m_{2}$ spec $ m_{2}^{2}$ $ m_{4}$ $ V$ spec
uniform

These values follow from the variances obtained in Subsection 5.2. Let us consider the Gaussian distribution. When $ n=19$, statistic $ \alpha=18\,m_{2}/\mu_{2}$ is a $ \chi_{18}^{2}$ random variable and $ \bar{\alpha}\doteq E\left(\alpha\right)=18$. The one sigma range of $ \bar{\alpha}$ is $ \left[18\pm\sqrt{36}\right]=\left[12,\,24\right]$ and therefore the diameter of the one sigma range for $ \alpha$ is equal to $ 2\,\bar{\alpha}/3$. In this special case, the probability that $ \alpha$ falls outside this range is easy to compute and amounts to $ 31\%$.

It has to be noticed that $ s^{4}$ can be "useless" even if $ s^{2}$ is "useful". This is partly related to $ \mathrm{d}\left(x^{2}\right)/x^{2}=2\,\mathrm{d}x/x$ and partly related to the statistical nature of the involved quantities. Moreover, using explicitly that a variable is Gaussian results into $ \mu_{4}=3\mu_{2}^{2}$ so that a useful statistic for $ var_{\Phi }\left(m_{2}\right)$ as soon as $ n>74$ instead of $ n>128$ obtained by ignoring this relation.

For a chi-square distribution, a similar situation occurs. The only significant change is that border values are increasing.

3 Exponential Distribution

We will now examine in details what happens when $ \xi\in\Omega $ is known to be exponentially distributed. This is a very strong hypothesis since it affirms that only one parameter is required to specify the population. If we are really sure of the validity of this hypothesis, we can lower the border of usability by an huge factor.

Statistic $ m$ is "useful" for estimating $ \mu $ as soon as $ n\geq3$. Estimating $ \mu_{2}$ by $ m_{2}$ will be foolish since a better statistic can be obtained via $ m^{2}$. It can be seen that :

$\displaystyle E_{\Phi }\left(m^{k+1}\right)=\frac{1}{\lambda^{k+1}}\,\frac{\left(n+k\right)!}{n!\,n^{k}}$

so that $ \alpha=m^{2}\,n/\left(n+1\right)$ is an unbiased statistic for $ m_{2}$, while $ var\left(\alpha\right)=2\left(2n+3\right)/n/\left(n+1\right)/\lambda^{4}$ : $ \alpha$ is a "useful" statistic for $ \mu_{2}$ when $ n>36$ (column spec in the table). The same argument holds for $ var_{\Phi }\left(m_{2}\right)$ (second column spec).

Among $ N=40000$ samples of size $ n=36$ drawn at random from an exponential population, with $ 1/\lambda=10$, the following values have been obtained. When using $ \alpha=m$ as statistic for $ \mu $, then $ 542+1240$ values fall outside $ \left[2\bar{\alpha}/3,\,4\bar{\alpha}/3\right]$, i.e. a proportion of $ 4.5\%$ (cf. $ \sqrt{var\left(m\right)}=\mu /6$ : it's a two-sigma interval for a variate not so far from normality). When using $ \alpha=m^{2}$ as statistic for $ \mu_{2}$, then $ 5869+6104$ values fall outside of $ \left[2\bar{\alpha}/3,\,4\bar{\alpha}/3\right]$, i.e. a proportion of $ 30\%$, and $ 389$ values outside $ \left[\bar{\alpha}\pm\bar{\alpha}\right]$, i.e. around $ 1\%$. When using $ \alpha=m_{2}$ as statistic for $ \mu_{2}$, then $ 9812+7692$ values fall outside of $ \left[2\bar{\alpha}/3,\,4\bar{\alpha}/3\right]$, i.e. a proportion of $ 44\%$, and $ 1492$ values outside $ \left[\bar{\alpha}\pm\bar{\alpha}\right]$, i.e. around $ 4\%$.


previous up next
Previous: 5 ESTIMATING THE VARIATIONS Up: SAMPLING DISTRIBUTION OF THE Next: 7 CONCLUSION


douillet@ensait.fr
2009-09-09