The Jackknife and Bootstrap

時間 2021-05-26 標籤 Assignment

The Jackknife and Bootstrap

Define a sample from an unknown population with n n n independent observations as

x = ( x 1 , x 2 , . . . , x n ) ′ . x = (x_1, x_2, ..., x_n)'. x=(x1,x2,...,xn)′.

A statistic θ ^ \hat{\theta} θ^ that we are interested in is a function of the sample x x x:
θ ^ = s ( x ) . \hat{\theta} = s(x). θ^=s(x).
The task is to get the standard deviation of $ \hat{\theta} $ using new samples generated from the original sample x x x.

1 The Jackknife Estimate of Standard Error

The basic idea of Jackknife is to create new samples from the observed sample using the method of removing one observation at a time.

Define the i i ith jackknife sample as the remaining sample after dropping the i i ith sample, that is
x ( i ) = ( x i , x 2 , . . . , x i − 1 , x i + 1 , . . . , x n ) ′        f o r i = 1 , 2 , . . . , n . . x_{(i)} = (x_i, x_2, ..., x_{i-1},x_{i+1}, ...,x_n)' ~~~~~~ for ~ i = 1, 2, ..., n.. x(i)=(xi,x2,...,xi−1,xi+1,...,xn)′      for i=1,2,...,n..
The corresponding value of the statistic of interest can be obtained from these jackknife samples as
θ ^ ( i ) = s ( x ( i ) )        f o r i = 1 , 2 , . . . , n . \hat{\theta}_{(i)} = s(x_{(i)}) ~~~~~~ for ~ i = 1, 2, ..., n. θ^(i)=s(x(i))      for i=1,2,...,n.
From these n n n new statistics, the jackknife estimate of standard error for θ ^ \hat{\theta} θ^ can be calculated as
s e ^ j a c k = n − 1 n ∑ 1 n ( θ ^ ( i ) − ∑ 1 n θ ^ ( i ) n ) 2 . \hat{se}_{jack} = \sqrt {\frac{n-1}{n} \sum_1^n \left( \hat{\theta}_{(i)} - \frac{\sum_1^n \hat{\theta}_{(i)}}{n} \right)^2}. se^jack=nn−11∑n(θ^(i)−n∑1nθ^(i))2 .
Take the sample mean x ˉ \bar{x} xˉ as an example. To get the jackknife estimate of standard error for x ˉ \bar{x} xˉ, we calculate the sample means of n n n jackknife samples
x ˉ ( i ) = x 1 + x 2 + . . . + x i − 1 + x i + 1 + . . . + x n n − 1          f o r i = 1 , 2 , . . . , n . \bar{x}_{(i)} = \frac{x_1 + x_2 + ... + x_{i-1} + x_{i+1} + ... + x_n}{n-1} ~~~~~~~~ for ~ i = 1, 2, ..., n. xˉ(i)=n−1x1+x2+...+xi−1+xi+1+...+xn        for i=1,2,...,n.
Then the jackknife estimate of standard error for x ˉ \bar{x} xˉ is
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \hat{se}_{jack…
which is exactly the same form as the classic formula.

2 The Nonparametric Bootstrap Estimate of Standard Error

The jackknife approach does not work for unsmooth statistics while Bootstrap does not have this drawbacks. The difference of bootstrap samples from jackknife samples is that bootstrap samples are obtained from random sampling with replacement. Apart from this, instead of n n n new samples, there can be any number of bootstrap samples. Though 200 bootstrap samples would be sufficient for the estimate of standard error (2000 for the estimate of confidence interval). The sample size of a bootstrap sample is n n n, the same as the original sample.

Let the number of bootstrap samples be N N N. Then denote the i i ith bootstrap sample as
x ( i ) ∗ = ( x 1 ∗ , x 2 ∗ , . . . , x n ∗ ) f o r i = 1 , 2 , . . . , N , x^*_{(i)} = (x^*_1, x^*_2, ..., x^*_n) ~~~~~~ for ~ i = 1, 2, ..., N, x(i)∗=(x1∗,x2∗,...,xn∗) for i=1,2,...,N,
Similarly, the corresponding value of the statistic of interest can be obtained as
θ ^ ( i ) ∗ = s ( x ( i ) ∗ ) f o r i = 1 , 2 , . . . , N . \hat{\theta}^*_{(i)} = s(x_{(i)}^*) ~~~~~~ for ~ i = 1, 2, ..., N. θ^(i)∗=s(x(i)∗) for i=1,2,...,N.
The bootstrap estimate of standard error for θ ^ \hat{\theta} θ^ is then the empirical standard deviation of these N N N values,
s e ^ b o o t = ∑ 1 N ( θ ^ ( i ) ∗ − ∑ 1 N θ ^ ( i ) ∗ N ) 2 N − 1 \hat{se}_{boot} = \sqrt{\frac{\sum_1^N \left( \hat{\theta}^*_{(i)} - \frac{\sum_1^N \hat{\theta}^*_{(i)}}{N} \right)^2}{N-1}} se^boot=N−1∑1N(θ^(i)∗−N∑1Nθ^(i)∗)2
Take a dataset of 50 uncorrelated data from the Poisson distribution with the mean of 5 as example.

As show in Figure 1,

Randomly obtain 50 data with replacement from the dataset as Sample 1.
Then repeat this procedure N N N times (N=200 is used in this example) to get Sample 2, Sample 3, …
Calculate the corresponding statistics of interest of these sample. Sample mean is used in this example.
Bootstrap estimate of standard error for this statistic then can be calculated.

Figure 1: The process of Bootstrap applied for example data set

3 Resampling Plans

The jackknife and the bootstrap are both specific situations of resampling from the original sample x = ( x 1 , x 2 , . . . , x n ) ′ x = (x_1, x_2, ..., x_n)' x=(x1,x2,...,xn)′.

3.1 Resampling Vector

Define a resampling vector P = ( P 1 , P 2 , . . . , P n ) \mathbf{P} = (P_1, P_2, ..., P_n) P=(P1,P2,...,Pn), in which $P_i \geq 0 $ and ∑ i = 1 n P i = 1 \sum_{i=1}^{n} P_i = 1 ∑i=1nPi=1.

P i P_i Pi in the vector P \mathbf{P} P means how much x i x_i xi weights in the new sample:
P i = T h e n u m b e r o f x i i n t h e n e w s a m p l e T h e s a m p l e s i z e o f t h e n e w s a m p l e . P_i = \frac{The ~ number ~ of ~ x_i ~ in ~ the ~ new ~ sample}{The ~ sample ~ size ~ of ~ the ~ new ~ sample}. Pi=The sample size of the new sampleThe number of xi in the new sample.
Therefore, the information in the a new sample can be obtained entirely from P P P and the original sample vector x = ( x 1 , x 2 , . . . , x n ) ′ x = (x_1, x_2, ..., x_n)' x=(x1,x2,...,xn)′.

The set of all possible values of P P P is a space which is a simplex¹ denoted by S n S_n Sn.

The new sample can be totally obtained from resampling vector P \mathbf{P} P with the original sample x x x fixed. Thus, the statistic of interest θ ^ ∗ \hat{\theta}^* θ^∗ corresponding to this specific new sample can be written as a function of P \mathbf{P} P:
θ ^ ∗ = S ( P ) . \hat{\theta}^* = S(\mathbf{P}). θ^∗=S(P).
The resampling plans intends to find out how the statistic of interest θ ^ ∗ \hat{\theta}^* θ^∗ changes with the weight vector P \mathbf{P} P varies across its value space, namely S n S_n Sn, holding the original data set x x x fixed.

3.2 Resampling Vectors of Jackknife Samples and Bootstrap Samples

Let the resampling vector with equal weight on each x i x_i xi be
P 0 = ( 1 , 1 , . . . , 1 ⏞ n i n t o t a l ) ′ n . \mathbf{P}_0 = \frac{(\overbrace{1, 1, ..., 1}^{n ~ in ~ total})'}{n}. P0=n(1,1,...,1 n in total)′.
The sample corresponding to P 0 \mathbf{P}_0 P0 is exactly the same as the original sample x x x. Then we have
S ( P 0 ) = s ( x ) = θ ^ , S(\mathbf{P}_0) = s(x) = \hat{\theta}, S(P0)=s(x)=θ^,
which is the original estimate of the statistic of interest.

The resampling vector of the i i ith jackknife sample x ( i ) x_{(i)} x(i) is
P ( i ) = ( 1 , 1 , . . . , 1 ⏞ i − 1 , 0 , 1 , 1 , . . . , 1 ⏞ n − i ) ′ n − 1 . \mathbf{P}_{(i)} = \frac{(\overbrace{1, 1, ..., 1}^{i-1}, 0, \overbrace{1, 1, ..., 1}^{n-i})'}{n-1}. P(i)=n−1(1,1,...,1 i−1,0,1,1,...,1 n−i)′.
For the i i ith bootstrap sample x ( i ) ∗ x^*_{(i)} x(i)∗, the resampling vector is
P ( i ) ∗ = ( N 1 , N 2 , . . . , N n ) ′ n , \mathbf{P}^*_{(i)} = \frac{(N_1, N_2, ..., N_n)'}{n}, P(i)∗=n(N1,N2,...,Nn)′,
where N i N_i Ni denotes the number of x i x_i xi (the i i ith observation from the original sample) in the new sample. Therefore we will have ∑ 1 n N i = n \sum_1^n N_i = n ∑1nNi=n.

What we do to get a bootstrap sample is to draw n n n times from n n n observations, and the probability of each observation being drawn is 1/n. Thus the outcome ( N 1 , N 2 , . . . , N n ) ′ (N_1, N_2, ..., N_n)' (N1,N2,...,Nn)′ follows a multinomial distribution. Therefore, for each specific value of P ( i ) ∗ \mathbf{P}^*_{(i)} P(i)∗, the probability of getting its corresponding sample denoted bootstrap probability can be calculated.

3.3 An Example of Resampling Simplex for Sample of Size 3

Figure 2 demonstrates a resampling simplex for the original sample with 3 observations x = ( x 1 , x 2 , x 3 ) ′ x = (x_1, x_2, x_3)' x=(x1,x2,x3)′.

The centre point is P 0 \mathbf{P}_0 P0 with the same weight for these 3 observations. P ( 1 ) \mathbf{P}_{(1)} P(1), P ( 2 ) \mathbf{P}_{(2)} P(2) and P ( 3 ) \mathbf{P}_{(3)} P(3) are the jackknife points. For each jackknife point, the one observation removed has weight 0 while the remaining two both weigh 1/2.

As for bootstrap, there are 10 possible situations for its new samples which are P 0 \mathbf{P}_0 P0, the three corner points and the six red starred points. The probability of obtaining these samples is different. Ideally, in every 27 samples, there will be 6 centre points ( P 0 \mathbf{P}_0 P0), 1 for each of the 3 corner points and 3 for each of the 6 red starred point.

Figure 2: An example of a resampling simplex for sample size $n = 3$

3.4 The Statistic of Interest Corresponding to the Resampling Vector

To explore the relationship between the statistic of interest θ ^ ∗ \hat{\theta}^* θ^∗ and the resampling vector P \mathbf{P} P is to study the features of the function between them which is θ ^ ∗ = S ( P ) \hat{\theta}^* = S(\mathbf{P}) θ^∗=S(P).

The Euclidean distance between a jackknife vector P ( i ) \mathbf{P}_{(i)} P(i) and P 0 \mathbf{P}_0 P0 is
∥ P ( i ) − P 0 ∥ = 1 n ( n − 1 ) . \| \mathbf{P}_{(i)} - \mathbf{P}_{0} \| = \frac{1}{\sqrt{n(n-1)}}. ∥P(i)−P0∥=n(n−1) 1.
With bootstrap probability known, the expected root mean square distance between a bootstrap vector P ( i ) ∗ \mathbf{P}^*_{(i)} P(i)∗ and P 0 \mathbf{P}_0 P0 can be calculated as well:
( E ∥ P ( i ) ∗ − P 0 ∥ 2 ) 1 2 = n − 1 n 2 . {\left( E {\| \mathbf{P}^*_{(i)} - \mathbf{P}_{0}} \|^2 \right)}^{\frac{1}{2}} = \sqrt{\frac{n-1}{n^2}}. (E∥P(i)∗−P0∥2)21=n2n−1 .
From above we can get that he expected distance between a bootstrap vector and the centre is an order of magnitude n \sqrt{n} n times further than the centre distance of a jackknife vector.

The approximate directional derivative of S ( P ) S(\mathbf{P}) S(P) in the direction from P ( i ) \mathbf{P}_{(i)} P(i) to P 0 \mathbf{P}_0 P0 is
D i = S ( P ( i ) ) − S ( P 0 ) ∥ P ( i ) − P 0 ∥ = ( θ ^ ( i ) − ∑ 1 n θ ^ ( i ) n ) ( n ( n − 1 ) ) , D_i = \frac{ S(\mathbf{P}_{(i)}) - S({\mathbf{P}_{0})} }{\| \mathbf{P}_{(i)} - \mathbf{P}_{0} \|} = \left( \hat{\theta}_{(i)} - \frac{\sum_1^n \hat{\theta}_{(i)}}{n} \right) \left( \sqrt{n(n-1)} \right), Di=∥P(i)−P0∥S(P(i))−S(P0)=(θ^(i)−n∑1nθ^(i))(n(n−1) ),
which means the slope of function S ( P ) S(\mathbf{P}) S(P) at P 0 \mathbf{P}_0 P0 in the direction of P ( i ) \mathbf{P}_{(i)} P(i).

Compare it with the formula of the jackknife estimate of standard error for θ ^ \hat{\theta} θ^ ( s e ^ j a c k \hat{se}_{jack} se^jack), we can find that s e ^ j a c k \hat{se}_{jack} se^jack is proportional to the root mean square of the slopes D i D_i Di.

When S ( P ) S(\mathbf{P}) S(P) is a linear function of P \mathbf{P} P, like the sample mean, s e ^ j a c k \hat{se}_{jack} se^jack is equal to s e ^ b o o t \hat{se}_{boot} se^boot. And when S ( P ) S(\mathbf{P}) S(P) is not linear, s e ^ j a c k \hat{se}_{jack} se^jack is only an approximation.

3.5 Compromise of the bootstrap algorithm

Theoretically, for a specific original sample set, with the distribution of all the possible bootstrap vectors known, the ideal bootstrap standard error estimate can be evaluated.

For a sample set with K K K possible resampling points, the ideal bootstrap standard error estimate is
s e ^ b o o t = ∑ 1 K p k ( S ( P ( k ) ∗ ) − S ( P 0 ) ) 2 \hat{se}_{boot} = \sqrt{ \sum_1^K p_k \left( S(\mathbf{P}_{(k)}^*) - S(\mathbf{P}_{0}) \right)^2 } se^boot=1∑Kpk(S(P(k)∗)−S(P0))2
where p k p_k pk is the bootstrap probability corresponding to P ( k ) ∗ \mathbf{P}_{(k)}^* P(k)∗. In this situation, the weighted mean of these K K K resampling points is the same as S ( P 0 ) S(\mathbf{P}_{0}) S(P0).

However, the number of the possible resampling points K K K increase incredibly with the increase of the sample size n n n. For a sample of size n
N = ( 2 n − 1 n ) . N = \begin{pmatrix} 2n-1 \\\\ n \end{pmatrix}. N=⎝⎛2n−1n⎠⎞.
For n = 20 n = 20 n=20, the number of K K K comes to 6.9 × 1 0 10 6.9 \times 10^{10} 6.9×1010 already. Thus it is unrealistic to calculate the ideal bootstrap standard error estimate in real life. What we do is to choosing N N N resampling vectors at random, as mentioned in the former part, to get a sufficiently accurate estimate.

References

Efron, B., Hastie, T. (2016) The Jackknife and the Bootstrap, Computer age statistical inference Algorithms, evidence, and data science. Cambridge: Cambridge University Press, pp. 155-166.

A ( n − 1 ) (n - 1) (n−1)-dimensional simplex ( S n S_n Sn) is a set of nonnegative vectors summing to 1. ↩︎