Coupon collector's problem

Graph of number of coupons, n vs the expected number of tries needed to collect them all, E (T )

In probability theory, the coupon collector's problem describes the "collect all coupons and win" contests. It asks the following question: Suppose that there is an urn of n different coupons, from which coupons are being collected, equally likely, with replacement. What is the probability that more than t sample trials are needed to collect all n coupons? An alternative statement is: Given n coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once? The mathematical analysis of the problem reveals that the expected number of trials needed grows as $\Theta(n\log(n))$ .^[1] For example, when n = 50 it takes about 225^[2] trials to collect all 50 coupons.

Understanding the problem

The key to solving the problem is understanding that it takes very little time to collect the first few coupons. On the other hand, it takes a long time to collect the last few coupons. In fact, for 50 coupons, it takes on average 50 trials to collect the very last coupon after the other 49 coupons have been collected. This is why the expected time to collect all coupons is much longer than 50. The idea now is to split the total time into 50 intervals where the expected time can be calculated.

Answer

The following table (click [show] to expand) gives the expected number of tries to get sets of 1 to 100 coupons.

Number of coupons, n	Expected number of tries per coupon = H_n	Expected total number of tries, E (T ) = ⌈n H_n ⌉	Number of coupons, n	Expected number of tries per coupon = H_n	Expected total number of tries, E (T ) = ⌈n H_n ⌉
1	1	1	51	4.51881	231
2	1.5	3	52	4.53804	236
3	1.83333	6	53	4.55691	242
4	2.08333	9	54	4.57543	248
5	2.28333	12	55	4.59361	253
6	2.45	15	56	4.61147	259
7	2.59286	19	57	4.62901	264
8	2.71786	22	58	4.64625	270
9	2.82897	26	59	4.66320	276
10	2.92897	30	60	4.67987	281
11	3.01988	34	61	4.69626	287
12	3.10321	38	62	4.71239	293
13	3.18013	42	63	4.72827	298
14	3.25156	46	64	4.74389	304
15	3.31823	50	65	4.75928	310
16	3.38073	55	66	4.77443	316
17	3.43955	59	67	4.78935	321
18	3.49511	63	68	4.80406	327
19	3.54774	68	69	4.81855	333
20	3.59774	72	70	4.83284	339
21	3.64536	77	71	4.84692	345
22	3.69081	82	72	4.86081	350
23	3.73429	86	73	4.87451	356
24	3.77596	91	74	4.88802	362
25	3.81596	96	75	4.90136	368
26	3.85442	101	76	4.91451	374
27	3.89146	106	77	4.92750	380
28	3.92717	110	78	4.94032	386
29	3.96165	115	79	4.95298	392
30	3.99499	120	80	4.96548	398
31	4.02725	125	81	4.97782	404
32	4.05850	130	82	4.99002	410
33	4.08880	135	83	5.00207	416
34	4.11821	141	84	5.01397	422
35	4.14678	146	85	5.02574	428
36	4.17456	151	86	5.03737	434
37	4.20159	156	87	5.04886	440
38	4.22790	161	88	5.06022	446
39	4.25354	166	89	5.07146	452
40	4.27854	172	90	5.08257	458
41	4.30293	177	91	5.09356	464
42	4.32674	182	92	5.10443	470
43	4.35000	188	93	5.11518	476
44	4.37273	193	94	5.12582	482
45	4.39495	198	95	5.13635	488
46	4.41669	204	96	5.14676	495
47	4.43796	209	97	5.15707	501
48	4.45880	215	98	5.16728	507
49	4.47921	220	99	5.17738	513
50	4.49921	225	100	5.18738	519

Solution

Calculating the expectation

Let T be the time to collect all n coupons, and let t_i be the time to collect the i-th coupon after i − 1 coupons have been collected. Think of T and t_i as random variables. Observe that the probability of collecting a new coupon is p_i = (n − (i − 1))/n. Therefore, t_i has geometric distribution with expectation 1/p_i. By the linearity of expectations we have:

\begin{align} \operatorname{E}(T)& = \operatorname{E}(t_1) + \operatorname{E}(t_2) + \cdots + \operatorname{E}(t_n) = \frac{1}{p_1} + \frac{1}{p_2} + \cdots + \frac{1}{p_n} \\ & = \frac{n}{n} + \frac{n}{n-1} + \cdots + \frac{n}{1} = n \cdot \left(\frac{1}{1} + \frac{1}{2} + \cdots + \frac{1}{n}\right) \, = \, n \cdot H_n. \end{align}

Here H_n is the n-th harmonic number. Using the asymptotics of the harmonic numbers, we obtain:

\operatorname{E}(T) = n \cdot H_n = n \log n + \gamma n + \frac1{2} + o(1), \ \text{as} \ n \to \infty,

where $\gamma \approx 0.5772156649$ is the Euler–Mascheroni constant.

Now one can use the Markov inequality to bound the desired probability:

\operatorname{P}(T \geq c \, n H_n) \le \frac1c.

Calculating the variance

Using the independence of random variables t_i, we obtain:

\begin{align} \operatorname{Var}(T)& = \operatorname{Var}(t_1) + \operatorname{Var}(t_2) + \cdots + \operatorname{Var}(t_n) \\ & = \frac{1-p_1}{p_1^2} + \frac{1-p_2}{p_2^2} + \cdots + \frac{1-p_n}{p_n^2} \\ & < \left(\frac{n^2}{n^2} + \frac{n^2}{(n-1)^2} + \cdots + \frac{n^2}{1^2}\right) \\ & < n^2 \cdot \left(\frac{1}{1^2} + \frac{1}{2^2} + \cdots + \frac{1}{n^2} \right) \\ & < n^2 \cdot \left(\frac{\pi^2}{6} - \frac{1}{n} + \frac{1}{2n^2}\right) \\ & < \frac{\pi^2}{6} n^2. \end{align}

The $\pi^2/6$ is a value of the Riemann zeta function (see Basel problem).

Now one can use the Chebyshev inequality to bound the desired probability:

\operatorname{P}\left(|T- n H_n| \geq c\, n\right) \le \frac{\pi^2}{6c^2}.

Tail estimates

A different upper bound can be derived from the following observation. Let ${Z}_i^r$ denote the event that the $i$ -th coupon was not picked in the first $r$ trials. Then:

\begin{align} P\left [ {Z}_i^r \right ] = \left(1-\frac{1}{n}\right)^r \le e^{-r / n} \end{align}

Thus, for $r = \beta n \log n$ , we have $P\left [ {Z}_i^r \right ] \le e^{(-\beta n \log n ) / n} = n^{-\beta}$ .

\begin{align} P\left [ T > \beta n \log n \right ] = P \left [ \bigcup_i {Z}_i^{\beta n \log n} \right ] \le n \cdot P [ {Z}_1^{\beta n \log n} ] \le n^{-\beta + 1} \end{align}

Extensions and generalizations

Paul Erdős and Alfréd Rényi proved the limit theorem for the distribution of T. This result is a further extension of previous bounds.

\operatorname{P}(T < n\log n + cn) \to e^{-e^{-c}}, \ \ \text{as} \ n \to \infty.

Donald J. Newman and Lawrence Shepp found a generalization of the coupon collector's problem when m copies of each coupon needs to be collected. Let T_m be the first time m copies of each coupon are collected. They showed that the expectation in this case satisfies:

\operatorname{E}(T_m) = n \log n + (m-1) n \log\log n + O(n), \ \text{as} \ n \to \infty.

Here m is fixed. When m = 1 we get the earlier formula for the expectation.

Common generalization, also due to Erdős and Rényi:

\operatorname{P}(T_m < n\log n + (m-1) n \log\log n + cn) \to e^{-e^{-c}/(m-1)!}, \ \ \text{as} \ n \to \infty.

Wolfgang Stadje has solved the case, if the stickers are bought in packets, which contain no doubles^[3]. The results show that for practical applications, say packets of five stickers, the effect of packets is negligible.

Sylvain Vardy and Yvan Velenik derived an optimized strategy including swapping and purchasing missing stickers by Monte Carlo simulation^[4].

In the general case of a nonuniform probability distribution, according to Philippe Flajolet,

E(T)=\int_0^\infty \big(1-\prod_{i=1}^n(1-e^{-p_it})\big)dt

Notes

↑ Here and throughout this article, "log" refers to the natural logarithm rather than a logarithm to some other base. The use of Θ here invokes big O notation.
↑ E(50) = 50(1 + 1/2 + 1/3 + ... + 1/50) = 224.9603, the expected number of trials to collect all 50 coupons. The approximation $n\log n+\gamma n+1/2$ for this expected number gives in this case $50\log 50+50\gamma+1/2 \approx 195.6011+28.8608+0.5\approx 224.9619$ .
↑ Wolfgang Stadje: The Collector's Problem with Group Drawings, Advances in Applied Probability, Vol. 22, No. 4 (Dec., 1990), pp. 866-882
↑ Sylvain Sardy and Yvan Velenik: Paninimania: sticker rarity and cost-effective strategy, Swiss Statistical Society, Bulletin nr. 66 (2010), 2-6.

References

Blom, Gunnar; Holst, Lars; Sandell, Dennis (1994), "7.5 Coupon collecting I, 7.6 Coupon collecting II, and 15.4 Coupon collecting III", Problems and Snapshots from the World of Probability, New York: Springer-Verlag, pp. 85–87, 191, ISBN 0-387-94161-4, MR 1265713 .
Dawkins, Brian (1991), "Siobhan's problem: the coupon collector revisited", The American Statistician 45 (1): 76–82, doi:10.2307/2685247, JSTOR 2685247 .
Erdős, Paul; Rényi, Alfréd (1961), "On a classical problem of probability theory" (PDF), Magyar Tudományos Akadémia Matematikai Kutató Intézetének Közleményei 6: 215–220, MR 0150807 .
Newman, Donald J.; Shepp, Lawrence (1960), "The double dixie cup problem", American Mathematical Monthly 67: 58–61, doi:10.2307/2308930, MR 0120672
Flajolet, Philippe; Gardy, Danièle; Thimonier, Loÿs (1992), "Birthday paradox, coupon collectors, caching algorithms and self-organizing search", Discrete Applied Mathematics 39 (3): 207–229, doi:10.1016/0166-218X(92)90177-C, MR 1189469 .
Isaac, Richard (1995), "8.4 The coupon collector's problem solved", The Pleasures of Probability, Undergraduate Texts in Mathematics, New York: Springer-Verlag, pp. 80–82, ISBN 0-387-94415-X, MR 1329545 .
Motwani, Rajeev; Raghavan, Prabhakar (1995), "3.6. The Coupon Collector's Problem", Randomized algorithms, Cambridge: Cambridge University Press, pp. 57–63, MR 1344451 .

External links

"Coupon Collector Problem" by Ed Pegg, Jr., the Wolfram Demonstrations Project. Mathematica package.
Coupon Collector Problem, a simple Java applet.
How Many Singles, Doubles, Triples, Etc., Should The Coupon Collector Expect?, a short note by Doron Zeilberger.

This article is issued from Wikipedia - version of the Wednesday, May 04, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.