Approximations to the Inverse Hypergeometric distribution | Vose Software

Approximations to the Inverse Hypergeometric distribution

The InvHypergeo(s, D, M) distribution describes the possible number of failures one may have before achieving s successes, where a trial is a sample without replacement from a population of size M, and where a success is defined as picking one of the D items in the population of size M that have some particular characteristic. So, for example, the number of not-infected animals selected at random to have s infected animals taken from a population M, where D of that population are known to be infected, is described by an InvHypergeo(s, D, M) distribution. The probability mass function for the InvHypergeo distribution is a mass of factorial calculations, which is quite laborious to calculate and leads us to look for suitable approximations.

NegBin approximation to the InvHypergeo

A hypergeometric process is sampling from a finite population without replacement, so that the result of a sample is dependent on the samples that have gone before it. If the population is very large, so that removing a sample of size n has no discernible effect on the population, then the probability that an individual sample will have the characteristic of interest is essentially constant and has the value D/M, because the probability of resampling an item in the population, were one to replace items after sampling, would be very small. In such cases, the hypergeometric process can be approximated by a binomial process.

The rule most often quoted is that this approximation works well when n < 0.1 M. The expected number of trials to achieve s successes is given by [s(M+1)/(D+1)], and recoginizing that M and D are large so M+1 » M, D+1 » D, we have the condition:

s < 0.1 D

For a binomial process the number of failures to achieve s successes, where p is the probability of success, is given by a NegBin(s,p), thus:

InvHypergeo(s, D, M) » NegBin(s, D/M)                       when s < 0.1 D


Example of a NegBin distribution approximation to an InvHypergeo distribution


Gamma approximation to the InvHypergeo

We have just seen how the InvHypergeo distribution can be approximated by the NegBin, providing s < 0.1 D, by approximating a hypergeometric process to a binomial process. We have also shown elsewhere that the binomial process can be approximated by the Poisson process, providing n is large and p is small. Thus, providing n is large and D/M is small, a hypergeometric process is well approximated by a Poisson process. In a Poisson process, the Gamma(a,b) distribution models the 'time' until observing a events where b is the mean time between events. The InvHypergeo distribution is the hypergeometric equivalent, modeling the number of failures to achieve s successes where [(M-D)/(D+1)] is the mean number of failures per success. The InvHypergeo excludes the s successes which in terms of a Poisson process are not included in the waiting time because each event is assumed to be instantaneous. To make the two approaches exactly comparable, we should therefore think of the mean number of trials per success, equal to [(M-D)/(D+1) +1] = [(M+1)/(D+1)] . Then, we can make the following approximation:

InvHypergeo(s, D, M) » Gamma(s, (M+1)/(D+1))               when s < 0.1 D,  D/M » 0

or more approximately:

InvHypergeo(s, D, M) » Gamma(s, M/D)               when s < 0.1 D,  D/M » 0

Two examples of a Gamma approximation to an InvHypergeo. Note the approximation gets better as s<<D and D/M gets smaller.

Normal approximation to the InvHypergeo

Central Limit Theorem tells that the sum of a large number (n) of independent, identically distributed random variables will approach a Normal distribution:

Sum = Normal(nm,sn)

where m and s are the mean and standard deviations for the individual random variables.

In a hypergeometric process, the number of failures to observe 1 success is given by InvHypergeo(1,D,M) which has moments given by:

For each success to have the same distribution of failures the number of individuals taken from the population must be a very small fraction, i.e.:

At the same time, s must be large to be adding up a large number of these distributions. Under these conditions, the InvHypergeo distribution can be approximated by a Normal distribution:

or, more approximately:

See Also