Selecting the appropriate distributions for your model

The precision of a risk analysis relies very heavily on the appropriate use of probability distributions to accurately represent the uncertainty, randomness and variability of the problem. In our experience, inappropriate use of probability distributions has proven to be a very common failure of risk analysis models. It stems, in part, from an inadequate understanding of the theory behind probability distribution functions and, in part, from failing to appreciate the knock-on effects of using inappropriate distributions.

In this section we discuss five basic properties of distributions and how these properties should be used to select the distributions in your model. The five properties are:

Discrete or continuous
Bounded or unbounded
Parametric or non-parametric
Univariate or multivariate
First and second order

Finally, we have put together tables with links for each distribution as it fits into each category for both univariate and multivariate distributions.

Discrete and continuous distributions

The most basic distinguishing property between probability distributions is whether they are continuous or discrete. It is extraordinary how often the discrete or continuous nature of a variable is overlooked when selecting the distribution that will be used to model it.

Discrete distributions

A discrete distribution may take one of a set of identifiable values, each of which has a calculable probability of occurrence. Discrete distributions are used to model parameters like the number of bridges a roading scheme may need, the number of key personnel to be employed or the number of customers that will arrive at a service station in a hour. Clearly, variables such as these can only take specific values: one cannot build half a bridge, employ 2.7 people or serve 13.6 customers.

Continuous distributions

A continuous distribution is used to represent a continuous variable, i.e. a variable that can take any value within a defined range (domain). For example, the height of an adult English male picked at random will have a continuous distribution because the height of a person is essentially infinitely divisible. We could measure his height to the nearest centimeter, millimeter, tenth of a millimeter, etc. The scale can be repeatedly divided up generating more and more possible values.

Properties like time, mass and distance, that are infinitely divisible, are modelled using continuous distributions. In practice, we also use continuous distributions to model variables that are, in truth, discrete but where the gap between allowable values is insignificant: for example, project cost (which is discrete with steps of one penny, one cent, etc.), exchange rate (which is only quoted to a few significant figures), number of employees in a large organization, etc.

Bounded and unbounded distributions

A distribution that is confined to lie between two determined values is said to be bounded. A distribution that is unbounded theoretically extends from minus infinity to plus infinity. A distribution that is constrained at one or either end is said to be partially bounded. Unbounded and partially bounded distributions may, at times, need to be constrained to remove the tail of the distribution so that nonsensical values are avoided. For example, using a Normal distribution to model sales volume opens up the chance of generating a negative value. If the probability of generating a negative value is significant, and we want to stick to using a Normal distribution, we must constrain the model in some way to eliminate any negative sales volume figure being generated. ModelRisk provides the function VoseXbounds( ) for this purpose. For example =VoseNormal(10, 3, VoseXBounds(5, 17)) will truncate a Normal(10, 3) distribution to lie between 5 and 17, as shown below. VoseNormal(10, 3, VoseXBounds(, 15)) would just cut off the right tail at 15.

Right-bounded distributions

You will notice from the table below that none of the distributions are bounded only on the right extreme. However, if you require a right bounded distribution for some reason, you need simply invert a left bounded distribution. For example: =-VoseWeibull(2,5) produces a left-skewed distribution with an unbounded minimum and a maximum of 0; =10-VoseGamma(2,1.5) produces a left-skewed distribution with an unbounded minimum and a maximum of 10, as shown in the Figures below:

Parametric and non-parametric distributions

There is a very useful distinction to be made between model-based parametric and empirical non-parametric distributions. By 'model-based', we mean a distribution whose shape is borne of the mathematics describing a conceptual probability model. By 'empirical' or 'non-parametric' we mean a distribution whose mathematics is defined by the shape that is required. For example, a Triangle distribution is defined by its minimum, mode and maximum values. The defining parameters are features of the graph shape.

Those distributions that fall under the 'empirical' or non-parametric class are intuitively easy to understand, extremely flexible and are therefore very useful. Model-based or parametric distributions require a greater knowledge of the underlying assumptions if they are to be used properly.

Parametric distributions should only be selected if either:

The theory underpinning the distribution applies to the particular problem;
It is generally accepted that a particular distribution has proven to be very accurate for modeling a specific variable without actually having any theory to support the observation;
The distribution matches the observed data very well indeed; or
One wishes to use a distribution that has a long tail extending beyond the observed minimum or maximum. These issues are discussed in more detail in the optional module on fitting distributions to data.

Univariate and multivariate distributions

Univariate distributions describe a single parameter or variable and are used to model a parameter or variable that is not probabilistically linked to any other in the model. Multivariate distributions describe several parameters whose values are probabilistically linked in some way. In most cases, we create the probabilistic links via one of several correlation methods. However, there are a few specific multivariate distributions that have specific, very useful purposes and are therefore worth studying more.

First or Second order distribution

A probability or inter-individual variability distribution for which the parameters are precisely known is called a first-order distribution. A probability or inter-individual variability distribution for which there is some uncertainty about the parameters is called a second-order distribution. Thus, for example, Normal(100,10) is a first order distribution, whereas Normal(m,s) is a second order distribution if m and s are estimated and thus themselves carry uncertainty distributions. You cannot have a second-order distribution of uncertainty because you cannot have uncertainty about uncertainty - it collapses down to the one distribution of uncertainty, just the same as you cannot have a probability distribution of a probability distribution - it collapses down to a single probability distribution.

A plot of a first-order distribution is easy to understand. For example:

It is somewhat more difficult to illustrate a second-order distribution. One needs to account for the uncertainty about the distribution, which is usually done by using a number of lines to reflect possible true distributions (sometimes called candy-floss or spaghetti plots):

The second-order cumulative plot is generally much clearer than its corresponding density plot.

Table of distributions

The table below gives an overview of the various distributions available in ModelRisk, so that you can most easily focus on which ones might be most appropriate for your modeling needs. Follow the links for an in-depth explanation of each. We have used the most common name for each distribution. If you are interested in a particular distribution whose name does not appear here, try using the search facility because many distributions have several names, or are recognized as simply special cases of other, more common distributions.

Univariate Distributions

	*Continuous*	*Discrete*
*Unbounded*	Cauchy Error function Error Extreme Value Max Extreme Value Min Generalized Extreme Value GLogistic Hyperbolic Secant JohnsonU KernelCU Laplace Logistic Mixed Normal Normal Skew Normal Slash Student-t Student3	Skellam
*Left or right bounded*	Bradford Burr Chi Chi Squared Dagum Erlang Exponential F Fatigue Life Gamma Inverse Gaussian Lifetime2 Lifetime3 LifetimeExp Levy LogGamma LogLaplace LogLogistic Lognormal Lognormal (base B) Lognormal (base E) Maxwell NCChiSq NCF Pareto (1st kind) Pareto (2nd kind) Pearson 5 Pearson 6 Rayleigh Weibull Weibull3	Beta-geometric Beta-Negative Binomial Burnt Finger Poisson Delaporte Geometric HypergeoM Inverse Hypergeometric Logarithmic Negative Binomial Poisson Poisson Uniform Polya
*Left and right bounded*	Beta BetaSubj Beta4 Bradford Cumulative ascending Cumulative descending GTU Histogram JohnsonB Kumaraswamy Kumaraswamy4 LogTriangle LogUniform Modified PERT Ogive PERT PERTAlt Reciprocal Relative Split Triangle Triangle TriangleAlt Uniform	Bernoulli Beta-binomial Binomial Discrete Discrete Uniform Hypergeometric HypergeoM HypergeoD InvHyperGeo Step Uniform

Italics indicate non-parametric distributions

Multivariate Distributions

	*Continuous*	*Discrete*
*Unbounded*	Multivariate Normal
*Left bounded*		Negative Multinomial 1 Negative Multinomial 2
*Left and right bounded*	Dirichlet	Multinomial Multivariate Hypergeometric Multivariate Inv Hypergeometric 1 Multivariate Inv Hypergeometric 2