Generating your own distribution when you know the cdf, pdf or pmf

Situation:

You wish to use a parametric probability distribution that is not provided by ModelRisk, and you know:

The cumulative distribution function (continuous variable);

The probability density function (continuous variable); or

The probability mass function (discrete variable).

This section describes the technique for each situation.

Known cumulative distribution function (cdf)

This method applies when you know the cdf of a continuous probability distribution. The algebraic equation of the cdf can often be inverted to make x the subject of the equation. For example, the cdf of the Exponential probability distribution is:

(1)

Where b is the mean of the Exponential distribution. Inverting equation 1 to make x its subject gives:

(2)

A random sample from any continuous probability distribution has equal probability of lying in any equally sized range of F(x) between zero and one. For example, the variable X has a 10% probability of producing a value between x1 and x2 (x2>x1) where F(x2)-F(x1) = 0.1. Looking at it the other way round, F(x) can be thought of as being a Uniform(0,1) random variable. Thus, we can write Equation 2 as an EXCEL/ModelRisk formula to generate values from an Exponential distribution with a mean of 23 as follows:

= -23*LN(1-VoseUniform(0,1)) (3)

or [can you see why?]:

= -23*LN(VoseUniform(0,1)) (4)

This is exactly equivalent to writing:

=VoseExpon(23)

in a cell. [Try it if you like]. Since we have used one ModelRisk distribution to generate another, if we force the Uniform distribution to sample from its lower end, Equation 3 (but not Equation 4) will generate values from the lower end of the Exponential distribution.

Disadvantage

The method hinges on being able to perform the algebra that inverts the cdf. If that proves too difficult, it may be easier to construct the cdf with a Cumulative distribution.

Known probability density function (pdf)

We may sometimes start with the pdf for a variable, determine the cdf by integration, and then use the first method although this generally requires good maths skills. Another reason one might like to determine the cdf could be an interest in the probability of the variable being below, between or above certain values. Integration of the pdf will give the cdf which can give you these probabilities. The example below shows you how to determine a cdf, starting with a pdf, in this case of a sine curve distribution.

Imagine that you want to design a distribution that followed the shape of a sine curve from 0 to a, where a is an input to the distribution. This distribution shape is shown in Figure 1.

Figure 1: The sine curve distribution we wish to create in our example

The probability density function f(x) is given by:

where b is a constant to be determined so that the area under the curve equals one, as required for a probability distribution.

The cdf F(x) is then (0<x<a):

(1)

For the area under the curve to equal one, b must be determined such that F(a) = 1, i.e.:

Therefore,

and, from Equation 1, F(x) becomes:

We now need to find the inverse function to F(x). So, rearranging the equation above for x:

To generate this distribution, we put a Uniform(0,1) distribution in cell A1 (say), the value for a in cell B1 (say) and, in the cell that generates values of x, we write:

= B1/PI() * ACOS (1-2*A1) (The Excel function ACOS(y) returns cos-1(y) )

Known probability mass function (pmf)

If you know the pmf of a distribution, it is a simple matter (in principle) to create the distribution in ModelRisk. The techniques above are not applicable because a cdf for a discrete variable is just the sum of the discrete probabilities, and it is thus not possible to construct an inverse transformation.

The method requires that you construct two arrays in Excel:

The first array is a set of values that the variable might take, e.g. {0,1,...99,100}

The second array is a set of probabilities using the pmf calculated for each of these values {p(0),p(1),...p(99),p(100)}

These two arrays are then used to construct the required distribution using ModelRisk's Discrete distribution:

=VoseDiscrete({x}, {p(x})

Of course, this method can become cumbersome if the {x} array is very large. In which case:

Make the {x} list with spacing larger than 1, e.g. {0,5,10,..495,500};
Calculate the associated probabilities {p(0),p(5),...p(495),p(500)}
Construct a VoseRelative(min,max,{x},{p(x)}) distribution, e.g. VoseRelative(-0.5,500.5,{5,10,...490,495},{{p(5),p(10),...p(490),p(495)})
Wrap a ROUND function around the Relative; =ROUND(VoseRelative(...),0) to return discrete value
Note that using a minimum = -0.5 (min-0.5) and a maximum 500.5 (max=0.5) will allocate a more accurate probability to the end values.