Adding correlation in aggregate calculations | Vose Software

Adding correlation in aggregate calculations

See also: Aggregate distributions introduction, Modeling correlation introduction, Vose Stop Sum

The most common method to determine the aggregate distribution of a number of correlated random variables is to simulate each random variable in its own spreadsheet Cell, using one of the correlation methods described here, and then sum them up in another cell. For example, the following model adds together Poisson(100) random variables each following a Lognormal(2,5) distribution but where these variables are correlated through a Clayton(10) copula.

Example model AggregateCorrelatedSimulation for simulating the aggregate distribution of correlated random variables

Cell C7 determines the 99.99th percentile of the Poisson(100) distribution - a value of 139 - which is used as a guide to set the maximum number of rows in the table. The Clayton copula values are used as 'U-parameter' inputs into the Lognormal distributions, meaning that they make the Lognormal distributions return the percentile equating to the copula value: for example, cell D12 returns a value of 2.5539..., which is the percentile of the Lognormal(2,5) distribution.

A Clayton copula provides a particularly high level of correlation of the variables at their low end. For example, the following plot shows the level of correlation of two variables with a Clayton(10):

Correlation of two variables with a Clayton(10)

Thus, the model will produce a wider range for the sum than an uncorrelated set of variables but in particular will produce more extreme low end values from a probabilistic view (the correlated sum has about a 70% probability of taking a lower value than the uncorrelated sum). The use of one of the Archimedean copulas is an appropriate tool here because we are adding up a random number of these variables but the number being summed does not affect the copula's behaviour - all variables will be related to the same degree no matter how many are being summed. The effect of the correlation is readily observed by repeating the model without any correlation. The following plot compares the two cumulative distributions:

Comparison of correlated and uncorrelated sums

Complete correlation

In the situation where the source of the randomness or uncertainty of the distribution associated with a random variable is the same for the whole group you are adding up, then there is really just one random variable. For example, imagine that a railway network company that must purchase 127,000 sleepers (the beams under the rails) next year. The sleepers will be made of wood, but the price is uncertain because the cost of timber may fluctuate. It is estimated that the cost will be $PERT(22.1, 22.7, 33.4) each. If all the timber is being purchased at the same time, it might be reasonable to believe that all the sleepers will have the same price. In that case, the total cost can be modelled simply: =127000*VosePert(22.1 ,22.7, 33.4).

Using covariance

If there are a large number n of random variables Xi (i=1 to n) being summed and the uncertainty of the sum is not dominated by a few of these distributions, the sum is approximately Normally distributed according to the Central Limit Theorem, as follows:

The equation says that the aggregate sum takes a Normal distribution with a mean equal to the sum of the means for the individual distributions being added together. It also says that the variance (the square of the standard deviation in the formula) of the Normal distribution is equal to the square of the covariance terms between each variable. The covariance terms are calculated as follows:


 where  ,  are the standard deviations of variables i and j,  is the correlation coefficient, and E[·] means 'the expected value of' the thing in the brackets. 

If we have data sets for the variables being modelled, Excel can calculate the covariance and correlation coefficients using the functions COVAR( ) and CORREL( ) respectively.

Correlating partial sums

We will sometimes be in the situation of having two or more sums of random variables that have some correlation relationship between them. For example, imagine that you are a hospital trying to forecast the number of patient-days you will need to provide next year, and you split the patients into three groups: surgery, maternity and chronic illness (e.g. cancer). Let's say that the distribution of days that a person will spend in hospital under each category is independent of the other categories, but the number of individuals being treated is correlated with the number of people in the catchment area, which is uncertain because hospital catchments are being redefined in your area.

There are plenty of ways to model this problem, but perhaps the most convenient is to start with the uncertainty of size of the number of people in the catchment area and derive what the demand will be for each type of care as a consequence, then make a projection of what the total patient-days might be as a result, as shown in the following model:

Example model Hospital_bed_days - forecasting the number of patient-days in a hospital

In this model the uncertainty about the catchment area population is modelled with a PERT distribution, the bed-days for each category of health care modelled by Lognormal distributions with different parameters and the number of patients in each category modelled with a Poisson distribution with mean equal to (population size-000s)*(expected cases/year/1000 people).

In the example model, three different methods are shown for simulating the aggregate distribution in each class: pure Monte Carlo for Surgery; FFT for Maternity, and Panjer's recursive method for Chronic. Any of the three could be used to model each category.

You'll notice that the Monte Carlo method is slightly different than the others in that I've used VosePoisson(...) instead of VosePoissonObject(...) because the VoseAggregateMC function requires a numerical input for how many variables to sum (allowing the flexibility that this could be a calculated value) whereas the FFT and Panjer methods perform calculations on the Poisson distribution and therefore need it to be defined as an Object.

Note that the same model could be achieved with other Monte Carlo simulation software by making randomly varying arrays for each category, the technique illustrated in the introduction topic, but the numbers in his problem would require very long arrays.

Using the same basic problem, let us now consider the situation where the frequency distribution for each category is correlated in some fashion, as we had before, but not because of their direct relationship to any observable variable. Imagine that the population size is known, but we want to model the effects of increased pollution in the area, so we want the Surgery and Chronic Poisson variables be positively correlated with each other, but negatively correlated with Maternity.

The following model uses a Normal copula to correlate the Poisson frequency distributions.

Example model Hospital_bed_days_2 for using a Normal copula to correlate the Poisson frequency distributions

There is in fact a FFT method to achieve this correlation between frequency distributions, but the algorithm is not particularly stable.

Turning now to the severity (length of hospital stay) distributions, we may wish to correlate the length of stay for all individuals in a certain category. In the above model, this can be achieved by creating a separate scaling variable for each Lognormal distribution which has a mean of 1, for example a Gamma() distribution which has the required mean and a standard deviation of h. Note that this means that the Lognormal distributions will no longer have the standard deviations they were given before.

Example model Hospital_bed_days_3 - a separate scaling variable for each Lognormal distribution

Finally, let's consider how to correlate the aggregate distributions themselves. We can construct the distribution of the number of bed days required for each type of health care using either the FFT or Panjer method. Since the distribution is constructed rather than simulated we can easily correlate the aggregate distributions by controlling how they are sampled. In the following example the model uses the FFT method to construct the aggregate variables and correlates them together using a Frank copula.

Example model Hospital_bed_days_4 for using the FFT method to construct the aggregate variables

Read on: How many random variables add up to a fixed total