 Rank order correlation and correlation matrices | Vose Software

# Rank order correlation and correlation matrices

### Introduction

Distributions in a model will often have to be correlated to ensure that only meaningful scenarios are generated for each iteration of the model.

For example, mortgage and interest rates could both play a part in a model, both represented by distributions that describe their uncertainty. However, the sampling from these distributions must be constrained because a low mortgage rate will only occur with a low interest rate, etc. In other words, the distributions need to be correlated.

The programming technique used to generate rank order correlated input distributions was invented by Iman and Conover (1982). The mathematical details of the technique are too involved to reproduce here but it is worth noting the benefits that their technique offers:

• The input distributions are correlated according to the rank of the values generated for each distribution. This means that all correlated distributions preserve their shape and the properties of the sampling method being used (e.g. Monte Carlo or LHS).

• The technique is therefore also distribution-free, meaning that it can be equally applied to any type of distribution.

• The technique is based on defining a correlation matrix, which means that any number of distributions can be correlated together.

The Iman and Conover correlation technique is a two-step process. First of all, a set of 'scores' are generated for each distribution to be correlated, where n is the number of iterations that are to be run. Then these 'scores' are rearranged together so their ranks produce the desired rank correlation. In the second step, the distributions to be correlated are all sampled n times and these sampled values are ranked. The ranks are then matched to the 'score' ranks from the first step to produce the sets of values that will be used for each iteration in the simulation.

At first sight, this process seems over-complicated. Why not just correlate the ranks between the values sampled from each distribution rather than go through the intermediary step of using 'scores'? Well, the van der Waerden scores that are used are based on the inverse function of the Normal distribution. Iman and Conover found that these scores produced 'natural-looking' correlations: variables correlated using van der Waerden scores produced elliptical-shaped scatter plots while using the ranking of the variables directly produced scatter patterns that were pinched in the middle and fanned out at each end.

Most risk analysis software products now offer a facility to correlate probability distribution within a risk analysis model using rank order correlation. The technique is very simple to use, requiring only that the analyst nominates the distributions that are to be correlated and a correlation value between -1 and +1. This coefficient is the Spearman's Rank Order Correlation Coefficient:

• A correlation value of -1 forces the two probability distributions to be exactly negatively correlated, i.e. the X percentile value in one distribution will appear in the same iteration as the 100-X percentile value of the other distribution.

• A correlation value of +1 forces the two probability distributions to be exactly positively correlated, i.e. the X percentile value in one distribution will appear in the same iteration as the X percentile value of the other distribution. In practice, one rarely uses correlation values of -1 and +1.

• Negative correlation values between 0 and -1 produce varying degrees of inverse correlation, i.e. a low value from one distribution will correspond to a high value in the other distribution and vice versa. The closer the correlation is to zero, the looser the relationship between the two distributions.

• Positive correlation values between 0 and +1 produce varying degrees of positive correlation, i.e. a low value from one distribution will correspond to a low value in the other distribution and a high value from one to a high value from the other.

• A correlation value of 0 means that there is no relationship between the two distributions.

### What is rank order correlation?

The rank correlation coefficient

The rank correlation coefficient, also known as "Spearman's Rho" is defined as: where   and where ui, vi are the ranks of the ith observation in samples 1 and 2 respectively.

The correlation coefficient is symmetric, i.e. only the difference between ranks is important and not whether distribution A is being correlated with distribution B or the other way round.

The rank order correlation coefficient uses the ranking of the data, i.e. what position (rank) the data point takes in an ordered list from the minimum to maximum values, rather than the actual data values themselves. It is therefore independent of the distribution shapes of the data sets and allows the integrity of the input distributions to be maintained. Spearman's rho is calculated as: where n is the number of data pairs and DR is the difference in the ranks between data values in the same pair.

This is in fact a shortcut formula where there are few or no ties. The exact formula is explained on the right.

Rank order correlation provides a very quick and easy to use method of modeling correlation between probability distributions. The technique is 'distribution independent', i.e. it has no effect on the shape of the correlated distributions. One is therefore guaranteed that the distributions used to model the correlated variables will still be replicated.

The primary disadvantage of rank order correlation is the difficulty in selecting the appropriate correlation coefficient. If one is simply seeking to reproduce a correlation that has been observed in previous data, the correlation coefficient can be calculated directly from the data using the formula in the previous section (Excel does not have a simple function for this, although the function CORREL( ) is usually a good approximation for roughly symmetric distributions). The difficulty appears when attempting to model an expert's opinion of the degree of correlation between distributions. A rank order correlation lacks intuitive appeal and it is therefore very difficult for the expert to decide which level of correlation best represents her opinion.

This difficulty is compounded by the fact that the same degree of correlation will look quite different on a scatter plot for different distribution types, e.g. two Lognormals with a 0.7 correlation will produce a different scatter pattern to two Uniform distributions with the same correlation. Determining the appropriate correlation coefficient is more difficult still if the two distributions do not share the same geometry, e.g. one is Normal and the other Uniform or one is a negatively skewed Triangle and the other a positively skewed Triangle. In such cases, the scatter plot will often show quite surprising results.

To help an expert estimate the degree of correlation to apply between two variables, it is useful to produce plots of various levels of correlation and let the expert see which pattern (if any) most corresponds to his/her view. We generally plot correlations of 0.5 upwards (or -0.5 downwards) as lesser correlation are not very strong. The figure below gives an example: A second disadvantage of rank order correlation is that it is ignoring any causal relationship between the two distributions. It is usually more logical to think of a dependency relationship along the lines of that described in the other correlation modeling methods, and therefore easier to estimate correlation relationships.

Despite the inherent disadvantages of rank order correlation, it's ease of use and speed make it a very practical technique. In summary, the following guidelines in using rank order correlation will help ensure that you avoid any problems:

• Use rank order correlation to model dependencies that only have a small impact on your model's results. If you are unsure of its impact, run two simulations: one with the selected correlation coefficient and one with zero correlation. If there is a substantial difference between the model's final results, you should choose one of the other more precise techniques;

• Wherever possible, restrict its use to pairs of similarly shaped distributions;

• If different shaped distributions are being correlated, preview the correlation using a scatter plot before accepting it into the model;

• Use charts similar to those above to help the expert determine the appropriate level of correlation; and

• Avoid modeling a correlation where there is neither a logical reason nor evidence for its existence. This last point is a contentious issue, since many would argue that it is safer to assume a 100% positive or negative correlation (whichever increases the spread of the model output) rather than zero. In our view, if there is neither a logical reason which would lead one to believe that the variables are related in some way, nor any statistical evidence to suggest that they are, it seems that one would be unjustified in assuming high levels of correlation. On the other hand, using levels of correlation throughout a model that maximises the spread of the output, and other correlation levels that minimise the spread of the output, do provide us with bounds within which we know the true distribution must lie (assuming the model is otherwise correct).

This technique is sometimes used in project risk analysis, for example, where for the sake of reassurance one would like to see the most widely spread output feasible given the available data and expert estimates. Using such pessimistic correlation coefficients may be helpful because it in some general way compensates for the tendency we all have to be over-confident about our estimates (of time to complete the project's tasks for example, thereby reducing the distribution of possible outcomes for the model outputs like the date of completion) as well as quietly recognizing that there are elements running through a whole project like management competence, team efficiency, and quality of the initial planning: factors that it would be uncomfortable to model explicitly.

### Correlating several variable together - correlation matrices

Correlation matrices are an extension of the rank order correlation coefficient method above. A correlation matrix enables the analyst to correlate several probability distributions together.  The rank order correlation coefficients are input into the cross-referenced positions in the matrix. Each distribution must clearly have a correlation of 1.0 with itself so the top left to bottom right diagonal elements are all 1.0. Furthermore, because the formula for the rank order correlation coefficient is symmetric, as explained above, the matrix elements are also symmetric about this diagonal line.  The table below is an example of a correlation matrix. There are some restrictions on the correlation coefficients that may be used within the matrix. For example, if A and B are highly positively correlated and B and C are also highly positively correlated, A and C cannot be highly negatively correlated. For the mathematically minded, the restriction is that there can be no negative eigenvalues for the matrix. The ModelRisk function VoseValidCorrmat tests whether a matrix is valid and, if not, returns the nearest valid matrix.

Whilst correlation matrices suffer from the same drawbacks as those outlined for simple rank order correlation, they are nonetheless an excellent way of producing a complex multiple correlation that is laborious and generally very difficult to achieve otherwise.

ModelRisk provides the function VoseCorrMatrix that will calculate the rank order correlation matrix for a set of observations. When you have a smallish data set there will be some uncertainty that the calculated correlation coefficients in the matrix are truly representative of the underlying reality. The ModelRisk function VoseCorrMatrixU simulates values from the joint uncertainty distribution of the matrix.