Check the quality of your data | Vose Software

Check the quality of your data

See also: Fitting distributions to data, Fitting in ModelRisk, Analyzing and using data introduction

The data you use to quantify parameters in your model may come from a variety of sources: scientific experiments; surveys; computer databases; literature searches; even computer simulations. Before committing to using a data set, you should convince yourself of the quality of the data. We offer the check list below to help you:

Is the past relevant?

If the data you are using come from historic experience, you implicitly assume that the world in which the data were observed are the same as now. Are there any tests that you can perform to see whether such an assumption is correct? If you know that things have changed, perhaps you could estimate some correction factor. For example, maybe you are running a railway network and are looking at historic delay events (when trains are held up). If there is a strong increase in the number of services you will be offering next year, or if you have some plan to work on a main line for a significant period, historic rates would need to be modified. Sometimes there is a proportional relationship that you could use, but often it is non-linear in which case you may find that historic data help you estimate the relationship.

Are the data a representative and random sample?

'Representative' and 'random' both need to be thought about. Samples can be representative, but also deliberately non-random. For example, if your country has 10 departments, you might deliberately take a survey of 100 people in each department, irrespective of its size. Samples can also be random but not representative: for example a random survey within one department may well not be representative of the whole population.

Are the data relevant to the current problem?

A problem we often come across is that people have worked so hard to collect a set of data that they are quite determined to use it. You need to be objective about what model can be constructed to inform decision-makers, and that may not use the hard-won data. In some situations, there can be a lot of pressure to use certain types of data to validate, for example, the existence of a research program.

Is parameter independent of others in the model?

If you are using data to estimate some model parameter, you will need to ask whether that parameter is independent of others in your model, and whether it would be possible to test that independence by analyzing, or collecting further, data.

What quality checks can you do?

It is such a big question, and the answers should be fairly apparent if you pose the question. Can you think of ways in which the data can be inaccurate? Rounding errors, and biases towards numbers like 1 to 10, 20, 30, etc. rather than 17 can be important.

Are there incentives to the correct or incorrect reporting of data?

In our experience, this is the most important question because it is quite pervasive and the most difficult to recognise. We offer some example from our experience - it might help you develop the required cynicism:

  • National statistics of oil imports that showed a 24000 ton import in one cargo of an oil used in electronics, because the ship's captain had some 120 different oil categories to chose from when making his customs declaration;

  • Companies (and researchers) that submitted to a regulator (and for publication) only data from experiments that supported their claim, and not those that did not;

  • A company that claimed difficulty in finding/collating last year's (really bad) sales data for its operations until the purchaser was too committed to back out;

  • A fellow risk analyst who quoted parts of papers (even parts of sentences) that support his client's position, and produce bogus but highly complicated models, then presented the results of those models as if they were fact rather than the result of (biased) conjecture;

  • Not our own experience: employees in a nuclear fuels reprocessing company copy/pasted columns of measurements rather than go through the tedious practice of measuring each reprocessed batch;

  • A diamond mining company that got low paid workers to go deep into the mine shaft and bring back rock samples, to help better correlate the geology with models. The workers broke off rock from near the entrance rather than bother making the long trip.

  • Students doing surveys being paid for each completed form. They get friends round, have some beers, and fill out a few dozen each. Much better (for them) than being in the rain, and they earn more money.

Systematic and non-systematic errors

The collected data will at times have measurement errors that add another level of uncertainty. In most scientific data collection, the random error is well understood and can be quantified, usually by simply repeating the same measurement and reviewing the distribution of results. Such random errors are described as non-systematic. Systematic errors, on the other hand, mean that the values of a measurement deviate from the true value in a systematic fashion, consistently either over- or under-estimating the true value. This type of error is often very difficult to identify and quantify. You may be able to estimate the degree of suspected systematic measurement error by comparing with measurements using another technique that is known (or believed) to have little or no systematic error.