A test with 30% reliability

These days, we have all heard that the Spanish government had bought, from a certain Chinese company, \(640\,000\) fast coronavirus detection tests whose reliability was only \(30\%\), so they had to return them.

A \(30\%\) reliability? What does that mean? Because, of course, we can all think of a very simple, very cheap, and even reusable test that you don’t have to buy in China and which, in principle, looks like it will have a reliability of $latex 50 percent: flip a coin. The only difficulty of the method is to find a coin in which it is easy to know what is heads and what is tails; in the current ones, it is not always possible.

As much as the Chinese government has said that this company was not licensed to sell these tests, and that their quality would probably not be very good, and that they have deceived us (also the British: the devil is the devil’s greatest comfort), there should be something significant in these tests with a \(30\%\) reliability so that they are not worse than pure chance. To do otherwise would be inconceivable.

Moreover, one tends to think that if the aforementioned test that answers “yes” or “no” is right \(30\%\) of the time, it would be enough to keep the opposite of what the test has answered and so we would get \(70\%\) of the time right. Miracle! (In fact, as the tests with a reliability of \(1\%\) must be so bad that they practically give them away, by applying the same method we could get \(99\%\) of the time right almost for free).

The reader, if it wasn’t clear before, will already be convinced that things can’t be that silly, and that we need to reason better. The key is that, when talking about percentages, you have to know what they are, i.e. what is in the numerator and what is in the denominator. What are these in the case of the \(30\%\) reliability we are talking about?

Before I go on, I must confess that I am almost embarrassed to write this, on a subject in which my knowledge is almost nil, and bearing in mind that people much more knowledgeable than I am can read me (I apologise for my audacity). The above reflections were not pedagogical resources, but the result of my astonishment at the news of the reliability of the \(30\%\), which made me start looking to find out what that meant.

We are not going into the biological or medical descriptions of what the test consists of, how it is carried out, why it works better or worse, whether it is more expensive or cheaper, slower or faster, more or less invasive… We will only look at the few mathematical concepts behind it, all of which are really elementary.

Diagnostic tests for diseases

Suppose we are faced with a disease in which there are only two types of individuals, sick and healthy. If the population is \(N\), there will be \(E\) sick and \(S\) healthy, with \(N=E+S\).

And that we have a test which, when applied to any individual, will tell us whether he is sick or healthy. But the test is not perfect, it will not always be right. The test is designed to look for sick people, so we say that the test is positive if it says that the individual is sick, and negative if it says that the individual is healthy.

When we test an individual, there are four possibilities:

  1. He/she is sick and the test shows him/her as sick.
  2. He/she is healthy and the test shows that he/she is healthy.
  3. He/she is sick but the test shows him/her to be healthy (false negative).
  4. He/she is healthy but the test indicates that he/she is sick (false positive).

The test fails in the third and fourth cases, but what is usually considered more problematic are the false negatives, as the health system does not care about them, which can be serious, both for themselves and for the infected that they may generate (false positives are followed up or even tried to be cured unnecessarily, and it is easy for subsequent tests to discover the error).

If we test the whole population (it would be the same if we did not test everyone), the number of individuals in each of the compartments is distributed as follows:

There are many possibilities for ratios here (and, if multiplied by \(100\), we will of course get percentages). For example, the proportions of sick people (also called disease prevalence) and healthy people would be, respectively,

$$ \frac{E}{N} = \frac{V_{+} + F_{-}}{N} \qquad\mbox{y}\qquad \frac{S}{N} = \frac{F_{+} + V_{-}}{N}. $$

As mentioned above, the test can fail in two ways, either by giving false positives or by giving false negatives. Respectively, the probabilities of both types of failures are

$$ \frac{F_{+}}{V_{+} + F_{+}} \qquad\mbox{y}\qquad \frac{F_{-}}{F_{-} + V_{-}}.$$

There are four named indicators of test quality, and they are as follows:

  1. Sensitivity \(= V_{+}/E = V_{+}/(V_{+}+F_{-})\).
  2. Specificity \(= V_{-}/S = V_{-}/(F_{+}+V_{-})\).
  3. Positive predictive value \(= V_{+}/(V_{+}+F_{+})\).
  4. Negative predictive value \(= V_{-}/(V_{-}+F_{-})\).

The most important are the first two (of course, there are more indicators, but for our purposes here, these are more than sufficient).

The first, sensitivity, is the probability of correctly classifying a diseased individual, i.e. the probability that a diseased individual will have a positive test result. It is, therefore, the ability of the test to detect the disease (say, for example, it detects individuals who are “a little bit sick”, with a low viral load).

The second, specificity, is the probability of correctly classifying a healthy individual, i.e. the probability that a healthy individual will get a negative test result. It is, therefore, the ability of the test to detect the absence of disease. (For example, let’s say that the test does not detect those infected with a virus other than the one of interest).

It would be most useful to have tests that are both very sensitive and very specific, but this often does not exist (or is too expensive or too slow). We will have to decide, in each case and for each disease, whether we are more interested in a sensitive test or a specific one, because the tests available do not always have a high value for these two indicators.

The fact is that, as we can see, there is no indicator called reliability, which is the word usually used in the media when referring to coronavirus tests. Actually, when they were talking about reliability they were using the usual meaning of the word, not a technical term. What they are referring to is sensitivity. As far as I could find, the specificity of the tests available was always very high, \(90\%\) or even \(100\%\), so the media didn’t bother to highlight it (we all think it’s normal for the things we buy to work well, so no one will bother to mention it).

Some examples

Let us assume a population of \(N=2000\) people, with \(E=630\) sick and \(S=1370\) healthy (the numbers are chosen so that the percentages we are going to apply give us integers), and to which we are going to run a test to try to find out who is sick and who is not.

Of course, the indicators that measure the quality of the test (sensitivity and specificity) must be thought of as having been previously evaluated on a population that adequately represents the population in which it will be used in real life. If, for example, the manufacturer has cheated and tested the test only on hospitalised patients with a very high viral load, they are telling us that their test has a much higher sensitivity than what we will actually get if it is applied to the general population.

If we have a test with a sensitivity of \(30\%\), and we apply it to the entire population, there will be \(V_{+} = 0.3\cdot 630 = 189\) true positives; the rest of the patients will be \(F_{-} = 630-189 = 441\) false negatives, and consequently, many patients will remain undetected. Without knowing the specificity we cannot know \(V_{-}\) or \(F_{+}\); let’s assume it is \(90%\). Then, there will be \(V_{-} = 0.9 \cdot 1370 = 1233\) true negatives and \(F_{+} = 1370-1233=137\) false positives.

Once we understand what we are talking about, if we try to “decide” who is sick using the expedient method of tossing a coin, we will have obtained \(V_{+} = F_{-} = 315\) and \(V_{-} = F_{+} = 685\). It is true that there will be fewer false negatives, but at the cost of a huge increase in false positives.

We have also read in the papers that, for a coronavirus test to have an acceptable reliability, the reliability should be at least \(70\%\); of course, again they are referring to sensitivity.

Even with a sensitivity of \(80%\), when applying such a test, we would have \(V_{+} = 0.8\cdot 630 = 504\) true positives and \(F_{-} = 126\) false negatives. This is still a much higher number than one would like, which is why, if a person has symptoms of the disease but tests negative, it is reasonable to repeat the test a couple of days later to try to confirm the diagnosis (in that couple of days, and if the person really is sick, the viral load can be expected to have increased and the test will find it easier to detect the positive).

About Juan Luis Varona 31 Articles
Mathematican, alfareño (from Alfaro) born in Tudela. Professor in the Universidad de La Rioja (Logroño)

Be the first to comment

Leave a Reply

Your email address will not be published.


*