J Manipulative Physiol Ther 2004 (Jan); 27 (1): 26–35 ~ FULL TEXT
Hugh Hurst and Jennifer Bolton
Anglo-European College of Chiropractic,
BACKGROUND: To date, clinical trials have relied almost exclusively on the statistical significance of changes in scores from outcome measures in interpreting the effectiveness of treatment interventions. It is becoming increasingly important, however, to determine the clinical rather than statistical significance of these change scores.
OBJECTIVE: To determine cutoff values for change scores that distinguish patients who have clinically improved from those who have not.
METHOD: Data were obtained from 165 back and 100 neck patients undergoing chiropractic treatment. Patients completed the Bournemouth Questionnaire (BQ) before treatment and the BQ and Patient's Global Impression of Change (PGIC) scale after treatment. Three statistical methods were applied to individual change scores on the BQ. These were (1) the Reliable Change Index (RCI); (2) the effect size (ES); and (3) the raw and percentage change scores. The PGIC scale was used as the "gold standard" of clinically significant change.
RESULTS: The RCI, using the cutoff value of >1.96, appropriately identified clinical improvement in back patients but not in neck patients. An individual ES of approximately 0.5 had the highest sensitivity and specificity in distinguishing back and neck patients who had undergone clinically significant improvement from those who had not. In terms of raw score changes, percentage BQ change scores [(raw change score/baseline score) x 100] of 47% and 34% were identified as having the highest sensitivity and specificity in distinguishing clinically significant improvement from nonimprovement in back and neck patients, respectively.
CONCLUSIONS: This study provides a methodological framework for identifying clinically significant change in patients. This approach has important implications in providing clinically relevant information about the effect of a treatment intervention in an individual patient.
From the Full-Text Article:
Evidence-based medicine advocates the application of findings from clinical trials in the treatment of individual patients. However, results from research studies are usually given as group mean values and the statistical significance of their differences. Data analyzed in this way give no indication of the proportion of patients in the group achieving a clinically important benefit from the treatment intervention. The information is therefore of limited clinical relevance, since there is no indication of the likelihood of a good response in a single patient. To counteract this, treatments are now being evaluated in terms of numbers needed to treat (NNT). NNT is an easily interpreted statistic informing the clinician of the number of patients that must be treated for a single patient to improve. [1, 2] To calculate the NNT statistic, it is necessary to identify those patients in the group who have undergone a clinically important improvement.
Defining the proportion of patients who have clinically improved is problematic, however, when the outcome of interest is subjective and there are no directly measurable end points to indicate that the patient's condition has resolved. An example is in evaluating the effect of treatment in nonspecific back and neck pain where the outcomes of most interest are changes in patients' self-reported levels of pain and disability. In such cases, it is necessary to distinguish those individual change scores on pain and disability scales that represent clinically important change from those that do not.
There are now a number of methods available for identifying clinically important intra-individual change in subjective outcome measures. [3, 4] These fall into 1 of 2 camps: the statistical or distribution-based methods on 1 hand and the global ratings or anchor-based methods on the other. The most common of the statistical methods are the effect size (ES) statistic and the Reliable Change Index (RCI), as well as simple change scores on the outcome measure itself.
The ES statistic is a method whereby mean differences between pre-treatment and post-treatment scores can be standardized to quantify an intervention's effect in units of standard deviation (SD). It is therefore independent of measuring units and can be used to compare outcomes.  ES statistics are widely used to assess the magnitude of treatment-related changes over time and can be applied both to group data and to data recorded from a single patient.  Using threshold values put forward by Cohen  and Testa,  ES values for group mean changes and individual changes, respectively, can be interpreted as small, medium, or large treatment effects. The question remains, however, as to how effect sizes relate to patients' own perceptions of change in their condition and how effect sizes can be interpreted as clinically important effects. For example, thresholds for individual effect sizes in terms of clinically important change would enable patients to be identified as improved or not.
The RCI, originally proposed by Jacobsen et al  and later modified by Christensen and Mendoza,  is similar to the ES statistic in that it calculates mean differences between pre-treatment and post-treatment scores but divides the difference by a standard error of measure that includes not only the SD of the measure but also its reliability coefficient. RCI values can be referenced to the normal distribution, and values that exceed 1.96 are unlikely (P < .05) unless an actual and reliable change has occurred.  Again, the question arises as to how this statistical method of arriving at a clinically important change compares with patients' own perceptions of a real and worthwhile change in their condition following treatment.
To assess patients' own impressions of change, a global scale from “much better” through “no change” to “much worse” is commonly used. [5, 10, 11] Since patients themselves make a subjective judgement about the meaning of the change to them following treatment, this scale is often taken as the external criterion or “gold standard” of clinically important change.  This makes intuitive sense and underlies current debates on statistical versus clinical significance.  Hence, in clinical trials in which end points cannot be directly measured, for example in pain conditions, assessing patients' experiences and what makes a difference to them in terms of a worthwhile and meaningful improvement is pivotal. Moreover, it is worth noting that statistical significance of change scores is derived from outcome measures that again rely on patients' interpretations and subjective judgments colored by their experiences of their condition.
The study reported in this article uses a patient self-report global change questionnaire based on a 7-point numerical rating scale (NRS) to determine from the patients' own perspective the degree of change (improvement) following treatment. This change was judged for its clinical importance by asking patients just how noticeable the change was. Using this as the “gold standard” of clinically significant improvement, the objectives of the study were to determine the sensitivity and specificity of statistical methods of determining clinically significant improvement, namely: (1) the RCI; (2) the ES statistic; and (3) the outcome measure's raw score and percentage score changes.
Deyo and Centor  highlighted the importance of a measure not only in its ability to detect a clinically important change when it has occurred but equally in its ability to detect when a clinically important change has not occurred. The issue is therefore not merely one of sensitivity to change but also the ability of a measure to distinguish between those patients who do improve and those who do not. All the statistical methods under test in this study were based on individual change scores before and after treatment recorded on the Bournemouth Questionnaire (BQ), a multidimensional outcome measure based on the biopsychosocial model of musculoskeletal pain and validated for use in back  and neck  pain patients.
In this study, 3 statistical methods derived from different computations of change scores on the BQ were investigated for their ability to distinguish patients who had undergone a clinically significant change from those who had not. The a priori definition of clinically significant improvement was a score of 6 or more on a 7-point NRS based on patients' global impression of change in their condition following treatment. This equated to feeling better or much better and a noticeable, worthwhile, and meaningful change. This anchor-based method has been used in many other studies to determine clinically significant change. [11, 21, 22, 23] In the absence of a true gold standard, asking patients themselves what constitutes a meaningful change to them, with all the attendant internal and external factors that might influence such judgment, seems intuitively the best that can be done when investigating issues of clinically important change.
This study identified from 70% to 80% agreement in categorizing patients as improved or not improved between asking patients directly on a PGIC scale and indirectly using cutoff values with high sensitivity and specificity on outcome measures. Since both methods rely on patients' own subjective judgements about change in their condition, this is reassuring. Many agreement studies rule out agreement that occurs by chance by using the k statistic in data analyses instead of simple percent agreement. However, in this case, since the data were not recorded as binary variables, the K statistic was not considered to be an appropriate method of analysis.
One of the 3 statistical methods used to categorize patients as improved and not improved, the RCI, gave anomalous results both in identifying the proportion of neck patients in the sample who improved and in calculations involving the PGIC scale. Neither of these findings was apparent when the RCI was used in back patients. The reliability coefficient of the neck BQ was relatively low, and this may have resulted in an overrigorous threshold for identifying patients who improved. Caution is therefore indicated when identifying clinically important improvement using the RCI for outcome measures in which reliability is moderate to poor.
The results of the sensitivity and specificity analyses showed that the second statistical method used in this study, the individual ES statistic, can be used to distinguish patients who improve from those who do not using the a priori definition of clinically important improvement from the PGIC scale. The findings of this study show that clinically significant improvement is indicated for individual back patients with an ES statistic of 0.4 or more and individual neck patients with an ES statistic of 0.5 or more. The similarity of these 2 values suggests that an overall individual ES cutoff of 0.5 for both types of patients rather than the exact values would be more convenient for use in a clinical setting and in the design of clinical trials.
The study has shown that the third statistical method under test can also be used to distinguish patients who have improved from those who have not. Raw change scores of 14 or more and percentage change scores of 47% or more were best associated with the a priori definition of clinical change in back pain patients. Corresponding cutoff values in neck pain patients were lower at 9 or more for raw change scores and 34% or more for percentage change scores. Using a similar definition of clinically important improvement, Farrar et al11 showed that a percentage change score of approximately 30% on an 11-point pain intensity NRS best distinguished chronic pain patients who had improved from those who had not. In an accompanying study to this one, using the BQ in a different sample of neck pain patients and using the RCI (but without the correction factor proposed by Christensen and Mendoza9) to identify clinically improved patients, corresponding cutoff values were raw score changes of 13 or more and percentage change scores of 33% (Bolton, submitted for publication). The similarity of the cutoff percentage change score value in both studies suggests this might be more appropriate as a clinical tool in identifying patients who have improved. Moreover, percentage change score is a standardized measure that is more easily interpretable, particularly when different outcome measures with different scales are in use. Farrar et al  concluded that in studies in which there is high variability in baseline pain levels, the relationship between percentage change and clinical improvement will be more consistent than the relationship between raw change and clinical improvement.
This article provides a methodological framework for interpreting statistical computations from outcome measures in terms of their clinical significance. In essence, it treats these computations as diagnostic tests in determining the presence or absence of a clinically significant change. There is a considerable amount of potential bias in the evaluation of diagnostic tests  and a strength of this study was that it avoided selection bias by recruiting patients in a consecutive manner. However, the study only looks at scores from 1 outcome measure in a limited patient group and change that occurs over a relatively short period of time. Moreover, the modified PGIC scale has not been tested for reliability or validity, nor has it been shown to be a valid external criterion for clinically significant change, even though we used it as such. In an area where there is an array of methods to define minimal important difference (anchor-based and statistical), more work is required to identify just what does constitute a clinically important difference, so that it can be used with confidence as a valid external criterion in future studies. Further work is also required into other outcome measures and other conditions. In particular, the reliability of the cutoff values reported in this study should be investigated by repeating the work in different samples of patients. In conditions such as back and neck pain, which are notoriously unpredictable and heterogeneous, issues of reliability are of paramount importance when cutoff values are being proposed for use in other settings. It is also the case that since this study's design did not include a control group, no conclusions have been drawn on the cause of the improvement observed in these patients and therefore the effect of the treatment intervention.
This study presents a number of threshold values on statistical computations from change scores that best identify patients undergoing clinically significant change from those who have not. This work is based, however, on the PGIC as an external criterion of clinically significant change, and while this may be both conceptually reasonable and clinically relevant, it remains to be seen whether or not this is a valid assumption. By identifying proportions of patients who have undergone clinically important change, calculations can be made of the NNT and thus facilitate the application of group results from clinical trials to an individual patient. This transition from research setting to clinical setting underpins the principles of the practice of evidence-based health care.