MAKING (COMMON) SENSE OF OUTCOME MEASURES
 
   

Making (Common) Sense
of Outcome Measures

This section was compiled by Frank M. Painter, D.C.
Send all comments or additions to:
  Frankp@chiro.org
 
   

FROM:   Manual Therapy 2015 (Dec); 20 (6): 723–726 ~ FULL TEXT

David M. Walton, PT PhD

School of Physical Therapy,
Western University,
1201 Western Rd.,
London, ON, N6G 1H1, Canada.
dwalton5@uwo.ca


Sound measurement is a cornerstone of good quantitative research. It matters not how well-defined a research question is, how informed the hypotheses, or how good the design, the study is doomed to failure without a sound means of measuring the effect. Good clinical practice should also seek to quantify the effects of treatment for individual patients using well-conceived measurement tools with sound properties. In lab-based basic science research, where the subject under study is often a tissue or cell, measurement tools can be quite precise and grounded in concrete observations (e.g. tissue X stretched 45 µm before complete rupture). Working with humans however is a far more nebulous endeavor; while some ‘hard’ objective indicators of health can be explored (e.g. blood glucose, heart rate, body temperature), the outcomes of most interest to rehabilitation professionals have no easily identifiable biological markers. For most outcomes of clinical interest, such as pain or perceived disability, we are dependent on the patients' report. This is the impetus behind the recent push towards greater adoption of Patient-Reported Outcomes (PROs).

PROs are defined as

“any report of the status of a patient's health condition that comes directly from the patient, without interpretation of the patient's response by a clinician or anyone else.” (FDA, 2009)

While a PRO could be as simple as recording a patient's response to the question ‘How are you doing today?’, far more elegant and precise methods exist intended to quantify a patient's perception of his/her health status that, while not yet universal, appear to be seeing increased use in routine clinical practice (Macdermid et al., 2013).

There are hundreds of PROs that have been developed in an attempt to quantify patients' perceptions, beliefs, opinions, values, or experiences that may have relevance to rehabilitation professionals. Good PROs have arguably become the gold standard for evaluating clinical outcomes, recognizing that it's most often what the patient feels, or more accurately, what they believe they feel, that drives health care usage and patient satisfaction. In other words, no matter how ‘normal’ the patient appears to the clinician physiologically, if he/she believes that his/her health remains unsatisfactory then that patient is likely to continue to consume health care resources or experience reduced daily function and productivity. Most rehabilitation research results use PROs as a primary outcome, but a poorly designed or conceptualized PRO may be lead to spurious results that are not reflective of true clinical practice. This presents clinicians and clinical researchers with a unique challenge for which many are not adequately prepared, that being the proper selection, administration, and interpretation of an appropriate PRO for their specific context and purpose.

Many of the articles published in this issue of Manual Therapy include some form of outcome measure on which statistical analyses have been conducted for the purpose of supporting/refuting/developing one or more hypotheses. Yet measurement science is a full-fledged and difficult field for even seasoned academics to understand. Terms like ‘reliability’ and ‘validity’ are common vernacular for most graduates of academic programs but are difficult concepts to fully grasp. Validity is particularly difficult to define and nigh impossible to prove. Most measurement scientists recognize that validity is a process rather than an end-point, and that a PRO can never be truly described as ‘valid’. In simplest terms, it is impossible to truly ‘validate’ a PRO because the gold standard required to do so resides somewhere in the infinite space of the human psyche. This represents a challenge for those of us who develop and evaluate PROs, and an even greater challenge for clinicians and researchers. The threshold for designating a PRO ‘adequately valid’ for clinical or research use is an inexact science often relying on context as much as good science.

While understanding the statistical and philosophical sciences behind measurement is valuable, this editorial presents a largely non-statistical common sense approach to evaluating PROs that aims to help readers quickly pick through the myriad of PROs available to identify those most appropriate for their context. PRO users should consider each question below when selecting, applying and interpreting a PRO.



Question 1:   What is the intended purpose of the tool?

Patient-Reported Outcomes (PROs) can be broadly classed 3 ways:

evaluating change over time (evaluative),

discriminating between two or more groups (discriminative), or

predicting some future event (predictive).
Table 1

While all 3 types should possess sound measurement properties from a variety of statistical perspectives, there are key characteristics that each should demonstrate (Table 1). Understanding the intended purpose of the tool allows users to quickly identify the most relevant statistics to interpret their results.



Question 2:   What is the ‘latent construct’ that the tool is intended to measure?

PROs are en vogue because they are the best way to find out how someone thinks, feels, or perceives their situation. Constructs for which there are no concrete observable standards are referred to as ‘latent constructs’, such as pain, sadness, love, expectation, and perceived disability. As measurement theory goes, the response a person gives to any item on a scale is a small reflection of their current status on that latent construct. In other words, if we imagine the emotion of joy on a continuum from no joy to extreme joy, then those who occupy a location on that continuum close to the ‘extreme joy’ end should respond to a scale question like ‘I love to sing and dance’ with greater agreement than someone who occupies a lower location, assuming singing and dancing is a reflection of joy. In order for this to make sense, it is necessary to understand the latent construct of ‘joy’, and what different sub-domains of joy may be observable. Let's say that joy itself is comprised of happiness, vitality, and social connectedness. In that case a tool meant to measure a person's location on the joy continuum should include questions that draw from all 3 sub-domains. When evaluating available tools, it is wise to understand the theoretical underpinnings of that mysterious ‘latent construct’, and make sure that the tool appears to include items that reflect it.



Question 3:   Does this tool make sense with a single overall score,
or should it be interpreted as more than one sub-score?


Closely related to Question 2, this question influences the way a user should interpret individual responses on a PRO. In the majority of cases, scores from each item on a PRO are arithmetically summed and interpreted as a single score. The problem arises when the tool should be interpreted as separate sub-scores. There are both conceptual and statistical reasons for this, but we'll focus on the former: using the hypothetical example of a ‘joy’ scale, let's assume it includes 3 subscales (happiness, vitality, social connections) each of which are represented by 5 items. Imagine a patient who scores low on the joy scale, suggesting a person who might benefit from an appropriate joy-based intervention. Unknown here is whether the problem is his sense of vitality, happiness, or social connectedness, making it difficult to plan treatments. Furthermore, imagine the prescribed treatment is a powerful antidepressant medication that has a side effect of cardiovascular depression. As a result, the patient's sense of happiness increases, but vitality decreases. The net effect on a single summary score in this case might indicate no change (one goes up, one goes down), but in reality two potentially very important changes have occurred. For this reason, it is wise to understand the subscales, or ‘factors’, within a scale and interpret the score accordingly.



Question 4:   Are all items equally important to all people?

Related to Question 3 is the question of whether all items should contribute equally to the scores. As an illustrative example, consider a hypothetical frequency-based lower extremity function PRO comprised of items that span the ‘function’ continuum from very easy (e.g. walking 20 m) to very difficult (e.g. jumping on one leg). It may be that the ability to hop on one leg is an activity to which a respondent neither aspires nor particularly cares to achieve. As a result, that item is unlikely to change in frequency no matter how good the intervention. In that case, does it make sense to include ‘jumping on one leg’ in your calculation of this patient's perceived function to the same extent as an activity that may be very important? This is the basic concept behind computer-adaptive testing (CAT) and also features in a more rudimentary way in some newer PROs such as the Canadian Occupational Performance Measure (COPM, Law et al., 1990) and Satisfaction and Recovery Index (SRI, Walton et al., 2014). For clinicians however, the solution may be easier; consider interpreting results of a PRO as both an overall score (in a way congruent with the literature) and by exploring responses to each item individually. Doing so may provide important insights into treatment planning.



Question 5:   Does the response structure make sense?

Scale responses can also be broadly classed into 3 categories: magnitude, frequency, or opinion-based. Other less frequent types exist but these are the most common for rehabilitation-focused PROs. Magnitude-based scales measure the magnitude or severity of respondent perceptions of their status on each item (e.g. none, slight, moderate, extreme). Frequency-based scales offer respondents the opportunity to indicate the temporal characteristics of their experiences or behaviors (e.g. never, rarely, sometimes, always). Opinion-based scales are intended to tap respondent opinions regarding statements that may either be personal or present their perception of people in general. Likert (1932) popularized opinion-based scales that commonly include options ranging from strong opinion (e.g. strongly agree or disagree) to neutral (e.g. neither agree nor disagree). While useful, users should consider the impact of bivalent scales on interpreting scores. Imagine an opinion scale with 10 items each scored 1, 2, 3, 4, or 5 where ‘3’ represents the neutral opinion. A score of 30 therefore represents a complete absence of opinion (e.g. 3's for all 10 items), conceptually equal to a zero (0) score. In this structure, scores from 10 to 29 would indicate general disagreement, while 31e50 indicate general agreement. In contrast, a complete absence of magnitude or frequency scored on the same 1, 2, 3, 4, 5 scale would be represented at the far low end, or a ‘10’ in this case. This consideration is relevant for both clinicians (knowing where the ‘zero’ point is) and may have an impact on choice of statistical analyses in research (van Schuur and Kiers, 1994; Jamieson, 2004).

Another problem arises when constructs are mixed in a response structure. Consider the joy scale again, with the item ‘I love to sing and dance’. Now imagine the response options are: ‘not at all’, ‘a little bit’, ‘moderately’, or ‘all the time’. Is this a frequency or magnitude-based scale? ‘A little bit’ or ‘moderately’ are most aligned with magnitude, while ‘all the time’ is clearly frequency. This is an example of mixed constructs, in that it's possible for a respondent to indicate, for example, that they moderately love to sing and dance all the time (that is, they do it all the time, but they only moderately love it), forcing the respondent to choose between reporting either magnitude or frequency, but not both.



Question 6:   Do the items make sense as written?

A good practice to adopt when considering implementation of a new PRO is to first attempt to complete the tool personally. If a clinician finds the items difficult to interpret or answer, then patients will as well. In most cases it should be possible to make a coherent sentence out of a combination of the item and the response to that item. For example, revisiting the imaginary joy scale that includes the item ‘I love to sing and dance’, inclusion of the word ‘love’ suggests a strong emotional valence that might make it appropriate for opinion or frequency-based scaling but inappropriate for a magnitude-based scale. Would it make sense for someone to state ‘I slightly love to sing and dance’? Further, this is an example of a double-barreled item; someone may love to sing, but not dance. In that case, how does one respond? For good measurement each item should tap a single domain of the latent variable. If singing and dancing were in fact important representations of someone's level of joy, then they should be two separate items. Table 2 provides common examples of item problems in PROs for which readers can watch when selecting the best tool for their purpose.


Table 2.   Common PRO item errors and why they are problematic.

Issue Example Reason

Double-Barreled Question ‘I love to sing and dance’ As a general rule, the word ‘and’ should not appear in an item. This usually means there are two constructs in one item, forcing respondents to answer two questions with a single response. As a result their response may not be an accurate reflection of their position on either.

Double-Negatives ‘I don't smoke’
(not at all true, a little bit true, moderately true, completely true)
In this case, people who do smoke would choose ‘not at all true’, the interpretation being ‘it's not at all true that I don't smoke’. This adds unnecessary cognitive load to the respondent and increases the risk of them making an error.

Advanced language ‘I am perpetually late for my exercise class’ The language used in a scale should be easily understandable to someone with about a grade 6 equivalent reading level. Here, not everyone will understand the concept of perpetuity.

Ambiguous items items ‘I generally enjoy exercise’
(strongly disagree, slightly disagree, slightly agree, strongly agree)
This is an example of an ambiguous item, in that it's not clear what ‘generally’ means. As a result it is too open to interpretation that adversely affects both validity and reliability.

Location dependence How much difficulty do you have:

Q1: Jumping 6 inches high

Q2: Jumping 12 inches high
(no difficulty, slight difficulty, moderate difficulty, extreme difficulty)
Location dependence is the term given to a phenomenon in which the answer to one item is not independent from the answer to a previous item, violating a statistical assumption of independent observations.

In this case, if someone is unable to jump 6 inches, then the question on jumping 12 inches is irrelevant.

A better option would be to ask how high the respondent can jump.

Conflated constructs Indicate how much pain you experience:

1. I rarely have mild pain

2. I occasionally have moderate pain

3. I often have severe pain
Each response option is a combination of both intensity (mild, moderate, severe) and frequency (rarely, occasionally, often).

Not only does this provide uninterpretable information on either construct, but it leaves those with, for example, rare but severe pain, unable to accurately provide a response.

Reverse-scored items Q1: Other people would say I am a happy person

Q2: Other people enjoy being around me

Q3: Other people find me difficult to talk to
The first 2 questions are oriented such that a higher response indicates greater social connectivity, while the 3rd is oriented the other way round. While this is not inherently problematic, users should be aware of the presence of reverse-scored items and make sure they are accounted for in the final score.

Further, it should be noted that reverse-scored items are often identified as problematic when such scales are subject to deeper analyses of their measurement properties and are often removed in later iterations (as a concrete example see the Tampa Scale of Kinesiophobia Woby et al., 2005)

Summary and recommendations

Use of PROs offers a valuable addition to clinical practice and a necessity in quantitative clinical research, whether to identify important groups, make decisions about the need and nature of intervention, or quantify the effect of that intervention. Proper selection, administration and interpretation of PROs are fields of science unto themselves, and an incorrect choice of PRO can lead to clinical or research findings that are inaccurate reflections of the truth. Even without detailed knowledge of psychometrics, clinicians and non-measurement researchers can at least make common sense decisions by taking the time to carefully read scales prior to adoption and answering the 6 questions posed here. By evaluating their purpose, factors, scoring, response structure, and any individual item issues, more informed decisions about PRO selection can be made that will ensure meaning and interpretation of clinical and research outcomes are at least not hampered by inappropriate measurement.



References:

  1. Food and Drug Administration.
    Guidance for industry use in medical product development to support labeling claims guidance for industry.
    December 2009: 1–39 (Available from:)
    http://www.fda.gov/downloads/Drugs/
    GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf

  2. Jamieson, S.
    Likert scales: how to (ab)use them.
    Med Educ. 2004 Dec; 38: 1217–1218

  3. Law, M., Baptiste, S., McColl, M.A., Opzoomer, A.,
    Polatajko, H., and Pollock, N.
    The Canadian occupational performance measure: an outcome measure for occupational therapy.
    Can J Occup Ther. 1990; 57: 82–87

  4. Likert, R.
    A technique for the measurement of attitudes.
    Arch Psychol. 1932; 140: 1–55

  5. Macdermid, J.C., Walton, D.M., Cote, P., Santaguida, P.L.,
    Gross, A., Carlesso, L. et al.
    Use of outcome measures in managing neck pain: an international multidisciplinary survey.
    Open Orthop J. 2013 Sep 20; 7: 506–520

  6. van Schuur, W.H. and Kiers, H.A.
    Why factor analysis often is the incorrect model for analyzing bipolar concepts,
    and what model to use instead.
    Appl Psychol Meas [Internet]. 1994; 18: 97–110
    Available from:)
    http://conservancy.umn.edu/bitstream/handle/11299/120012/1/v18n2p097.pdf

  7. Walton, D.M., MacDermid, J.C., Pulickal, M.,
    Rollack, A., and Veitch, J.
    Development and initial validation of the Satisfaction and Recovery Index (SRI)
    for measurement of recovery from musculoskeletal trauma.
    Open Orthop J. 2014 Sep 30; 8: 316–325

  8. Woby, S.R., Roach, N.K., Urmston, M., and Watson, P.J.
    Psychometric properties of the TSK-11: a shortened version of the Tampa Scale for Kinesiophobia.
    Pain. 2005 Sep; 117: 137–144

Return to OUTCOME ASSESSMENT

Since 2-07-2016

                  © 1995–2024 ~ The Chiropractic Resource Organization ~ All Rights Reserved