J Manipulative Physiol Ther 2006 (Jul); 29 (6): 475–485 ~ FULL TEXT
Mette Jensen Stochkendahl, Henrik Wulff Christensen, Jan Hartvigsen, Werner Vach, Mitchell Haas, Lise Hestbaek, Alan Adams, Gert Bronfort
Nordic Institute of Chiropractic and Clinical Biomechanics Research Department,
Part of Clinical Locomotion Science,
Objective: Poor reproducibility of spinal palpation has been reported in previously published literature, and authors of recent reviews have posted criticism on study quality. This article critically analyzes the literature pertaining to the inter- and intraobserver reproducibility of spinal palpation to investigate the consistency of study results and assess the level of evidence for reproducibility.
Methods: Systematic review and meta-analysis were performed on relevant literature published from 1965 to 2005, identified using the electronic databases MEDLINE, MANTIS, and CINAHL and checking of reference lists. Descriptive data from included articles were extracted independently by 2 reviewers. A 6-point scale was constructed to assess the methodological quality of original studies. A meta-analysis was conducted among the high-quality studies to investigate the consistency of data, separately on motion palpation, static palpation, osseous pain, soft tissue pain, soft tissue changes, and global assessment. A standardized method was used to determine the level of evidence.
Results: The quality score of 48 included studies ranged from 0% to 100%. There was strong evidence that the interobserver reproducibility of osseous and soft tissue pain is clinically acceptable (kappa > or = 0.4) and that intraobserver reproducibility of soft tissue pain and global assessment are clinically acceptable. Other spinal procedures are either not reproducible or the evidence is conflicting or preliminary.
Key Indexing Terms Reproducibility of Results, Palpation, Literature Review, Diagnostic Tests, Spine,
From the FULL TEXT Article:
Biomechanical dysfunction is thought to be an important contributor to spinal pain, and manual palpation is a widely used procedure for the diagnosis of such dysfunctions among providers of manual medicine. [1-3] Contrary to the expectations of many clinicians, unacceptable levels of reproducibility have been shown in the majority of the previously published literature, and authors of newer reviews have questioned the utility of manual examination procedures in spinal diagnosis altogether. [4-7]
Severe criticism has been posted on the design of the original studies, including
the use of asymptomatic subjects, [4, 5]
inexperienced observers, 
parallel testing, 
unclear definitions of positive findings and rating scales, [4, 6]
weak description of study results, [4, 5, 7] and
the need for improvement in overall study quality. [4, 7]
Furthermore, the dependence of Cohen's κ (the most widely used statistical method in studies on reproducibility) on the prevalence of positive findings, and the composition of the study population has been the subject of discussion. [8, 9]
Unfortunately, these reviews themselves have important limitations. For instance, some deal with only a minority of manual examination procedures such as chiropractic procedures only,  1 spinal region, [4, 6, 10] or motion palpation only.  In only 3 reviews were a predefined quality system applied to assess study quality, [4, 6, 7] and in none of the reviews were both the number of studies, the methodological quality, and the consistency of the outcomes considered, as recommended by van Tulder and others. [11-13] Finally, in none of these reviews was the impact of the predefined criteria on the conclusions tested. Therefore, the value of palpation as a diagnostic tool is, at present, still unknown and so are the abilities of practitioners of manual therapy to reliably diagnose spinal dysfunctions using palpation.
We therefore decided that another systematic review taking into account the above issues was warranted. Furthermore, a meta-analysis including comparable studies of adequate methodological standard and assessment of the consistency of study outcomes would be highly useful. The purpose of this paper is therefore to systematically review and critically assess the design and statistical methodology of the literature pertaining to reproducibility of spinal palpation adopting standardized criteria for judging diagnostic studies. A meta-analysis was conducted to evaluate consistency of study outcomes. Finally, the level of evidence for the reproducibility of spinal palpation was determined.
Summary of Results
After reviewing studies dealing with reproducibility of manual palpation of the entire spine, including the SI joints, we found strong evidence for clinically acceptable reproducibility both within and between observers for palpation of osseous and soft tissue pain (STP) and within the same observer for global assessment (GA). Strong evidence for clinically unacceptable levels of reproducibility for intra- and interobserver global assessment, motion palpation (MP) and soft tissue changes (STC) was found. Intraobserver reproducibility was consistently higher than interobserver reproducibility, and reproducibility of palpation for pain response was consistently higher than reproducibility of palpation for motion.
The most recent and comprehensive review evaluating the reproducibility of spinal palpation by Seffinger et al  applied different inclusion and general review criteria, and thus, only 27 of 44 articles and 9 of 19 high-quality articles included in this review were evaluated. Furthermore, we included several more recent publications and articles dealing with the SI joints, GA, and evaluated single results from multiple test regimens. Our conclusions are based on predefined criteria and an evaluation of consistency of high-quality studies, a method not previously applied, whereas the conclusions by Seffinger et al  were based on both high- and low-quality studies without an evaluation of consistency. The authors concluded that pain provocation tests are most reliable, and soft tissue paraspinal palpatory diagnostic test is not reliable. Among the 12 highest-quality articles, pain provocation, motion, and landmark location tests were reliable within the same observer, but not always among observers under similar conditions. Overall, examiner' discipline, experience level, consensus on procedures used, training, or the use of symptomatic subjects did not improve reliability. This is in agreement with our findings. Furthermore, we conclude that palpation of pain is reproducible both within and among observers, whereas MP may be reproducible within the same observer.
Methodological and Clinical Considerations
The experimental design of reproducibility studies has been criticized in previous reviews, [4-7, 68-71] and we found that 26 of 48 articles were of low methodological quality, had invalid statistical methods, or insufficient reporting of palpation procedures or test results.
Comparability of the studies included in a review is the important requirement to ensure valid generalizations. We ensured comparability with respect to the palpation procedures used, but the studies were rather heterogeneous with respect to characteristics such as definition of positive findings, segmental region, standardization, occupation, experience, symptomatic status of test population, and parallel testing. However, our investigation showed that most study characteristics had little influence on the study results, with the exception of the application condition. Especially, standing palpation was associated with very low κ values. Among the reviewed studies, standing palpation is used solely in the “Gillet test” of SI biomechanical dysfunction, and only 2 studies reporting this condition were included in our analysis. [39, 59] However, both contributed to the evaluation of the inter- and intraobserver agreement of MP. If we remove these 2 studies, then the average κ for the interobserver agreement increases to 0.19 (0.13-0.26), and the intraobserver agreement increases to 0.44 (0.14-0.73), such that the intraobserver agreement of MP can be regarded as acceptable.
Poor reproducibility of MP may reflect the design of reproducibility studies, rather than the quality of the palpation procedure. [29, 30, 72] Greater reproducibility may be attained by allowing positive findings in a neighboring spinal segment to count in assessing agreement.  However, this implies that we define a new, different diagnostic test which, then, requires a clinical rationale of test meaningfulness, beyond just an increase in κ values.  Further, parallel testing (test regimens) seems to aid the observer in making the clinical decision, thus enhancing reproducibility; [30, 42] a tendency we could also observe in our data. The acceptable intraobserver reproducibility for GA is also in line with this finding. However, when evaluating a combination of tests, information is only given about the reproducibility of the single test as part of this exact combination of tests. [14, 73] Moreover, we must be aware that conclusions on a single test from a study involving several tests may be only valid if the test is applied as part of this exact combination of tests. From a clinical perspective, increased reproducibility with parallel testing indicates that at this point, clinicians should not base their diagnosis on a single clinical examination finding such as palpation but, rather, conduct a range of tests. It is, however, premature to make clinical guidelines on how to use palpation because many aspects of palpation, such as the validity, still need to be investigated.
The reproducibility of palpation for pain response is consistently higher than palpation for motion and, consistently, substantially higher within an observer than among different observers. However, both palpatory pain studies and intraobserver studies in general have inherent problems with blinding of observers. In intraobserver studies, conscious and unconscious cues may render blinding of the observers impossible, and the independence of measures can not be guaranteed. In palpatory pain studies, blinding of subjects is impossible. Both situations imply the risk of overestimating reproducibility. It should also be noted that intraobserver reproducibility is somewhat higher than interobserver reproducibility by definition (depending on the magnitude of observer by subject interaction). 
A dilemma between high internal validity and clinical applicability arises when designing studies of reproducibility. For example, training studies contrast maximal (ideal) reproducibility with actual reproducibility in practice. To enhance the internal validity, rigid testing conditions should be set up with considerations to blinding, randomization, standardization and training, and parallel testing. However, rigid enforcement of testing condition often diverges from the clinical situation and, hence, may reduce the external validity. In a clinical situation, a mix of both asymptomatic and symptomatic patients will most likely present to practitioners of manual medicine. Therefore, the study population should consist of a mix of both symptomatic and asymptomatic subjects so that the reproducibility of the testing procedure has a relation to the characteristics of the study population.  Finally, in spite of the use in every day clinical routines, test procedures do not always necessarily evaluate the clinical entity it is intended to evaluate, and it is therefore important to discuss the content of the test procedure. [14, 75]
κ is widely accepted as the statistical method of choice for evaluating agreement between 2 observers for a binary classification.  It is, however, not without problems to use κ as the sole measure of observer agreement because information is lost when a 4-fold table is summarized into 1 number. Consequently, we do not know whether it is due to a difference in prevalence estimates between observers, or whether observers lack agreement in spite of similar prevalence if a moderate κ value is obtained in a study of reproducibility.
κ has been criticized for its dependence on the prevalence of positive findings, which limits its usefulness in meta-analyses, because studies with varying prevalence are typically compared. However, the composition of the study population may have greater impact on κ than the prevalence of positive findings.  Both a binary outcome and a reported κ value were required for studies to be part of our meta-analysis. However, binary outcomes may vary according to the definition of positive findings (ie, prevalence is directly dependent on the definition of positive findings). For example, if the observer is asked to identify any hypomobile segment(s) in a spinal region, the prevalence can vary from 0% to 100%, depending on the study population. If the observer is to identify the most hypomobile segment, the overall prevalence of positive findings will be 100%, but at any particular segment under investigation, the prevalence of the most hypomobile can be 0% to 100%. However, we found no association between the prevalence of positive findings and κ values. This supports that the composition of the study populations is probably of greater importance than the prevalence of positive findings, as suggested by Vach. 
Different words and schemes have been used to evaluate the strength of reproducibility, but there are no definitive guidelines for interpreting good concordance. [8, 76] Moreover, little research has been done to establish minimal, clinically acceptable reproducibility, and perhaps more important than qualifying the strength of concordance, the quantitative reproducibility indices need to be evaluated in terms of their clinical application. 
Limitations of this Review
Different methodologies have been advocated for systematic reviews of trials addressing therapeutic efficacy,  but little consensus exists when it comes to assessing the quality of reproducibility studies. We have chosen to evaluate the strength of evidence based on a best-evidence synthesis method, and this is one of the main differences between this review and previously published reviews on the same topic. Heterogeneity across studies, in terms of test procedures, inclusion criteria, study design and presentation of results, may be masked by the best-evidence approach. Considerable heterogeneity in study characteristics was noted across studies included in this review. However, despite this heterogeneity, the meta-analysis showed very consistent overall findings and only moderate impact of the specific design characteristics on the study outcomes.
The exclusion from the meta-analysis of studies that did not report a binary outcome is another important difference between this and previous reviews. To compare studies of reproducibility, the same type of outcome and method of statistics must be applied. On this account, we had to exclude 5 high-quality studies from the meta-analysis. Results from these studies are not directly comparable to the included studies, but all 5 articles show results with similar trends of low interobserver agreement on MP and higher interobserver agreement on evaluation of pain; they were included in the level of evidence assessment. The restricted number of articles causes the strength of evidence to be preliminary or nonexistent in 3 categories. In return, the power of the conclusions with respect to pain and motion testing is compelling. However, results were, in some categories, based on a relatively small number of original studies, making the conclusions very sensitive to just a few future high-quality studies with different results.
A κ value was reported in all high-quality studies using a binary classification. Hence, there was no need to calculate these from a published 4-fold table. No attempts were made to retrieve additional, original results or materials from the primary authors.
Although every effort was made to find all published reproducibility studies, selection bias may have occurred because we included only English-language articles. Publication bias may have resulted in an overestimation of test reproducibility because studies arriving at positive conclusions are more likely to get published. [77, 78] Furthermore, reviewer bias is also a possible limitation of this review. Reviewers were not blinded to the authors or the results of the individual trials when the methodological scoring was performed because of our familiarity with the literature.
Despite acceptable study quality according to our criteria, many trials still had methodological limitations or, at best, inadequate reporting of methods. Nonetheless, reproducibility of spinal manual palpation has been very thoroughly investigated and more than 40 original articles have been evaluated in this review. However, to shed light on the clinical usefulness of palpation, the validity needs to be investigated, and new innovative research that addresses the concomitant problems of selecting a golden standard in motion testing is warranted. Future research should also address the question of palpation in the overall assessment of neck and back pain patients and the importance of palpation as part of the complete clinical evaluation of patients.
Palpation for pain is reproducible at a clinically acceptable level, both within the same observer and among observers. Palpation for global assessment (GA) is reproducible within the same observer but not among different observers. The level of evidence to support these conclusions is strong. The reproducibility of motion palpation (MP), soft tissue changes (STC) and static palpation (SP) is not clinically acceptable.
The level of evidence is strong for interobserver reproducibility of MP and STC, whereas no evidence or conflicting evidence exists for SP and intraobserver reproducibility of STC. Results are overall robust with respect to the predefined levels of acceptable quality. However, the results are sensitive to changes in the preset level of clinically acceptable reproducibility and to the number of included studies.
Joint principles and procedures.
in: Bergmann TF Petersen DH Lawrence DJ Chiropractic technique: principles and procedures.
Churchill Livingstone Inc, New York1993: 51-121
Schafer RC, Faye LJ
Introduction To the Dynamic Chiropractic Paradigm
in: Schafer RC Faye LJ Motion palpation and chiropractic technique. 1st ed.
The motion palpation institute, Huntington Beach, CA: 1-41
3rd ed. Butterworths, London 1977
Are chiropractic tests for the lumbo-pelvic spine reliable
and valid? A systematic critical literature review.
J Manipulative Physiol Ther. 2000; 23: 258-275
Spinal motion palpation: a review of reliability studies.
J Man Manip Ther. 2002; 10: 24-39
van der Wurff P
Clinical tests of the sacroiliac joint. A systemic methodological review.
Part 1: reliability.
Man Ther. 2000; 5: 30-36
Reliability of spinal palpation for diagnosis of back and neck pain:
a systematic review of the literature.
Spine. 2004; 29: E413-E425
Statistical methodology for reliability studies.
J Manipulative Physiol Ther. 1991; 14: 119-132
The dependence of Cohen's kappa on the prevalence does not matter.
J Clin Epidemiol. 2005; 58: 655-661
Inter-examiner reliability in detecting cervical spine dysfunction: a short review.
J Osteopath Med. 2002; 5: 24-27
van Tulder MW
Method guidelines for systematic reviews in the
Cochrane collaboration back review group for spinal disorders.
Spine. 1997; 22: 2323-2330
Cochrane reviewers' handbook 4.2.0.
Cochrane Collaboration, Oxford 2003 (cited 2004 Jun 1)
van Poppel MN
Systematic review of psychosocial factors at work and private life
as risk factors for back pain.
Spine. 2000; 25: 2114-2125
Reproducibility and validity studies of diagnostic procedures in manual/
in: International Federation for Manual/Musculoskeletal Medicine Scientific committee.
Protocol Formats 2004
Systematic reviews in health care:
systematic reviews of evaluations of diagnostic and screening tests.
BMJ. 2001; 323: 157-162
Meta-analytic methods for diagnostic test accuracy.
J Clin Epidemiol. 1995; 48: 119-130
Some common problems in medical research.
in: Altman DG Practical statistics for medical research.
Chapman & Hall, London1991: 396-439
Bigos S, Bower O, Braen G, et al.
Acute Lower Back Problems in Adults. Clinical Practice Guideline No. 14.
Rockville, MD: Agency for Health Care Policy and Research,
Public Health Service, U.S. Department of Health and Human Services; 1994
Psychosocial factors at work in relation to low back pain and consequences
of low back pain; a systematic, critical review of prospective cohort studies.
Occup Environ Med. 2004; 61: e2
De Vet HC
van Mameren H
The interexaminer reproducibility of physical examination of the cervical spine.
J Manipulative Physiol Ther. 2004; 27: 84-90
Interexaminer reliability in physical examination of the cervical spine.
J Manipulative Physiol Ther. 1999; 22: 511-516
Interexaminer reliability in physical examination of the neck.
J Manipulative Physiol Ther. 1997; 20: 516-520
Interexaminer reliability in physical examination of patients with low back pain.
Spine. 1997; 22: 814-820
Interexaminer reliability of eight evaluative dimensions of lumbar segmental abnormality.
J Manipulative Physiol Ther. 1990; 13: 463-470
Interexaminer reliability of observations in physical examinations of the neck.
Phys Ther. 1987; 67: 1526-1532
Reliability of palpation assessment in non-neutral dysfunctions of the lumbar spine.
Orthop Phys Ther Pract. 2004; 16: 23-26
Interrater reliability of clinical examination measures for
identification of lumbar segmental instability.
Arch Phys Med Rehabil. 2003; 84: 1858-1864
Can manipulative physiotherapists agree on which lumbar level to treat based on palpation?
Physiotherapy. 2003; 89: 74-81
Palpation of the upper thoracic spine—an observer reliability study.
J Manipulative Physiol Ther. 2002; 25: 285-292
Clinical tests on impairment level related to low back pain: a study of test reliability.
J Rehabil Med. 2002; 34: 176-182
The kinematics of motion palpation and its effect on
the reliability for cervical spine rotation.
J Manipulative Physiol Ther. 2002; 25: E7
Measurement challenges in physical diagnosis:
refining interrater palpation, perception and comminication.
J Bodyw Mov Ther. 2001; 5: 245-253
Inter-examiner reliability of the Johnson and Friedman percussion scan of the thoracic spine.
J Osteopath Med. 2001; 4: 15-20
Reliability of chiropractic methods commonly used to detect
manipulable lesions in patients with chronic low-back pain.
J Manipulative Physiol Ther. 2000; 23: 231-238
Inter-examiner reliability in assessing passive intervertebral motion of the cervical spine.
Man Ther. 2000; 5: 97-101
van Suijlekom HA
de Vet HC
van den Berg SG
Interobserver reliability in physical examination of
the cervical spine in patients with headache.
Headache. 2000; 40: 581-586
Inter-examiner and intra-examiner reliability of standing flexion test.
Man Ther. 1999; 4: 87-93
Preliminary study of the reliability of assessment procedures for
indications for chiropractic adjustments of the lumbar spine.
J Manipulative Physiol Ther. 1999; 22: 382-389
van Neerbos K
van der Wurff P
Intraexaminer and interexaminer reliability of the Gillet test.
J Manipulative Physiol Ther. 1999; 22: 4-9
The relationships between spinal sagittal configuration, joint mobility,
general low back mobility and segmental mobility in female homecare personnel.
Scand J Rehabil Med. 1999; 31: 197-206
Upper cervical instability: are clinical tests reliable?
Man Ther. 1997; 2: 91-97
Inter-examiner reliability to detect painful upper cervical joint dysfunction.
Aust J Physiother. 1997; 43: 125-129
Counterstrain and traditional osteopathic examination
of the cervical spine compared.
J Bodyw Mov Ther. 1997; 1: 173-178
Interexaminer reliability of chiropractic evaluation for
cervical spine problems—a pilot study.
Chiropr J Aust. 1996; 5: 23-29
Reliability of manual end-play palpation of the thoracic spine.
Chiropr Tech. 1995; 7: 120-124
Interrater reliability of manual therapy assessment techniques.
Phys Ther Can. 1995; 47: 173-180
Interrater reliability of lumbar accessory motion mobility testing.
Phys Ther. 1995; 75: 786-792
Reliability in evaluating passive intervertebral motion of the lumbar spine.
J Man Manip Ther. 1995; 3: 135-143
Reliability of pain and stiffness assessments in clinical manual lumbar spine examination.
Phys Ther. 1994; 74: 801-809
Interexaminer reliability of palpation for cervical spine tenderness.
J Manip Physiol Ther. 1994; 17: 591-595
Intra- and interexaminer reliability of certain pelvic palpatory procedures
and the sitting flexion test for sacroiliac joint mobility and dysfunction.
J Neuromusculoskel Syst. 1994; 2: 65-69
Interexaminer reliability of eight evaluative dimensions of lumbar segmental abnormality: part II.
J Manipulative Physiol Ther. 1993; 16: 363-374
The role of experience in clinical accuracy.
J Manipulative Physiol Ther. 1990; 13: 68-71
Chiropractic examination procedures: a reliability and consistency study.
J Aust Chiropr Assoc. 1989; 19: 101-104
Reliability of motion palpation procedures to detect sacroiliac joint fixations.
J Manipulative Physiol Ther. 1989; 12: 86-92
Interexaminer concordance in detecting joint-play asymmetries in
the cervical spines of otherwise asymptomatic subjects.
J Manipulative Physiol Ther. 1989; 12: 428-433
Intra- and interobserver reliability of passive motion palpation of the lumbar spine.
J Manipulative Physiol Ther. 1989; 12: 440-445
Interexaminer reliability of palpatory evaluations of the lumbar spine.
Am J Chiropr Med. 1988; 1: 5-11
Inter- and intra-examiner reliability of palpation for sacroiliac joint dysfunction.
J Manipulative Physiol Ther. 1987; 10: 164-171
Inter- and intra-examiner reliability of motion palpation for the thoracolumbar spine.
J Manipulative Physiol Ther. 1987; 10: 1-4
An inter- and intra-examiner reliability study of motion palpation
of the lumbar spine in lateral flexion in the seated position.
Eur J Chiropr. 1986; 34: 121-141
Intra and interexaminer reliability of motion palpation in the cervical spine.
J Can Chiropr Assoc. 1985; 29: 195-199
Reliability study of detection of somatic dysfunctions in the cervical spine.
J Manipulative Physiol Ther. 1985; 8: 9-16
Intertester reliability for selected clinical tests of the sacroiliac joint.
Phys Ther. 1985; 65: 1671-1675
Interexaminer study of palpation in detecting location of spinal segmental dysfunction.
J Am Osteopath Assoc. 1983; 82: 839-845
Reliability in evaluating passive intervertebral motion.
Phys Ther. 1982; 62: 436-444
Reproducibility and interexaminer correlation of motion palpation
findings of the sacroiliac joints.
J Can Chiropr Assoc. 1980; 24: 59-69
Manual therapy rounds. A critical review of the literature on tests of the sacroiliac joint.
J Man Manip Ther. 1995; 3: 157-161
Inter-examiner reliability of motion palpation of the lumbar spine:
a review of quantitative literature.
Am J Chiropr Med. 1989; 2: 107-110
The reliability of lumbar motion palpation.
J Manipulative Physiol Ther. 1992; 15: 518-524
The reliability of reliability.
J Manipulative Physiol Ther. 1991; 14: 199-208
Humphreys BK, Delahaye M, Peterson CK:
An Investigation into the Validity of Cervical Spine Motion Palpation Using
Subjects with Congenital Block Vertebrae as a 'Gold Standard'
BMC Musculoskelet Disord 2004 (Jun 15); 5 (1): 19
van Deursen L
The value of some clinical tests of the sacro-iliac joint.
Man Med. 1990; 5: 96-99
Estimation of the reliability of skill tests.
Res Q. 1958; 29: 279-293
Efficacy of cervical endplay assessment as an indicator for spinal manipulation.
Spine. 2003; 28: 1091-1096
The measurement of observer agreement for categorical data.
Biometrics. 1977; 33: 159-174
Unravelling the fetal origins hypothesis: is there really an
inverse association between birthweight and subsequent blood pressure?
Lancet. 2002; 360: 659-665
Empirical evidence for selective reporting of outcomes in
randomized trials: comparison of protocols to published articles.
JAMA. 2004; 291: 2457-2465