Despite the expertise of the Task Force and our Scientific Advisors, opinion had to take a back seat to evidence. Thus, all conclusions and recommendations were based on scientifically admissible studies, when they were available. The general rules of evaluation of evidence were adopted in advance and fine-tuned to adapt them to the body of evidence as the work of the Task Force progressed. Experience in science, clinical judgment and well-reasoned opinion were not totally disregarded, but were always subordinate when admissible scientific evidence was available.

When confronting controversies of clinical diagnosis and management, a special challenge is to enable a process whereby methodologists are educated about the clinical issues and clinicians are trained in the design and analysis of experimental and non-experimental studies. Understanding the important issues is crucial when decisions about public health policy are under consideration. The strategy used by this Task Force to assemble valid data from the many published original studies evolved over two decades. First, it required adoption of criteria of eligibility for the type of publication to be considered. For instance, in this effort we eschewed review articles and reports of secondary analyses, except as background reading or as sources of references to primary reports. Only original research was eligible to be considered as scientific evidence.

As described in detail in later Sections, we standardized the process of evaluating original articles screened as eligible. This ensured that all important features were weighed carefully each time by reviewers. The various types of experimental and non-experimental studies required specific variants of the abstraction forms. The specific tactics and procedures were developed initially during the Canadian Task Force on the Periodic Health Examination, 26, 101 and the methods of selecting, weighing and synthesizing original data from multiple sources were refined during successive Task Forces (e.g., the New Brunswick Task Force on Reye's Syndrome and Environmental Risk Factors, 102 the Inter-University Task Force on Passive Smoking, 103 the Working Group on Low Osmolality/High Osmolality Contrast Media 55 and the Quebec Task Force on Vertebral Column Disorders in the Workplace 104). In the current effort, we further refined our methods, including extensive modifications of the forms for critical appraisal of articles.

In the educational field, Slavin coined the descriptive phrase "best evidence synthesis" and argued that a method of aggregating data is needed that avoids the constraints and pitfalls of meta-analysis on one hand and the haphazardness of unstructured literature review on the other. 97 The key features of best evidence synthesis are: predetermined explicit criteria of quality for articles and type of data used in the aggregating process, a diligent search for relevant unpublished material and presentation of the results as ranges of estimates of effect with probability statements linked to the boundaries of the ranges, if necessary. Meta-analysis in contrast, seeks a single estimate of effect. Since both meta-analysis and best evidence synthesis are vulnerable to publication bias, 100 the search for unpublished material mentioned above is important in both types of undertakings.

Over the three years of deliberations of the Task Force the evidence was found to be sparse and generally of unacceptable quality. The original research articles in the literature strained our capacity to adhere strictly to best evidence synthesis methodology. However, the following important elements of the method were retained: (a) guidelines on the type of research papers that could be considered, (b) a diligent search for relevant unpublished articles, (c) a structured critical appraisal with predetermined checklists and rating scales and (d) an unwillingness to overinterpret the synthesized evidence with single estimates of effect. However, applying a priori operational criteria of quality when accepting or rejecting studies could have resulted in rejection of virtually all articles considered. We used judgement to identify valid and useful components of published reports, which taken as a whole would not have met conventional standards. Therefore, this Scientific Monograph presents qualitative descriptions of the aggregate data, rather than ranges of estimates of effect.