Note from the National Guideline Clearinghouse (NGC): This guideline was developed by the National Clinical Guideline Centre (NCGC) on behalf of the National Institute for Health and Care Excellence (NICE). See the "Availability of Companion Documents" field for the full version of this guidance.
Evidence of Effectiveness
The evidence was reviewed following the steps shown schematically in Figure 1 in the full version of the guideline:
- Relevant studies were critically appraised using the appropriate checklist as specified in the Guidelines Manual. For diagnostic questions, the QUADAS-2 checklist was followed (http://www.bris.ac.uk/quadas/quadas-2/ ).
- Key information was extracted on the study's methods, PICO (patients, intervention, comparison and outcome) factors and results. These were presented in summary tables (in each review chapter of the full guideline) and evidence tables (in Appendix G in the full version of the guideline [see the "Availability of Companion Documents" field]).
- Summaries of evidence were generated by outcome (included in the relevant review protocols) and were presented in Guideline Development Group (GDG) meetings:
- Randomised studies: data were meta-analysed where appropriate and reported in Grading of Recommendations, Assessment, Development and Evaluation (GRADE) profiles (for intervention reviews).
- Observational studies: data were presented as a range of values in GRADE profiles.
- Prognostic studies: data were presented as a range of values, usually in terms of the relative effect as reported by the authors.
- Diagnostic studies were presented as measures of diagnostic test accuracy (sensitivity, specificity, positive and negative predictive value). Coupled values of sensitivity and specificity were summarised in receiver operating characteristic (ROC) curve to allow visual comparison between different index tests (plotting data at different thresholds) and to investigate heterogeneity more effectively (given data were reported at the same thresholds). Diagnostic meta-analyses were carried out whenever data from at least 5 studies were available. See Chapter 1 and Appendix J in the full version of the guideline for details.
A 20% sample of each of the above stages of the reviewing process was quality assured by a second reviewer to eliminate any potential of reviewer bias or error.
Methods of Combining Clinical Studies
Data Synthesis for Intervention Reviews
Where possible, meta-analyses were conducted to combine the results of studies for each review question using Cochrane Review Manager (RevMan5) software. Fixed-effects (Mantel-Haenszel) techniques were used to calculate risk ratios (relative risk) for the binary outcomes.
For continuous outcomes, measures of central tendency (mean) and variation (standard deviation) were required for meta-analysis. Data for continuous outcomes were analysed using an inverse variance method for pooling weighted mean differences and, where the studies had different scales, standardised mean differences were used. The generic inverse variance option in RevMan5 is used if any studies reported solely summary statistics and 95% confidence interval (95% CI) or standard error; this included any hazard ratios reported. When the only evidence available was based on studies that summarised results by presenting medians (and interquartile ranges), or only p values were given, this information was assessed in terms of the study's sample size and was included in the GRADE tables without calculating the relative or absolute effects. Consequently, aspects of quality assessment such as imprecision of effect could not be assessed for evidence of this type. Where reported, time-to-event data was presented as a hazard ratio.
Stratified analyses were predefined for some review questions at the protocol stage when the GDG identified that these strata are different in terms of biological and clinical characteristics and the interventions were expected to have a different effect on subpopulations. It was decided at the outset that acute heart failure refers to distinct subpopulations, i.e., acute heart failure with pulmonary oedema, cardiogenic shock, acute right-sided heart failure, and acute decompensated chronic heart failure.
Statistical heterogeneity was assessed by visually examining the forest plots, and by considering the chi-squared test for significance at p<0.1 or an I-squared inconsistency statistic (with an I-squared value of more than 50% indicating considerable heterogeneity). Where considerable heterogeneity was present, the authors carried out predefined subgroup analyses – see protocols in Appendix C in the full version of the guideline.
Assessments of potential differences in effect between subgroups were based on the chi-squared tests for heterogeneity statistics between subgroups. If no sensitivity analysis was found to completely resolve statistical heterogeneity then a random-effects (DerSimonian and Laird) model was employed to provide a more conservative estimate of the effect.
For interpretation of the binary outcome results, differences in the absolute event rate were calculated using the GRADEpro software, for the median event rate across the control arms of the individual studies in the meta-analysis. Absolute risk differences were presented in the GRADE profiles and in clinical summary of findings tables, for discussion with the GDG.
For binary outcomes, absolute event rates were also calculated using the GRADEpro software using event rate in the control arm of the pooled results.
Data Synthesis for Prognostic Factor Reviews
Odds ratios (ORs), risk ratios (RRs) or hazard ratios (HRs), with their 95% confidence intervals (95% CIs) for the effect of the pre-specified prognostic factors were extracted from the papers. Studies at lower risk of bias were preferred, taking into account the analysis and the study design. In particular, prospective cohort studies were preferred that reported multivariable analyses, including key confounders as identified by the GDG at the protocol stage for that outcome. A narrative summary of results from univariate analyses was also given, highlighting the very high risk of bias as there was a high chance of unknown real effect due to lack of controlling for potential confounders. Data were not combined in meta-analyses for prognostic studies.
Data Synthesis for Diagnostic Test Accuracy Review
Data and Outcomes
For the reviews of diagnostic test accuracy, a positive result on the index test was found if the patient had values of the measured quantity above a threshold value, and different thresholds could be used. Diagnostic test accuracy measures used in the analysis were: sensitivity, specificity, positive and negative predictive value, area under the ROC curve. The threshold of a diagnostic test is defined as the value at which the test can best differentiate between those with and without the target condition and, in practice, the thresholds used varies amongst studies. In the one diagnostic review for this guideline, sensitivity was given more importance than specificity since natriuretic peptide testing is used as a 'rule out' test. This means that the test is carried out to minimise the false negative test results. The GDG defined the clinically relevant natriuretic thresholds to be used in the analysis based on the thresholds described in the current European heart failure guideline (see Chapter 5.1 in the full version of the guideline for details)
Coupled forest plots of sensitivity and specificity with their 95% CIs across studies (at various thresholds) were produced for each test, using RevMan5. In order to do this, 2x2 tables (the number of true positives, false positives, true negatives and false negatives) were directly taken from the study if given, or else were derived from raw data or calculated from the set of test accuracy statistics (calculated 2x2 tables can be found in Appendix I in the full version of the guideline).
To allow comparison between tests, summary ROC curves (by type of natriuretic peptide and by threshold level) were generated for each diagnostic test from the pairs of sensitivity and specificity calculated from the 2x2 tables, selecting threshold per study. A ROC plot shows true positive rate (sensitivity) as a function of false positive rate (1 minus specificity). Data were entered into RevMan5 and ROC curves were fitted using the Moses Littenburg approach. In order to compare diagnostic tests, 2 or more tests were plotted on the same graph. The performance of the different diagnostic tests was then assessed by examining the summary ROC curves visually: the test that had a curve lying closest to the upper left corner (100% sensitivity and 100% specificity) was interpreted as the best test.
A second analysis was conducted on studies that used two types of natriuretic peptides in the same study population. Results were plotted on one graph indicating paired results for each study. Paired results could show whether one peptide performed consistently better within study populations.
For those studies that reported area under the ROC curve (AUC) data, these were also plotted on a graph, for each diagnostic test. The AUC describes the overall diagnostic accuracy across the full range of thresholds. The GDG agreed on the following criteria for AUC:
- ≤0.50: worse than chance
- 0.50–0.60: very poor
- 0.61–0.70: poor
- 0.71–0.80: moderate
- 0.81–0.90: good
- 0.91–1.00: excellent or perfect test
Heterogeneity or inconsistency amongst studies was visually inspected in the forest plots where appropriate (only when there were similar thresholds).
When data from 5 or more studies were available, a diagnostic meta-analysis was carried out. Study results were pooled using the bivariate method for the direct estimation of summary sensitivity and specificity using a random effects approach (in WinBUGS® software - for the program code see Appendix J in the full version of the guideline). This model also assesses the variability between studies by incorporating the precision by which sensitivity and specificity have been measured in each study. A confidence ellipse is shown in the graph that indicates the confidence region around the summary sensitivity/specificity point.
Appraising the Quality of Evidence by Outcomes
The evidence for outcomes from the included RCTs and, where appropriate, observational studies were evaluated and presented using an adaptation of the 'Grading of Recommendations Assessment, Development and Evaluation (GRADE) toolbox' developed by the international GRADE working group (http://www.gradeworkinggroup.org/ ). The software developed by the GRADE working group (GRADEpro) was used to assess the quality of each outcome, taking into account individual study quality factors and the meta-analysis results. Results were presented in GRADE profiles ('GRADE tables'), which consist of 2 sections: the 'Clinical evidence profile' table includes details of the quality assessment while the 'Clinical evidence summary of findings' table includes pooled outcome data, where appropriate, an absolute measure of intervention effect and the summary of quality of evidence for that outcome. In this table, the columns for intervention and control indicate summary measures and measures of dispersion (such as mean and standard deviation, or median and range) for continuous outcomes and frequency of events (n/N: the sum across studies of the number of patients with events divided by sum of the number of completers) for binary outcomes. Reporting or publication bias was only taken into consideration in the quality assessment and included in the 'Clinical evidence profile' table if it was apparent.
The evidence for each outcome was examined separately for the quality elements listed and defined in Table 2 in the full version of the guideline. Each element was graded using the quality levels listed in Table 3 in the full version of the guideline. For each of these quality elements evidence for each outcome is downgraded where applicable using the following levels.
The main criteria considered in the rating of these elements are discussed below. Footnotes were used to describe reasons for grading a quality element as having serious or very serious problems.
The ratings for each component are summed to obtain an overall assessment for each outcome. The grades described above lead to an overall quality rating as described in the "Rating Scheme for the Strength of the Evidence" field. For example if the quality element 'risk of bias' is downgraded twice and 'imprecision' downgraded once an overall rating of 'Very low' is given for this outcome and any further low or high risks in other quality elements will not change this rating.
The GRADE toolbox is currently designed only for intervention reviews using randomised trials and observational studies but the authors adapted the quality assessment elements and outcome presentation for diagnostic accuracy studies.
Grading the Quality of Clinical Evidence
After results were pooled, the overall quality of evidence for each outcome was considered. The following procedure was adopted when using GRADE:
- A quality rating was assigned, based on the study design. RCTs start as High, observational studies as Low, and uncontrolled case series as Low or Very low.
- The rating was then downgraded for the specified criteria: risk of bias (study limitations), inconsistency, indirectness, imprecision and publication bias. These criteria are detailed in the full version of the guideline. Evidence from observational studies (which had not previously been downgraded) was upgraded if there was: a large magnitude of effect, a dose–response gradient, and if all plausible confounding would reduce a demonstrated effect or suggest a spurious effect when results showed no effect. Each quality element considered to have 'serious' or 'very serious' risk of bias was rated down by 1 or 2 points respectively.
- The downgraded or upgraded marks were then summed and the overall quality rating was revised. For example, all RCTs started as High and the overall quality became Moderate, Low or Very low if 1, 2 or 3 points were deducted respectively.
- The reasons or criteria used for downgrading were specified in the footnotes.
The details of the criteria used for each of the main quality elements are discussed further in the Sections 3.3.6 to 3.3.10 in the full version of the guideline.
Assessing Clinical Importance (Benefit, Harm or No Difference)
The GDG assessed the evidence by outcome in order to determine if there was, or potentially was, a clinically important benefit, a clinically important harm or no clinically important difference between interventions. To facilitate this, binary outcomes were converted into absolute risk differences (ARDs) using GRADEpro software: the median control group risk across studies was used to calculate the ARD and its 95% CI from the pooled risk ratio.
The GDG considered a minimal important difference (MID) based on the point estimate of the absolute effect for intervention studies. For all outcomes the GDG used the robustness of the evidence, i.e., GRADE rating, as well as the absolute effect (if positive) of the outcome of interest to decide whether the intervention could be considered beneficial for this outcome. The same point estimate, but in the opposite direction would apply if the outcome was negative.
This assessment was carried out by the GDG for each critical outcome, and an evidence summary table was produced to compile the GDG's assessments of clinical importance per outcome, alongside the evidence quality.
Evidence statements are summary statements that are presented after the GRADE profiles, summarising the key features of the clinical effectiveness evidence presented. The wording of the evidence statements reflects the certainty or uncertainty in the estimate of effect. The evidence statements are presented by outcome and encompass the following key features of the evidence:
- The number of studies and the number of participants for a particular outcome
- A brief description of the participants
- An indication of the direction of effect (if one treatment is beneficial or harmful compared to the other, or whether there is no difference between the 2 tested treatments)
- A description of the overall quality of evidence (GRADE overall quality)
Evidence of Cost-effectiveness
The GDG is required to make decisions based on the best available evidence of both clinical and cost-effectiveness. Guideline recommendations should be based on the expected costs of the different options in relation to their expected health benefits (that is, their 'cost-effectiveness') rather than the total implementation cost. Thus, if the evidence suggests that a strategy provides significant health benefits at an acceptable cost per patient treated, it should be recommended even if it would be expensive to implement across the whole population.
Evidence on cost-effectiveness related to the key clinical issues being addressed in the guideline was sought. The health economist:
- Undertook a systematic review of the published economic literature
- Undertook a new cost-effectiveness analysis to cover priority areas
The health economist:
- Critically appraised relevant studies using the economic evaluations checklist as specified in the Guidelines Manual
- Extracted key information about the studies' methods and results into evidence tables (included in Appendix H in the full version of the guideline)
- Generated summaries of the evidence in NICE economic evidence profiles (included in the relevant chapter for each review question)
NICE Economic Evidence Profiles
The NICE economic evidence profile has been used to summarise cost and cost-effectiveness estimates. The economic evidence profile shows an assessment of applicability and methodological quality for each economic evaluation, with footnotes indicating the reasons for the assessment. These assessments were made by the health economist using the economic evaluation checklist from the Guidelines Manual. It also shows the incremental costs, incremental effects (for example, quality-adjusted life years [QALYs]) and incremental cost-effectiveness ratio for the base case analysis in the evaluation, as well as information about the assessment of uncertainty in the analysis. See Appendix H in the full version of the guideline for more details.
If a non-United Kingdom (UK) study was included in the profile, the results were converted into pounds sterling using the appropriate purchasing power parity.