Diagnostic Performance of Ultrasound-Based Risk Stratification Systems for Thyroid Nodules: A Systematic Review and Meta-Analysis
Article information
Abstract
Background
This study investigated the diagnostic performance of biopsy criteria in four society ultrasonography risk stratification systems (RSSs) for thyroid nodules, including the 2021 Korean (K)-Thyroid Imaging Reporting and Data System (TIRADS).
Methods
The Ovid-MEDLINE, Embase, Cochrane, and KoreaMed databases were searched and a manual search was conducted to identify original articles investigating the diagnostic performance of biopsy criteria for thyroid nodules (≥1 cm) in four widely used society RSSs.
Results
Eleven articles were included. The pooled sensitivity and specificity were 82% (95% confidence interval [CI], 74% to 87%) and 60% (95% CI, 52% to 67%) for the American College of Radiology (ACR)-TIRADS, 89% (95% CI, 85% to 93%) and 34% (95% CI, 26% to 42%) for the American Thyroid Association (ATA) system, 88% (95% CI, 81% to 92%) and 42% (95% CI, 22% to 67%) for the European (EU)-TIRADS, and 96% (95% CI, 94% to 97%) and 21% (95% CI, 17% to 25%) for the 2016 K-TIRADS. The sensitivity and specificity were 76% (95% CI, 74% to 79%) and 50% (95% CI, 49% to 52%) for the 2021 K-TIRADS1.5 (1.5-cm size cut-off for intermediate-suspicion nodules). The pooled unnecessary biopsy rates of the ACR-TIRADS, ATA system, EU-TIRADS, and 2016 K-TIRADS were 41% (95% CI, 32% to 49%), 65% (95% CI, 56% to 74%), 68% (95% CI, 60% to 75%), and 79% (95% CI, 74% to 83%), respectively. The unnecessary biopsy rate was 50% (95% CI, 47% to 53%) for the 2021 K-TIRADS1.5.
Conclusion
The unnecessary biopsy rate of the 2021 K-TIRADS1.5 was substantially lower than that of the 2016 K-TIRADS and comparable to that of the ACR-TIRADS. The 2021 K-TIRADS may help reduce potential harm due to unnecessary biopsies.
INTRODUCTION
The management of thyroid nodules has become a topic of debate worldwide with the increasing incidence of thyroid carcinomas and increasing number of thyroid incidentalomas [1-3]. Ultrasonography (US) is the standard imaging modality for evaluating thyroid nodules, and many professional societies have proposed US-based risk stratification systems (RSSs) or Thyroid Imaging Reporting and Data Systems (TIRADSs) for thyroid nodules [4-11]. Though these systems may share the purpose of optimally discriminating malignancy based on US findings, they have different structures for the risk stratification of nodules (pattern-based or point-based systems) and different size cut-offs for biopsy. RSSs are used for triage to select patients for US-guided biopsy and to rule out thyroid malignancy. As triage tests, RSSs play a role in reducing unnecessary nodule biopsies and require an appropriate sensitivity for thyroid malignancy [12]. Therefore, although many studies have evaluated the diagnostic performance of various RSSs or TIRADSs by using the thresholds for classifying nodules into categories [13,14], the diagnostic performance in real-world practice needs to be assessed using the biopsy criteria of each RSS or TIRADS.
A tendency for overdiagnosis leading to overtreatment has been noted in recent years, and the need to reduce the unnecessary biopsy rate is increasingly emphasized. Therefore, many studies have evaluated the unnecessary biopsy rate as an important index in diagnostic performance [15-19]. The recently updated 2021 K-TIRADS [11] raised the size cut-offs for biopsy for low and intermediate-suspicion nodules to reduce the unnecessary biopsy rate because previous studies had shown that the 2016 K-TIRADS afforded notably high sensitivity for malignancy, but had a high rate of unnecessary biopsies [15-17,20,21].
Therefore, this study aimed to evaluate the diagnostic performance of biopsy criteria in four widely used society RSSs, including the American College of Radiology (ACR)-TIRADS, the American Thyroid Association (ATA) system, the European (EU)-TIRADS, and the 2016/2021 Korean (K)-TIRADS.
METHODS
This systematic review and meta-analysis followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [22].
Literature search strategy
A systematic literature search was done through the Ovid-MEDLINE, Embase, Cochrane, and KoreaMed databases with the following search terms until September 7, 2022: [(thyroid)] AND [(cancer) OR (carcinoma) OR (tumor) OR (neoplasm)] AND [(ultrasonography) OR (sonography) OR (ultrasonic) OR (ultrasound)] AND [(screen) OR (detect) OR (early diagnosis) OR (sensitivity) OR (specificity)]. We included studies published in English. Two thyroid radiologists (L.J. and M.K.L.), each with 8 and 9 years of experience, independently searched the literature and selected relevant articles. Any cases of disagreement were solved by consensus after discussion with a third reviewer (D.G.N.) with 23 years of experience.
Inclusion and exclusion criteria
The inclusion criteria were as follows: (1) population: adult patients who underwent thyroid US and had thyroid nodules larger than 1 cm; (2) index test: US RSSs (ACR-TIRADS [9], ATA system [6], EU-TIRADS [8], 2016 K-TIRADS [7], and 2021 K-TIRADS [11]); (3) reference standard: cytopathologic diagnosis (fine-needle aspiration, core needle biopsy, or surgery) with or without imaging follow-up; (4) outcomes: sensitivity, specificity, and unnecessary biopsy rate; and (5) study design: all observational (retrospective or prospective) original articles.
The exclusion criteria were as follows: (1) studies that did not use RSSs; (2) studies without sufficient data to calculate the diagnostic performance for nodules (≥1 cm) based on the estimated true-positive, true-negative, false-positive, and false-negative rates, according to any of the ACR-TIRADS, ATA system, EU-TIRADS, 2016 K-TIRADS, and 2021 K-TIRADS; (3) the presence of a further size limitation for inclusion other than ≥1 cm; (4) studies with a suspected overlapping population or data (in the case of overlap, the study with the larger cohort was included); (5) review articles, case reports, review articles, editorials, letters, and conference abstracts; (6) studies for which the full text was not available in English.
Data extraction
A structured form was used to extract the following information: (1) study characteristics: first author, year of publication, country where each study was performed, study design (prospective/retrospective; single/multicenter), study period, and reference standard; (2) demographic and clinical characteristics: numbers of total and male patients, mean age and range of included patients, numbers of total and malignant nodules, mean size and range of the included nodules; (3) RSSs (ACR-TIRADS, ATA system, EU-TIRADS, 2016 K-TIRADS, and 2021 K-TIRADS); and (4) outcomes: diagnostic performance of biopsy criteria in RSSs, including sensitivity, specificity, and the unnecessary biopsy rate. For the 2021 K-TIRADS, with a range of cut-off sizes for biopsy of intermediate-suspicion nodules (1.0 to 1.5 cm), the diagnostic performance of biopsy criteria was recorded separately as 2021 K-TIRADS1.0 and 2021 K-TIRADS1.5. The unnecessary biopsy rate was defined as the proportion of biopsy-confirmed benign nodules (false-positive) among all benign nodules (false-positive+true-negative), which also can be calculated as 1-specificity.
Quality assessment
Two reviewers (L.J. and M.K.L.) with 8 and 9 years of experience in thyroid radiology independently extracted the data and performed quality assessment. The quality of included studies was evaluated using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) [23]. Any disagreement was solved by consensus after discussion with a third reviewer (D. G.N.) with 23 years of experience.
Data synthesis and analysis
The primary outcome of this meta-analysis was the diagnostic performance of each US RSS for thyroid nodules. Using random-effects modeling, the pooled sensitivity and specificity with 95% confidence intervals (CIs) were evaluated from individual studies. Hierarchical summary receiver operating characteristic (HSROC) curves with 95% CIs and prediction regions were graphically visualized. Publication bias was evaluated using a Deeks’ funnel plot, and Deeks’ asymmetry test was used to evaluate the P value and statistical significance [24]. A secondary outcome was the unnecessary biopsy rate, which was defined as the proportion of biopsy-confirmed benign nodules among all benign nodules. For meta-analytic pooling of the unnecessary biopsy rate, the inverse variance method was used to calculate weights, and their 95% CIs were obtained using Der-Simonian-Laird random-effects modeling [25]. The Higgins I2 statistic was used to determine the heterogeneity (I 2=0% to 40%, insignificant heterogeneity; 30% to 60%, moderate heterogeneity; 50% to 90%, substantial heterogeneity; and 75% to 100%, considerable heterogeneity) [26].
The presence of a threshold effect caused by heterogeneity was visually assessed by the coupled forest plots of pooled sensitivity and specificity. In addition, the threshold effect, which is a positive correlation between sensitivity and the false-positive rate, was calculated; a Spearman correlation coefficient >0.6 between the sensitivity and false-positive rates was considered to indicate a threshold effect [27].
All statistical analyses were performed using STATA version 17.0 (Stata Corp, College Station, TX, USA).
RESULTS
Literature search and eligibility criteria
A flow diagram describing study selection is presented in Fig. 1. A total of 1,076 studies were initially identified. Nineteen duplicated studies were excluded, and 918 studies screened on the basis of titles and abstracts were excluded. Afterward, 139 full text articles with potential eligibility were assessed, and 136 studies were further excluded because they did not use any RSS for thyroid nodules (n=92), did not provide sufficient data for calculating the diagnostic performance of biopsy criteria for thyroid nodules (≥1 cm) according to any of the RSSs (ACR-TIRADS, ATA system, EU-TIRADS, 2016 K-TIRADS, or 2021 K-TIRADS) (n=29), had a further specific size limitation among thyroid nodules (≥1 cm) (n=1), were suspected of having an overlapping study population (n=3), were reviews (n=3), and were not written in English (n=9). Eight studies [16,17,21,28-32] were added after searching the bibliographies of these articles. Finally, a total of 11 articles were included [15-17,20,21,28-33].
Characteristics of the included studies
The characteristics of the 11 included studies are summarized in Table 1. Three studies [16,28,32] were prospectively designed, and five [15,20,21,29,33] were multicenter studies. The number of included patients ranged from 128 to 5,081 in all 11 studies, with the proportion of male patients ranging from 13.4% to 24.9% in 10 studies, excluding that of Middleton et al. [21], in which the data were not available. The mean or median age of the included patients ranged from 49.2 to 56 years, except for one article without age information. The number of included nodules ranged from 144 to 5,708, with the proportion of malignant nodules ranging from 6.5% to 29.5%. Diagnostic performance with respect to the biopsy criteria was reported with the following distribution: ACR (n=10) [15-17,20,21,28-32], ATA (n=6) [15-17,20,21,31], EU-TIRADS (n=5) [16,28,29,31,32], 2016 K-TIRADS (n=8) [15-17,20,21,28,31,33], and 2021 K-TIRADS (2021 K-TIRADS1.0 and 2021 K-TIRADS1.5) (n=1) [33]. All studies used both cytologic and histopathologic findings as the reference standard, except one [28] that used cytology as the only reference standard. In two studies [15,17], US follow-up was used as one of the reference standards for benign nodules, and thyroid nodules with initial benign results on biopsy and decreased or stable size on follow-up US after more than 12 months were finally classified as benign nodules.
Quality assessment
Nine studies fulfilled five domains, one study fulfilled four domains, and one study fulfilled all seven domains (Fig. 2). Ten studies [15-17,20,21,28,29,31-33] had a low-risk of bias for patient selection regarding consecutively registered patients. Patient selection was unclear in one study [30]. All studies had a low-risk of bias in the index test domain owing to the use of specified RSSs. One study had a low-risk of bias in the reference standard because they specified that a pathologist was blinded to the radiology report, while the others had an unclear risk of bias [28]. One study had a low-risk of bias in the flow and timing domain because cytology was the only reference standard in the study [28]. The flow and timing domain was unclear in the other 10 studies [15-17,20,21,25,29-33]. All 11 studies were categorized as having low concerns for applicability in the patient selection, index test, and reference standard domains.
Diagnostic performance
The diagnostic performance of biopsy criteria in RSSs is summarized in Table 2. Among the studies evaluating the diagnostic performance of biopsy criteria in RSSs, the pooled sensitivity and specificity were 82% (95% CI, 74% to 87%) and 60% (95% CI, 52% to 67%) for the ACR-TIRADS, 89% (95% CI, 85% to 93%) and 34% (95% CI, 26% to 42%) for the ATA system, 88% (95% CI, 81% to 92%) and 42% (95% CI, 22% to 67%) for the EU-TIRADS, and 96% (95% CI, 94% to 97%) and 21% (95% CI, 17% to 25%) for the 2016 K-TIRADS (Fig. 3). A large-population multicenter study of the 2021 K-TIRADS [33] showed sensitivity and specificity of 91% (95% CI, 89% to 93%) and 40% (95% CI, 38% to 41%) with the 1.0-cm cut-off for intermediate-suspicion nodules (2021 K-TIRADS1.0), and 76% (95% CI, 74% to 79%) and 50% (95% CI, 49% to 52%) with the 1.5-cm cut-off for intermediate-suspicion nodules (2021 K-TIRADS1.5), respectively. All studies showed considerable heterogeneity (I 2>75%), except for the sensitivity of EU-TIRADS (I2=0%).
Unnecessary biopsy rates
The unnecessary biopsy rate in RSSs is summarized in Table 2. The pooled unnecessary biopsy rates of the ACR-TIRADS, ATA system, EU-TIRADS, and 2016 K-TIRADS were 41% (95% CI, 32% to 49%), 65% (95% CI, 56% to 74%), 68% (95% CI, 60% to 75%), and 79% (95% CI, 74% to 83%), respectively (Fig. 4). All studies showed considerable heterogeneity (I2>75%). A large-population multicenter study of the 2021 K-TIRADS [33] showed an unnecessary biopsy rate of 60% (95% CI, 59% to 62%) with the 1.0-cm cut-off for intermediate-suspicion nodules (2021 K-TIRADS1.0) and 50% (95% CI, 48% to 51%) with the 1.5-cm cut-off for intermediate-suspicion nodules (2021 K-TIRADS1.5). The pooled unnecessary biopsy rate in all RSSs was 60% (95% CI, 54% to 67%).
DISCUSSION
Our study, which included 11 studies with 27,250 nodules, showed that the diagnostic performance of US-based biopsy criteria was variable among the RSSs, ranging from 76% to 96% for sensitivity, from 21% to 60% for specificity, and from 41% to 79% for the unnecessary biopsy rate. The 2016 K-TIRADS had the highest sensitivity and unnecessary biopsy rate, and the ACR-TIRADS had the lowest sensitivity and unnecessary biopsy rate among the pooled data. The 2021 K-TIRADS1.5 had a similar sensitivity and unnecessary biopsy rate compared to those of the ACR-TIRADS, and the 2021 K-TIRADS showed a substantially lower unnecessary biopsy rate with either cut-off (1 or 1.5 cm) for intermediate-suspicion nodules than that of the 2016 K-TIRADS.
In this study, we investigated the diagnostic performance of biopsy criteria in RSSs, including the 2021 K-TIRADS, for clinically relevant thyroid nodules (≥1 cm). Although many studies have investigated the diagnostic performance of RSSs, very few studies have specifically focused on reviewing the diagnostic performance of the biopsy criteria in RSSs [34,35]. In a review article by Castellana et al. [34], diagnostic performance was also variable among RSSs and ranged from 54% to 87% for sensitivity and from 28% to 64% for specificity. Additionally, the tendency for higher sensitivity with the 2016 K-TIRADS (86%; 95% CI, 73% to 94%) and the ATA system (87%; 95% CI, 75% to 94%) and higher specificity with the ACR-TIRADS (74%; 95% CI, 61% to 83%) was similar to the findings of our study. However, there were two essential points that made our study different: (1) the inclusion of the 2021 K-TIRADS; and (2) the exclusion of data from sub-centimeter nodules. We only included studies with relevant data for nodules over 1 cm, as sub-centimeter nodules are not routinely recommended to be biopsied.
The unnecessary biopsy rate has received attention in studies evaluating the diagnostic performances of RSSs [15-19,35] with respect to the potential harm of unnecessary biopsy. False-positive results carry the risk of potential complications and increased costs due to an increased number of biopsies, although US-guided biopsy is a safe procedure, and inconclusive biopsy results may lead to repeated biopsies or unnecessary diagnostic surgery for some nodules [36]. However, there are various definitions of the unnecessary biopsy rate: (1) the percentage of benign nodules among nodules requiring biopsy (1–positive predictive value) [28,33]; (2) the percentage of benign nodules requiring biopsy among all nodules [15,17,28,33]; and (3) the percentage of benign nodules requiring biopsy among all benign nodules (1–specificity) [20,31]. We used the third definition considering the heterogeneity in the prevalence of malignant tumors among the included studies, because the unnecessary biopsy rate defined using the other definitions depends on the prevalence of malignant tumors in the study population. In a review of the unnecessary biopsy rate for thyroid nodules according to four RSSs [35], the first definition of the unnecessary biopsy rate was applied, and the ACR-TIRADS showed a significantly lower unnecessary biopsy rate of 25% (95% CI, 22% to 29%) than that of the ATA system (51%; 95% CI, 44% to 58%; P<0.001) and the 2016 K-TIRADS (55%; 95% CI, 42% to 67%; P<0.001). In our study, the pooled unnecessary biopsy rate of the 2016 K-TIRADS (79%; 95% CI, 74% to 83%) was also higher than that of the ACR-TIRADS (41%; 95% CI, 33% to 49%) despite a different definition of the unnecessary biopsy rate. However, the unnecessary biopsy rate of the 2021 K-TIRADS1.5 was reported to be as low as 50% and was relatively similar to that of ACR-TIRADS [29].
Our study is unique in that it includes the 2021 K-TIRADS. The diagnostic performance of the 2021 K-TIRADS was separately described in this study as 2021 K-TIRADS1.0 and 2021 K-TIRADS1.5 according to each size cut-off, considering the suggested range of 1 to 1.5 cm for biopsy in intermediate-suspicion nodules in the 2021 K-TIRADS. Since most missed malignancies will be small (<1.5 cm) low-risk tumors, it may be reasonable to apply the size cut-off of 1.5 cm for biopsy in most intermediate-suspicion nodules without high-risk clinical or US features of metastasis or gross extrathyroidal extension despite the risk of decreased sensitivity for malignant tumors. However, we may selectively apply the size cut-off of 1 cm for biopsy in some patients with high-risk factors who require higher sensitivity for malignant tumors [11]. The unnecessary biopsy rate of the 2021 K-TIRADS1.5 was lower than those of the 2016 K-TIRADS, EU-TIRADS, and ATA system, but was similar to that of the ACR-TIRADS. According to a study comparing the diagnostic performance of biopsy criteria in RSSs [29], the unnecessary biopsy rate of small thyroid nodules (1 to 2 cm) in the 2021 K-TIRADS1.5 was the lowest among RSSs, even compared with the ACR-TIRADS. Accordingly, the difference in the unnecessary biopsy rate between the 2016 and 2021 K-TIRADS1.5 is due to the reduced number of unnecessary biopsies in small thyroid nodules (1 to 2 cm) by the 2021 K-TIRADS1.5. Although the 2021 K-TIRADS1.5 had lower sensitivity than the 2016 K-TIRADS, Chung et al. [33] showed that the decrease of sensitivity was exclusively noted for small thyroid nodules (1 to 2 cm) and demonstrated that most missing malignant tumors would be small low-risk tumors. US surveillance can mitigate the decreased sensitivity for small thyroid nodules (1 to 2 cm) in the 2021 K-TIRADS1.5.
This study has limitations to note. First, only one relevant study evaluated the diagnostic performance of biopsy criteria in the 2021 K-TIRADS. Although that study was a multicenter study with a large sample size, further validation studies are needed. Second, studies that did not provide the specific outcomes for nodules (≥1 cm) could not be included according to the eligibility criteria. Third, we only presented pooled sensitivity, specificity, and unnecessary biopsy rates among studies without meta-regression due to the paucity of studies employing the 2021 K-TIRADS. It would be worthwhile to perform metaregression for comparison between RSSs in the future, after more studies adopt the 2021 K-TIRADS.
In conclusion, the 2021 K-TIRADS showed a substantially lower unnecessary biopsy rate than that of the 2016 K-TIRADS, while maintaining an appropriate diagnostic sensitivity for clinically relevant thyroid malignancy. The 2021 K-TIRADS may help reduce the potential harm due to unnecessary biopsies.
Notes
CONFLICTS OF INTEREST
No potential conflict of interest relevant to this article was reported.
AUTHOR CONTRIBUTIONS
Conception or design: M.K.L., D.G.N. Acquisition, analysis, or interpretation of data: L.J., M.K.L., D.G.N. Drafting the work or revising: L.J., M.K.L., J.Y.L., E.J.H., D.G.N. Final approval of the manuscript: M.K.L., D.G.N.
Acknowledgements
We acknowledge and thank Miyoung Choi (National Evidencebased Healthcare Collaborating Agency, Division of Health Technology Assessment Research) and Chang Hee Cho (The Korean Society of Radiology), who contributed to searching and interpreting the evidence. This study was supported by the Korean Thyroid Association and a research fund from National Cancer Center (grant number: 2112570-3).