Clinical Problem and the Role of MRCP
Accurate differentiation of benign from malignant biliary strictures remains among the most consequential diagnostic challenges in hepatopancreatobiliary medicine. Malignant strictures—predominantly cholangiocarcinoma (CCA), pancreatic head carcinoma, and ampullary adenocarcinoma—carry a 5-year survival rate of approximately 35% when resected early, declining to less than 12 months at advanced stages 5. Benign etiologies encompass postoperative bile duct injury, primary sclerosing cholangitis (PSC), IgG4-related sclerosing cholangitis, and inflammatory fibrosis; these require entirely different management strategies, and unnecessary hepatectomy performed for presumed malignancy occurs in an estimated 3% of major biliary surgical series 10.
Magnetic resonance cholangiopancreatography (MRCP) has become a widely used noninvasive imaging modality for biliary tree evaluation, offering high soft-tissue resolution without ionizing radiation and avoiding the procedural risks associated with invasive biliary interventions. Nevertheless, MRCP interpretation is subjective and operator-dependent, with diagnostic accuracy ranging from 38% to 90% across published series 4. Endoscopic retrograde cholangiopancreatography (ERCP) with tissue sampling (brush cytology, forceps biopsy) remains the procedural reference standard but carries morbidity—including pancreatitis, perforation, and bleeding—and is ideally reserved for therapeutic intervention. This diagnostic gap has driven substantial interest in artificial intelligence (AI) and deep learning approaches to augment and standardize MRCP interpretation 9.
Deep Learning Applied to MRCP
Deep learning has been applied to MRCP through three principal strategies. First, convolutional neural networks (CNNs)—including ResNet50, DenseNet121, Xception, and EfficientNet—process 2D or 3D MRCP sequences to extract hierarchical imaging features automatically, without reliance on hand-crafted morphologic criteria. Ensemble architectures, combining predictions from multiple CNN models via logistic regression meta-learners or weighted averaging, reduce overfitting and improve robustness 2. Second, multimodal fusion models integrate CNN-derived imaging features with clinical and laboratory variables—age, sex, alkaline phosphatase (ALP), total bilirubin, alanine aminotransferase, and carbohydrate antigen 19-9 (CA19-9)—through separate neural network branches fused at the feature level, capturing complementary diagnostic information 6. Third, quantitative MRCP (MRCP+) employs AI-enabled post-processing to generate objective, scanner-independent biliary tree metrics—ductal volume, number, diameter, stricture count, and dilatation severity—from standard non-contrast 3D heavily T2-weighted MRCP sequences already acquired in routine clinical care 5.
Object detection algorithms (e.g., YOLO) isolate the common bile duct (CBD) as a region of interest before feature extraction, and explainability techniques such as gradient-weighted class activation mapping (Grad-CAM) have been applied to visualize which image regions drive model predictions, though this remains incompletely addressed in the available literature 1.
Diagnostic Performance
The most methodologically robust evidence comes from a prospective external validation study of an Xception CNN ensemble combined with logistic regression (Xce-LR model), trained retrospectively across two institutions (n = 378) and validated prospectively in an independent cohort (n = 60). The Xce-LR model achieved an area under the receiver operating characteristic curve (AUC) of 0.890 on internal testing and 0.885 on external validation, with prospectively observed sensitivity, specificity, and accuracy each at 90.0%. A multimodal deep learning fusion model integrating ResNet50-derived MRCP features with clinical parameters (Thoi et al., 2025) demonstrated accuracy of 89.8%, AUC of 0.904, sensitivity of 81.8%, and specificity of 95.7%, substantially outperforming image-only (AUC 0.87) and clinical-only (AUC 0.79) approaches 61.
Quantitative MRCP metrics provide complementary evidence. Eurboonyanun and colleagues demonstrated that a biliary tree volume threshold of ≥25 ml yielded an AUC of 0.79 (95% confidence interval [CI]: 0.63–0.96), sensitivity of 86.96%, specificity of 73.33%, positive predictive value (PPV) of 83.33%, and negative predictive value (NPV) of 78.57% in 38 histologically confirmed patients 5. Additional metrics—total duct length (AUC 0.80), total length of strictures and dilatations (AUC 0.81), and number of dilatations (AUC 0.81)—provided similarly good discrimination. A separate AI model utilizing clinical biomarkers for distal CBD obstruction classification achieved an AUC of 0.908, sensitivity 83.1%, specificity 87.2%, PPV 74.5%, and NPV 92.0%, significantly outperforming individual markers including ALP (AUC 0.795) and CBD diameter (AUC 0.775) 7.
In the PSC surveillance context, a meta-analysis of 8 studies and 531 patients reported pooled MRI/MRCP sensitivity for CCA detection of 98.9% (95% CI: 98.6–99.3%), with only 3 false-negative cases among 36 confirmed malignancies 9.
Concordance with Expert Radiologist Diagnosis
Reader concordance data are available from several sources. In the Xce-LR prospective cohort, model performance was statistically comparable to three expert radiologists (accuracy 76.7%–86.7%), with no significant differences in accuracy (p = 0.302), sensitivity (p = 0.143), or specificity (p = 0.774). The DeePSC multiview CNN ensemble for PSC-compatible bile duct changes outperformed four radiologists by 5.5 percentage points on 3-Tesla (T) data and 10.1 percentage points on 1.5-T data, though differences did not reach statistical significance (p = 0.34 and 0.13, respectively), likely reflecting small test sets (n = 39 per field strength) 2.
The most compelling concordance data arise from quantitative MRCP reader-assist studies in PSC. When radiologists reviewed MRCP supplemented with MRCP+ metrics, inter-reader agreement for intrahepatic high-grade stricture detection improved significantly—from 42.9% to 67.9% (p = 0.02)—and Cohen's kappa increased from 0.36 ± 0.12 to 0.53 ± 0.12 (p < 0.001). Reader confidence tended to improve concurrently 5. For extrahepatic duct assessment, MRCP+ metrics demonstrated an AUC of 0.85 for maximum dilatation diameter, with high intra-reader reproducibility (kappa 0.788–0.839) 5.
Evidence Summary Table
| Study / Year | Population | Model / Input | Reference Standard | Validation Type | Sensitivity | Specificity | AUC / Accuracy | Concordance With Radiologists | Key Limitations |
|---|---|---|---|---|---|---|---|---|---|
| Liu et al., 2025 (Xce-LR) | Retro: n=378 (2 centers); Prospective: n=60 | Xception CNN ensemble + logistic regression (3T MRCP) | Histopathology, ERCP, surgical findings | Retro training + prospective external validation | 90.0% | 90.0% | AUC 0.885; Acc 90.0% | Comparable to 3 radiologists (76.7–86.7%); p=0.143–0.774 | Benign cases downsampled; small prospective cohort |
| Thoi et al., 2025 6 | 465 total; 143 with MRCP images | ResNet50 (MRCP) + clinical neural network (ALP, bilirubin, CA19-9, etc.); element-wise fusion | Histopathology, ERCP, clinical diagnosis | Retrospective, 3-fold cross-validation | 81.8% | 95.7% | AUC 0.904; Acc 89.8% | Not reported | Single-center; no external validation; validation design unclear |
| Eurboonyanun et al., 2025 (MRCP+) 5 | n=38 (23 malignant, 15 benign) | Quantitative MRCP metrics: biliary volume, duct dimensions, stricture/dilatation count | Histopathology (endoscopic biopsy or surgical resection) | Retrospective, single-center | 86.96% | 73.33% | AUC 0.79; Acc 81.58% | Performance comparable to expert radiologists; kappa 0.788–0.839 (intra-reader) | Very small cohort; no external validation; MRCP+ is not a deep learning classifier |
| DeePSC, 2023 (PSC detection) 2 | n=606 (342 PSC, 264 controls); 1.5T, 3T, external vendor | Multiview CNN ensemble (7 radial MRCP projections; 20 networks) | PSC diagnosis per EASL criteria | Internal (n=39 per field) + external multivendor | 80.0–100.0% | 80.0–83.5% | Acc 80.5–92.4% | Outperformed 4 radiologists by 5.5–10.1 pp; p=0.13–0.34 (NS) | PSC detection only, not malignancy classification; small test sets |
| AI biomarker model 7 | Not fully detailed | AI model (clinical biomarkers) | Histopathology (presumed) | Not fully detailed | 83.1% | 87.2% | AUC 0.908; Acc 85.9% | Not reported; superior to ALP (AUC 0.795) and CBD diameter (AUC 0.775) | Limited methodological detail accessible; biomarker-focused, not imaging-based |
| Wang et al., 2021 4 | n=168 (83 malignant, 85 benign DBS) | MRCP + CT 4-feature scoring model (stricture length, angle, double duct sign, arterial phase density) | Histology, endoscopy, follow-up | Retrospective, single-center | 73.5% | 85.9% | AUC 0.828 | Inter-observer kappa 0.41–0.80; reader accuracy improved with CT addition (70.2%→81.5%) | No deep learning; excluded cholangiolithiasis; single-center |
Abbreviations: AUC, area under receiver operating characteristic curve; Acc, accuracy; ALP, alkaline phosphatase; CBD, common bile duct; CNN, convolutional neural network; DBS, distal bile duct stricture; ERCP, endoscopic retrograde cholangiopancreatography; MRCP, magnetic resonance cholangiopancreatography; NS, not statistically significant; pp, percentage points; PSC, primary sclerosing cholangitis.
Strengths and Limitations of the Evidence Base
The published literature exhibits several notable strengths: prospective validation in the Xce-LR study 7, multicenter training data, integration of imaging with clinical variables, and application of quantitative MRCP across vendor platforms 52. External validation of DeePSC across 1.5-T, 3-T, and a different scanner vendor demonstrated resilience to imaging protocol heterogeneity, with external accuracy reaching 92.4% 2.
Significant limitations are pervasive, however. Virtually all studies are retrospective and single-center. Sample sizes for MRCP imaging subsets range from 38 to 378, with prospective cohorts as small as 60 patients. Class imbalance—malignant cases often overrepresented—may inflate apparent performance, and downsampling of benign cases introduces its own selection distortion. Spectrum bias is frequent: most studies exclude choledocholithiasis or PSC-related diffuse strictures, and no published study has systematically evaluated deep-learning MRCP performance across all major benign etiologies (PSC, IgG4-related disease, postoperative strictures) in parallel 10. Explainability analysis, though feasible via Grad-CAM, is incompletely reported. Integration into picture archiving and communication systems (PACS) and clinical radiology workflows has not been formally evaluated in any study identified in the retrieved literature. No FDA-cleared or CE-marked deep-learning MRCP system specifically intended for benign-versus-malignant biliary stricture classification was identified during the literature review; regulatory status should be verified at the time of publication 10.
Clinical Readiness and Future Directions
Current evidence positions deep-learning-assisted MRCP as a promising investigational tool with potential to match or modestly exceed expert radiologist performance, rather than a validated clinical standard. Reported models have demonstrated sensitivity and specificity generally ranging from approximately 73% to 90% and AUC values from 0.79 to 0.91 across heterogeneous study populations and validation designs, and meaningful improvements in inter-reader agreement when used as decision-support aids 56. However, no published study fulfills the criteria for routine clinical implementation: prospective multicenter validation in diverse, consecutive patient populations using standardized imaging protocols and histopathologic reference standards remains absent.
Future research should prioritize: (1) prospective multicenter validation with standardized protocols and transparent reporting per Checklist for Artificial Intelligence in Medical Imaging (CLAIM) and Standards for Reporting Diagnostic Accuracy (STARD) guidelines; (2) formal reader-assist study designs quantifying changes in radiologist accuracy and confidence when presented with AI outputs; (3) subgroup analyses across specific benign etiologies (PSC, IgG4-related disease, postoperative injury); (4) explainability analysis to identify the morphologic features driving model decisions; (5) PACS integration and workflow feasibility studies; and (6) cost-effectiveness analysis benchmarked against current diagnostic pathways including ERCP with tissue sampling and endoscopic ultrasound (EUS)-guided biopsy 910.
In conclusion, deep-learning-assisted MRCP represents a scientifically compelling but investigational approach. Until robust prospective evidence and regulatory evaluation are available, these systems should be positioned as decision-support tools to augment—not supplant—expert radiologic interpretation and multidisciplinary clinical judgment in the diagnostic work-up of biliary strictures.