Poster Display - 26
Diagnostic Sensitivity of Artificial Intelligence in Detecting Acute Surgical Abdomen on Paediatric Abdominal X-rays compared with Operative Findings
Abdul Mannan, Zahra Auqil, Batool Fatima, Zuha Zafar, Rehman Waheed, Muhammad Kashif Bashir
Department of Pediatric Surgery, Mayo Hospital Lahore, Pakistan
Objective:
To evaluate the diagnostic performance of ChatGPT-4 (June 2024 version) in interpreting pediatric abdominal radiographs by comparing its outputs against intraoperative findings (gold standard) and radiologist reports, and to quantify agreement using Cohen’s kappa.
Methods:
A total of 28 pediatric abdominal X-rays were interpreted using ChatGPT-4 (June 2024 version). Interpretations were generated using only clinical summaries and radiographic images. These were compared against intraoperative findings as the gold standard for diagnostic correctness. Additionally, in 13 cases where radiologist reports were available, ChatGPT's interpretations were compared for agreement using Cohen’s kappa statistic.
Results:
ChatGPT accurately interpreted 16 out of 28 cases (57.1%) based on intraoperative findings. It agreed with radiologist assessments in 9 out of 13 cases (69.2%), yielding a Cohen’s kappa (κ) of 0.38, indicating fair agreement. Notably, ChatGPT missed pneumoperitoneum in both instances where it was present — a critical oversight. The tool performed reasonably well in identifying obstructive pathologies such as small bowel obstruction and Hirschsprung's disease, but showed limitations in detecting subtle or life-threatening findings.
Conclusion:
ChatGPT-4 (June 2024 version) demonstrated moderate diagnostic accuracy in pediatric abdominal X-ray interpretation, with fair inter-rater agreement compared to radiologists. While it shows potential as an educational or triage support tool, its current limitations — particularly in missing emergent findings like pneumoperitoneum — underscore the need for cautious and supervised clinical integration.