Title : Evaluating the risk of bias in randomized control trials in orthopedic surgery: Using chatGPT4o, gemini pro, and claude 3.5
Abstract:
Introduction: The accurate assessment of the risk of bias (RoB) is critical in clinical research to ensure the validity of study findings traditionally evaluated using the Cochrane Risk of Bias tool. Large Language Models (LLMs), such as ChatGPT-4, have demonstrated potential to automate RoB assessment through advanced natural language processing. This study investigates the capability of these LLMs to evaluate RoB in randomized controlled trials (RCTs) within the context of orthopedic surgery, aiming to assess their accuracy and concordance with human raters.
Methods: Twenty RCTs in orthopedic surgery were randomly selected for RoB assessment across six domains: random sequence generation, allocation concealment, selective reporting, blinding (participants/personnel and outcome assessment), incomplete outcome data, and other sources of bias. Two independent orthopedic surgeons assessed RoB, with a third resolving discrepancies. The ratings served as the reference standard for comparison with LLM-generated evaluations. Each LLM (ChatGPT-4, Claude Pro, and Gemini Pro) was provided with the methods and results sections of each trial and evaluated using a standardized prompt based on the Cochrane framework. Statistical analyses, including Fisher's exact test, were conducted to compare the concordance between LLMs and human raters.
Results: ChatGPT-4 and Claude Pro demonstrated high concordance with human raters, achieving 90% accuracy in RoB assessments across domains, with no statistically significant difference from human evaluations (p = 1.0). Gemini Pro showed moderate agreement at 75%, though this difference was not statistically significant (p = 0.30). Domain-specific analysis revealed high accuracy for random sequence generation and blinding (participants/personnel and outcome assessment), while lower alignment was observed for allocation concealment and other sources of bias. Claude Pro exhibited superior performance in selective reporting (90% concordance). These results underscore the potential of LLMs to replicate human-level performance in systematic reviews.
Conclusion: This study highlights the capacity of advanced LLMs, particularly ChatGPT-4 and Claude Pro, to perform RoB assessments with accuracy comparable to human experts. While LLMs excelled in several domains, variability in allocation concealment and other sources of bias suggests they should complement, rather than replace, human expertise. Future research should focus on refining LLM prompts, expanding sample diversity, and integrating these tools into systematic review workflows to enhance efficiency and standardization in evidence-based medicine.