ChatGPT-4 versus human generated multiple choice questions - A study from a medical college in Pakistan

Authors

  • Muhammad Ahsan Naseer Department of Health Professions Education, Liaquat National Hospital & Medical College, Karachi, Pakistan
  • Yusra Nasir Department of Health Professions Education, Liaquat National Hospital & Medical College, Karachi, Pakistan
  • Afifa Tabassum Department of Health Professions Education, Liaquat National Hospital & Medical College, Karachi, Pakistan
  • Sobia Ali Department of Health Professions Education, Liaquat National Hospital & Medical College, Karachi, Pakistan

DOI:

https://doi.org/10.53685/jshmdc.v5i2.253

Keywords:

Artificial intelligence, Multiple choice questions, Undergraduate medical examination, ChatGPT-4

Abstract

Background: There has been a growing interest in using artificial intelligence (AI) generated multiple choice questions (MCQs) to supplement traditional assessments. While AI claims to generate higher-order questions, few studies focus on undergraduate medical education assessment in Pakistan.

Objective: To compare the quality of human-developed versus ChatGPT-4-generated MCQs for the final-year MBBS written MCQs examination

Methods: This observational study compared ChatGPT-4-generated and human-developed MCQs in four specialties: Pediatrics, Obstetrics and Gynecology (Ob/Gyn), Surgery, and Medicine. Based on the table of specifications, 204 MCQs were ChatGPT-4-generated and 196 MCQs were retrieved from the question bank of the medical college. ChatGPT-4-generated and human-generated MCQs were anonymized and MCQs quality was scored using a checklist based on the National Board of Medical Examiner criteria. Data was analyzed using SPSS version 23 and Mann-Whitney U and Chi square tests were applied.

Results: Out of 400 MCQs, 396 MCQs were included in the final review as four MCQs were not according to the table of specification. Total scores were not significantly different between human-generated and ChatGPT-4 generated MCQs (p=0.12). However, human-developed MCQs performed significantly better than ChatGPT-4-generated MCQ in Ob/Gyn (p=0.03). Human-developed MCQs scored better than ChatGPT-generated MCQs in the item checklist “stem includes necessary details for answering the question’’ in Ob/Gyn and Pediatrics (p < 0.05) as well as in "Is the item appropriate for cover the options rule"? in Surgery.

Conclusion: With a well-structured and specific prompting, ChatGPT-4 has the potential to assist in medical examination MCQ development. However, ChatGPT-4 has limitations where in depth contextual item generation is required.

References

Tangianu F, Mazzone A, Berti F, Pinna G, Bortolotti I, Colombo F, et. al. Are multiple-choice questions a good tool for the assessment of clinical competence in Internal Medicine?. Ital J Med 2018; 12(2): 88-96. doi: 10.4081/itjm.2018.980 DOI: https://doi.org/10.4081/itjm.2018.980

Towns MH. Guide to developing high-quality, reliable, and valid multiple-choice assessments. J Chem Educ 2014; 91(9) : 1426-1431. doi: 10.1021/ed500076x DOI: https://doi.org/10.1021/ed500076x

Diwan C, Srinivasa S, Suri G, Agarwal S, Ram P. AI-based learning content generation and learning pathway augmentation to increase learner engagement. Comput. Educ.: Artif Intell 2023; 4:100110. doi: 10.1016/j.caeai.2022.100110 DOI: https://doi.org/10.1016/j.caeai.2022.100110

Owan VJ, Abang KB, Idika DO, Etta EO, Bassey BA. Exploring the potential of artificial intelligence tools in educational measurement and assessment. EURASIA J Math Sci Tech Ed 2023; 19(8): em2307. doi: 10.29333/ejmste/13428 DOI: https://doi.org/10.29333/ejmste/13428

Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2024; 46(3): 366-372. doi: 10.1080/0142159X. 2023.2249588 DOI: https://doi.org/10.1080/0142159X.2023.2249588

Ali FA, Sharif S, Ata M, Patel N, Muhammad Rafay M, Syed HR, et. al. The Chat GPT develops multiple choice questions (MCQs) for postgraduate specialty assessment–A reality or a myth? Pak J Neruol Surg. 2024; 28(1):142-149. doi: 10.36552/ pjns.v28i1.963 DOI: https://doi.org/10.36552/pjns.v28i1.963

Giray L. Prompt engineering with ChatGPT: A guide for academic writers. Ann Biomed Eng. 2023; 51(12): 2629–2633. doi: 10.1007/s10439-023-03272-4 DOI: https://doi.org/10.1007/s10439-023-03272-4

Cheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, et.al. ChatGPT versus human in generating medical graduate examination multiple choice questions-A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS One. 2023;18(8):e0290691. doi: 10.1371/journal.pone.0290691 DOI: https://doi.org/10.1371/journal.pone.0290691

Ahmed A, Jamil E, Abubakar M, Batool A, Masoom Akhtar M, Iqbal Nasiri M, Ullah M. Harnessing the power of ChatGPT to develop effective MCQ-based clinical pharmacy examinations. J Res Technol Edu. 2024; 2:1-1. doi: 10.1080/15391523.2024.2425435 DOI: https://doi.org/10.1080/15391523.2024.2425435

Laverghetta AJ, Licato J. Generating better items for cognitive assessments using large language models BEA. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications. 2023; 414-428. doi: 10.18653/v1/2023.bea-1.34 DOI: https://doi.org/10.18653/v1/2023.bea-1.34

Rezigalla AA. AI in medical education: uses of AI in construction type A MCQs. BMC Med. Educ. 2024; 24(1): 247. doi: 10.1186/s12909-024-05250-3 DOI: https://doi.org/10.1186/s12909-024-05250-3

Haataja ES, Tolvanen A, Vilppu H, Kallio M, Peltonen J, Metsäpelto RL. Measuring higher-order cognitive skills with multiple choice questions–potentials and pitfalls of Finnish teacher education entrance. Teach Educ. 2023; 122: 103943. doi: 10.1016/j.tate.2022.103943 DOI: https://doi.org/10.1016/j.tate.2022.103943

Agarwal M, Sharma P, Goswami A. Analysing the applicability of ChatGPT, Bard, and Bing to generate reasoning-based multiple-choice questions in medical physiology. Cureus. 2023; 15(6):e40977. doi: 10.7759/cureus.40977 DOI: https://doi.org/10.7759/cureus.40977

Morrison S, Free KW. Writing multiple-choice test items that promote and measure critical thinking. J Nurs Educ. 2001; 40(1):17-24. doi: 10.3928/0148-4834-20010101-06 DOI: https://doi.org/10.3928/0148-4834-20010101-06

Liu J, Zheng J, Cai X, Wu D, Yin C. A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons. iScience. 2023; 26(9). doi: 10.1016/j.isci.2023.107590 DOI: https://doi.org/10.1016/j.isci.2023.107590

Paranjape K, Schinkel M, Nannan Panday R, Car J, Nanayakkara P. Introducing artificial intelligence training in medical education. JMIR Med Educ. 2019; 5(2): e16048. doi: 10.2196/16048 DOI: https://doi.org/10.2196/16048

Adiguzel T, Kaya MH & Cansu FK. Revolutionizing education with AI: Exploring the transformative potential of ChatGPT. Contemp. Educ. Technol. 2023; 15(3): ep429. doi: 10.30935/cedtech/13152 DOI: https://doi.org/10.30935/cedtech/13152

Doughty J, Wan Z, Bompelli A, Qayum J, Wang T, Zhang J, et.al. A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. In Proceedings of the 26th ACEC. 2024; 114-123. doi: 10.1145/3636243.363625 DOI: https://doi.org/10.1145/3636243.3636256

Downloads

Published

12/31/2024

How to Cite

[1]
Naseer, M.A., Nasir, Y., Tabassum, A. and Ali, S. 2024. ChatGPT-4 versus human generated multiple choice questions - A study from a medical college in Pakistan. Journal of Shalamar Medical & Dental College . 5, 2 (Dec. 2024), 58–64. DOI:https://doi.org/10.53685/jshmdc.v5i2.253.

Issue

Section

Original Articles