Evaluating Various Large Language Models’ Scoring Characteristics of Open-Ended Responses and Essays

Authors

Keywords:

: grading reliability, large language models (LLMs), prompt engineering

Abstract

This study evaluates the grading effectiveness of large language models (LLMs), including ChatGPT-5, Claude, and DeepSeek, on open-ended responses and essays. In Phase 1, AI scores were compared with scores of human instructors, revealing differences in leniency, depth, and alignment, quantified using a normalized distance metric. In Phase 2, prompt engineering and few-shot learning improved alignment with human graders. Principal Component Analysis plots with Kernel Density Estimation contours supported these gains. A single middle-performing exemplar strategy consistently improved grading alignment across models and assignments without negative effects, while other strategies showed variable results. Careful design remains essential to ensure fairness and responsible AI-assisted assessment.

Downloads

Download data is not yet available.

Author Biographies

  • Yeseul Nam, University of Central Arkansas

    Dr. Yeseul Nam is an Assistant Professor in the Department of Psychology and Counseling at the University of Central Arkansas. She teaches courses such as Introduction to Psychology, Multicultural Psychology, and Psychology Apprenticeship. Her research, including her publication Nam & Chen (2021), focuses on cross-cultural differences, racial and socioeconomic disparities, as well as pedagogy development and emerging adulthood. 

  • James E. Wages III, University of Central Arkansas

    Dr. James E. Wages III is an Assistant Professor of psychology in the Department of Psychology and Counseling at the University of Central Arkansas. Dr. Wages researches topics on social cognition, intergroup processes, decision-making, and the intersection of these areas (e.g., see Wages, Perry, Bodenhausen, & Skinner, 2022). Dr. Wages primarily teaches Research Methods, Psychology in Context, and Psychology Apprenticeship.

References

Aji, C. A., & Khan, M. J. (2019). The impact of active learning on students’ academic performance. Open Journal of Sciences, 7(3), 204-211. https://doi.org/10.4236/jss.2019.73017.

Alabidi, S., Alarabi, K., Alsalhi, N. R., & Mansoori, M. A. (2023). The dawn of ChatGPT: Transformation in science assessment. Eurasian Journal of Educational Research, 106, 321-337. https://doi.org/10.14689/ejer.2023.106.019

Chiu, T. K. F., Xia, Q., Zhou, X., Chai, C. S., Cheng, M. (2023). Systematic literature review on

opportunities, challenges, and future research recommendations of artificial intelligence in education. Computers and Education: Artificial Intelligence, 4, 100118.

https://doi.org/10.1016/j.caeai.2022.100118

Damasevicious, R. & Šidlauskiene, T. (2024). AI as a teacher: A new educational dynamic for modern classrooms for personalized learning support. AI-enhanced teaching methods (pp. 1-24). https://doi.org/10.4018/979-8-3693-2728-9.ch001

Geewax, J. J. (2021). API design patterns. Manning.

Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E., & Wicks, P. (2023). Large language models AI chatbots require approval as medical devices. Nature Medicine, 29, 2396-2398.

https://doi.org/10.1038/s41591-023-02412-6.

Giray, L. (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of

Biomedical Engineering, 51(12), 2629-2633. https://doi.org/10.1007/s10439-023-03272-4.

Jukiewicz, M. (2024). The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Thinking Skills and Creativity, 52, e101522. https://doi.org/10.1016/j.tsc.2024.101522

World Medical Association. (2008). Declaration of Helsinki: Ethical principles for medical research involving human subjects. Journal of the American Medical Association, 300(20), 2413–2415. https://doi.org/10.1001/jama.2008.346

Klyshbekova, M., & Abbott, P. (2024). Chatbot and assessment in higher education: A magic wand or a disrupter?” Electronic Journal of E-Learning, 22(2), 30-45.

https://doi.org/10.34190/ejel.21.5.3114

Kooli, C., & Yusuf, N. (2024). Transforming educational assessment: Insights into the use of ChatGPT and large language models in grading. International Journal of Human-Computer Interaction, 1-12. https://doi.org/10.1080/10447318.2024.2338330

Lo, C. K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Science, 13(4), 410.

Mcgovern, M. (2024). Using the generative artificial intelligence (AI) chatbot of perplexity and CHATGPT as a teaching and learning tool for practice teachers and students within social work placement. The Journal of Teaching and Learning, 22(1-2), 1-19. https://doi.org/10.1921/jpts.v21i3.2223

Shin, B., Lee, J., & Yoo, Y. (2024). Exploring automatic scoring of mathematical descriptive assessment using prompt engineering with the GPT-4 model: Focused on permutations and combinations. The Mathematical Education, 63(2), 187-207. https://doi.org/10.7468/mathedu.2024.63.2.187

Wang, S., Wang, F., Zhu, Z., Wang, J., Tran, T., Du, Z. (2024). Artificial intelligence in education: A systematic literature review. Expert Systems with Applications, 254, 214167. https://doi.org/10.1016/j.eswa.2024.214167

Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., Bowen, S. S., & Wood, M. (2024). Grading the grader: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 0(0), 1-7. https://doi.org/10.1177/00986283241282696

Yang, X., Wang, Q., & Lyu, J. (2023). Assessing ChatGPT’s educational capabilities and application potential, ECNU Review of Education. https://doi.org/10.1177/20965311231210006

Published

2026-03-17

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to student privacy and institutional restrictions but are available from the corresponding author on reasonable request.

Issue

Section

Articles

Categories

How to Cite

Evaluating Various Large Language Models’ Scoring Characteristics of Open-Ended Responses and Essays. (2026). Journal on Excellence in College Teaching. https://celt.miamioh.edu/index.php/JECT/article/view/1332