Introduction to AI-based Automated Item Generation and Automated Scoring
Alina A. von Davier and Duanli Yan
Full day short course (9:00am – 5:00pm)
Short course #1
In the era of artificial intelligence (AI), the field of educational testing faces significant challenges, particularly in test development and scoring, two key innovations include automated item generation (AIG) and automated scoring (AS). Only recently that generative AI has facilitated the development of complex test items on a large scale.
We introduce “the item factory”, for managing large-scale test development including automation of item generation, quality review, quality assurance, and crowdsourcing techniques in adaptive testing. We present an overview of the latest natural language processing (NLP) techniques and large language models for AIG, alongside psychometric principles and practices for test development. We discuss the application of engineering principles in designing efficient item production processes (Luecht, 2008; Dede et al, 2018; von Davier, 2017). As AS becomes becomes an integral part of the assessment landscape due to their advantages in reporting time, cost, objectivity, consistency, transparency, and feedback. We aim to demystify AS and provide a comprehensive understanding of its workings. We offer an overview of the design, development, evaluation, and quality control of automated scoring systems, along with practical advice and considerations for practitioners on the applications of these systems into formative and summative assessments (Yan, Rupp, & Foltz, 2020).
Intended audience
This course is designed for individuals interested in the field of large-scale assessments and in AIG and AS. It covers various aspects, including AI for AIG and AS, the development and implementation of AIG and AS systems, standardization and validation techniques, and best practices for quality control and operations.
Summary of Objectives
This training offers a comprehensive overview to the many facets
of automated item generation (AIG) and automated scoring (AS) in
adaptive testing, drawing from real-world operational practices,
published papers (Attali et al, 2022; von Davier et al., 2024)
and the edited volume Handbook of Automated Scoring:
Theory into Practice (Yan, Rupp, & Foltz, 2020).
Participants will be guided through all operational aspects of
AIG and AS, from design to implementation.
- Demystifying the Black Box: The participants will gain an in-depth understanding of the various methods used for generating items and constructing automated scoring systems for evaluations. We will discuss a human-in-the-loop approach to automation and the value of preserving human values in highly automated systems.
- Implementing Systems in Operational Practice: Designing and implementing AIG and AS is crucial for educational assessments nowadays. However, transitioning them into operational systems and deploying them requires a more complex process, involving different implementation models and associated procedures. Participants will learn preprocessing textual data, filtering unsoarable essays and diverting them to hand-scoring, model building, score assignment, and reporting.
- Evaluating and Maintaining Systems Over Time: The participants will learn various approaches to assessing the performance of AIG and AS systems, including comparisons to human test development and scoring using evaluation metrics, as well as managing system changes in operational practices. See Analytics for Quality Assurance in Assessment (von Davier, Liao, et al., 2022) for an example of such a system.
- Exploring Open Issues, Future Directions, and Engaging in General Discussion: The participants will learn applications of AIG and AS, discuss potential challenges in implementation and the requirements for advancing the field with AI, NLP, psychometrics, ethics, and human values in strengthening the operational use of AIG and AS.
About the Instructors
Alina A. von Davier (Duolingo)
Dr. Alina von Davier is the
chief of assessment, Duolingo, where she leads the Duolingo
English Test research and development area. She is also the
Founder and CEO of EdAstra
Tech, a service-oriented EdTech company. She is a researcher,
innovator, and executive leader in the field of computational
psychometrics, machine learning, and education.
Duanli Yan (Measurement Incorporated)
Dr. Duanli Yan is a research
scientist at Measurement Incorporated working on AI-based
automated scoring. She served as the director of data analysis
and computational research in the research and development
division in ETS, responsible for automated scoring engine upgrade
evaluations. She has extensive experience in adaptive testing,
psychometric research, test security, innovative technology
research and implementations. Dr. Yan is also an adjunct
professor at Fordham University and at Rutgers University, the
State University of New Jersey.
The instructors have published extensively including many books and peer reviewed journals. They are co-editors for several volumes:
- Artificial Intelligence Applications in Educational Learning and Assessment (2026)
- Computerized Multistage Testing: Theory and Applications (2014), which won 2016 AERA Division D Significant Contribution to Educational Measurement and Research Methodology award,
- Research for Practical Issues and Solutions in Computerized Multistage Testing (2024)
And they co-authored Computerized Adaptive and Multistage Testing with R (2017). Dr. von Davier is a co-editor for Computational Psychometrics: New methodologies for a new generation of digital learning and assessment. Dr. Duanli Yan is a co-editor for Handbook of Automated Scoring: Theory into Practices, and for Handbook of Research on Science Learning Progressions.