June 7, 2024 | By Arif Rachmatullah, Marta Mielicki, Hui Yang, and Nonye Alozie
The quality of science education heavily depends on the skill and knowledge of science teachers, especially their Pedagogical Content Knowledge (PCK). This unique, teacher-specific knowledge is crucial for teachers to effectively deliver, organize, and tailor instruction on scientific concepts to cater to the diverse needs and interests of students. Although PCK is vital, existing methods for evaluating PCK, which include written tests, surveys, interviews, and classroom observations, have their drawbacks.
There is consensus that combining interviews with observations is a suitable approach for assessing PCK. However, these methods are time-consuming and often subjective, making them impractical for widespread research applications. Consequently, scenario-based written assessments have become the most common method, especially in large-scale studies. But even this method is relatively labor-intensive and time-consuming, hindering its broad implementation in research and practice.
The emergence of artificial intelligence (AI) technologies, particularly large language models (LLMs), presents a significant opportunity for innovation in PCK evaluation. In the “A Multisource-based Automated Tool for Measuring Science Teachers’ Pedagogical Content Knowledge” (MASTer PCK) project, SRI’s Education and Information & Computing Sciences divisions began collaborating to find ways to leverage AI technology to improve PCK evaluation.
The goal of this collaboration was to train machine learning models to accurately assess teachers’ PCK. Training machine learning models requires extensive data, and the original plan was to collect data from 5,000 teachers nationwide to train the models.
As with many education research projects, the MASTer PCK team experienced challenges in the recruitment of teachers. Completing hour-long, cognitively demanding PCK assessments proved a daunting task for teachers, which is not surprising considering their often-hectic schedules. Despite increasing the amount of compensation for participation and contacting 5,000 teachers, the MASTer PCK project saw less than 0.5% participation.
In response to this challenge, the MASTer PCK team pivoted to the utilization of synthetic data, generated by large language models through a meticulous, iterative prompting process to mimic actual teacher responses. A dedicated team of coders spent an extensive amount of time to score both the synthetic data and the data from real teachers.
The application of synthetic data in training machine learning models to score synthetic and teacher data yielded promising results with a 72% agreement level relative to human scoring. However, this approach, employing data from six different LLMs (Falcon 40b, Llama 2 13b, Llama 2 7b, Mistral 7b, Zephyr 7b alpha, and MPT 7b), also revealed several lessons:
- For synthetic data, quantity does not equal to quality. Many synthetic data sets were extensive and elaborate but often filled with irrelevant information, such as repeating the scenario texts or unrelated concepts, challenging the assumption that longer responses correlate with higher quality.
- Some synthetic responses are in unusual formats. Some responses deviated from typical teacher response formats, including dialogue-style interactions, which do not align with standard teacher responses.
- The LLMs struggle with parsing information in visual and mathematical equation formats. LLMs struggled to generate accurate responses for questions involving visual or mathematical models, such as scientific processes depicted through equations or pictorial models (e.g., diagrams and equations of photosynthesis and respiration processes), leading to off-target responses.
- Some synthetic responses within the same scenario are inconsistent. Within the same scenario, some synthetic responses varied significantly, particularly in identifying conceptual challenges and proposing relevant instructional strategies, highlighting a lack of consistency.
The promise of synthetic data in PCK research is substantial, offering a pathway to overcome the practical challenges of data collection in large-scale studies.
To maximize the potential of this approach, it is crucial to enhance the quality and relevance of synthetic data, ensuring they align closely with or can mimic realistic teacher responses. This can be achieved through more sophisticated prompting techniques and iterative refinement of the LLMs’ outputs; hence, more research on prompting is inevitable.
Additionally, addressing the inconsistencies and format limitations in synthetic responses will be vital. As AI technologies continue to evolve, their application in education research, particularly in assessing and understanding the nuanced realm of PCK, holds great promise.
Specifically, AI technologies can be used to improve the way teacher education and professional development programs provide timely and on-point feedback to teachers, which significantly influences their instructional practices. With careful refinement, synthetic data can significantly contribute to the advancement of education research by potentially providing solutions to the recruitment and data collection issues.
Lastly, leveraging and figuring out ethical considerations, such as using and uploading teachers’ real data to publicly available, more advanced LLMs (e.g., GPT series) to train them, can be the next promising direction that will likely advance the research and practice around developing an accurate machine learning model to assess teacher PCK.