1 General Description
Short answer scoring is an established task in educational natural language processing (Burrows et al., 2015; Bai & Stede, 2023; Bexte et al., 2024). It was already addressed in at least two previous shared tasks, namely the SemEval-2013 Shared Task (Dzikovska et al., 2013) and the ASAP-SAS Kaggle competition. In recent years, it has been primarily approached using two paradigms: instance-based and similarity-based approaches (Bexte et al., 2022; Bexte et al., 2023). With this shared task, we consider another interpretation: rubric-based short-answer scoring. For rubric-based short answer scoring, a model is provided with an answer, a question, and a textual scoring rubric that, for each possible score, establishes criteria that need to be met so the same can be assigned to a given answer. To successfully solve this task, a model must understand the semantics of a scoring rubric and learn to apply its criteria to unseen student answers. This setup is directly inspired by the established way of manually scoring student answers (Reddy & Andrade, 2010; Panadero & Jonsson, 2013). Here, human scorers are often provided with a rubric and must reason about which criteria apply to a given answer to assign the correct score.
While rubrics have been applied in various works on free-text scoring and showed promise as supplementary input or pre-training resource (Wang et al., 2019; Li et al., 2023; Condor et al., 2022) as well as for systems based on LLM prompting (Wei et al., 2022; Frohn et al., 2025), there is, as of now, a lack of focused research on rubric-based short-answer scoring. In practice, the task involves many challenges, such as dealing with uncertainty in the rubrics and, depending on the question, a large range of possible answers. As of now, there is a clear research gap in understanding how well current natural language processing methods can comprehend rubrics and apply them to students’ answers to predict corresponding scores. With this shared task, we aim to kickstart research on this specific interpretation of the short-answer scoring task and help discover promising approaches. Moreover, since the dataset we propose for this shared task is in German, we aim to contribute to the body of non-English benchmark datasets.
2 Dataset
| Input Category | Example |
|---|---|
| Question | Name consequences that the gas shortage could have for Germany and your school. |
| Answer | The school might have to close because it can no longer be heated. It could also be that students just have to wear jackets during lessons. |
| Rubric | (2) Students identify at least two links between a gas supply stop and the supply of electricity and/or heating. (1) Students identify one link between a gas supply stop and the supply of electricity and/or heating. (0) Students do not identify a link between a gas supply stop and the supply of electricity and/or heating. |
| Score | 1/2 |
The dataset we aim to use in this shared task is an entirely novel dataset called ALICE-LP-1.0, which contains mid- and high-school level answers to questions from four STEM domains: physics, mathematics, biology, and chemistry. These answers were collected within the context of the ALICE project, funded by the Leibniz Foundation, at middle and high schools in the German state of Schleswig-Holstein. The main goal of this project was to conduct classroom learning progression analytics (Kubsch et al., 2022). Teachers guided students through lessons which involved solving various interactive learning activities in Moodle courses synchronously. This included a set of short-answer items, which are the basis of this dataset. The dataset is entirely in German. Each question is rated on a three-point scale (incorrect, partially correct, correct). A single datapoint in this dataset consists of the question text, the answer, a textual three-way rubric defining the criteria for each possible score, and the corresponding score. Table 1 shows an example datapoint from this dataset. For the shared task, we also aim to provide a two-point version that distinguishes correct and incorrect answers, with partially correct answers being counted as incorrect. The training set contains 7,899 answers. The evaluation set is divided into two subsets: unseen answers and unseen questions. This setup was inspired by the established SemEval-2013 (Dzikovska et al., 2013) and SAF (Filighera et al., 2022) datasets, enabling the evaluation of how well systems can conduct transfer to unseen questions. The unseen answers dataset includes 2,008 answers to questions contained in the training set (using a question-wise stratified 80/20 split). In contrast, the unseen questions dataset contains 3,168 answers to questions not in the training set.
3 Tracks and Evaluation Metrics
In the context of the shared task, we aim to host four evaluation tracks. All tracks are evaluated using the same set of evaluation metrics, namely precision, recall, weighted F1 and quadratic weighted kappa (QWK), with QWK acting as the primary evaluation metric that determines the leaderboard ranking for each track. The four tracks are the following:
- Unseen answers three-way: For this track, participating models are evaluated based on how well they can score unseen answers to questions seen during training. For this track, we distinguish between correct, partially correct and incorrect answers.
- Unseen questions three-way: For this track, participating models are evaluated based on how well they can score answers to questions not seen during training. For this track, we distinguish between correct, partially correct and incorrect answers.
- Unseen answers two-way: For this track, participating models are evaluated based on how well they can score unseen answers to questions seen during training. For this track, we only distinguish between correct and incorrect answers, with partially correct answers being counted as incorrect.
- Unseen questions two-way: For this track, participating models are evaluated based on how well they can score answers to questions not seen during training. For this track, we only distinguish between correct and incorrect answers, with partially correct answers being counted as incorrect.
All participants will be allowed to submit five entries per track, with the best submission, as determined by QWK, counted towards the official leaderboard of the shared task. We allow for five submissions to account for different variants of authors’ systems and the non-deterministic nature of deep learning methods. The shared task will be completely open, meaning that participants are not limited in any way regarding methods, models and additional resources. We encourage all participants to publish the code on a platform such as GitHub or Codeberg under an open-source license of their choice.
4 System Description Papers
All participants are highly encouraged to submit a system description paper. System description papers must use the most recent ACL LaTeX templates and can be up to five pages in length, plus unlimited pages for references and appendices. Also, participants are required to include a limitations and an ethics section in their paper, which also do not count towards the page limit. Each participating team that aims to submit a paper must nominate at least one reviewer who will review two other participating teams’ system description papers. While the papers must follow good scientific practice, we ask the reviewers to apply lenient criteria. Reasons for rejection would mainly be plagiarism, fake results, or similar. However, especially, a lack of methodological novelty is NOT a reason to reject a shared task system description paper.
5 Submission Format
The dataset comes in a JSON format in the following form:
[
...,
{
"id": "f81d4fae-7dec-11d0-a765-00a0c91e6bf6",
"question_id": "41d2e746-86e5-4e0f-a541-4a10859ad0d7",
"question": "...",
"answer": "...",
"rubric": {
"Incorrect": "...",
"Partially Correct": "...",
"Correct": "..."
},
"score": "Incorrect/Partially Correct/Correct"
},
...
]
For the submission of the predicted scores, it suffices to submit a JSON in the following format per track:
[
...,
{
"id": "f81d4fae-7dec-11d0-a765-00a0c91e6bf6",
"question_id": "41d2e746-86e5-4e0f-a541-4a10859ad0d7",
"score": "Incorrect/Partially Correct/Correct"
},
...
]
6 Timeline
- 26th January 2026 AoE: release of training data
- 21st March 2026 AoE: release of unlabelled test data
- 28th March 2026 AoE: deadline for submitting results
- 3rd April 2026 AoE: publication of official leaderboards
- 24th April 2026 AoE: deadline for system description papers
- 2nd May 2026 AoE: deadline for reviews
- 4th May 2026 AoE: notification of acceptance
- 12th May 2026 AoE: deadline for camera-ready papers
7 Organizers
- Sebastian Gombert: DIPF | Leibniz-Institute for Research and Information in Education
- Zhifan Sun: DIPF | Leibniz-Institute for Research and Information in Education
- Fabian Zehner: DIPF | Leibniz-Institute for Research and Information in Education
- Jannik Lossjew: IPN | Leibniz-Institute for Science and Mathematics Education
- Tobias Wyrwich: IPN | Leibniz-Institute for Science and Mathematics Education
- Berrit Katharina Czinczel: IPN | Leibniz-Institute for Science and Mathematics Education
- David Bednorz: IPN | Leibniz-Institute for Science and Mathematics Education
- Sascha Bernholt: IPN | Leibniz-Institue for Science and Mathematics Education
- Knut Neumann: IPN | Leibniz-Institute for Science and Mathematics Education
- Ute Harms: IPN | Leibniz-Institute for Science and Mathematics Education
- Aiso Heinze: IPN | Leibniz-Institute for Science and Mathematics Education
- Hendrik Drachsler: DIPF | Leibniz-Institute for Research and Information in Education & Goethe-Universität Frankfurt
8 Registration
References
-
Bai, X., & Stede, M. (2023). A survey of current machine learning approaches to student free-text evaluation for intelligent tutoring. International Journal of Artificial Intelligence in Education, 33(4), 992-1030.
-
Bexte, M., Horbach, A., & Zesch, T. (2023, July). Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring?. In Findings of the association for computational linguistics: Acl 2023 (pp. 1892-1903).
-
Bexte, M., Horbach, A., & Zesch, T. (2022, July). Similarity-based content scoring-how to make S-BERT keep up with BERT. In Proceedings of the 17th workshop on innovative use of NLP for building educational applications (BEA 2022) (pp. 118-123).
-
Bexte, M., Horbach, A., & Zesch, T. (2024). Strengths and weaknesses of automated scoring of free-text student answers. Informatik Spektrum, 47(3), 78-86.
-
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International journal of artificial intelligence in education, 25(1), 60-117.
-
Condor, A., Pardos, Z., & Linn, M. (2022, July). Representing scoring rubrics as graphs for automatic short answer grading. In International Conference on Artificial Intelligence in Education (pp. 354-365). Cham: Springer International Publishing.
-
Dzikovska, M. O., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., … & Dang, H. T. (2013, June). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Second joint conference on lexical and computational semantics (* SEM), Volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 263-274).
-
Filighera, A., Parihar, S., Steuer, T., Meuser, T., & Ochs, S. (2022, May). Your answer is incorrect… would you like to know why? introducing a bilingual short answer feedback dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 8577-8591).
-
Frohn, S., Burleigh, T., & Chen, J. (2025, July). Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design. In International Conference on Artificial Intelligence in Education (pp. 44-51). Cham: Springer Nature Switzerland.
-
Kubsch, M., Czinczel, B., Lossjew, J., Wyrwich, T., Bednorz, D., Bernholt, S., … & Rummel, N. (2022, August). Toward learning progression analytics—Developing learning environments for the automated analysis of learning using evidence centered design. In Frontiers in education (Vol. 7, p. 981910). Frontiers Media SA.
-
Li, Z., Lloyd, S., Beckman, M., & Passonneau, R. J. (2023, December). Answer-state recurrent relational network (AsRRN) for constructed response assessment and feedback grouping. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 3879-3891).
-
Panadero, E., & Jonsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational research review, 9, 129-144.
-
Reddy, Y. M., & Andrade, H. (2010). A review of rubric use in higher education. Assessment & evaluation in higher education, 35(4), 435-448.
-
Wang, T., Inoue, N., Ouchi, H., Mizumoto, T., & Inui, K. (2019, November). Inject rubrics into short answer grading system. In Proceedings of the 2nd workshop on deep learning approaches for low-resource NLP (DeepLo 2019) (pp. 175-182).
-
Wei, Y., Pearl, D., Beckman, M., & Passonneau, R. J. (2025). Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis. arXiv preprint arXiv:2504.03877.
