1 General Description

Short answer scoring is an established task in educational natural language processing (Burrows et al., 2015; Bai & Stede, 2023; Bexte et al., 2024). It was already addressed in at least two previous shared tasks, namely the SemEval-2013 Shared Task (Dzikovska et al., 2013) and the ASAP-SAS Kaggle competition. In recent years, it has been primarily approached using two paradigms: instance-based and similarity-based approaches (Bexte et al., 2022; Bexte et al., 2023). With this shared task, we consider another interpretation: rubric-based short-answer scoring. For rubric-based short answer scoring, a model is provided with an answer, a question, and a textual scoring rubric that, for each possible score, establishes criteria that need to be met so the same can be assigned to a given answer. To successfully solve this task, a model must understand the semantics of a scoring rubric and learn to apply its criteria to unseen student answers. This setup is directly inspired by the established way of manually scoring student answers (Reddy & Andrade, 2010; Panadero & Jonsson, 2013). Here, human scorers are often provided with a rubric and must reason about which criteria apply to a given answer to assign the correct score.

While rubrics have been applied in various works on free-text scoring and showed promise as supplementary input or pre-training resource (Wang et al., 2019; Li et al., 2023; Condor et al., 2022) as well as for systems based on LLM prompting (Wei et al., 2022; Frohn et al., 2025), there is, as of now, a lack of focused research on rubric-based short-answer scoring. In practice, the task involves many challenges, such as dealing with uncertainty in the rubrics and, depending on the question, a large range of possible answers. As of now, there is a clear research gap in understanding how well current natural language processing methods can comprehend rubrics and apply them to students’ answers to predict corresponding scores. With this shared task, we aim to kickstart research on this specific interpretation of the short-answer scoring task and help discover promising approaches. Moreover, since the dataset we propose for this shared task is in German, we aim to contribute to the body of non-English benchmark datasets.

2 Dataset

Input Category Example
Question Name consequences that the gas shortage could have for Germany and your school.
Answer The school might have to close because it can no longer be heated.
It could also be that students just have to wear jackets during lessons.
Rubric (2) Students identify at least two links between a gas supply stop
and the supply of electricity and/or heating.
(1) Students identify one link between a gas supply stop and
the supply of electricity and/or heating.
(0) Students do not identify a link between a gas supply stop and
the supply of electricity and/or heating.
Score 1/2
Table 1: An example question with the corresponding rubric, one example student answer and the corresponding score
taken from the ALICE-LP dataset (translated from German to English).

The dataset we aim to use in this shared task is an entirely novel dataset called ALICE-LP-1.0, which contains mid- and high-school level answers to questions from four STEM domains: physics, mathematics, biology, and chemistry. These answers were collected within the context of the ALICE project, funded by the Leibniz Foundation, at middle and high schools in the German state of Schleswig-Holstein. The main goal of this project was to conduct classroom learning progression analytics (Kubsch et al., 2022). Teachers guided students through lessons which involved solving various interactive learning activities in Moodle courses synchronously. This included a set of short-answer items, which are the basis of this dataset. The dataset is entirely in German. Each question is rated on a three-point scale (incorrect, partially correct, correct). A single datapoint in this dataset consists of the question text, the answer, a textual three-way rubric defining the criteria for each possible score, and the corresponding score. Table 1 shows an example datapoint from this dataset. For the shared task, we also aim to provide a two-point version that distinguishes correct and incorrect answers, with partially correct answers being counted as incorrect. The training set contains 7,899 answers. The evaluation set is divided into two subsets: unseen answers and unseen questions. This setup was inspired by the established SemEval-2013 (Dzikovska et al., 2013) and SAF (Filighera et al., 2022) datasets, enabling the evaluation of how well systems can conduct transfer to unseen questions. The unseen answers dataset includes 2,008 answers to questions contained in the training set (using a question-wise stratified 80/20 split). In contrast, the unseen questions dataset contains 3,168 answers to questions not in the training set.

3 Tracks and Evaluation Metrics

In the context of the shared task, we aim to host four evaluation tracks. All tracks are evaluated using the same set of evaluation metrics, namely precision, recall, weighted F1 and quadratic weighted kappa (QWK), with QWK acting as the primary evaluation metric that determines the leaderboard ranking for each track. The four tracks are the following:

  1. Unseen answers three-way: For this track, participating models are evaluated based on how well they can score unseen answers to questions seen during training. For this track, we distinguish between correct, partially correct and incorrect answers.
  2. Unseen questions three-way: For this track, participating models are evaluated based on how well they can score answers to questions not seen during training. For this track, we distinguish between correct, partially correct and incorrect answers.
  3. Unseen answers two-way: For this track, participating models are evaluated based on how well they can score unseen answers to questions seen during training. For this track, we only distinguish between correct and incorrect answers, with partially correct answers being counted as incorrect.
  4. Unseen questions two-way: For this track, participating models are evaluated based on how well they can score answers to questions not seen during training. For this track, we only distinguish between correct and incorrect answers, with partially correct answers being counted as incorrect.

All participants will be allowed to submit five entries per track, with the best submission, as determined by QWK, counted towards the official leaderboard of the shared task. We allow for five submissions to account for different variants of authors’ systems and the non-deterministic nature of deep learning methods. The shared task will be completely open, meaning that participants are not limited in any way regarding methods, models and additional resources. We encourage all participants to publish the code on a platform such as GitHub or Codeberg under an open-source license of their choice.

4 System Description Papers

All participants are highly encouraged to submit a system description paper. System description papers must use the most recent ACL LaTeX templates and can be up to five pages in length, plus unlimited pages for references and appendices. Also, participants are required to include a limitations and an ethics section in their paper, which also do not count towards the page limit. Each participating team that aims to submit a paper must nominate at least one reviewer who will review two other participating teams’ system description papers. While the papers must follow good scientific practice, we ask the reviewers to apply lenient criteria. Reasons for rejection would mainly be plagiarism, fake results, or similar. However, especially, a lack of methodological novelty is NOT a reason to reject a shared task system description paper.

5 Submission Format

The dataset comes in a JSON format in the following form:

[
  ...,
  {
    "id": "f81d4fae-7dec-11d0-a765-00a0c91e6bf6",
    "question_id": "41d2e746-86e5-4e0f-a541-4a10859ad0d7",
    "question": "...",
    "answer": "...",
    "rubric": {
      "Incorrect": "...",
      "Partially Correct": "...",
      "Correct": "..."
    },
    "score": "Incorrect/Partially Correct/Correct"
  },
  ...
]

For the submission of the predicted scores, it suffices to submit a JSON in the following format per track:

[
  ...,
  {
    "id": "f81d4fae-7dec-11d0-a765-00a0c91e6bf6",
    "question_id": "41d2e746-86e5-4e0f-a541-4a10859ad0d7",
    "score": "Incorrect/Partially Correct/Correct"
  },
  ...
]

6 Timeline

  • 26th January 2026 AoE: release of training data
  • 21st March 2026 AoE: release of unlabelled test data
  • 28th March 2026 AoE: deadline for submitting results
  • 3rd April 2026 AoE: publication of official leaderboards
  • 24th April 2026 AoE: deadline for system description papers
  • 2nd May 2026 AoE: deadline for reviews
  • 4th May 2026 AoE: notification of acceptance
  • 12th May 2026 AoE: deadline for camera-ready papers

7 Organizers

  • Sebastian Gombert: DIPF | Leibniz-Institute for Research and Information in Education
  • Zhifan Sun: DIPF | Leibniz-Institute for Research and Information in Education
  • Fabian Zehner: DIPF | Leibniz-Institute for Research and Information in Education
  • Jannik Lossjew: IPN | Leibniz-Institute for Science and Mathematics Education
  • Tobias Wyrwich: IPN | Leibniz-Institute for Science and Mathematics Education
  • Berrit Katharina Czinczel: IPN | Leibniz-Institute for Science and Mathematics Education
  • David Bednorz: IPN | Leibniz-Institute for Science and Mathematics Education
  • Sascha Bernholt: IPN | Leibniz-Institue for Science and Mathematics Education
  • Knut Neumann: IPN | Leibniz-Institute for Science and Mathematics Education
  • Ute Harms: IPN | Leibniz-Institute for Science and Mathematics Education
  • Aiso Heinze: IPN | Leibniz-Institute for Science and Mathematics Education
  • Hendrik Drachsler: DIPF | Leibniz-Institute for Research and Information in Education & Goethe-Universität Frankfurt

8 Leaderboards

BEA Shared Task Leaderboards

Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.726 0.887 0.887 0.887
2 WSE Research 0.717 0.882 0.883 0.883
3 SDPA 0.682 0.869 0.866 0.867
4 RETUYT-INCO 0.674 0.865 0.861 0.863
5 ASLAN 0.663 0.861 0.864 0.862
6 AMATI 0.644 0.857 0.843 0.847
7 Diffuser 0.615 0.84 0.842 0.841
8 Afrilan 0.546 0.814 0.803 0.807
9 HFT 0.477 0.786 0.793 0.787
Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.726 0.887 0.887 0.887
2 WSE Research 0.717 0.882 0.883 0.883
3 WSE Research 0.717 0.882 0.883 0.883
4 IWM-DKM 0.71 0.88 0.881 0.881
5 WSE Research 0.71 0.88 0.88 0.88
6 WSE Research 0.702 0.878 0.879 0.878
7 IWM-DKM 0.7 0.876 0.877 0.876
8 IWM-DKM 0.698 0.875 0.877 0.876
9 WSE Research 0.697 0.874 0.875 0.874
10 IWM-DKM 0.684 0.869 0.87 0.869
11 SDPA 0.682 0.869 0.866 0.867
12 SDPA 0.679 0.867 0.866 0.866
13 RETUYT-INCO 0.674 0.865 0.861 0.863
14 RETUYT-INCO 0.673 0.866 0.861 0.862
15 RETUYT-INCO 0.672 0.864 0.862 0.863
16 RETUYT-INCO 0.667 0.862 0.862 0.862
17 ASLAN 0.663 0.861 0.864 0.862
18 SDPA 0.662 0.86 0.862 0.86
19 SDPA 0.656 0.857 0.856 0.857
20 ASLAN 0.654 0.86 0.863 0.859
21 ASLAN 0.654 0.857 0.86 0.858
22 RETUYT-INCO 0.654 0.858 0.851 0.853
23 SDPA 0.652 0.856 0.858 0.856
24 ASLAN 0.648 0.857 0.86 0.856
25 AMATI 0.644 0.857 0.843 0.847
26 AMATI 0.644 0.857 0.843 0.847
27 ASLAN 0.637 0.85 0.853 0.851
28 Diffuser 0.615 0.84 0.842 0.841
29 Afrilan 0.546 0.814 0.803 0.807
30 AMATI 0.489 0.801 0.807 0.795
31 AMATI 0.485 0.802 0.807 0.794
32 AMATI 0.485 0.802 0.807 0.794
33 HFT 0.477 0.786 0.793 0.787
34 HFT 0.465 0.781 0.789 0.783
35 HFT 0.435 0.766 0.77 0.768
36 HFT 0.424 0.761 0.766 0.763
Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.55 0.813 0.818 0.815
2 SDPA 0.535 0.806 0.804 0.805
3 WSE Research 0.533 0.806 0.811 0.808
4 RETUYT-INCO 0.49 0.787 0.791 0.789
5 HFT 0.482 0.786 0.793 0.788
6 ASLAN 0.457 0.78 0.789 0.779
7 Afrilan 0.437 0.766 0.757 0.761
8 AMATI 0.289 0.704 0.714 0.708
Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.55 0.813 0.818 0.815
2 SDPA 0.535 0.806 0.804 0.805
3 WSE Research 0.533 0.806 0.811 0.808
4 SDPA 0.531 0.806 0.811 0.807
5 IWM-DKM 0.526 0.802 0.797 0.799
6 WSE Research 0.525 0.801 0.804 0.803
7 WSE Research 0.51 0.799 0.806 0.8
8 WSE Research 0.503 0.792 0.792 0.792
9 WSE Research 0.503 0.792 0.792 0.792
10 IWM-DKM 0.501 0.796 0.804 0.797
11 RETUYT-INCO 0.49 0.787 0.791 0.789
12 HFT 0.482 0.786 0.793 0.788
13 HFT 0.477 0.784 0.791 0.785
14 HFT 0.467 0.78 0.787 0.782
15 SDPA 0.461 0.776 0.781 0.778
16 ASLAN 0.457 0.78 0.789 0.779
17 ASLAN 0.454 0.778 0.787 0.778
18 HFT 0.452 0.772 0.763 0.767
19 SDPA 0.452 0.771 0.772 0.772
20 ASLAN 0.452 0.772 0.779 0.774
21 HFT 0.447 0.77 0.76 0.764
22 Afrilan 0.437 0.766 0.757 0.761
23 SDPA 0.436 0.767 0.775 0.769
24 RETUYT-INCO 0.432 0.764 0.77 0.766
25 ASLAN 0.417 0.77 0.78 0.765
26 ASLAN 0.388 0.747 0.757 0.75
27 RETUYT-INCO 0.341 0.746 0.76 0.736
28 AMATI 0.289 0.704 0.714 0.708
Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.796 0.781 0.782 0.78
2 WSE Research 0.79 0.773 0.774 0.773
3 SDPA 0.776 0.775 0.763 0.766
4 ASLAN 0.757 0.758 0.743 0.746
5 AMATI 0.749 0.76 0.736 0.739
6 RETUYT-INCO 0.729 0.728 0.728 0.728
7 Diffuser 0.698 0.712 0.703 0.705
8 Afrilan 0.647 0.659 0.661 0.658
Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.796 0.781 0.782 0.78
2 IWM-DKM 0.79 0.772 0.77 0.771
3 WSE Research 0.79 0.773 0.774 0.773
4 WSE Research 0.79 0.769 0.768 0.768
5 WSE Research 0.788 0.773 0.773 0.773
6 WSE Research 0.784 0.765 0.765 0.765
7 WSE Research 0.781 0.765 0.765 0.765
8 IWM-DKM 0.781 0.779 0.767 0.77
9 IWM-DKM 0.78 0.775 0.768 0.77
10 IWM-DKM 0.779 0.771 0.763 0.765
11 SDPA 0.776 0.775 0.763 0.766
12 SDPA 0.758 0.746 0.749 0.744
13 SDPA 0.757 0.767 0.75 0.753
14 ASLAN 0.757 0.758 0.743 0.746
15 SDPA 0.757 0.77 0.752 0.755
16 SDPA 0.756 0.76 0.755 0.756
17 ASLAN 0.754 0.757 0.741 0.744
18 AMATI 0.749 0.76 0.736 0.739
19 ASLAN 0.747 0.75 0.737 0.74
20 ASLAN 0.744 0.744 0.728 0.731
21 ASLAN 0.739 0.74 0.725 0.728
22 RETUYT-INCO 0.729 0.728 0.728 0.728
23 AMATI 0.721 0.755 0.711 0.715
24 Diffuser 0.698 0.712 0.703 0.705
25 RETUYT-INCO 0.695 0.701 0.702 0.702
26 Afrilan 0.647 0.659 0.661 0.658
27 AMATI 0.614 0.662 0.655 0.65
28 AMATI 0.567 0.645 0.633 0.629
29 AMATI 0.491 0.596 0.596 0.595
 
Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.681 0.68 0.664 0.669
2 WSE Research 0.672 0.67 0.663 0.665
3 SDPA 0.644 0.634 0.633 0.633
4 ASLAN 0.579 0.653 0.587 0.593
5 Afrilan 0.523 0.607 0.608 0.607
6 AMATI 0.394 0.525 0.525 0.520
Rank Team Name Quadratic Weighted Kappa Weighted Precision Weighted Recall Weighted F1-Score
1 IWM-DKM 0.681 0.68 0.664 0.669
2 WSE Research 0.672 0.67 0.663 0.665
3 WSE Research 0.653 0.664 0.662 0.663
4 SDPA 0.644 0.634 0.633 0.633
5 SDPA 0.636 0.65 0.627 0.633
6 IWM-DKM 0.635 0.657 0.64 0.644
7 SDPA 0.629 0.644 0.625 0.63
8 SDPA 0.627 0.646 0.617 0.624
9 WSE Research 0.625 0.65 0.65 0.65
10 WSE Research 0.625 0.655 0.647 0.648
11 WSE Research 0.621 0.649 0.65 0.649
12 SDPA 0.621 0.63 0.628 0.628
13 IWM-DKM 0.591 0.727 0.601 0.6
14 ASLAN 0.579 0.653 0.587 0.593
15 ASLAN 0.542 0.701 0.576 0.57
16 IWM-DKM 0.541 0.738 0.576 0.563
17 ASLAN 0.539 0.676 0.571 0.567
18 ASLAN 0.525 0.643 0.566 0.567
19 Afrilan 0.523 0.607 0.608 0.607
20 ASLAN 0.517 0.609 0.562 0.568
21 AMATI 0.394 0.525 0.525 0.520

References

  1. Bai, X., & Stede, M. (2023). A survey of current machine learning approaches to student free-text evaluation for intelligent tutoring. International Journal of Artificial Intelligence in Education, 33(4), 992-1030.
  2. Bexte, M., Horbach, A., & Zesch, T. (2023, July). Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring?. In Findings of the association for computational linguistics: Acl 2023 (pp. 1892-1903).
  3. Bexte, M., Horbach, A., & Zesch, T. (2022, July). Similarity-based content scoring-how to make S-BERT keep up with BERT. In Proceedings of the 17th workshop on innovative use of NLP for building educational applications (BEA 2022) (pp. 118-123).
  4. Bexte, M., Horbach, A., & Zesch, T. (2024). Strengths and weaknesses of automated scoring of free-text student answers. Informatik Spektrum, 47(3), 78-86.
  5. Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International journal of artificial intelligence in education, 25(1), 60-117.
  6. Condor, A., Pardos, Z., & Linn, M. (2022, July). Representing scoring rubrics as graphs for automatic short answer grading. In International Conference on Artificial Intelligence in Education (pp. 354-365). Cham: Springer International Publishing.
  7. Dzikovska, M. O., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., … & Dang, H. T. (2013, June). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Second joint conference on lexical and computational semantics (* SEM), Volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 263-274).
  8. Filighera, A., Parihar, S., Steuer, T., Meuser, T., & Ochs, S. (2022, May). Your answer is incorrect… would you like to know why? introducing a bilingual short answer feedback dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 8577-8591).
  9. Frohn, S., Burleigh, T., & Chen, J. (2025, July). Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design. In International Conference on Artificial Intelligence in Education (pp. 44-51). Cham: Springer Nature Switzerland.
  10. Kubsch, M., Czinczel, B., Lossjew, J., Wyrwich, T., Bednorz, D., Bernholt, S., … & Rummel, N. (2022, August). Toward learning progression analytics—Developing learning environments for the automated analysis of learning using evidence centered design. In Frontiers in education (Vol. 7, p. 981910). Frontiers Media SA.
  11. Li, Z., Lloyd, S., Beckman, M., & Passonneau, R. J. (2023, December). Answer-state recurrent relational network (AsRRN) for constructed response assessment and feedback grouping. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 3879-3891).
  12. Panadero, E., & Jonsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational research review, 9, 129-144.
  13. Reddy, Y. M., & Andrade, H. (2010). A review of rubric use in higher education. Assessment & evaluation in higher education, 35(4), 435-448.
  14. Wang, T., Inoue, N., Ouchi, H., Mizumoto, T., & Inui, K. (2019, November). Inject rubrics into short answer grading system. In Proceedings of the 2nd workshop on deep learning approaches for low-resource NLP (DeepLo 2019) (pp. 175-182).
  15. Wei, Y., Pearl, D., Beckman, M., & Passonneau, R. J. (2025). Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis. arXiv preprint arXiv:2504.03877.