In the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors, the goal was to evaluate how well LLM-based math tutors support students. The task focused on four aspects of feedback:

spotting mistakes

identifying where the mistake happens

giving guidance

providing actionable suggestions

For our submission, we built on FLAN-T5 models with a multi-step training pipeline. In addition to standard fine-tuning, we used model merging (DARE-TIES) to leverage information across all four labels – and saw clear improvements over plain fine-tuning.

Our models achieved F1 scores between 52 and 69 and accuracies between 62% and 87%, ranking 11th, 8th, 11th, and 9th across the four tracks.

Link to the paper: Link