In the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors, the goal was to evaluate how well LLM-based math tutors support students. The task focused on four aspects of feedback:
spotting mistakes
identifying where the mistake happens
giving guidance
providing actionable suggestions
For our submission, we built on FLAN-T5 models with a multi-step training pipeline. In addition to standard fine-tuning, we used model merging (DARE-TIES) to leverage information across all four labels – and saw clear improvements over plain fine-tuning.
Our models achieved F1 scores between 52 and 69 and accuracies between 62% and 87%, ranking 11th, 8th, 11th, and 9th across the four tracks.
Link to the paper: Link
