Publications

Scoring with Confidence? -- Exploring High-confidence Scoring for Saving Manual Grading Effort

Bexte, Marie and Horbach, Andrea and Schützler, Lena and Christ, Oliver and Zesch, Torsten
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), 2024

Abstract

A possible way to save manual grading effort in short answer scoring is to automatically score answers for which the classifier is highly confident. We explore the feasibility of this approach in a high-stakes exam setting, evaluating three different similarity-based scoring methods, where the similarity score is a direct proxy for model confidence. The decision on an appropriate level of confidence should ideally be made before scoring a new prompt. We thus probe to what extent confidence thresholds are consistent across different datasets and prompts. We find that high-confidence thresholds vary on a prompt-to-prompt basis, and that the overall potential of increased performance at a reasonable cost of additional manual effort is limited.

Similarity-Based Content Scoring - A more Classroom-Suitable Alternative to Instance-Based Scoring?

Bexte, Marie and Horbach, Andrea and Zesch, Torsten
Findings of the Association for Computational Linguistics: ACL 2023, 2023

Abstract

Automatically scoring student answers is an important task that is usually solved using instance-based supervised learning. Recently, similarity-based scoring has been proposed as an alternative approach yielding similar perfor- mance. It has hypothetical advantages such as a lower need for annotated training data and better zero-shot performance, both of which are properties that would be highly beneficial when applying content scoring in a realistic classroom setting. In this paper we take a closer look at these alleged advantages by comparing different instance-based and similarity-based methods on multiple data sets in a number of learning curve experiments. We find that both the demand on data and cross-prompt performance is similar, thus not confirming the former two suggested advantages. The by default more straightforward possibility to give feedback based on a similarity-based approach may thus tip the scales in favor of it, although future work is needed to explore this advantage in practice.

To Score or Not to Score: Factors Influencing Performance and Feasibility of Automatic Content Scoring of Text Responses

Zesch, Torsten and Horbach, Andrea and Zehner, Fabian
Educational Measurement: Issues and Practice, 2023

Abstract

Abstract In this article, we systematize the factors influencing performance and feasibility of automatic content scoring methods for short text responses. We argue that performance (i.e., how well an automatic system agrees with human judgments) mainly depends on the linguistic variance seen in the responses and that this variance is indirectly influenced by other factors such as target population or input modality. Extending previous work, we distinguish conceptual, realization, and nonconformity variance, which are differentially impacted by the various factors. While conceptual variance relates to different concepts embedded in the text responses, realization variance refers to their diverse manifestation through natural language. Nonconformity variance is added by aberrant response behavior. Furthermore, besides its performance, the feasibility of using an automatic scoring system depends on external factors, such as ethical or computational constraints, which influence whether a system with a given performance is accepted by stakeholders. Our work provides (i) a framework for assessment practitioners to decide a priori whether automatic content scoring can be successfully applied in a given setup as well as (ii) new empirical findings and the integration of empirical findings from the literature on factors that influence automatic systems' performance.

Similarity-Based Content Scoring - How to Make S-BERT Keep Up With BERT

Bexte, Marie and Horbach, Andrea and Zesch, Torsten
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), 2022

Abstract

The dominating paradigm for content scoring is to learn an instance-based model, i.e. to use lexical features derived from the learner answers themselves. An alternative approach that receives much less attention is however to learn a similarity-based model. We introduce an architecture that efficiently learns a similarity model and find that results on the standard ASAP dataset are on par with a BERT-based classification approach.

LeSpell - A Multi-Lingual Benchmark Corpus of Spelling Errors to Develop Spellchecking Methods for Learner Language

Bexte, Marie and Laarmann-Quante, Ronja and Horbach, Andrea and Zesch, Torsten
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

Abstract

Spellchecking text written by language learners is especially challenging because errors made by learners differ both quantitatively and qualitatively from errors made by already proficient learners. We introduce LeSpell, a multi-lingual (English, German, Italian, and Czech) evaluation data set of spelling mistakes in context that we compiled from seven underlying learner corpora. Our experiments show that existing spellcheckers do not work well with learner data. Thus, we introduce a highly customizable spellchecking component for the DKPro architecture, which improves performance in many settings.

Evaluating Automatic Spelling Correction Tools on German Primary School Children's Misspellings

Laarmann-Quante, Ronja and Prepens, Lisa and Zesch, Torsten
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning, 2022

`Meet me at the ribary' -- Acceptability of spelling variants in free-text answers to listening comprehension prompts

Laarmann-Quante, Ronja and Schwarz, Leska and Horbach, Andrea and Zesch, Torsten
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), 2022

Abstract

When listening comprehension is tested as a free-text production task, a challenge for scoring the answers is the resulting wide range of spelling variants. When judging whether a variant is acceptable or not, human raters perform a complex holistic decision. In this paper, we present a corpus study in which we analyze human acceptability decisions in a high stakes test for German. We show that for human experts, spelling variants are harder to score consistently than other answer variants. Furthermore, we examine how the decision can be operationalized using features that could be applied by an automatic scoring system. We show that simple measures like edit distance and phonetic similarity between a given answer and the target answer can model the human acceptability decisions with the same inter-annotator agreement as humans, and discuss implications of the remaining inconsistencies.

Implicit Phenomena in Short-answer Scoring Data

Bexte, Marie and Horbach, Andrea and Zesch, Torsten
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language, 2021

Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels

Ding, Yuning and Horbach, Andrea and Zesch, Torsten
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020

Abstract

In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.

Don't take ``nswvtnvakgxpm'' for an answer --The surprising vulnerability of automatic content scoring systems to adversarial input

Ding, Yuning and Riordan, Brian and Horbach, Andrea and Cahill, Aoife and Zesch, Torsten
Proceedings of the 28th International Conference on Computational Linguistics, 2020

Abstract

Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.

The Influence of Variance in Learner Answers on Automatic Content Scoring

Horbach, Andrea and Zesch, Torsten
Frontiers in Education, 2019

Abstract

Automatic content scoring is an important application in the area of automatic educational assessment. Short texts written by learners are scored based on their content while spelling and grammar mistakes are usually ignored. The difficulty of automatically scoring such texts varies with the variance within the learner answers. In this paper, we first discuss factors that influence variance in learner answers, so that practitioners can better estimate if automatic scoring might be applicable to their usage scenario. We then compare the two main paradigms in content scoring: (i) similarity-based and (ii) instance-based methods, and discuss how well they can deal with each of the variance-inducing factors described before.

Cross-Lingual Content Scoring

Horbach, Andrea and Stennmanns, Sebastian and Zesch, Torsten
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 2018

Abstract

We investigate the feasibility of cross-lingual content scoring, a scenario where training and test data in an automatic scoring task are from two different languages. Cross-lingual scoring can contribute to educational equality by allowing answers in multiple languages. Training a model in one language and applying it to another language might also help to overcome data sparsity issues by re-using trained models from other languages. As there is no suitable dataset available for this new task, we create a comparable bi-lingual corpus by extending the English ASAP dataset with German answers. Our experiments with cross-lingual scoring based on machine-translating either training or test data show a considerable drop in scoring quality.

ESCRITO - An NLP-Enhanced Educational Scoring Toolkit

Zesch, Torsten and Horbach, Andrea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018

Investigating neural architectures for short answer scoring

Riordan, Brian and Horbach, Andrea and Cahill, Aoife and Zesch, Torsten and Lee, Chong Min
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 2017

Abstract

Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models -- ngrams and embeddings -- are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.

The Influence of Spelling Errors on Content Scoring Performance

Horbach, Andrea and Ding, Yuning and Zesch, Torsten
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), 2017

Abstract

Spelling errors occur frequently in educational settings, but their influence on automatic scoring is largely unknown. We therefore investigate the influence of spelling errors on content scoring performance using the example of the ASAP corpus. We conduct an annotation study on the nature of spelling errors in the ASAP dataset and utilize these finding in machine learning experiments that measure the influence of spelling errors on automatic content scoring. Our main finding is that scoring methods using both token and character n-gram features are robust against spelling errors up to the error frequency in ASAP.

Investigating Active Learning for Short-Answer Scoring

Horbach, Andrea and Palmer, Alexis
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, 2016

Reducing Annotation Efforts in Supervised Short Answer Scoring

Zesch, Torsten and Heilman, Michael and Cahill, Aoife
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 2015

Finding a Tradeoff between Accuracy and Rater's Workload in Grading Clustered Short Answers

Horbach, Andrea and Palmer, Alexis and Wolska, Magdalena
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 2014

Abstract

In this paper we investigate the potential of answer clustering for semi-automatic scoring of short answer questions for German as a foreign language. We use surface features like word and character n-grams to cluster answers to listening comprehension exercises per question and simulate having human graders only label one answer per cluster and then propagating this label to all other members of the cluster. We investigate various ways to select this single item to be labeled and find that choosing the item closest to the centroid of a cluster leads to improved (simulated) grading accuracy over random item selection. Averaged over all questions, we can reduce a teacher’s workload to labeling only 40% of all different answers for a question, while still maintaining a grading accuracy of more than 85%.