Thumbs down for Generative AI text detection tools

Ever since ChatGPT was released at the end of 2022, we have seen a race between students using Generative AI tools (such as ChatGPT) in inadmissible ways, and lecturers trying to find ways to identify text generated by AI. Very soon, tools appeared that claimed to be able to identify text generated in the course of such malpractice. Even OpenAI, the organisation that developed ChatGPT, released a tool (AI Text Classifier) that claimed to able to distinguish between AI-generated text and human-generated text – a tool that has since been discontinued by OpenAI because of inaccuracies.

Some of these tools could indeed be used to some extent to distinguish between AI-generated text and human generated text. The problem is that many of these tools identify ‘false positives’ or ‘false negatives’, potentially leading to reputational damage to both students and lecturers (and their institutions).

A recent article by Prof. Debora Weber-Wulff (University of Applied Sciences HTW Berlin) and others reports on an extensive test of 14 publicly available detection tools for AI-generated text (two of them being commercially available). (‘Testing of Detection Tools for AI-Generated Text’ (currently still in the process of peer evaluation, but available on a preprint server at https://arxiv.org/abs/2306.15666.) Weber-Wulff is a leading and highly respected researcher in the area of text matching software, e.g. plagiarism checkers.

They report: “This paper exposes serious limitations of the state-of-the-art AI-generated text detection tools and their unsuitability for use as evidence of academic misconduct. Our findings do not confirm the claims presented by the systems. … Therefore, our conclusion is that the systems we tested should not be used in academic settings… Our findings strongly suggest that the ‘easy solution’ for detection of AI-generated text does not (and maybe even could not) exist”.

In the case of plagiarism identification tools, ‘positives’ in the identification of potentially plagiarised text can still be verified manually, using an identified text as evidence. In the case of AI-generated text (such as with ChatGPT), there is no such evidence.

Weber-Wulff et al. appeal to educators to rather rethink academic assessment strategies and the learning/teaching processes leading to assessment.