'Scarlet Cloak and the Forest Adventure': A Preliminary Study of the Impact of AI on Commonly Used Writing Tools

Saved in:
Bibliographic Details
Title: 'Scarlet Cloak and the Forest Adventure': A Preliminary Study of the Impact of AI on Commonly Used Writing Tools
Language: English
Authors: Barbara Bordalejo, Davide Pafumi (ORCID 0000-0002-1113-187X), Frank Onuh, A. K. M. Iftekhar Khalid, Morgan Slayde Pearce, Daniel Paul O'Donnell
Source: International Journal of Educational Technology in Higher Education. 2025 22.
Availability: BioMed Central, Ltd. Available from: Springer Nature. 233 Spring Street, New York, NY 10013. Tel: 800-777-4643; Tel: 212-460-1500; Fax: 212-348-4505; e-mail: customerservice@springernature.com; Web site: https://www.springer.com/gp/biomedical-sciences
Peer Reviewed: Y
Page Count: 25
Publication Date: 2025
Document Type: Journal Articles
Reports - Research
Descriptors: Artificial Intelligence, Writing (Composition), Quality Control, Writing Evaluation, Writing Improvement, Writing Strategies, Technology Uses in Education, Readability, Identification
DOI: 10.1186/s41239-025-00505-5
ISSN: 2365-9440
Abstract: This paper explores the growing complexity of detecting and differentiating generative AI from other AI interventions. Initially prompted by noticing how tools like Grammarly were being flagged by AI detection software, it examines how these popular tools such as Grammarly, EditPad, Writefull, and AI models such as ChatGPT and Microsoft Bing Copilot affect human-generated texts and how accurately current AI-detection systems, including Turnitin and GPTZero, can assess texts for use of these tools. The results highlight that widely used writing aids, even those not primarily generative, can trigger false positives in AI detection tools. In order to provide a dataset, the authors applied different AI-enhanced tools to a number of texts of different styles that were written prior to the development of consumer AI tools, and evaluated their impact through key metrics such as readability, perplexity, and burstiness. The findings reveal that tools like Grammarly that subtly enhance readability also trigger detection and increase false positives, especially for non-native speakers. In general, paraphrasing tools score low values in AI detection software, allowing the changes to go mostly unnoticed by the software. However, the use of Microsoft Bing Copilot and Writefull on our selected texts were able to eschew AI detection fairly consistently. To exacerbate this problem, traditional AI detectors like Turnitin and GPTZero struggle to reliably differentiate between legitimate paraphrasing and AI generation, undermining their utility for enforcing academic integrity. The study concludes by urging educators to focus on managing interactions with AI in academic settings rather than outright banning its use. It calls for the creation of policies and guidelines that acknowledge the evolving role of AI in writing, emphasizing the need to interpret detection scores cautiously to avoid penalizing students unfairly. In addition, encouraging openness on how AI is used in writing could alleviate concerns in the research and writing process for both students and academics. The paper recommends a shift toward teaching responsible AI usage rather than pursuing rigid bans or relying on detection metrics that may not accurately capture misconduct.
Abstractor: As Provided
Entry Date: 2025
Accession Number: EJ1460975
Database: ERIC
Full text is not displayed to guests.
FullText Links:
  – Type: pdflink
    Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwFIhT3QRZnbDWJB2HziAcHHAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDEliN_FbKs1r-lNpIwIBEICBmrAKEXOHrTAEg1W2UuhjVvkeheTdDN2o_JBt9PT1iUgXXLmgQcL0K9hHnAcf__FONwgD6z7_rcSW85jGIkkCfzYn00I-H1reOiNP9sfWfILJ3XuDJYtR40FpZ1ehEz9MN9bp9HU6StEunMfS-duBjOQkkuVLPUIeqiumErtzbo4-_DchzPfgF9ddn5X2dgSzbSwyNYGuLH7sEEo=
Text:
  Availability: 1
  Value: <anid>AN0182828918;[jq69]07feb.25;2025Feb10.01:49;v2.2.500</anid> <title id="AN0182828918-1">"Scarlet Cloak and the Forest Adventure": a preliminary study of the impact of AI on commonly used writing tools </title> <sbt id="AN0182828918-2">Introduction</sbt> <p>This paper explores the growing complexity of detecting and differentiating generative AI from other AI interventions. Initially prompted by noticing how tools like Grammarly were being flagged by AI detection software, it examines how these popular tools such as Grammarly, EditPad, Writefull, and AI models such as ChatGPT and Microsoft Bing Copilot affect human-generated texts and how accurately current AI-detection systems, including Turnitin and GPTZero, can assess texts for use of these tools. The results highlight that widely used writing aids, even those not primarily generative, can trigger false positives in AI detection tools. In order to provide a dataset, the authors applied different AI-enhanced tools to a number of texts of different styles that were written prior to the development of consumer AI tools, and evaluated their impact through key metrics such as readability, perplexity, and burstiness. The findings reveal that tools like Grammarly that subtly enhance readability also trigger detection and increase false positives, especially for non-native speakers. In general, paraphrasing tools score low values in AI detection software, allowing the changes to go mostly unnoticed by the software. However, the use of Microsoft Bing Copilot and Writefull on our selected texts were able to eschew AI detection fairly consistently. To exacerbate this problem, traditional AI detectors like Turnitin and GPTZero struggle to reliably differentiate between legitimate paraphrasing and AI generation, undermining their utility for enforcing academic integrity. The study concludes by urging educators to focus on managing interactions with AI in academic settings rather than outright banning its use. It calls for the creation of policies and guidelines that acknowledge the evolving role of AI in writing, emphasizing the need to interpret detection scores cautiously to avoid penalizing students unfairly. In addition, encouraging openness on how AI is used in writing could alleviate concerns in the research and writing process for both students and academics. The paper recommends a shift toward teaching responsible AI usage rather than pursuing rigid bans or relying on detection metrics that may not accurately capture misconduct.</p> <p>Artificial intelligence (AI) has become a powerful tool for the development of various applications, including writing assistants, i.e., software programs that help writers with tasks such as spelling correction, grammar checking, readability analysis, feedback generation, and/or even plagiarism detection (Giray, [<reflink idref="bib14" id="ref1">14</reflink>]; Nguyen et al., [<reflink idref="bib23" id="ref2">23</reflink>]; O'Neill & Russell, [<reflink idref="bib25" id="ref3">25</reflink>]). These tools are used by writers to enhance the clarity and quality of their writing, as well as support the development of writing skills. They are often used by non-native speakers of English to help identify typically problematic areas such as in the idiomatic use of prepositions or articles. Advancements in machine learning (ML) and Natural Language Processing (NLP), and the advent of consumer-friendly Large Language Model (LLM) chatbots such as ChatGPT, Copilot, Bard, Llama and Gemini, however, have scaled the capabilities of these tools greatly, from performing mere "corrective" tasks such as identifying misspellings or potential stylistic concerns to generating complete, coherent, and humanlike texts with just a few simple prompts.</p> <p>While these AI-powered writing tools offer great benefits, they have also introduced new challenges in academic settings, particularly in the assessment of student work (Adams & Chuah, [<reflink idref="bib4" id="ref4">4</reflink>]; Baidoo-Anu & Ansah, [<reflink idref="bib5" id="ref5">5</reflink>]; Debora et al., [<reflink idref="bib11" id="ref6">11</reflink>]; Giray, [<reflink idref="bib14" id="ref7">14</reflink>]; Lambert & Stevens, [<reflink idref="bib20" id="ref8">20</reflink>]; Perkins et al., [<reflink idref="bib31" id="ref9">31</reflink>]). Some challenges for educators include distinguishing human and AI-generated text, false positives in AI detection, and misleading detection metrics. The capabilities of LLMs have made it increasingly difficult for educators to distinguish between human-written content and AI-generated/assisted texts (Ostertag, [<reflink idref="bib27" id="ref10">27</reflink>]). Gary Ostertag ([<reflink idref="bib27" id="ref11">27</reflink>]) contested Porsdam Mann and colleagues' argument ([<reflink idref="bib32" id="ref12">32</reflink>]) that personalized LLMs can create high-quality academic writing. Ostertag asserts that the model AUTOGEN, trained on a standard dataset and then fine-tuned, can perform better than the GPT-3 base model and produces text that is as readable and coherent as expert-written work and can improve academic writing and help generate ideas (Porsdam Mann et al., [<reflink idref="bib32" id="ref13">32</reflink>]), but this does not equate to the production of high-quality scholarly work (Ostertag, [<reflink idref="bib27" id="ref14">27</reflink>]). The generated text lacks the genuine communicative intentions of a human author, which is essential for academic writing (Ostertag, [<reflink idref="bib27" id="ref15">27</reflink>]). For example, the University of Lethbridge subscribes to ProWritingAid, a tool that offers advanced grammar checking, style suggestions, and readability analysis to help students improve their writing skills. However, the tool's capabilities have expanded beyond these traditional features and now includes a function that allows students to input as little as two sentences and receive AI-generated expansions that can create up to three paragraphs of logical and coherent texts (Pereira, [<reflink idref="bib30" id="ref16">30</reflink>]; Porutiu, [<reflink idref="bib33" id="ref17">33</reflink>]). Current AI detection metrics still struggle to reliably assess texts that have been partially processed or enhanced by tools like ProWritingAid, as the resulting quality differs subtly from that generated entirely by mainstream LLMs like ChatGPT or Llama 2 (Elkhatat et al., [<reflink idref="bib13" id="ref18">13</reflink>]).</p> <p>The second challenge is that traditional plagiarism detection software (e.g., Turnitin) and newer AI-content detection tools such as GTPZero attempt to address the complexities introduced by AI writing tools, but their effectiveness remains limited (Tian & Cui, [<reflink idref="bib38" id="ref19">38</reflink>]; Turnitin, [<reflink idref="bib39" id="ref20">39</reflink>]). In this context, Nikolic et al.'s paper discusses the impact of ChatGPT on assessment and raises concerns about the authenticity of student submissions and the inherent difficulty in detecting plagiarism. While ChatGPT can pass some subjects and excel in certain assessment types, current assessment practices may need to change and require significant adaptation to withstand these advancements (Ray, [<reflink idref="bib35" id="ref21">35</reflink>]; Nikolic et al., [<reflink idref="bib24" id="ref22">24</reflink>]). Moreover, these plagiarism tools (particularly the application from Turnitin) can often condense multiple analyses to a single score, which can create a misleadingly definitive impression of the impact of AI (or not) on student work. This reductionist approach can obscure the complex interactions between human-generated and AI-assisted texts. Additionally, they also have a problem with false positives (i.e. mistakenly identifying human-generated writing as AI-generated), particularly with writing produced by non-native speakers of English and members of Equity-Deserving Groups (Chechitelli, [<reflink idref="bib8" id="ref23">8</reflink>]; Quidwai et al., [<reflink idref="bib34" id="ref24">34</reflink>]). With the integration of AI into many writing assistance tools, this third challenge has become even more complex: while grammar checkers, which also help with stylistic improvement and summarization, have long been used in academic settings, their increasing reliance on AI-powered content-generators means that even work presented by a student in good faith as their own can trigger an adverse misleading detection score in an AI detector. We must rethink assessing students' work and ensuring academic integrity in this evolving landscape because traditional methods may not work well in AI-enhanced writing environments.</p> <p>We demonstrate that AI detection software can have difficulty distinguishing between stylistic interventions from writing tools and wholly generated AI text. This limitation makes the detection tools inadequate for assessing academic misconduct and technological misuse. The paper discusses the implications of these findings for academic integrity and suggests that educators should approach AI detectors with caution. We conclude by offering recommendations to educators on effectively managing interactions with AI in educational settings, rather than attempting to eliminate its use. We also provide guidance on developing assessment policies that address AI's evolving nature and its influence on academic writing.</p> <hd id="AN0182828918-3">The origins of the "Scarlet Cloak" study</hd> <p>The title of this paper, "Scarlet Cloak and the Forest Adventure," originates from paraphrased results generated during our experiments with AI tools. These tests were part of our broader effort to gain insights into how AI engages with well-known cultural texts, including reimagining the title of Charles Perrault's classic "Little Red Riding Hood" ([<reflink idref="bib3" id="ref25">3</reflink>]). In the Fall of 2023, the University of Lethbridge, where the authors of this paper are engaged in teaching and research, convened a group of faculty members, teaching assistants, and library writing tutors to discuss the potential issues and emerging challenges posed by AI-generated writing content in student assignments. The group acknowledged the growing power of easily accessible and consumer-friendly LLMs (especially ChatGPT, developed by OpenAI and released on November 30, 2022; Gorichanaz, [<reflink idref="bib15" id="ref26">15</reflink>]). In this context, many instructors found that they did not have a complete understanding of either the systems used by the students or the metrics provided by AI detectors—a situation which could lead to overly aggressive and pedagogically counter-productive responses concerning their use in an institutional context that, at the time, suggested that the use of AI in writing assignments was synonymous with academic misconduct (Chan, [<reflink idref="bib7" id="ref27">7</reflink>]; Zhang et al., [<reflink idref="bib42" id="ref28">42</reflink>]). The situation was complicated by the lack of a comprehensive policy—at the University of Lethbridge and in many other institutions—dealing with writing tools such as Grammarly, Writefull, EditPad, QuillBot, and similar software that increasingly use LLMs to help identify and correct grammatical errors, punctuation, paraphrasing, and problems with clarity in human-generated text rather than large-scale text generation.</p> <p>Students and faculty members in the University's Humanities Innovation Lab (HIL) were already engaged with the technologies in their research and teaching and actively assisting instructors with non-technical backgrounds in understanding how such tools (and their detectors) worked (Pafumi et al., [<reflink idref="bib29" id="ref29">29</reflink>]). By this point, the doctoral students began to find that they could identify a base level of what we might call "GPT-Sprache" in student submissions (and researcher submissions at a journal in which they were employed) by its peculiar sentence structure and word choice (including the infamously non-nuanced use of "delve"; see Matsui, [<reflink idref="bib21" id="ref30">21</reflink>]). As the Fall of 2023 progressed, it became clear that some students were experimenting with AI and testing the boundaries of what was allowed in a classroom context. We began to study the question of how students were using AI more closely, implementing a policy for checking for the (mis)use of AI ("misuse" defined in this case as presenting AI-generated work as a student's own) in papers identified by Turnitin's internal detector as containing 30% or more AI-generated text in the classes we were teaching.</p> <p>The results of this preliminary work were very interesting. On the basis of our policy, we separated about a dozen papers for further analysis, identifying 13 submissions from the two sections of a common first-year course (English 1900, "Introduction to Language and Literature") that appeared to have made significant use of generative AI in their composition. But while some students whose work was flagged by our detectors as having a high percentage of likely-AI-Generated content admitted to using the standard chatbots, others claimed that they had "not used AI" in writing that our detectors suggested were at least partially AI-generated. Finally, one student disclosed that they had used "rephrasing" software as part of an accommodation provided to them by the university to support their work (an "accommodation" is an official prescription from student support services at the University that requires instructors to allow the use of specific technologies or allowances, such as screen-readers, special software, and extra time for tests and exams, for specific students). This disclosure prompted a series of conversations in which students expressed disbelief that their work could have been flagged by AI detection software and, just as significantly, easily passed informal tests we had been using to gauge student learning, such as defending their written content in an oral exam. Several students mentioned that they had used Grammarly (in addition to enlisting human help in their form of proof-reading assistance from their mother), and this made us wonder whether this software or similar other tools (Giray, [<reflink idref="bib14" id="ref31">14</reflink>]) might be skewing the results we had obtained. We then tested a long abstract that had been written within our research group that we knew had been human-generated (Bordalejo et al., [<reflink idref="bib6" id="ref32">6</reflink>]), but on which we had used Grammarly (our research group consists mainly of second-language speakers of English, and most of them use Grammarly for phrase-level assistance).</p> <p>Turnitin gave our abstract a score of 16% for AI generation, suggesting both that at least some of the students were correct in their claims about the origins of their papers and that a focus simply on detecting AI might produce far more false positives than we suspected. More importantly, we began to see that an indiscriminate approach to identifying (and presumably condemning) the use of AI to generate text in student work might have the pedagogically adverse effect of discouraging students from continuing to use widely accepted practices for improving their (human-generated) writing such as using grammar- and spelling-checkers and consulting with their parents and friends. The rest of this paper reports on our exploration of how such tools impact human-generated writing and how current AI-detection tools recognize them, focusing on two major questions:</p> <p></p> <ulist> <item> The degree to which (and how) widely-used writing tools such as new AI-powered versions of Grammarly and EditPad affect human-generated textual content and</item> <p></p> <item> How (and how reliably) such interventions are assessed by existing AI detection tools, such as Turnitin and GPTZero.</item> </ulist> <p>Although we began working on this paper in 2023, during a period of growing interest and concern about the role of AI in educational contexts, our research and writing have continued since then, allowing us to incorporate new insights and perspectives as the landscape of AI use and detection has evolved. Nonetheless, the present study is subject to several limitations that merit consideration. Most notably, the analysis is confined to a dataset comprising only three texts. The small sample size inherently limits the extent to which broader conclusions can be drawn. The possibility of self-selection bias in the selection of texts also plays a role in the generalizability of the results. In addition to this, the study employs a novel, exploratory methodology that has not yet been validated in prior research, necessitating cautious interpretation of the results. In fact, the potential influence of text genre and stylistic factors, though addressed, underscores the need for future research to incorporate a more extensive and diverse corpus to enhance the robustness and replicability of the findings. To mitigate these limitations, subsequent investigations should aim to replicate the methodology and employ larger, more representative datasets to substantiate the recommendations offered in this research.</p> <hd id="AN0182828918-4">Materials</hd> <p>The methodology employed in this study was specifically developed by the authors to address the unique challenges and objectives posed by the research topic, in light of the absence of established or widely-validated methods in the existing literature for investigating these tools. While this approach yielded insightful results, it is important to acknowledge that the procedure has not been previously validated in similar studies. This choice is due to the novel and rapidly evolving nature of the research area, which necessitates exploratory methodologies to keep pace with the fast-moving advancements in the field. Future research should prioritize refining this methodology, or integrating established procedures where applicable, to ensure scholarly consistency and reliability.</p> <hd id="AN0182828918-5">Text selection</hd> <p>Given the increasing ubiquity of LLM-trained tools, including within basic word processing software applications, and the evidence of our own abstract, we decided that the only way to study the first question was to begin with texts written by authors long before the advent of consumer-focused LLM tools and the immediately preceding generation of sophisticated grammar checkers and writing tools. Given the small-scale, preliminary nature of the study, we also decided to focus on short texts with a known provenance and written primarily or exclusively by first-language speakers of English:</p> <p></p> <ulist> <item> A multi-author interdisciplinary article focusing on a humanities topic but published in the primarily science-focused journal <emph>Nature</emph> that was written in part by a close collaborator of our lab (Barbrook et al., [<reflink idref="bib1" id="ref33">1</reflink>]);</item> <p></p> <item> An article focusing on the practice of the humanities written by O'Donnell (O'Donnell, [<reflink idref="bib2" id="ref34">2</reflink>]); and</item> <p></p> <item> A folk story "Little Red Riding Hood" known to many cultures, taken from a late-nineteenth century edition (Perrault, [<reflink idref="bib3" id="ref35">3</reflink>]).</item> </ulist> <p>The range of genres and styles here is also deliberate. While testing the impact of writing tools on a full range of genres is beyond the scope of this topic, we felt that it was important given the variability of student writing to include texts written in different ways and registers: from highly structured quasi-scientific writing (Barbrook et al., [<reflink idref="bib1" id="ref36">1</reflink>]), to a more informal but still scholarly register (O'Donnell, [<reflink idref="bib2" id="ref37">2</reflink>]), to a more literary style (Perrault, [<reflink idref="bib3" id="ref38">3</reflink>]). While these were all written by professional researchers and authors (in contrast to our student papers), and while none of them represent the well-known format of the "undergraduate essay" (though O'Donnell, [<reflink idref="bib2" id="ref39">2</reflink>] is structurally speaking a professional example of the scholarly essay), we felt that they provided enough of a range to eliminate the possibility that genre and style might be a determining factor in our results.</p> <hd id="AN0182828918-6">AI detectors</hd> <p>In order to understand the impact of common writing tools on these papers, and to assess the reliability of AI detection in its assessment, we needed to understand precisely how the main tools available to us at the time—Turnitin and GPTZero—worked.</p> <hd id="AN0182828918-7">GPTZero</hd> <p>The most transparent of these, GPTZero, uses metrics that are based most heavily on two relatively simple statistical and mathematical measurements: "perplexity" and "burstiness."</p> <p>Perplexity is a score derived from an attempt at reverse engineering the text under consideration in order to assess how likely it is that a given sequence was produced by a known LLM (see Chen et al., [<reflink idref="bib9" id="ref40">9</reflink>]). Bots like ChatGPT use a probability distribution model to determine their word choice. In producing text, the bot chooses its next word based on the values it has assigned to the words that precede it (this is why the beginnings of LLM-produced text often closely follow the original wording of the prompt). The perplexity score is the quantification of the degree to which that particular choice fails to match the LLM's predictions. Or, as formulated in Chen et al., ([<reflink idref="bib9" id="ref41">9</reflink>]): for a test set (T), perplexity (PP) is the inverse of the probability assigned to each word (w<subs>i</subs>) of the test set (t), as determined by the sequence of all previous words (PM(w<subs>i</subs> | w<subs>1</subs> ... w<subs>i</subs>–1)):</p> <p> <ephtml> <math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>P</mi><msub><mi>P</mi><mi>T</mi></msub><mfenced close=")" open="("><msub><mi>p</mi><mi>M</mi></msub></mfenced><mo>=</mo><mfrac><mn>1</mn><msup><mfenced close=")" open="("><mrow><msubsup><mo>∏</mo><mrow><mn>1</mn><mo>=</mo><mn>1</mn></mrow><mi>t</mi></msubsup><msub><mi>P</mi><mi>M</mi></msub><mfenced close=")" open="("><mrow><msub><mi>w</mi><mi>i</mi></msub><mrow><mo stretchy="false">|</mo></mrow><msub><mi>w</mi><mn>1</mn></msub><mo>⋯</mo><msub><mi>w</mi><mrow><mi>i</mi><mo>-</mo><mn>1</mn></mrow></msub></mrow></mfenced></mrow></mfenced><mfrac><mn>1</mn><mi>t</mi></mfrac></msup></mfrac></mrow></math> </ephtml> </p> <p>Graph</p> <p>Tian, the developer of GPTZero, considers perplexity to be one of the most reliable ways to assess whether the text is the output of a generative model since it essentially assesses the model's familiarity with the language of the text in question by testing its ability to anticipate on a probabilistic basis the next word in a sequence (Tian, [<reflink idref="bib37" id="ref42">37</reflink>]). The higher the score, the less likely it is that a given sequence was produced by the model under consideration (i.e., the word choice is "surprising" to the LLM); a lower score, on the other hand, indicates that the choice is closer to what the model would predict, suggesting, potentially, that it was produced by a bot using that particular LLM (or a human who writes in a manner consistent with the bot: this is one of the reasons why AI detectors are more likely to detect text produced by second-language speakers, who may be more likely to use stereotypical phraseology and less-fluid sentence structures and formulas in their writing).</p> <p>Burstiness, for its part, scores variation in word choice (Hoonlor et al., [<reflink idref="bib18" id="ref43">18</reflink>], p. 6): the frequency with which a word (w) appears in a test span (t) of a given document (d), as compared to the frequency of that same word throughout the entire span (T) in any document of the corpus (dt):</p> <p> <ephtml> <math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>B</mi><mi>u</mi><mi>r</mi><mi>s</mi><mi>t</mi><mfenced close=")" open="("><mrow><mi>w</mi><mo>,</mo><mi>t</mi></mrow></mfenced><mo>=</mo><mfenced close=")" open="("><mrow><mfrac><mfenced close="|" open="|"><mrow><msub><mi>d</mi><mi>t</mi></msub><mo>:</mo><mi>w</mi><mo>∈</mo><msub><mi>d</mi><mi>t</mi></msub></mrow></mfenced><mfenced close="|" open="|"><mrow><mi>d</mi><mo>:</mo><mi>w</mi><mo>∈</mo><mi>d</mi></mrow></mfenced></mfrac><mo>-</mo><mfrac><mn>1</mn><mi>T</mi></mfrac></mrow></mfenced></mrow></math> </ephtml> </p> <p>Graph</p> <p>A high burstiness means that the word under examination has been used more often during the specific interval than elsewhere in the corpus.</p> <p>This score is significant because of the way in which different word patterns correlate with certain writing behaviours. Human writers tend to have greater burstiness because of the way they use word choice to create sentence accents. They combine longer and shorter sentences in different ways depending on where they are in a text (e.g. topic sentences followed by more complex expository sentences in the main body of a paragraph; shorter and more declarative sentences at the beginning and end of a text rather than in the middle). They tend to cluster certain words, concepts and ideas together, and will often use things such as cognates (e.g. creature and creation) and parallel grammatical structures (the more the merrier) to emphasize these connections. AI-generated texts, in contrast, particularly in earlier generations of the current bots, tend to exhibit greater uniformity in these areas: more uniform sentence structures and length, fewer non-literal thematic connections and parallelisms in their word-choice and phrase-level structures (however, burstiness is likely to be higher for native speakers of a given language than it is for those for whom it is a second language, meaning that non-native speakers are once again likely to have a score closer to the bot).</p> <p>In addition to these two-principle metrics, GPTZero has used four other tests that, while considered less conclusive, nevertheless help establish the probability of AI-generation (Tian & Cui, [<reflink idref="bib38" id="ref44">38</reflink>]):</p> <p></p> <ulist> <item> Readability: how easily a piece of text can be understood by a human recipient. The score is based on the analysis of sentences with short words and a low number of syllables, as these characteristics tend to result in high readability scores.</item> <p></p> <item> Simplicity: a measure of how straightforward and uncomplicated a text is to comprehend. It assesses the percentage of words used in the text that belong to the 100 most common words in the English language. Texts with a higher percentage of these common words are generally considered simpler and easier to understand.</item> <p></p> <item> SAT percentage: the proportion of words in a text that are considered "academic," drawing a comparison to the vocabulary found in US standardized college admissions exams, the Standard Admission Tests, which have been used as training sets for generative AIs. (This is also the reason why, although prompted, British spelling is often disregarded or blended into a sort of Canadian spelling by LLMs.) Generally, a higher percentage of SAT words indicates a more demanding vocabulary within the text (and likelihood of AI generation).</item> <p></p> <item> Average sentence length: provides insights into the variance in writing patterns within a text, reflecting the average length of sentences in the text. In the same vein as burstiness, variability in sentence length is common in human writing, and this metric can help identify deviations from typical patterns. In this sense, low scores are between 0 and 15, medium scores between 15 and 25, and high scores 25 and above.</item> </ulist> <p>These six metrics are scored using three different scales:</p> <p></p> <ulist> <item> For Perplexity, Burstiness, Readability, and Simplicity, the scale is an integer, where low runs from 0 to 35; medium, 35 to 80; and high, 80 to 100 or above;</item> <p></p> <item> SAT percentage is, as the name suggests, a percentage, where between 0 and 2% is considered low (likely not AI-generated), 2–4% medium, and 5%+ high (likely to be AI-Generated);</item> <p></p> <item> Average sentence length is scored using an integer, where a low score is 0–15, medium 15–25, and high 25+.</item> </ulist> <p>Finally, GPTZero provides the user with a figure expressing the overall likelihood that a given text has been generated by AI. This probability score is for the document as a whole and is based on the interaction of the different individual metrics.</p> <hd id="AN0182828918-8">Turnitin</hd> <p>In contrast to GPTZero, Turnitin's AI-detector provides the user with a single metric (likelihood of AI generation) for both individual passages within the text and the text as a whole.</p> <p>In part, this appears to be a result of the application's history: while GPTZero was designed from the ground up as an AI-detection application, Turnitin was built on an existing application that provided a single "similarity score" between examples of student writing and a vast database of student submissions and other texts, primarily checking for plagiarism (Turnitin, [<reflink idref="bib40" id="ref45">40</reflink>]).</p> <p>One problem of Turnitin's approach is that it provides the user with far less material to work from in assessing the use (or the probability of use) of AI within a given text: whereas GPTZero provides the user with the constituent scores allowing more subtle analysis, Turnitin, which presumably uses some combination of similar metrics, simply summarizes these results in a single measure of probability. Perhaps as a result, Turnitin acknowledges that its method can produce a significant number of false positives, particularly when the detected likelihood of AI-generated content is below 20% (D'Agostino, [<reflink idref="bib10" id="ref46">10</reflink>]; Kuykendall, [<reflink idref="bib19" id="ref47">19</reflink>]; Smith, [<reflink idref="bib36" id="ref48">36</reflink>]).</p> <hd id="AN0182828918-9">Tool selection</hd> <p>Finally we had the question of the tools we would use to process the human-written texts. Based on our discussions with students, we divided these into two major groups: general AI tools (ChatGPT and Microsoft Bing Copilot); and specialist writing tools such as Grammarly, EditPad, Writefull, etc. Within this second group we made a distinction between non-generative (those that do not [currently] use generative AI) and hybrid (those that make use of generative AI for some or all applications).</p> <hd id="AN0182828918-10">General AI tools</hd> <p>For general AI tools, we focused particularly on ChatGPT, GPT 3.5 [gpt-3.5-turbo-0613 (OpenAI, [<reflink idref="bib26" id="ref49">26</reflink>])] and GPT 4 [gpt-4-0613 [OpenAI, [<reflink idref="bib26" id="ref50">26</reflink>])], and Microsoft Bing Copilot [Prometheus 4.0 [Microsoft, [<reflink idref="bib22" id="ref51">22</reflink>])]. The latter is a digital assistant tool integrated into the Bing search engine and, if enabled, capable of directly accessing web pages to perform several tasks, from understanding context to summarization. Copilot is also notable because it solves what was at the time one of the most evident problems affecting LLMs—the chronological limitations of a temporally-limited training set: a principal difference between ChatGPT and Copilot at the time, in fact, lay in the capability of the latter to perform automatic web-based research. Overall, all of these models were chosen because they were widely used at the time and because they can be prompted to compose or improve entire texts. Additionally, Copilot—unlike ChatGPT, Llama 2, and GPT-derived models—did not appear to be listed among the models GPTZero claims to be able to detect (while Copilot is based on ChatGPT, its use of updated web results meant that its output was different).</p> <hd id="AN0182828918-11">Specialist writing aids</hd> <p>The tools we selected for this group were EditPad, Grammarly, and Writefull.</p> <p></p> <ulist> <item> EditPad is a free-to-access multilingual text editor equipped with many functions: a plagiarism checker, a paraphraser, a grammar checker, and a text summarizer (EditPad, [<reflink idref="bib12" id="ref52">12</reflink>]). It is able to paraphrase texts in three modes: smooth, "reworder," and formal, all of which intervene in the text at different levels to modify its overall tone—from informal to formal, going through substantial paraphrasing.</item> <p></p> <item> Grammarly is a non-specialized writing tool that provides real-time grammar, punctuation, and style checks, in response to which it suggests corrections to enhance clarity, correctness, and readability across various types of text (Grammarly, [<reflink idref="bib16" id="ref53">16</reflink>], [<reflink idref="bib17" id="ref54">17</reflink>]). While this tool long predates the rise of consumer-friendly AI, a more general AI has been integrated into recent iterations, complementing the grammar suggestion with the possibility of rewriting thanks to a full prompt interface.</item> <p></p> <item> Writefull is an automatic proofreading software specifically designed to support academic writing (Writefull, [<reflink idref="bib41" id="ref55">41</reflink>]). The software is compatible with many word processors and can be installed as a plug-in or used online. Writefull offers a range of specialized features tailored for scholarly communication. Among these: a title generator, which uses an abstract as an input; an abstract generator based on the text of the paper; and a paraphraser capable of rephrasing a sentence with three different progressive levels of intensity: low (minor changes to the original text); medium (changes at word and phrase level); and high (including changes at the phrase level or higher).</item> </ulist> <p>In this study, we utilized all of these tools to operationalize our research question on false positives in grammar checkers. Each tool, in fact, requires different procedures for data input and subsequent output processing through Turnitin and GPTZero, considering the different ways all these tools are designed and work—for instance, academic-oriented tools vs. non-specialized ones. We have focused only on their paraphrasing capabilities, which we understand as the capability to rewrite a text in a way that alters its lexical and syntactic structure for clarity reasons.</p> <hd id="AN0182828918-12">Method</hd> <p>The basic method we used was to apply each of these tools to our three texts and then examine the impact using Turnitin and GPTZero. Because of the wide range of interventions the tools we used can make (from suggesting titles to identifying spelling mistakes to creating entire texts from a single prompt) and the different ways and contexts in which they are used (as Word Processor plugins to free-standing web applications; tools based on specific genres to more general applications), we focused primarily on a single task: paraphrasing (understood as the capability to rewrite a text in a way that alters its lexical and syntactic structure. This was common to all and, not coincidentally, is the area in which both instructors and students seemed to experience the most confusion as to what was pedagogically desirable and acceptable.</p> <hd id="AN0182828918-13">Acquiring the texts</hd> <p>The first thing we did was acquire processable versions of our texts. We collected these by downloading them from their original online sources and converting them into plain text files (.txt) in order to eliminate extra-textual elements, such as graphic content, and residual HTML or XML elements related to the layout and formatting, hyperlinks, and other metadata. For the academic papers, we omitted the bibliographies and meta-textual elements such as the description of the figures and the content of footnotes. Each file used in this study is stored in a dedicated folder within a GitHub repository (Pafumi et al., [<reflink idref="bib28" id="ref56">28</reflink>]) according to our naming convention to ensure reproducibility. We named each file (.txt) in an informative way followed by a suffix indicating which tool would be used on the article. For instance, files with both the original and the rephrased texts were called "O'Donnell", "Perrault" or "Barbrook" followed by the name of the tool we used (e.g., "O'Donnell_GPT3.5"), whereas, the AI detection reports (.pdf files) were named after "Turnitin" or "GPTZero" followed by the tool we run it through (e.g., "GPTZero-["O'Donnell-Copilot"). A Python file (.py) with the code used to obtain the visualizations was included as well.</p> <hd id="AN0182828918-14">Applying the tools</hd> <p>The next step was to apply the tools to our corpus of texts. As noted above, this required a different process for each tool.</p> <hd id="AN0182828918-15">ChatGPT 3.5 and 4</hd> <p>In the case of GPT 3.5 and 4, we pasted the text to be processed directly into the application interface, preceded by the prompt: "Could you improve my paper/story and its title, but maintaining the word count?" We did not include the reference lists (in the case of the two academic papers), which were instead removed before pasting and subsequently pasted back into the rewritten text after processing. This was to avoid the noise created by formulaic, non-content-focused passages such as in-line citations and bibliographic entries.</p> <hd id="AN0182828918-16">Copilot</hd> <p>Bing Copilot works slightly differently with respect to ChatGPT. First, it requires one to set the desired "temperature" of the intervention (i.e., the nature and freedom with which the bot will apply changes to the text), choosing among "creative," "balanced," and "precise." In our case, we chose the "creative" mode on the assumption it would generate the most idiosyncratic text compared to the original. Afterward, we copied the texts to be processed (again without the reference list) into the chat textbox and prompted: "Improve my paper/story and its title by rewriting it for me, but maintaining the word count." Because Copilot limited prompts to 4000 characters, we had to divide the original text in each case into smaller chunks, prompting: "Do the exact same thing for the rest of it." When processing was finished, we copied the various chunks into a new.txt file, removed the links that Bing inserts at the bottom of each text as a way to refer to its sources, and any meta-textual parts of its response (such as salutations and other stereotypical phrases explaining what it was about to do). We also reconstructed the original paragraph structure, which has often been modified in light of the character limit of the chatbox. All the modified texts were stored in different.txt files all using the naming convention discussed above.</p> <hd id="AN0182828918-17">Editpad</hd> <p>For EditPad, we used an approach similar to the one used with ChatGPT and Copilot, copying text from the original documents and inputting it into the paraphrasing tool. We also tried different combinations before choosing the one we assumed was most likely to be used by students, namely the "reworder" option, which changes the structure of the text to be paraphrased.</p> <p>A problem with EditPad lies in the unstable and limited nature of the software. This required splitting the original text into smaller chunks to avoid glitches and errors. Overall, EditPad produces poor-quality (though, as we will see, interesting) outcomes. This means that the text sometimes had to be generated multiple times or required extra cleaning later in order to provide a reasonable output (i.e., something similar to what an average-or-better student might submit for grading). During our experiment, for example, we had instances where EditPad produced texts in Korean, added blank spaces where they were not present in the original, and produced old XML entity references (such as andquot; and equot; to open and close straight quotation marks) as literals. In this case, we saved the rephrased texts in different.txt files with their new denomination, as discussed above.</p> <hd id="AN0182828918-18">Grammarly and Writefull</hd> <p>In the case of Grammarly, we used the paid, premium version, which recently has begun to include a generative AI tool that can make different suggestions based on pre-defined or customizable sets of prompts. To submit the texts, we opened a copy of the original file using the application notepad, started the app, and asked it to improve the text and its title. After that, we accepted all suggestions and saved the output under a new filename.</p> <p>We used roughly the same process for Writefull, which works as a plug-in in most word processors—although in this case, we did not use the premium version, since it appears to differ from the regular version solely by removing a daily word limit. Other limitations are that texts must be in English and that the inputs must be 40 and 2000 characters long. In this manner, we divided the original texts into sections of the prescribed length and we selected the "high" mode that heavily rephrases the text by changing its structure.</p> <hd id="AN0182828918-19">Reviewing the results in Turnitin and GPTZero</hd> <p>The final step in our method involved inputting the original and rewritten texts into our AI detectors and comparing the results. For each test paper, we produced three heat maps, three scatter plots, three histograms, and three area graphs in order to visualize the values collected by GPTZero and Turnitin (these data are presented in tabular and graphic forms).</p> <p>These visualizations were chosen to represent the particular features of the texts analyzed in the most efficient way (see Tables 1, 2, and 3). The heat map is the most comprehensive among them because it provides an immediate description of the changes introduced in readability, simplicity, perplexity, and burstiness and their interrelatedness through a colour gradient corresponding to a range going from 0 to 200. The darker the colour on the heat map, the higher the values registered in the dataset and corresponding to that category. The heat map also accounts for the different tools used, apart from the original text.</p> <p>Table 1 Results of the application of various AI tools to the O'Donnell paper</p> <p> <ephtml> <table frame="hsides" rules="groups"><thead><tr><th align="left"><p>O'Donnell (<xref ref-type="bibr" rid="bibr2">2009</xref>)</p></th><th align="left"><p>Turnitin score (%)</p></th><th align="left"><p>GPTZero score (%)</p></th><th align="left"><p>Readability</p></th><th align="left"><p>SAT (%)</p></th><th align="left"><p>Simplicity</p></th><th align="left"><p>Perplexity</p></th><th align="left"><p>Burstiness</p></th><th align="left"><p>Avg. sentence length</p></th></tr></thead><tbody><tr><td align="left"><p>Original</p></td><td align="left"><p>0</p></td><td align="left"><p>0</p></td><td char="." align="char"><p>27.6</p></td><td char="." align="char"><p>3.7</p></td><td char="." align="char"><p>42.1</p></td><td char="." align="char"><p>55.5</p></td><td char="." align="char"><p>285.9</p></td><td char="." align="char"><p>25.5</p></td></tr><tr><td align="left"><p>EditPad</p></td><td align="left"><p>0</p></td><td align="left"><p>2</p></td><td char="." align="char"><p>17.7</p></td><td char="." align="char"><p>3.6</p></td><td char="." align="char"><p>41.6</p></td><td char="." align="char"><p>190.2</p></td><td char="." align="char"><p>321.6</p></td><td char="." align="char"><p>30.9</p></td></tr><tr><td align="left"><p>Writefull</p></td><td align="left"><p>0</p></td><td align="left"><p>2</p></td><td char="." align="char"><p>27.3</p></td><td char="." align="char"><p>3.1</p></td><td char="." align="char"><p>39.7</p></td><td char="." align="char"><p>59.7</p></td><td char="." align="char"><p>298.9</p></td><td char="." align="char"><p>22.7</p></td></tr><tr><td align="left"><p>Grammarly</p></td><td align="left"><p>0</p></td><td align="left"><p>0</p></td><td char="." align="char"><p>33.0</p></td><td char="." align="char"><p>3.7</p></td><td char="." align="char"><p>40.2</p></td><td char="." align="char"><p>70.8</p></td><td char="." align="char"><p>373.1</p></td><td char="." align="char"><p>16.5</p></td></tr><tr><td align="left"><p>GPT 3.5</p></td><td align="left"><p>11</p></td><td align="left"><p>65</p></td><td char="." align="char"><p>8.1</p></td><td char="." align="char"><p>5.1</p></td><td char="." align="char"><p>34.3</p></td><td char="." align="char"><p>72.2</p></td><td char="." align="char"><p>143.3</p></td><td char="." align="char"><p>23.1</p></td></tr></tbody></table> </ephtml> </p> <p>Table 2 Results of the application of various AI tools to the Barbrook et al. paper</p> <p> <ephtml> <table frame="hsides" rules="groups"><thead><tr><th align="left"><p>Barbrook et al</p></th><th align="left"><p>Turnitin score (%)</p></th><th align="left"><p>GPTZero score (%)</p></th><th align="left"><p>Readability</p></th><th align="left"><p>SAT (%)</p></th><th align="left"><p>Simplicity</p></th><th align="left"><p>Perplexity</p></th><th align="left"><p>Burstiness</p></th><th align="left"><p>Avg. sentence length</p></th></tr></thead><tbody><tr><td align="left"><p>Original</p></td><td align="left"><p>0</p></td><td align="left"><p>0</p></td><td char="." align="char"><p>28.1</p></td><td char="." align="char"><p>4.2</p></td><td char="." align="char"><p>34.1</p></td><td char="." align="char"><p>62.7</p></td><td char="." align="char"><p>64.1</p></td><td char="." align="char"><p>18.1</p></td></tr><tr><td align="left"><p>EditPad</p></td><td align="left"><p>0</p></td><td align="left"><p>0</p></td><td char="." align="char"><p>18.3</p></td><td char="." align="char"><p>4.7</p></td><td char="." align="char"><p>34.4</p></td><td char="." align="char"><p>172.0</p></td><td char="." align="char"><p>469.4</p></td><td char="." align="char"><p>19.4</p></td></tr><tr><td align="left"><p>Writefull</p></td><td align="left"><p>0</p></td><td align="left"><p>11</p></td><td char="." align="char"><p>22.7</p></td><td char="." align="char"><p>2.8</p></td><td char="." align="char"><p>31.4</p></td><td char="." align="char"><p>71.1</p></td><td char="." align="char"><p>79.4</p></td><td char="." align="char"><p>17.1</p></td></tr><tr><td align="left"><p>Grammarly</p></td><td align="left"><p>0</p></td><td align="left"><p>32</p></td><td char="." align="char"><p>21.1</p></td><td char="." align="char"><p>5.1</p></td><td char="." align="char"><p>36.7</p></td><td char="." align="char"><p>65.8</p></td><td char="." align="char"><p>76.1</p></td><td char="." align="char"><p>22.4</p></td></tr><tr><td align="left"><p>GPT 3.5</p></td><td align="left"><p>0</p></td><td align="left"><p>48</p></td><td char="." align="char"><p>10.5</p></td><td char="." align="char"><p>3.8</p></td><td char="." align="char"><p>26.6</p></td><td char="." align="char"><p>73.2</p></td><td char="." align="char"><p>66.1</p></td><td char="." align="char"><p>19.0</p></td></tr></tbody></table> </ephtml> </p> <p>Table 3 Results of the application of various AI tools to the Perrault story</p> <p> <ephtml> <table frame="hsides" rules="groups"><thead><tr><th align="left"><p>Perrault</p></th><th align="left"><p>Turnitin score (%)</p></th><th align="left"><p>GPTZero score (%)</p></th><th align="left"><p>Readability</p></th><th align="left"><p>SAT (%)</p></th><th align="left"><p>Simplicity</p></th><th align="left"><p>Perplexity</p></th><th align="left"><p>Burstiness</p></th><th align="left"><p>Avg. sentence length</p></th></tr></thead><tbody><tr><td align="left"><p>Original</p></td><td align="left"><p>0</p></td><td align="left"><p>0</p></td><td char="." align="char"><p>74.3</p></td><td char="." align="char"><p>0.6</p></td><td char="." align="char"><p>40.3</p></td><td char="." align="char"><p>50.8</p></td><td char="." align="char"><p>118.8</p></td><td char="." align="char"><p>18.3</p></td></tr><tr><td align="left"><p>EditPad</p></td><td align="left"><p>0</p></td><td align="left"><p>33</p></td><td char="." align="char"><p>68.2</p></td><td char="." align="char"><p>1.9</p></td><td char="." align="char"><p>39.5</p></td><td char="." align="char"><p>112.4</p></td><td char="." align="char"><p>4701.4</p></td><td char="." align="char"><p>19.7</p></td></tr><tr><td align="left"><p>Writefull</p></td><td align="left"><p>0</p></td><td align="left"><p>1</p></td><td char="." align="char"><p>74.2</p></td><td char="." align="char"><p>0.5</p></td><td char="." align="char"><p>38.7</p></td><td char="." align="char"><p>55.7</p></td><td char="." align="char"><p>85.0</p></td><td char="." align="char"><p>16.3</p></td></tr><tr><td align="left"><p>Grammarly</p></td><td align="left"><p>11</p></td><td align="left"><p>33</p></td><td char="." align="char"><p>70.2</p></td><td char="." align="char"><p>0.8</p></td><td char="." align="char"><p>39.4</p></td><td char="." align="char"><p>48.0</p></td><td char="." align="char"><p>103.3</p></td><td char="." align="char"><p>18.3</p></td></tr><tr><td align="left"><p>GPT 3.5</p></td><td align="left"><p>0</p></td><td align="left"><p>86</p></td><td char="." align="char"><p>63.8</p></td><td char="." align="char"><p>1.2</p></td><td char="." align="char"><p>35.4</p></td><td char="." align="char"><p>51.7</p></td><td char="." align="char"><p>213.8</p></td><td char="." align="char"><p>16.2</p></td></tr><tr><td align="left"><p>GPT 4</p></td><td align="left"><p>0</p></td><td align="left"><p>48</p></td><td char="." align="char"><p>64.0</p></td><td char="." align="char"><p>0.6</p></td><td char="." align="char"><p>33.0</p></td><td char="." align="char"><p>116.0</p></td><td char="." align="char"><p>386.8</p></td><td char="." align="char"><p>13.1</p></td></tr><tr><td align="left"><p>Microsoft Bing Copilot</p></td><td align="left"><p>0</p></td><td align="left"><p>0</p></td><td char="." align="char"><p>81.5</p></td><td char="." align="char"><p>0.5</p></td><td char="." align="char"><p>43.2</p></td><td char="." align="char"><p>43.3</p></td><td char="." align="char"><p>58.8</p></td><td char="." align="char"><p>17.5</p></td></tr></tbody></table> </ephtml> </p> <p>Similarly, the line chart represents a more fine-grained measure, that is, the average sentence length values obtained after the rewriting. The ranges go from a maximum of 30 to a minimum of 15 and are divided for each version of the same text. The percentage of SAT words resulting from the analysis of the vocabulary is visualized in the form of histograms, which have been colour-coded depending on three SAT categories: green for values below 2%, yellow between 2% < 4%, and red for above 4%.</p> <p>The last visualization consists of the three area charts accounting for the difference in accuracy detection between GPTZero and Turnitin AI detection.</p> <hd id="AN0182828918-20">Results</hd> <p>The findings reveal interesting insights into how different text editing and AI tools manipulate text, focusing on various metrics such as readability, simplicity, perplexity, burstiness, and the likelihood of being flagged for plagiarism or AI generation.</p> <hd id="AN0182828918-21">GPT 3.5 and 4</hd> <p>GPT 3.5 and 4 often produce text that is likely to be flagged as AI-generated, primarily because detection tools like GPTZero and Turnitin are trained on the same models (at the time of this study). Texts run through GPT 3.5, for instance, show moderate to high GPTZero scores, ranging from 48 to 86%, indicating a high likelihood of being flagged as AI-generated. This is especially evident in datasets like Barbrook et al. and Perrault. However, Turnitin scores remain mostly at 0%, with a single dataset (O'Donnell, [<reflink idref="bib2" id="ref57">2</reflink>]) showing an 11% score, suggesting limited AI detection. In terms of readability and simplicity, GPT 3.5 tends to produce text with lower readability and simplicity scores, making it harder to read and more complex. For example, readability scores are as low as 8.1 in the GPT 3.5 modified text of O'Donnell and 10.5 in Barbrook et al., while simplicity scores are 26.6 in Barbrook et al. and 34.3 in O'Donnell. This complexity is further highlighted by moderate perplexity values, ranging from 51.7 in Perrault to 73.2 in Barbrook et al., and burstiness scores like 213.8 in Perrault, which add a human-like quality to the text.</p> <p>GPT 4, on the other hand, produces text that closely mimics human variability due to its high perplexity and burstiness. It shows high perplexity values, especially 116.0 when applied to Perrault, and significant burstiness, particularly 386.8 in Perrault, reflecting substantial sentence structure variability. However, this comes at the expense of readability and simplicity, with scores like 7.7 in the GPT 4-modified O'Donnell and 64.0 in Perrault for readability, and 21.9 in Barbrook et al. and 29.3 in O'Donnell for simplicity.</p> <hd id="AN0182828918-22">Grammarly</hd> <p>Grammarly—which is known for its ability to enhance readability and simplicity as well as flag grammatical errors—did relatively better than GPT. When applied to the texts, it generally produced a score of 0% in Turnitin score, with one exception in Perrault where there was an 11% score, suggesting potential issues with AI intervention. The GPTZero scores for the modified texts ranged from 0 to 33%, indicating a moderate likelihood of being flagged as AI-generated. The readability scores ranged between 21.1 and 70.2, while simplicity scores oscillated between 36.7 and 40.2, while maintaining moderate perplexity and burstiness (65.8 perplexity and 76.1 burstiness in Perrault). To summarize, these numbers suggest that Grammarly's modifications introduce slight but noticeable human-like variability into the text.</p> <hd id="AN0182828918-23">Editpad</hd> <p>While Grammarly focuses on enhancing readability and simplicity with moderate variability, EditPad stands out for its unique approach to text manipulation. It introduces significant variability through extreme perplexity and burstiness, offering a distinct contrast to Grammarly's more balanced modifications. The tool shows no issues with AI intervention as it consistently has a 0% Turnitin score, and generally low GPTZero scores, ranging from 0 to 33%. It also maintains moderate readability, with scores like 17.7 in Grammarly-modified O'Donnell, and 68.2 in Perrault. Its simplicity scores vary from 34.4 to 39.5, indicating that it keeps texts relatively simple. However, EditPad is notable for its extremely high perplexity and burstiness, particularly in Perrault, with 112.4 perplexity and 4701.4 burstiness. This suggests a significant human-like quality with highly varied sentence structures. Even in other datasets, it maintains high values, like 172.0 perplexity and 469.4 burstiness in Barbrook. EditPad's perplexity and burstiness enable it to generate remarkably human-like text, though this can also compromise readability, which makes it ideal for applications that require high structural variability.</p> <hd id="AN0182828918-24">Writefull</hd> <p>Writefull tends to take a more measured approach by maintaining a natural level of sentence structure variability. This allows it to refine text effectively while preserving human-like qualities, making it suitable for minor edits that enhance readability without drastically altering the original structure. Writefull has a consistent 0% Turnitin score, indicating no AI intervention, and generally low GPTZero scores, except for one instance where it reaches 11%. It maintains readability close to the original text, with scores ranging from 22.7 to 74.2, and slightly reduces simplicity, with scores from 31.4 to 38.7. This balance makes the tool effective for making minor edits that enhance readability without significantly altering the original structure or triggering AI detection. Its moderate perplexity and burstiness, such as 71.1 and 85.0 in Perrault, suggest a natural level of variability in sentence structure. The tool refines text effectively while preserving human-like qualities through balanced perplexity and burstiness. It is well-suited for minor edits that enhance readability without significantly altering the original structure.</p> <hd id="AN0182828918-25">Microsoft Bing Copilot</hd> <p>Microsoft Bing Copilot consistently has 0% Turnitin and GPTZero scores, indicating no AI intervention and a low likelihood of being flagged as AI-generated. It generally produces the most readable and simplest text among the tools, with readability scores up to 81.5 and simplicity scores up to 43.2. This makes it one of the easiest tools to read and understand. Despite its focus on readability, Copilot maintains a moderate level of perplexity and burstiness, as demonstrated by scores like 43.3 perplexity and 58.8 burstiness in Co-pilot-modified Perrault. This suggests that while the text is easy to read, it also retains a degree of human-like variability, which is crucial for making AI-generated text appear more natural and less mechanical. The balance between readability and human-like qualities makes Microsoft Bing Copilot particularly effective for generating clear and accessible text. Its consistency in avoiding AI detection and plagiarism issues further enhances its reliability for straightforward communication which makes it a valuable tool for users who require readable and human-like text without the risk of being flagged by detection systems.</p> <hd id="AN0182828918-26">Discussion</hd> <p>In the study, the original texts of the primary sources were put in different AI-powered writing tools such as generative AI models (like GPT and Microsoft Bing Copilot), hybrid tools (such as Grammarly and Writefull), and predictive paraphrasers (like EditPad), all of which produced rephrased version of the texts. We then evaluated the interaction between the AI detector GPTZero, and the AI-generated rephrased texts. The result of this study shows a complex and dynamic interaction between AI-powered writing tools, human-generated texts, and the AI detection systems, highlighting significant challenges and implications for academic integrity and assessment policies. These findings show the need for educators to carefully interpret AI-detector results and focus on managing interactions with AI in educational settings, rather than attempting to eliminate its use.</p> <p>Starting with the first paper, the values of the O'Donnell article are intriguing (see Fig. 1). This is an academic article for which we anticipated high values of burstiness in the original since it is one of the key features of academic writing the LLM is trying to reproduce. However, the values for the same parameter in the part of the diagram devoted to GPT 3.5 and 4 are surprisingly lower than the average—especially in the second instance. This means that GPT models are actually flattening the complexity of the text as part of their practice when asked to rephrase a text, consequentially becoming less recognizable to detection tools. Another intriguing result in this text is given by EditPad, whose perplexity value is higher than the average produced by other tools. This might suggest that EditPad generates higher perplexity because it is not using generative AI, whereas the other tools achieve lower perplexity because they do so.</p> <p>Graph: Fig. 1 Heatmap 1: O'Donnell dataset heatmap</p> <p>This pattern is also confirmed by the second set of results related to Perrault's "Little Red Riding Hood" (or as ChatGPT prefers to call it, "Scarlet Cloak"; see Fig. 2). As expected of children's fairy-tale, the values of burstiness and perplexity for the original are very low.</p> <p>Graph: Fig. 2 Heatmap 2: Perrault dataset heatmap</p> <p>However, the values attributed to EditPad, again, indicate a very high value for both metrics with a peak of 4701.4 for the burstiness. Conversely, to what was observed previously with O'Donnell, ([<reflink idref="bib2" id="ref58">2</reflink>]), GPT 3.5 and 4 here increase the values of burstiness with a more foreignizing effect on the part of the model itself, which struggles to recognize the paraphrased text. The result changes partially in the case of Barbrook et al. (see Fig. 3), wherein the only rewriting detected by the metrics is EditPad, as in the previous cases. In all of the cases analyzed, Writefull and Grammarly seem to avoid being picked up by either detector completely. Looking at the values of average sentence length by tool used, it is possible to see that in every text—except Barbrook et al.'s article—the original average is always greater than the other texts. This indicates a tendency of LLMs to aim at simplifying while rewriting.</p> <p>Graph: Fig. 3 Heatmap 3: Barbrook et al. dataset heatmap</p> <p>One interesting result is the significant shift between the paid and unpaid ChatGPT. This shift relates to the fact that the model of GPT4 can simplify sentence length in a better way than its previous version, as has been observed in all of the three line charts (see Figs. 4, 5, and 6). The results obtained through rephrasing the text in Grammarly, in turn, are not that informative, being seemingly related to the nature of the text or the random seed selection more than some feature of the text. EditPad, instead, shows the highest average sentence length.</p> <p>Graph: Fig. 4 Line chart 1: Average sentence length by tool for O'Donnell</p> <p>Graph: Fig. 5 Line chart 2: Average sentence length by tool for Perrault</p> <p>Graph: Fig. 6 Line chart 3: Average sentence length by tool for Barbrook et al.</p> <p>Notably, paraphrasing tools like Writefull and purely LLMs like GPT 3.5 show very similar values, which may make them almost unnoticed by detection software. As per the vocabulary used (see Figs. 7, 8, and 9), the SAT percentage indexes the lexical intervention the paraphrasing tool performs. O'Donnell's article demonstrates that despite being fairly aligned with the expectations of SAT words in an academic paper, GPT 3.5 and 4 increase its presence even more than other AIs, such as Copilot, and hybrid paraphrasing tools. A similar tendency is shown in the results from processing Barbrook et al.'s article, where the original SAT percentage is above 4% but the SAT value registered for GPT 4 together with Grammarly is even higher than the original. Lastly, Perrault's text is—consistent with our expectations—the text showing the lowest values, while EditPad peaked at its maximum. This is related to the fact that it is already pretty low in terms of SAT words within the original text, even if GPT 3.5 and EditPad are scoring fairly high percentages. Something concerning is that Microsoft Bing Copilot and Writefull were able to eschew AI detection fairly consistently, registering percentages almost equal to or even lower than the original texts.</p> <p>Graph: Fig. 7 Histogram 1: SAT% by tool for O'Donnell</p> <p>Graph: Fig. 8 Histogram 2: SAT% by tool for Perrault</p> <p>Graph: Fig. 9 Histogram 3: SAT% by tool for Barbrook et al.</p> <p>Overall, the accuracy results between Turnitin and GPTZero scores are very informative of the situation we have just outlined (see Figs. 10, 11, and 12). Both Turnitin and GPTZero, in fact, cannot detect AI-interventions in a definitive way. The only difference lies in the fact that Turnitin does not reveal when a text has been touched by AIs in most cases—with the exception of Grammarly's rewriting of Perrault's "Little Red Riding Hood." Barbrook et al.'s example makes this very clear since the Turnitin score does not detect it at all—being in absolute terms the least accurate in comparison with GPTZero. However, the latter is not deprived of flaws. While the scores are always apparent for academic writing—especially for Microsoft Bing Copilot and ChatGPT models—paraphrasing tools always score low values, especially for Writefull. The comparison between Turnitin and GPTZero scores highlights their limitations in definitively detecting AI interventions.</p> <p>Graph: Fig. 10 Area chart 1: Turnitin versus GPTZero scores of O'Donnell dataset</p> <p>Graph: Fig. 11 Area chart 2: Turnitin versus GPTZero scores of Perrault dataset</p> <p>Graph: Fig. 12 Area chart 3: Turnitin versus GPTZero scores of Barbrook dataset</p> <p>The analysis of these AI tools, such as GPT 4, EditPad, and Grammarly, demonstrate that while they can mimic human variability to some extent, they often introduce issues with readability, simplicity, or detectability. GPT models, for instance, tend to flatten the complexity of academic writing, reducing burstiness and making the text less recognizable to detection tools. This was particularly evident in the analysis of the O'Donnell text, where GPT 3.5 and GPT 4 produced lower-than-average burstiness values, reducing the text's complexity and making it less recognizable to detection. EditPad, on the other hand, significantly increases both perplexity and burstiness, which makes its output more human-like but potentially bypasses AI detection. This pattern was, for example, seen in the analysis of Perrault's "Little Red Riding Hood," where EditPad's output showed extreme burstiness, while GPT models introduced a foreignizing effect that made the text harder for the model to recognize.</p> <p>Grammarly and Writefull manage to avoid detection while maintaining readability and complexity close to the original text. Grammarly enhances readability and simplicity effectively, with scores between 33.0 and 70.2 for readability and 36.7 to 39.4 for simplicity across various datasets. It achieves this while maintaining moderate perplexity and burstiness, suggesting that Grammarly's modifications introduce slight but noticeable human-like variability into the text. Writefull, on the other hand, maintains moderate perplexity and burstiness which indicates that it retains a natural level of sentence structure variability. This balance makes Writefull effective for making minor edits that enhance readability without significantly altering the original structure or triggering AI detection. Microsoft Bing Copilot almost consistently avoids detection, and maintains readability and simplicity while mimicking human writing. It produces highly readable text, with scores like 28.6 in the O'Donnell dataset and 81.5 in Perrault.</p> <hd id="AN0182828918-27">Conclusion</hd> <p>This paper presents an analysis of various AI tools, the different "traces" they leave on different types of texts, and the effectiveness of AI detection tools in flagging these interventions. It focuses mainly on the different characteristics of AI interventions in texts and how they vary between different tools. Our endeavour was prompted by noticing how more innocuous tools, like Grammarly, were being picked up by AI detection software, despite them being considered a more "acceptable" way of using AI in the classroom. As AI becomes more embedded in consumer software, avoiding AI-generated content in academic work is increasingly difficult. Educators are advised to carefully interpret AI-detector results and focus on managing interactions with AI rather than simply trying to eliminate its use in educational settings. Teachers in postsecondary education (and other sectors) must approach the use of LLM-based Generative AI (and especially the interpretation of the results of AI detectors interpreting the results of AI-detectors) with great caution: not only do different tools produce potentially different thumbprints, but perfectly legitimate uses of AI-enabled tools also result in "false positives", and the results should not be taken as evidence of academic misconduct. This is further compounded by the rise and use of prompt engineering techniques. With careful, detailed instructions and advanced prompting, students can guide the model to generate precise and highly personalized responses that are coherent and tailored to specific styles or topics, making it difficult to distinguish human-written work from AI's. Therefore, the question shifts from "has this writing been touched by AI?" to "how was the inevitable presence of AI handled?" Responding to these challenges by attempting to enforce an overall ban on AI use in classroom settings may become only further ineffective, as the accuracy scores of AI detection softwares like GPTZero and Turnitin have demonstrated. The study, in fact, revealed limitations in the capabilities of current detection tools, particularly when assessing paraphrased content, as shown in the discussion by the discrepancies in the performances of Turnitin and GPTZero. For example, Turnitin hardly reveals AI interventions in text, while GPTZero remains imperfect, particularly when assessing paraphrased content.</p> <p>Considering the impact of the various tools on the different texts, we advise that suitable (if necessarily interim) assessment policies in this fast-developing area be designed. These policies should not aim to eliminate AI use entirely as this will be impractical, considering the inevitable integration of these tools in various academic endeavours. Institutional policies in this regard should focus on asking "how can our faculty and students use these tools without stifling the development and advancement of critical thinking and innovation which academia is known for? What process can they put in place to assess whether AI tools are being used by students to genuinely enhance learning and performance rather than circumventing academic effort? It's important to note that concern regarding the potential for unfair academic punishment due to AI use is not limited to students. Educators whose work is central to knowledge production, also face this risk. While there might be several ways to address these concerns, we would recommend the development of detailed AI use guidelines and training programs that educate faculties on both the ethical and practical implications of using these tools. Transparency on how AI is used, say in methodology and the inclusion of "AI Disclosure Statement" could also help alleviate concerns and promote trust in the research process and the scholarly output of educators/academics.</p> <p>Because this paper is focused on students, we would recommend that educators should focus on teaching students how to use these tools effectively and correctly rather than attempting to ban the use of AI altogether, as many of the AI detection tools are imperfect and may lead to students being accused of plagiarism for honest use of AI. Educators can further advance this goal by stating clear parameters for AI in their courses. To encourage transparency among students' AI use, they might be asked to disclose their interaction with AI, including the tools and possibly the prompts they use and/or even share the link to such conversations with the educator. AI use in real-world scenarios also often involves a combined application of tools, where outputs from one tool are refined or repurposed using another. For instance, a student might generate an essay draft using a ChatGPT, refine it with an AI-powered writing assistant (e.g. Grammarly or ProWritingAid), and finalize it with a paraphrasing tool (e.g. Quillbot). To better understand how this multimodal use impacts or enables AI-generated content to bypass detection, educators could require students to document the specific tools used at each stage of producing a single piece, along with their purpose at each step. Incorporating activities such as in-class writing can also provide educators with a baseline understanding of students' unaided capabilities. As per the academics, the limitedness of our sample invites researchers to replicate this methodology with a larger dataset, possibly from a broader and/or more interdisciplinary perspective. Current AI detection tools predominantly focus on English-language outputs which leaves a significant gap in understanding how they perform across other languages. Expanding the assessments beyond the English language could show language nuances in AI-generated texts that might inform the development of detection methods that incorporate such non-English languages.</p> <hd id="AN0182828918-28">Acknowledgements</hd> <p>Not applicable.</p> <hd id="AN0182828918-29">Author contributions</hd> <p>Conceptualization: BB, AK, FO, DO, DP; methodology: BB, AK, FO, DO, DP, MP; software: DP; validation: AK, FO, DP; formal analysis: BB, AK, FO, DO, DP; investigation: BB, AK, FO, DO, DP; resources: BB, DO; data curation: DP; writing—original draft preparation: BB, AK, FO, DO, DP; writing—review & editing: BB, AK, DP, FO, DO, MP; visualization: DP; supervision: BB, DO; project administration: BB, DO;</p> <hd id="AN0182828918-30">Funding</hd> <p>The research was not directly funded by any funder.</p> <hd id="AN0182828918-31">Availability of data and materials</hd> <p>The datasets generated and/or analysed during the current study are available in the Github, https://github.com/DavidePafumi/scarlet-cloak-ai-repository</p> <hd id="AN0182828918-32">Declarations</hd> <p></p> <hd id="AN0182828918-33">Competing interests</hd> <p>The authors have no competing interests to declare.</p> <hd id="AN0182828918-34">Publisher's Note</hd> <p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p> <ref id="AN0182828918-35"> <title> References </title> <blist> <bibl id="bib1" idref="ref33" type="bt">1</bibl> <bibtext> Barbrook AC, Howe CJ, Blake N, Robinson P. The phylogeny of the Canterbury Tales. Nature. 1998; 394; 1998: 839-840. 10.1038/29667. 1494.03021</bibtext> </blist> <blist> <bibl id="bib2" idref="ref34" type="bt">2</bibl> <bibtext> O'Donnell, D. P. (2009). Byte me: Technological education and the humanities. Heroic Age, 12. https://jemne.org/issues/12/em.php.</bibtext> </blist> <blist> <bibl id="bib3" idref="ref25" type="bt">3</bibl> <bibtext> Perrault, C. (1891). Little Red Riding Hood. In A. Lang (Ed.), The blue fairy book (5th ed, pp. 51–53). Longmans, Green, and Company. https://sites.pitt.edu/~dash/type0333.html#perrault</bibtext> </blist> <blist> <bibl id="bib4" idref="ref4" type="bt">4</bibl> <bibtext> Adams D, Chuah K-MChuri PP, Joshi S, Elhoseny M, Omrane A. Artificial intelligence-based tools in research writing: Current trends and future potentials. Artificial intelligence in higher education. 2022; CRC Press: 169-184. 10.1201/9781003184157-9. 1498.00003</bibtext> </blist> <blist> <bibl id="bib5" idref="ref5" type="bt">5</bibl> <bibtext> Baidoo-Anu D, Ansah LO. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI. 2023; 7; 1: 52-62. 10.61969/jai.1337500</bibtext> </blist> <blist> <bibl id="bib6" idref="ref32" type="bt">6</bibl> <bibtext> Bordalejo, B, Pafumi, D, Pearce, M. S, & O'Donnell, D. P. (2024). Re-imagining research tools for teaching in a post-pandemic world [Paper presentation]. DH 2024 Conference Reinvention & Responsibility, Washington, D.C. https://doi.org/10.5281/ZENODO.13255256</bibtext> </blist> <blist> <bibl id="bib7" idref="ref27" type="bt">7</bibl> <bibtext> Chan CKY. Is AI changing the rules of academic misconduct? An in-depth look at students' perceptions of 'AI-giarism'. arXiv. 2023. 10.48550/arXiv.2306.03358. 1526.35095</bibtext> </blist> <blist> <bibl id="bib8" idref="ref23" type="bt">8</bibl> <bibtext> Chechitelli, A. (2023, March 16). Understanding false positives within our AI writing detection capabilities. Turnitin. https://<ulink href="http://www.turnitin.com/blog/understanding-false-positives-within-our-ai-writing-detection-capabilities">www.turnitin.com/blog/understanding-false-positives-within-our-ai-writing-detection-capabilities</ulink></bibtext> </blist> <blist> <bibl id="bib9" idref="ref40" type="bt">9</bibl> <bibtext> Chen SF, Beeferman D, Rosenfeld R. Evaluation metrics for language models. 2018; Carnegie Mellon University. 10.1184/R1/6605324.v1. 0917.68171</bibtext> </blist> <blist> <bibtext> D'Agostino, S. (2023, June 1). Turnitin's AI detector: Higher-than-expected false positives. Inside Higher Ed.https://<ulink href="http://www.insidehighered.com/news/quick-takes/2023/06/01/turnitins-ai-detector-higher-expected-false-positives">www.insidehighered.com/news/quick-takes/2023/06/01/turnitins-ai-detector-higher-expected-false-positives</ulink></bibtext> </blist> <blist> <bibtext> Debora W-W, Alla A-N, Sonja B, Tomáš F, Jean G-D, Olu P. Testing of detection tools for AI-generated text. arXiv. 2023. 10.48550/arXiv.2306.15666</bibtext> </blist> <blist> <bibtext> EditPad. (2023). Online notepad & wordpad. Just Great Software. https://<ulink href="http://www.editpad.org/">www.editpad.org/</ulink></bibtext> </blist> <blist> <bibtext> Elkhatat AM, Elsaid K, Almeer S. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. International Journal for Educational Integrity. 2023. 10.1007/s40979-023-00140-5</bibtext> </blist> <blist> <bibtext> Giray L. Don't let Grammarly overwrite your style and voice:' Writers' advice on using Grammarly in writing. Internet Reference Services Quarterly. 2024. 10.1080/10875301.2024.2344762. 0868.65011</bibtext> </blist> <blist> <bibtext> Gorichanaz T. Accused: How students respond to allegations of using ChatGPT on assessments. Learning: Research and Practice. 2023; 9; 2: 183-196. 10.1080/23735082.2023.2254787</bibtext> </blist> <blist> <bibtext> Grammarly. (2023a). Grammarly: Grammar checker and AI writing app. Grammarly Inc. https://<ulink href="http://www.grammarly.com/">www.grammarly.com/</ulink></bibtext> </blist> <blist> <bibtext> Grammarly. (2023b, September 29). Putting Grammarly's generative AI capability into action. Grammarly.https://<ulink href="http://www.grammarly.com/blog/putting-grammarlys-generative-ai-capability-into-action/">www.grammarly.com/blog/putting-grammarlys-generative-ai-capability-into-action/</ulink></bibtext> </blist> <blist> <bibtext> Hoonlor A, Szymanski BK, Zaki MJ, Thompson J. An evolution of computer science research. Communications of the ACM. 2013; 56; 10: 74-83. 10.1145/2500892</bibtext> </blist> <blist> <bibtext> Kuykendall, K. (2023, May 30). Turnitin AI detector analyzed 38M submissions in its first 6 weeks; updates answer educator feedback. Technological Horizons in Education Journal (THE). https://thejournal.com/Articles/2023/05/30/Turnitin-Shares-Stats-From-AI-Detection-of-38M-Submissions-Tweaks-Detector-Feature.aspx</bibtext> </blist> <blist> <bibtext> Lambert J, Stevens M. ChatGPT and generative AI technology: A mixed bag of concerns and new opportunities. Computers in the Schools. 2023; 41; 4: 559-583. 10.1080/07380569.2023.2256710. 1510.65002</bibtext> </blist> <blist> <bibtext> Matsui K. Delving into PubMed records: Some terms in medical writing have drastically changed after the arrival of ChatGPT. medRxiv. 2024. 10.1101/2024.05.14.24307373. 1537.83029</bibtext> </blist> <blist> <bibtext> Microsoft. (2023). Bing Copilot. Microsoft. https://copilot.microsoft.com/</bibtext> </blist> <blist> <bibtext> Nguyen A, Hong Y, Dang B, Huang X. Human-AI collaboration patterns in AI-assisted academic writing. Studies in Higher Education. 2024; 49; 5: 847-864. 10.1080/03075079.2024.2323593. 07871998</bibtext> </blist> <blist> <bibtext> Nikolic S, Daniel S, Haque R, Belkina M, Hassan GM, Grundy S, Lyden S, Neal P, Sandison C. ChatGPT versus engineering education assessment: A multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment integrity. European Journal of Engineering Education. 2023; 48; 4: 559-614. 10.1080/03043797.2023.2213169</bibtext> </blist> <blist> <bibtext> O'Neill R, Russell AMT. Grammarly: Help or hindrance? Academic learning advisors' perceptions of an online grammar checker. Journal of Academic Language and Learning. 2019; 13; 1: A88-A107. 1290.65024</bibtext> </blist> <blist> <bibtext> OpenAI. (2023). ChatGPT. OpenAI. https://chat.openai.com/</bibtext> </blist> <blist> <bibtext> Ostertag G. Meaning by courtesy: LLM-generated texts and the illusion of content. The American Journal of Bioethics. 2023; 23; 10: 91-93. 10.1080/15265161.2023.2249851. 1474.03034</bibtext> </blist> <blist> <bibtext> Pafumi, D, Bordalejo, B, Onuh, F, O'Donnell, D, & Khalid, A. I. (2024). GitHub—DavidePafumi/Scarlet-Cloak-Ai-Repository. https://github.com/DavidePafumi/scarlet-cloak-ai-repository</bibtext> </blist> <blist> <bibtext> Pafumi, D, Onuh, F, & Khalid, A. I. (2023). Artificial intelligence and how to detect it in students' submissions. https://doi.org/10.5281/ZENODO.13393668</bibtext> </blist> <blist> <bibtext> Pereira, A. (2024, May 8). Top 7 AI academic writing tools for researchers. Career in STEM. https://<ulink href="http://www.careerinstem.com/top-7-ai-academic-writing-tools-for-researchers/">www.careerinstem.com/top-7-ai-academic-writing-tools-for-researchers/</ulink></bibtext> </blist> <blist> <bibtext> Perkins, M, Roe, J, Postma, D, McGaughran, J, & Hickerson, D. (2023). Game of tones: Faculty detection of GPT-4 generated content in university assessments. arXiv.org. https://doi.org/10.48550/arXiv.2305.18081</bibtext> </blist> <blist> <bibtext> Porsdam Mann S, Earp BD, Møller N, Vynn S, Savulescu J. AUTOGEN: A personalized large language model for academic enhancement—ethics and proof of principle. The American Journal of Bioethics. 2023; 23; 10: 28-41. 10.1080/15265161.2023.2233356</bibtext> </blist> <blist> <bibtext> Porutiu, T. (2024, August 7). 14 best AI writing software tools of 2024 (top picks). AuthorityHacker. https://<ulink href="http://www.authorityhacker.com/best-ai-writing-software/">www.authorityhacker.com/best-ai-writing-software/</ulink></bibtext> </blist> <blist> <bibtext> Quidwai, A, Li, C, & Dube, P. (2023). Beyond black box AI generated plagiarism detection: From sentence to document level. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 727–735). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.58</bibtext> </blist> <blist> <bibtext> Ray PP. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems. 2023; 3: 121-154. 10.1016/j.iotcps.2023.04.003. 60.1392.01</bibtext> </blist> <blist> <bibtext> Smith, K. (2023, November 10). One size does not fit all: Interpreting Turnitin's AI writing score. Turnitin. https://<ulink href="http://www.turnitin.com/blog/one-size-does-not-fit-all-interpreting-turnitin-s-ai-writing-score">www.turnitin.com/blog/one-size-does-not-fit-all-interpreting-turnitin-s-ai-writing-score</ulink></bibtext> </blist> <blist> <bibtext> Tian, E. [@edward_the6]. (2023, January 2). I spent New Years building GPTZero — an app that can quickly and efficiently detect whether an essay is ChatGPT or human written [Tweet]. Twitter/X. https://twitter.com/edward_the6/status/1610067688449007618</bibtext> </blist> <blist> <bibtext> Tian, E, & Cui, A. (2023). GPTZero: Towards detection of AI-generated text using zero-shot and supervised methods.GPTZero. https://gptzero.me/</bibtext> </blist> <blist> <bibtext> Turnitin. (2023). Turnitin. Turnitin LLC.https://<ulink href="http://www.turnitin.com/">www.turnitin.com/</ulink></bibtext> </blist> <blist> <bibtext> Turnitin. (2024). Turnitin's AI writing detection capabilities.https://<ulink href="http://www.turnitin.com/solutions/topics/ai-writing/ai-detector/">www.turnitin.com/solutions/topics/ai-writing/ai-detector/</ulink></bibtext> </blist> <blist> <bibtext> Writefull. (2023). Writefull. Writefull.https://<ulink href="http://www.writefull.com/">www.writefull.com/</ulink></bibtext> </blist> <blist> <bibtext> Zhang L, Amos C, Pentina I. Interplay of rationality and morality in using ChatGPT for academic misconduct. Behaviour and Information Technology. 2024. 10.1080/0144929X.2024.2325023. 1471.68223</bibtext> </blist> </ref> <aug> <p>By Barbara Bordalejo; Davide Pafumi; Frank Onuh; A. K. M. Iftekhar Khalid; Morgan Slayde Pearce and Daniel Paul O'Donnell</p> <p>Reported by Author; Author; Author; Author; Author; Author</p> </aug> <nolink nlid="nl1" bibid="bib14" firstref="ref1"></nolink> <nolink nlid="nl2" bibid="bib23" firstref="ref2"></nolink> <nolink nlid="nl3" bibid="bib25" firstref="ref3"></nolink> <nolink nlid="nl4" bibid="bib11" firstref="ref6"></nolink> <nolink nlid="nl5" bibid="bib20" firstref="ref8"></nolink> <nolink nlid="nl6" bibid="bib31" firstref="ref9"></nolink> <nolink nlid="nl7" bibid="bib27" firstref="ref10"></nolink> <nolink nlid="nl8" bibid="bib32" firstref="ref12"></nolink> <nolink nlid="nl9" bibid="bib30" firstref="ref16"></nolink> <nolink nlid="nl10" bibid="bib33" firstref="ref17"></nolink> <nolink nlid="nl11" bibid="bib13" firstref="ref18"></nolink> <nolink nlid="nl12" bibid="bib38" firstref="ref19"></nolink> <nolink nlid="nl13" bibid="bib39" firstref="ref20"></nolink> <nolink nlid="nl14" bibid="bib35" firstref="ref21"></nolink> <nolink nlid="nl15" bibid="bib24" firstref="ref22"></nolink> <nolink nlid="nl16" bibid="bib34" firstref="ref24"></nolink> <nolink nlid="nl17" bibid="bib15" firstref="ref26"></nolink> <nolink nlid="nl18" bibid="bib42" firstref="ref28"></nolink> <nolink nlid="nl19" bibid="bib29" firstref="ref29"></nolink> <nolink nlid="nl20" bibid="bib21" firstref="ref30"></nolink> <nolink nlid="nl21" bibid="bib37" firstref="ref42"></nolink> <nolink nlid="nl22" bibid="bib18" firstref="ref43"></nolink> <nolink nlid="nl23" bibid="bib40" firstref="ref45"></nolink> <nolink nlid="nl24" bibid="bib10" firstref="ref46"></nolink> <nolink nlid="nl25" bibid="bib19" firstref="ref47"></nolink> <nolink nlid="nl26" bibid="bib36" firstref="ref48"></nolink> <nolink nlid="nl27" bibid="bib26" firstref="ref49"></nolink> <nolink nlid="nl28" bibid="bib22" firstref="ref51"></nolink> <nolink nlid="nl29" bibid="bib12" firstref="ref52"></nolink> <nolink nlid="nl30" bibid="bib16" firstref="ref53"></nolink> <nolink nlid="nl31" bibid="bib17" firstref="ref54"></nolink> <nolink nlid="nl32" bibid="bib41" firstref="ref55"></nolink> <nolink nlid="nl33" bibid="bib28" firstref="ref56"></nolink>
Header DbId: eric
DbLabel: ERIC
An: EJ1460975
AccessLevel: 3
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: 'Scarlet Cloak and the Forest Adventure': A Preliminary Study of the Impact of AI on Commonly Used Writing Tools
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Barbara+Bordalejo%22">Barbara Bordalejo</searchLink><br /><searchLink fieldCode="AR" term="%22Davide+Pafumi%22">Davide Pafumi</searchLink> (ORCID <externalLink term="http://orcid.org/0000-0002-1113-187X">0000-0002-1113-187X</externalLink>)<br /><searchLink fieldCode="AR" term="%22Frank+Onuh%22">Frank Onuh</searchLink><br /><searchLink fieldCode="AR" term="%22A%2E+K%2E+M%2E+Iftekhar+Khalid%22">A. K. M. Iftekhar Khalid</searchLink><br /><searchLink fieldCode="AR" term="%22Morgan+Slayde+Pearce%22">Morgan Slayde Pearce</searchLink><br /><searchLink fieldCode="AR" term="%22Daniel+Paul+O'Donnell%22">Daniel Paul O'Donnell</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="SO" term="%22International+Journal+of+Educational+Technology+in+Higher+Education%22"><i>International Journal of Educational Technology in Higher Education</i></searchLink>. 2025 22.
– Name: Avail
  Label: Availability
  Group: Avail
  Data: BioMed Central, Ltd. Available from: Springer Nature. 233 Spring Street, New York, NY 10013. Tel: 800-777-4643; Tel: 212-460-1500; Fax: 212-348-4505; e-mail: customerservice@springernature.com; Web site: https://www.springer.com/gp/biomedical-sciences
– Name: PeerReviewed
  Label: Peer Reviewed
  Group: SrcInfo
  Data: Y
– Name: Pages
  Label: Page Count
  Group: Src
  Data: 25
– Name: DatePubCY
  Label: Publication Date
  Group: Date
  Data: 2025
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Journal Articles<br />Reports - Research
– Name: Subject
  Label: Descriptors
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Artificial+Intelligence%22">Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+%28Composition%29%22">Writing (Composition)</searchLink><br /><searchLink fieldCode="DE" term="%22Quality+Control%22">Quality Control</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+Evaluation%22">Writing Evaluation</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+Improvement%22">Writing Improvement</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+Strategies%22">Writing Strategies</searchLink><br /><searchLink fieldCode="DE" term="%22Technology+Uses+in+Education%22">Technology Uses in Education</searchLink><br /><searchLink fieldCode="DE" term="%22Readability%22">Readability</searchLink><br /><searchLink fieldCode="DE" term="%22Identification%22">Identification</searchLink>
– Name: DOI
  Label: DOI
  Group: ID
  Data: 10.1186/s41239-025-00505-5
– Name: ISSN
  Label: ISSN
  Group: ISSN
  Data: 2365-9440
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: This paper explores the growing complexity of detecting and differentiating generative AI from other AI interventions. Initially prompted by noticing how tools like Grammarly were being flagged by AI detection software, it examines how these popular tools such as Grammarly, EditPad, Writefull, and AI models such as ChatGPT and Microsoft Bing Copilot affect human-generated texts and how accurately current AI-detection systems, including Turnitin and GPTZero, can assess texts for use of these tools. The results highlight that widely used writing aids, even those not primarily generative, can trigger false positives in AI detection tools. In order to provide a dataset, the authors applied different AI-enhanced tools to a number of texts of different styles that were written prior to the development of consumer AI tools, and evaluated their impact through key metrics such as readability, perplexity, and burstiness. The findings reveal that tools like Grammarly that subtly enhance readability also trigger detection and increase false positives, especially for non-native speakers. In general, paraphrasing tools score low values in AI detection software, allowing the changes to go mostly unnoticed by the software. However, the use of Microsoft Bing Copilot and Writefull on our selected texts were able to eschew AI detection fairly consistently. To exacerbate this problem, traditional AI detectors like Turnitin and GPTZero struggle to reliably differentiate between legitimate paraphrasing and AI generation, undermining their utility for enforcing academic integrity. The study concludes by urging educators to focus on managing interactions with AI in academic settings rather than outright banning its use. It calls for the creation of policies and guidelines that acknowledge the evolving role of AI in writing, emphasizing the need to interpret detection scores cautiously to avoid penalizing students unfairly. In addition, encouraging openness on how AI is used in writing could alleviate concerns in the research and writing process for both students and academics. The paper recommends a shift toward teaching responsible AI usage rather than pursuing rigid bans or relying on detection metrics that may not accurately capture misconduct.
– Name: AbstractInfo
  Label: Abstractor
  Group: Ab
  Data: As Provided
– Name: DateEntry
  Label: Entry Date
  Group: Date
  Data: 2025
– Name: AN
  Label: Accession Number
  Group: ID
  Data: EJ1460975
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1460975
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1186/s41239-025-00505-5
    Languages:
      – Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 25
    Subjects:
      – SubjectFull: Artificial Intelligence
        Type: general
      – SubjectFull: Writing (Composition)
        Type: general
      – SubjectFull: Quality Control
        Type: general
      – SubjectFull: Writing Evaluation
        Type: general
      – SubjectFull: Writing Improvement
        Type: general
      – SubjectFull: Writing Strategies
        Type: general
      – SubjectFull: Technology Uses in Education
        Type: general
      – SubjectFull: Readability
        Type: general
      – SubjectFull: Identification
        Type: general
    Titles:
      – TitleFull: 'Scarlet Cloak and the Forest Adventure': A Preliminary Study of the Impact of AI on Commonly Used Writing Tools
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Barbara Bordalejo
      – PersonEntity:
          Name:
            NameFull: Davide Pafumi
      – PersonEntity:
          Name:
            NameFull: Frank Onuh
      – PersonEntity:
          Name:
            NameFull: A. K. M. Iftekhar Khalid
      – PersonEntity:
          Name:
            NameFull: Morgan Slayde Pearce
      – PersonEntity:
          Name:
            NameFull: Daniel Paul O'Donnell
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 12
              Type: published
              Y: 2025
          Identifiers:
            – Type: issn-electronic
              Value: 2365-9440
          Numbering:
            – Type: volume
              Value: 22
          Titles:
            – TitleFull: International Journal of Educational Technology in Higher Education
              Type: main
ResultId 1