Using Keystroke Dynamics to Detect Nonoriginal Text

Saved in:
Bibliographic Details
Title: Using Keystroke Dynamics to Detect Nonoriginal Text
Language: English
Authors: Paul Deane, Mo Zhang, Jiangang Hao, Chen Li
Source: Journal of Educational Measurement. 2026 63(1).
Availability: Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
Peer Reviewed: Y
Page Count: 32
Publication Date: 2026
Document Type: Journal Articles
Reports - Research
Descriptors: Keyboarding (Data Entry), Word Processing, Writing (Composition), Natural Language Processing, Automation, Identification, Essays, Artificial Intelligence, Accuracy, Writing Evaluation, Plagiarism
DOI: 10.1111/jedm.12431
ISSN: 0022-0655
1745-3984
Abstract: Keystroke analysis has often been used for security purposes, most often to authenticate users and identify impostors. This paper examines the use of keystroke analysis to distinguish between the behavior of writers who are composing an original text, vs. copying or otherwise reproducing a non-original texts. Recent advances in text generation using large language models makes the use of behavioral cues to identify plagiarism more pressing, since users seeking an advantage on a writing assessment may be able to submit unique AI-generated texts. We examine the use of keystroke log analysis to detect non-original text under three conditions: a laboratory study, where participants were either copying a known text or drafting an original essay, and two studies from operational assessments, where it was possible to identify essays that were non-original by reference to their content. Our results indicate that it is possible to achieve accuracies in excess of 94% under ideal conditions where the nature of each writing session is known in advance, and greater than 89% in operational conditions where proxies for non-original status, such as similarity to other submitted essays, must be used.
Abstractor: As Provided
Entry Date: 2026
Accession Number: EJ1501462
Database: ERIC
Full text is not displayed to guests.
FullText Links:
  – Type: pdflink
    Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwHnfoWLqHuwawNk3RKWaM3NAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOy_Qe4V_-KcYnw5-QIBEICBmuWlaV6UyLeNl7pZ1GNC2diMoiBWlI061FclSXNMR7lFVVLEUtazSzBb4e7vGhzyMOMdPjVl98yt3DlcaHBDjX-N1mLN34Y0GcywgvxUf9vhK5ezd5M4eBT5AQ2DVie-IDFs0ASoSFGI3zjdy1b2ZA4ob8vS59S55qFqSDFNv4Lrdc7ph4ZS00yuC9F16m15b-Zlc9NDCaUkNB8=
Text:
  Availability: 1
  Value: <anid>AN0192629998;mea01mar.26;2026Apr01.06:22;v2.2.500</anid> <title id="AN0192629998-1">Using Keystroke Dynamics to Detect Nonoriginal Text </title> <sbt id="AN0192629998-2">Introduction</sbt> <p>Keystroke analysis has often been used for security purposes, most often to authenticate users and identify impostors. This paper examines the use of keystroke analysis to distinguish between the behavior of writers who are composing an original text, vs. copying or otherwise reproducing a non‐original texts. Recent advances in text generation using large language models makes the use of behavioral cues to identify plagiarism more pressing, since users seeking an advantage on a writing assessment may be able to submit unique AI‐generated texts. We examine the use of keystroke log analysis to detect non‐original text under three conditions: a laboratory study, where participants were either copying a known text or drafting an original essay, and two studies from operational assessments, where it was possible to identify essays that were non‐original by refernece to their content. Our results indicate that it is possible to achieve accuracies inexcess of 94% under ideal conditions where the nature of each writing sessionis known in advance, and greater than 89% in operational conditions where proxies for non‐original status, such as similarity to other submitted essays, must be used.</p> <p>Keystroke dynamics have been used for some time to verify user identity, both for simple fixed tasks, such as verifying password entry (Ilonen, [<reflink idref="bib38" id="ref1">38</reflink>]; Bhana & Flowerday, [<reflink idref="bib11" id="ref2">11</reflink>]), and for more complex tasks, such as verifying whether two essays were written by the same or different people (Choi et al., [<reflink idref="bib18" id="ref3">18</reflink>]). Such keystroke identification models exploit stable aspects of people's writing and keyboarding habits, typing and composing patterns that do not change very much from one task to the next. However, many features of writing are dynamic and unstable, depending on the task, the context, and the writer's state of mind (Bixler & D'Mello, [<reflink idref="bib13" id="ref4">13</reflink>]; Deane et al., [<reflink idref="bib25" id="ref5">25</reflink>]; Van Steendam et al., [<reflink idref="bib57" id="ref6">57</reflink>]). For instance, Deane et al. ([<reflink idref="bib25" id="ref7">25</reflink>]) found that middle school students showed significantly different keystroke patterns when they were composing from scratch, when they were editing an existing text, or when they were copying it verbatim.</p> <p>Such behavioral differences may be valuable as a way to check on the writing processes of candidates completing an assessment. In particular, candidates in a high‐stakes context often seek to get higher scores by replicating a high‐quality text, written as a whole or in part by someone else, instead of composing an original text of their own (Dinneen, [<reflink idref="bib28" id="ref8">28</reflink>]; Lane, 2013). One method of detecting nonoriginal texts is to check for overlap, in whole or in part, with other submissions, or with texts on the web (Anson & Kruse, 2023; Burstein et al., [<reflink idref="bib16" id="ref9">16</reflink>]; Lochbaum et al., [<reflink idref="bib43" id="ref10">43</reflink>]). However, such methods can only detect a portion of the nonoriginal text that gets submitted by test candidates. If no one else submits the same essay, or if the source text cannot be identified on the web, or if the writer has included enough paraphrased or original material, this kind of cheating behavior may go undetected (Dawson, [<reflink idref="bib24" id="ref11">24</reflink>]). This is especially problematic in the case of AI‐generated essays, or essays that seek to "game" the scoring algorithm (Baldwin et al., [<reflink idref="bib7" id="ref12">7</reflink>]). If an AI‐generated essay is submitted on a high‐stakes assessment, it may be "unique," and thus undetectable by essay similarity detection (Barrett et al., [<reflink idref="bib9" id="ref13">9</reflink>]). While detectors for AI‐generated text exist, they may cease to work effectively when the text has mixed paragraphs from human‐written and AI‐generated (Hao & Fauss, [<reflink idref="bib35" id="ref14">35</reflink>]; Sadasivan et al., [<reflink idref="bib51" id="ref15">51</reflink>]; Yan et al., [<reflink idref="bib59" id="ref16">59</reflink>]). Thus, it is valuable to seek information from other modalities, such as the writing process, since certain writing behaviors may strongly suggest that writers are copying or paraphrasing text produced by someone else. Such a capability would go far toward filling existing gaps in test security for writing assessment.</p> <p>The basic premise for the studies presented here is that the cognitive processes underlying writing differ by task (Hayes, [<reflink idref="bib36" id="ref17">36</reflink>]). That is, different writing tasks require writing processes to be managed and coordinated differently. We thus expect keystroke logs to be different at different stages of the writing process (e.g., planning/outlining vs. drafting), for different genres (argument writing vs. narrative writing) and especially for fundamentally different writing tasks such as copy‐typing versus drafting an original text. Copy‐typing involves no content generation, no translation of ideas into words. It requires keyboarding skill and monitoring of typing accuracy but is likely to elicit relatively few editing behaviors. Drafting an original text, on the other hand, requires planning, content generation, and translation of ideas into words, phrases, and sentences, transcription/keyboarding, and monitoring the output for both content and typing accuracy (Galbraith & Baaijen, [<reflink idref="bib33" id="ref18">33</reflink>]).</p> <p>This paper presents the results of three empirical studies using extensive datasets to illustrate the high accuracy achieved in detecting copywriting styles from draft writing through the analysis of keystroke process data. In the first study, we drew a direct comparison between the writing process of copy‐typing and draft writing. Each participant was tasked with submitting both a draft essay and a copied essay, the keystroke features of which served as the foundation for training classifier models to effectively differentiate between the two writing styles. In the subsequent two studies, we used operational data from high stakes assessments, where the best indication of copying is the presence of verbatim material reproduced in two or more essays. We used the amount of common material as a measure of the likelihood that two essays were derived from a common (and therefore, nonoriginal model). To be precise, we used trigram cosine similarity metrics, which measured the extent to which two essays deployed a common set of three‐word sequences. High cosine similarity between two essays on this measure implies the presence of large amounts of text that is identical, or only slightly modified, from some common source. We then built classifiers that attempted to predict which essays were high or low on the trigram content similarity measure, using features derived from the keystroke logs as predictors. In the following sections, we will introduce the details of these studies.[<reflink idref="bib1" id="ref19">1</reflink>]</p> <hd id="AN0192629998-3">Literature Review</hd> <p>Multiple studies indicate the feasibility of using keystroke log analysis to detect the production of nonoriginal text. Specifically:</p> <p>Deane et al. ([<reflink idref="bib25" id="ref20">25</reflink>]) collected data from 463 eight‐grade students at an urban middle school in the American West. Students completed three tasks: (i) a copy‐typing task, in which they were given paper copies of the passage "Mark Twain's Huckleberry Finn" from the America's Story Web site produced by the Library of Congress and asked to retype it; (ii) an essay writing task, in which they were asked to write an essay on the topic "Should schools be allowed to sell junk food to their students?"; and (iii) an editing task, in which they were asked to revise and edit essay they produced. Multiple features were collected, using a digital writing platform that captured keystroke logs, including (i) the frequency of different types of events (in‐word keystrokes, word initial keystrokes, between‐word and between‐sentence keystrokes, backspacing actions and jump‐to‐edit events); (ii) differences in the median log duration of different types of inter‐key intervals or IKIs (for in‐word, word‐initial, between‐word, between‐sentence, backspacing, and jump‐to‐edit events); (iii) the percentage of time spent in different types of events (in‐word IKIs, word‐initial IKIs, between‐word IKIs, between‐sentence IKIs, backspacing, jump‐to‐edit events, very long pauses, and document‐initial pauses), and in various other features (such as length of jump events where cursor position changes, multiword deletions and word substitutions). They found that (a) initial pauses and long pauses took up a larger percentage of time in drafting than in copy‐typing; (b) more time was spent pausing between words and sentences in drafting than in copy‐typing; (c) bursts of text production were longer in drafting than in copy‐typing; (d) there were more editing actions and more time spent on editing actions in drafting than in copy‐typing.</p> <p>Conijn et al. ([<reflink idref="bib22" id="ref21">22</reflink>]) examined two datasets: (i) the Villani dataset (Tappert et al., [<reflink idref="bib55" id="ref22">55</reflink>]; Monaco et al., [<reflink idref="bib46" id="ref23">46</reflink>]), which collected responses from 142 participants (359 copy texts and 1,262 emails), and (ii) an academic writing dataset, which included responses from 128 participants, each of whom completed a copy task (transcribing an 850‐character fable) and a writing task (summarizing an academic article). They examined a range of features designed to measure students' writing processes, including total time, the number of words, keystrokes, and backspaces produced, typing efficiency, and the number, mean, and standard deviation of IKIs (summarized over all IKIs, word‐internal IKIs, between‐word IKIs, and between‐sentence IKIs). They reported results very similar to those reported by Deane et al. ([<reflink idref="bib25" id="ref24">25</reflink>]). For the Villani dataset, they found that more backspaces were used, the largest IKIs took longer, total time was larger, and typing efficiency was lower on the email task than on the copy‐typing task. For the academic writing dataset, they found that students spent more time on long pauses, took longer to write, showed greater variance in between‐word latency, and showed lower typing efficiency on the academic summary task, compared to the copy task.</p> <p>Trezise et al. ([<reflink idref="bib56" id="ref25">56</reflink>]) had 62 university students (47 men and 15 women) complete writing tasks under three distinct conditions: free writing, general transcription, and self‐transcription. They summarized student writing processes using features designed to measure general writing processes, including the number and variability of words added or deleted per minute and the number, duration, and length of busts of text production. They report results very similar to those reported in Deane et al. ([<reflink idref="bib25" id="ref26">25</reflink>]) and Conijn et al. ([<reflink idref="bib22" id="ref27">22</reflink>]): free writing (original composition) was characterized by greater variability in timing (including a higher incidence of long pauses and shorter bursts of text production) and higher incidence of errors and revision behaviors (associated with backspacing and deletion). Self‐transcription was faster and less prone to error than free text generation, but transcription of a nonoriginal text by slower typing with little revision and relatively brief pauses. Cluster analysis identified three clusters, and reliably sorted free writing into one cluster (58 of 61 cases assigned to Cluster 1) and transcription of nonoriginal text into a separate cluster (51 of 61 cases assigned to Cluster 3, with only one case assigned to Cluster 1), with self‐transcription being classified either to Cluster 2 or Cluster 3.</p> <p>Crossley et al. ([<reflink idref="bib23" id="ref28">23</reflink>]) initially collected authentic essays from a sample of 4,992 Amazon Automatic Turk crowdsourcing workers, then selected a sample of 500 essays from that set, which they also had transcribed by (different) Amazon Turk workers to produce a final dataset of 500 matched pairs of authentic versus transcribed keystroke logs. They sampled a wide range of features, similar to those reported by Deane et al. ([<reflink idref="bib25" id="ref29">25</reflink>]), Conijn et al. ([<reflink idref="bib22" id="ref30">22</reflink>]), and Trezise et al. ([<reflink idref="bib56" id="ref31">56</reflink>]), and then built machine learning classifiers to distinguish between authentic and transcribed essays, using 2/3 of the data for training and 1/3 for evaluation. All four machine learning models they tested (linear discriminant analysis, multilayer perceptron, random forest, and support vector machine) showed strong performance, with precision ranging between.953 and.987, and recall ranging between.966 and.993. They found that compared to the transcription context, writers in an authentic context produced (and deleted) more content, made more revisions, showed a greater variability in the number of keystrokes they produced, and spent more time pausing overall, and especially before sentences and words. Once again, these results are largely consistent with the findings of Deane et al. ([<reflink idref="bib25" id="ref32">25</reflink>]), and indicate that with paired, prelabeled data, the production of nonoriginal text can be detected with high levels of classification accuracy.</p> <p>Setting aside related, but relatively distinct work focused on identifying plagiarism in software writing (e.g., Schneider et al., [<reflink idref="bib52" id="ref33">52</reflink>]), there is one recent article that addresses the problem of detecting plagiarism from keystroke logs, but from a rather different perspective. Kundu et al. ([<reflink idref="bib41" id="ref34">41</reflink>]) approach the problem of plagiarism‐detection, building on previous work that measured keystroke dynamics to enforce security in software systems.</p> <p>The earliest line of research in this area focused on keystroke dynamics for fixed texts, such as passwords (Tappert et al., [<reflink idref="bib55" id="ref35">55</reflink>]; Idrus et al., [<reflink idref="bib37" id="ref36">37</reflink>]; Parkinson et al., [<reflink idref="bib49" id="ref37">49</reflink>]; Chang et al., [<reflink idref="bib17" id="ref38">17</reflink>]). A second line of research, focusing on free, rather than fixed texts, developed out of this tradition, focusing on extracting data from multiple interactions with a system to do continuous authentication and impostor detection (Gunetti & Picardi, [<reflink idref="bib34" id="ref39">34</reflink>]; Alsultan & Warwick, [<reflink idref="bib4" id="ref40">4</reflink>]; Ayotte et al., [<reflink idref="bib6" id="ref41">6</reflink>]). These approaches, which could be highly accurate when used to distinguish authentic users from impostors, focused on typing patterns, rather than general writing process patterns, and typically used features based on specific key presses or key press sequences. In addition to detecting impostors, similar methods have been extended to recognize cheating on tests (Agarwal et al., [<reflink idref="bib3" id="ref42">3</reflink>]; Fenu et al., [<reflink idref="bib31" id="ref43">31</reflink>]; Flior & Kowalski, [<reflink idref="bib32" id="ref44">32</reflink>]; Kochegurova & Zateev, [<reflink idref="bib40" id="ref45">40</reflink>]; Mungai & Huang, [<reflink idref="bib48" id="ref46">48</reflink>]) and to infer biometric characteristics, such as age or gender (Buker et al., [<reflink idref="bib15" id="ref47">15</reflink>]; Brizan et al., [<reflink idref="bib14" id="ref48">14</reflink>]; Pentel, [<reflink idref="bib50" id="ref49">50</reflink>]), or even to detect emotional states or deceptive behaviors (Epp et al., [<reflink idref="bib29" id="ref50">29</reflink>]; Monaro, et al., [<reflink idref="bib47" id="ref51">47</reflink>]; Banerjee et al., [<reflink idref="bib8" id="ref52">8</reflink>]). More recently, researchers in this tradition have begun to use deep learning methods, with generally good results (Sun et al., [<reflink idref="bib53" id="ref53">53</reflink>]; Bernardi et al., [<reflink idref="bib10" id="ref54">10</reflink>]; Lu et al., [<reflink idref="bib44" id="ref55">44</reflink>]).</p> <p>Kundu et al. ([<reflink idref="bib41" id="ref56">41</reflink>]) build on a specific deep learning keystroke dynamics model, Typenet (Acien et al., [<reflink idref="bib1" id="ref57">1</reflink>]; Acien et al., [<reflink idref="bib2" id="ref58">2</reflink>]) that achieved high levels of accuracy on an impostor recognition task (an Equal Error Rate, or EER, of 2.2% for physical keyboards and 9.2% for touchscreen keyboards). Typenet is a type of Recurrent Neural Network (RNN), specifically a Long Short‐Term Memory (LSTM) model, designed to predict keycodes, hold, and release times for keystroke sequences. In their study, Kundu and colleagues built a Typenet model using background data from two large datasets (the SBU dataset from Banerjee et al., 2014, with 196 participants, and the Buffalo dataset from Sun et al., 2016, with 148 participants), which they combined with an original (Proposed) dataset collected from 40 participants, who wrote essays with and without access to the internet and assistive AI tools. All three datasets included both freely written and copy‐typed texts for the same users. This design enabled Kundu and colleagues to contrast accuracy under different conditions: with different keyboards, under conditions of high or low cognitive load, and with or without AI assistance. Depending on the exact condition, accuracy ranged between. 52 and. 85, with the worst performances occurring when the classifier was required to generalize across the SBU, Buffalo, and Proposed datasets, and the best performance when the classifier was trained and tested on the same keyboard, or on the same dataset. These results suggest considerable variability in the keystroke patterns that distinguish original from plagiarized text across tasks and populations, but since the LSTM model Kundu and colleagues used is not easily interpretable, it is difficult to characterize how typing behaviors varied across tasks and contexts.[<reflink idref="bib2" id="ref59">2</reflink>]</p> <p>These results indicate that it should be feasible to create classifiers that use keystroke log analysis to identify plagiarized text on direct writing assessments. Since we wish to understand how models differ across assessments, we use a classical NLP model with explicit features derived from earlier keystroke analysis models, consistent with the analyses in Deane et al. ([<reflink idref="bib25" id="ref60">25</reflink>]), Deane et al. ([<reflink idref="bib26" id="ref61">26</reflink>]), Choi et al. ([<reflink idref="bib18" id="ref62">18</reflink>]) and Choi & Deane ([<reflink idref="bib18" id="ref63">18</reflink>]). As mentioned in the introduction, the rest of this paper is divided into three studies, which examine the performance of keystroke classification models at detecting nonoriginal text under varying conditions and for different assessment programs. Study 1 reanalyzes the data from Deane et al. ([<reflink idref="bib25" id="ref64">25</reflink>]), building machine classification models to distinguish original drafting from copy‐typing. Studies 2 and 3 examine data from two different large‐scale testing programs, using essays detected using essay similarity detection software (and confirmed by test security professionals) as a proxy for nonoriginal text in the absence of explicit labeled data.</p> <hd id="AN0192629998-4">Study 1</hd> <p>Study 1 is an examination of how well keystroke‐based machine learning classifiers can perform under relatively straightforward circumstances, when we know which texts were copied and which were created during the assessment. Under the experimental conditions described below, we expected the contrasts between copy‐typing versus drafting to be stark, since the set‐up maximized the differences between the two conditions. To begin with, students' performances involved either 100% copy‐typing or 100% original drafting. In a typical cheating scenario, candidates may use shell language, memorized text, or copy from sources for part of the response, but may also introduce some original language of their own. In addition, we had all students copy‐type the same text, as opposed to having them copy a selection of different texts from different sources. Moreover, all students wrote essays on the same topic, and the copy‐typing and essay writing tasks were on different topics. These conditions maximized potential sources of difference in writing processes, supporting a demonstration of how well keystroke behavior patterns could distinguish what kind of writing task someone is performing, based solely on behavioral differences. However, it is important to note that the models we built were designed to discriminate between the same person's behavior under two testing conditions, which means that the model necessarily valued features that were characteristic of copying or drafting behaviors for most of the students in the sample.</p> <hd id="AN0192629998-5">Methodology</hd> <p></p> <hd id="AN0192629998-6">Participants</hd> <p>Data were collected from 8th grade students in an urban middle school in a state in the American West. This was the dataset previously analyzed in Deane et al. ([<reflink idref="bib25" id="ref65">25</reflink>]). This school had a population of students who identify primarily as being from minority groups and from households with low socioeconomic status:</p> <p></p> <ulist> <item> 72.1% free and reduced‐price lunch</item> <p></p> <item> 20% limited English proficiency</item> <p></p> <item> 60.3% Hispanic, 27.7% Black, 4.8% White, 2.6% Asian</item> </ulist> <p>The set of students for whom we had at least one valid log from the drafting session was just over two hundred (<emph>N</emph> = 201). Due to errors during the testing session, we had somewhat fewer logs from the copy‐typing session (<emph>N</emph> = 174).</p> <hd id="AN0192629998-7">Instrument</hd> <p>Students completed two tasks relevant to this analysis: a copy‐typing task and an essay drafting task. Both were administered digitally in classrooms with teacher proctoring. Students were expected only to use the materials provided to them. In the retyping task, students were asked to retype the article "Mark Twain's Huckleberry Finn," a public domain text that was provided to them in printed form. In the essay drafting task, students were asked to write an essay using an ETS CBAL® formative assessment module on the topic of junk food being sold in schools. Prior to drafting the essay, students read essays on the topic and completed an initial plan. Both the essays and the plans were available to them while they were drafting the essay. Both tasks were administered online using a research version of ETS' Criterion digital writing tool (ETS, [<reflink idref="bib30" id="ref66">30</reflink>].). Thirty minutes were allotted for each task. For each student, keystroke logs were collected for both tasks.</p> <hd id="AN0192629998-8">Data capture and analysis</hd> <p>A script written in JavaScript was attached to the web page that delivered the research version of Criterion, triggering an analysis of the text buffer every time the content changed. Among other things, this analysis performed a diff on the prior and current version of the buffer, and recorded the results, along with timestamps and total time elapsed. Figure 1 shows the kind of data that was recorded for each keystroke:</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0001.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0001.jpg" title="1 Information captured in a keystroke log." /> </p> <p></p> <p>The resulting logs were subjected to further analysis to extract a broad range of keystroke features. Specifically, they were processed using the keystroke pipeline described in Deane and Zhang (2015), excluding features that might be content‐sensitive, such as features related to the overall length of the response, or targeting specific keystroke sequences. After data cleaning, this pipeline extracted 141 features describing various aspects of the writing process. These features included measures of:</p> <p></p> <ulist> <item> The total time, quantity, latency, and speed of various typing events, including events that take place in different linguistic contexts (within word, between words, between sentences, and between paragraphs). Total time was measured in seconds, latency in log milliseconds, and speed in characters per second.</item> <p></p> <item> The total time, quantity, latency, and speed of editing events (cut and paste events, backspacing events, and jump edits, where changes are made at different locations in the text being produced). Total time was measured in seconds, latency in log milliseconds, and speed in characters per second.</item> <p></p> <item> The extent of word‐level edits, defined as changes to partially typed words before the writer moved on to another word. These were measured in percent of total events.</item> <p></p> <item> Typing fluency, defined in terms of the number and length of bursts of text production (where bursts are defined as keystroke sequences delimited by long pauses). For this study, bursts were defined as sequences of keystroke events, where none of the consecutive events in the sequence took more than 2/3 of a second.</item> </ulist> <p>Multiple versions of specific features were produced by varying feature parameters. For instance, different statistical summaries were applied to each base feature definition, including raw count, mean and standard deviation, median and range. Roughly speaking, these features reflected different aspects of the writing process. The tempo of typing events within words was likely to reflect pure typing processes; pauses between words, sentences, and paragraphs were increasingly likely to reflect pauses associated with planning and idea generation processes; backspacing, cut, paste, and jump edit events were likely to reflect some form of metacognitive monitoring of the writing process, leading to revisions, edits, and corrections to low‐level errors.</p> <hd id="AN0192629998-10">Data cleaning</hd> <p>The primary goal of the data cleaning step was to make it possible to compare copy‐typing and drafting behaviors within‐student. We therefore selected participants whose keystroke log data was available to us for both the Retyping and the Drafting tasks. Out of 371 students as initially selected, only a portion (182 students) had keystroke data for both writing tasks. We also removed a few cases where there was a large amount of missing data or the total time exceeded total test time (35 minutes), possibly reflecting an administration error where students went back to the task after the session was complete. As a consequence, 174 of the copy‐typing logs were kept for further analysis (although somewhat more students, <emph>N</emph> = 194, had valid logs from the drafting task). Features with too many missing values were also excluded, leading to a final dataset containing 82 keystroke features.[<reflink idref="bib3" id="ref67">3</reflink>] The median was imputed for any remaining missing values. These features were based upon an early version of ETS keystroke data collection system, distinct from those used in Studies 2 and 3. Studies 2 and 3 used a later iteration of the feature extraction system developed for operational use and which included a much larger selection of features to support the development of machine learning models.</p> <hd id="AN0192629998-11">Analysis</hd> <p>Logs from copy‐typing sessions were labeled "0" (as consisting of nonoriginal text). Logs from drafting sessions were labeled "1" (as consisting of original, candidate‐produced text). We trained and fine‐tuned seven commonly used machine learning methods (Mahesh, [<reflink idref="bib45" id="ref68">45</reflink>]), including Random Forest, Gradient Boosting, Neural Network, Logistic Regression, K Nearest Neighbors, Naive Bayes, and Linear Support Vector Machine. Given the relatively small size of the dataset, we used fourfold cross‐validation. The seven machine learning models were trained and fine‐tuned using fourfold cross validation on the training set. Grid search algorithm was used for model tuning. We systematically evaluated every combination of hyperparameter values and the optimal combination of values that maximizes the average value in cross‐validation were selected for each of the machine learning models. Performance on the test set was evaluated in terms of overall classification accuracy, AUC (area under the curve), precision, recall, and <emph>F</emph>1 score. We also examined feature importance (calculated using the Gini index), both on the two best‐performing models, but also on the logistic regression model, which provided a simple linear model of feature contributions to prediction.</p> <hd id="AN0192629998-12">Results</hd> <p>Gradient boosting returned the strongest model, with a classification accuracy of.951, an area under the curve (AUC) of.990, an <emph>F</emph>1 score of.944, precision of.968, and recall of.947. The Random Forest model came in second, with a classification accuracy of.948, an area under the curve of.987, an <emph>F</emph>1 score of.942, a precision of.959, and a recall of.921. The Naïve Bayes, Neural Network, and Logistic Regression models fell in the middle, with classification accuracy between. 93 and. 94, AUC between. 94 and. 96, <emph>F</emph>1 scores between. 91 and. 92, precision between. 93 and. 96, and recall between. 88 and. 921. The <emph>K</emph> Nearest Neighbors and Naïve Bayes models had the worst performances, with classification accuracies of.910 and. 896, AUCs of.943 and. 938, <emph>F</emph>1 scores of.892 and. 877, precision of.949 and. 926, and recall of.842 and. 836.[<reflink idref="bib4" id="ref69">4</reflink>]</p> <p>The 25 most important features for the highest‐performing Gradient Boosting model are shown in Table 2.[<reflink idref="bib5" id="ref70">5</reflink>] Three features dominated this model: the proportion of characters in multiword insertions (feature importance. 3855), the number of characters inserted in multiword insertions (feature importance. 3273), and the standard deviation of time spent in bursts of fast typing (delimited by pauses greater than 2/3 of a second). Figures 2 gives a sense of how some of these features interact.</p> <p>1 Table Classification Results for the Experimental Copy‐Typing Dataset</p> <p> <ephtml> <table><thead><tr valign="bottom"><th /><th>Classification Accuracy</th><th>AUC</th><th><italic>F</italic>1</th><th>Precision</th><th>Recall</th></tr></thead><tbody><tr><td>Gradient Boosting</td><td>.951</td><td>.990</td><td>.944</td><td>.968</td><td>.947</td></tr><tr><td>Random Forest</td><td>.948</td><td>.987</td><td>.942</td><td>.959</td><td>.921</td></tr><tr><td>Neural Network</td><td>.942</td><td>.964</td><td>.910</td><td>.933</td><td>.921</td></tr><tr><td>SVM</td><td>.931</td><td>.971</td><td>.916</td><td>.941</td><td>.888</td></tr><tr><td>Logistic Regression</td><td>.928</td><td>.954</td><td>.914</td><td>.952</td><td>.882</td></tr><tr><td>kNN</td><td>.910</td><td>.943</td><td>.892</td><td>.949</td><td>.842</td></tr><tr><td>Naive Bayes</td><td>.896</td><td>.938</td><td>.877</td><td>.926</td><td>.836</td></tr></tbody></table> </ephtml> </p> <p>2 Table The 25 Most Important Features in the Study 1 Gradient Boosting Model</p> <p> <ephtml> <table><thead><tr valign="bottom"><th>Feature Name</th><th align="center">GB Feature Importances</th></tr></thead><tbody><tr><td>Percentage of characters in multiword insertions</td><td>.3855</td></tr><tr><td>No. characters inserted in multiword insertions</td><td>.3273</td></tr><tr><td>SD of time spent in bursts of fast typing</td><td>.1576</td></tr><tr><td>Proportion time spent pausing before sentence‐internal punctuation marks</td><td>.0495</td></tr><tr><td>Proportion time spent pausing between phrasal bursts (bursts delimited by very long [>2‐second] pauses).</td><td>.0138</td></tr><tr><td>Mean log latency between characters in a word</td><td>.0060</td></tr><tr><td>Proportion time spent pausing before the first character in a word</td><td>.0050</td></tr><tr><td>Median log latency before and end‐sentence punctuation mark</td><td>.0040</td></tr><tr><td>Proportion time spent pausing before the first character in a sentence</td><td>.0027</td></tr><tr><td>SD of log latency before a single backspace event</td><td>.0025</td></tr><tr><td>Proportion of keystrokes that are end sentence punctuation marks</td><td>.0025</td></tr><tr><td>Proportion of events that are multiword deletions</td><td>.0021</td></tr><tr><td>Proportion of characters inserted not deleted</td><td>.0021</td></tr><tr><td>Median of longest pauses in a sequence of one or more line or paragraph breaks</td><td>.0018</td></tr><tr><td>Total time spent at the start of a phrasal burst (bursts delimited by very long (>2 second) pauses</td><td>.0018</td></tr><tr><td>Proportion of deleted characters</td><td>.0018</td></tr><tr><td>Proportion of time spent in the longest pause in any given word</td><td>.0018</td></tr><tr><td>Median log latency before a space character between words</td><td>.0016</td></tr><tr><td>Proportion of time spent deleting multiword sequences</td><td>.0016</td></tr><tr><td>SD of log latency between characters within a word</td><td>.0015</td></tr><tr><td>Mean log length of bursts of fast typing (in characters)</td><td>.0014</td></tr><tr><td>Total time spent deleting multiword sequences</td><td>.0013</td></tr><tr><td>Mean log latency between sentences</td><td>.0012</td></tr><tr><td>Proportion of words edited</td><td>.0012</td></tr><tr><td>Proportion end sentence punctuation</td><td>.00184</td></tr><tr><td>SD log latency for within‐word keystrokes</td><td>.00154</td></tr><tr><td>Total pause time before a jump event</td><td>.00140</td></tr><tr><td>Proportion of time spent pausing at the end of a sentence</td><td>.00131</td></tr><tr><td>Total time spent pausing before multiword insertions</td><td>.00126</td></tr><tr><td>Proportion of uncorrected spelling errors detected</td><td>.000832</td></tr><tr><td>Mean log latency before a cut/paste/jump event</td><td>.000624</td></tr><tr><td>Proportion of time spent pausing before a multiword deletion</td><td>.000589</td></tr><tr><td>SD of length of phrasal bursts</td><td>.000462</td></tr><tr><td>Median of the longest pause spent before an editing action in any word</td><td>.000461</td></tr></tbody></table> </ephtml> </p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0002.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0002.jpg" title="2 Scatterplot: variability of burst time versus proportion chars in multiword insertions for Study 1." /> </p> <p></p> <p>As these figures reveal, many of the original texts were distinguished by the presence of multiword insertions, averaging between 1% and 10% of the total number of the characters in the text, reflecting relatively long chunks of text copied and pasted in a different location in the text buffer as a result of editing operations. Original texts were also distinguished by lower variability in the amount of time spent in any given burst of fast typing, whereas copy‐typing tended to display more evenly timed bursts. Similarly, Figure 3 shows one of the trees constructed by the Gradient Boosting algorithm in Fold 4, which involved a first split based on the proportion of characters in multiword insertions, and secondary splits based on the variability of burst times and the proportion of multiword insertion events relative to all typing events.[<reflink idref="bib6" id="ref71">6</reflink>]</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0003.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0003.jpg" title="3 Decision tree (Study 1, Fold 4)." /> </p> <p></p> <p>The 25 most important features for the second‐highest performing Random Forest model are shown in Table 3. A broader range of features have high importance, but the highly valued features come in groups, representing different ways of summarizing the same behaviors. Most of the most important features in Table 3 are measures of multiword insertion and deletion—the number and proportion of characters inserted in multiword insertions, the number and relative proportion of multiword insertion events, the proportion of characters inserted or deleted, etc. Most of the rest of the most important features reflect variability in pause patterns: the standard deviation of log time spent in bursts of fast typing, the standard deviation of log latency between sentences, the standard deviation of the length of bursts of fast typing in characters, the proportion of time spent in long pauses between and within words, etc. Other features in Table 3 have to do with other kinds of editing events: jump edits, word replacements, etc. The Random Forest and Gradient Boosting models therefore tell similar stories: original texts show greater variability of pause patterns and a greater prevalence of editing and especially revision behaviors.</p> <p>3 Table Feature Importance for the Random Forest Model</p> <p> <ephtml> <table><thead><tr valign="bottom"><th>Feature Name</th><th align="center">RF Feature Importances</th></tr></thead><tbody><tr><td>No. characters inserted in multiword insertions</td><td>.1143</td></tr><tr><td>Percentage of characters in multiword insertions</td><td>.1037</td></tr><tr><td>Proportion of multiword insertion events</td><td>.0640</td></tr><tr><td>Proportion of characters inserted rather than deleted</td><td>.0624</td></tr><tr><td>SD of log time spent in bursts of fast typing</td><td>.0584</td></tr><tr><td>Proportion of characters deleted rather than inserted</td><td>.0401</td></tr><tr><td>SD of log latency between sentences</td><td>.0347</td></tr><tr><td>SD of length of bursts of fast typing (in characters)</td><td>.0325</td></tr><tr><td>Total time spent in multiword deletions</td><td>.0316</td></tr><tr><td>Proportion of multiword deletion events</td><td>.0295</td></tr><tr><td>Proportion of time spent in the longest word‐internal pause in any word</td><td>.0290</td></tr><tr><td>Mean jump edit length relative to length of document</td><td>.0266</td></tr><tr><td>Proportion of time spent in multiword deletions</td><td>.0252</td></tr><tr><td>SD of log latency before cut/paste/jump events</td><td>.0204</td></tr><tr><td>Proportion of time spent pausing at the end of a word</td><td>.0197</td></tr><tr><td>Mean log latency between words</td><td>.0181</td></tr><tr><td>Proportion of time spent pausing at the start of a sentence</td><td>.0173</td></tr><tr><td>Proportion of time spent pausing before in‐sentence punctuation marks</td><td>.0165</td></tr><tr><td>Proportion of events replacing text with new content</td><td>.0160</td></tr><tr><td>Proportion of events involving choice of a different word</td><td>.0155</td></tr><tr><td>Average length of jump edit events in characters</td><td>.0153</td></tr><tr><td>Mean log length of bursts of fast typing (in characters)</td><td>.0145</td></tr><tr><td>Proportion of time spent at boundary of phrasal burst (typing sequence delimited by pause > 2 seconds)</td><td>.0143</td></tr><tr><td>Median time spent pausing at the start of a sentence</td><td>.0110</td></tr><tr><td>Mean log time spent in a burst of fast typing</td><td>.1045</td></tr></tbody></table> </ephtml> </p> <hd id="AN0192629998-15">Discussion for Study 1</hd> <p>This dataset represents in many ways a "best case" for distinguishing drafting from copy‐typing, since we know what students were doing in each session, and we know that each session consisted 100% of either drafting or copy‐typing behavior. The results we obtained—a classification accuracy of.951, and an area under the curve statistic greater than. 99—can reasonably be viewed as demonstrating a high upper bound for the classification task. It demonstrates that people do, in fact, write in very different ways when they are copy‐typing versus producing original text. The models also tell a consistent story about the difference between copy‐typing and drafting: Copy‐typing tends to happen at a steady pace, and since the content is fixed in advance, there is little evidence of editing or revision. Drafting, on the other hand, requires the writer to engage in content generation and sentence planning, which create greater variability in tempo, and may lead the writer to produce content that they later reject or modify, leading to greater incidence of characteristic editing and revision behaviors, such as the use of cut and paste (or drag and drop) to rearrange text content.</p> <hd id="AN0192629998-16">Study 2</hd> <p></p> <hd id="AN0192629998-17">Methodology</hd> <p>All data for Study 2 was drawn from operational submissions to a large‐scale, high‐stakes assessment. As part of this test, candidates wrote two essays on different prompts. Keystroke logs and the final submitted essays were captured by the assessment delivery platform. We constructed the dataset used in this study from a corpus of 91,181 candidate responses.</p> <hd id="AN0192629998-18">Materials</hd> <p>The assessment was administered between March 31, 2021 and December 2, 2021 in an online, secure system, with administrations split between ETS‐administered testing centers with physical proctoring (45.6%), and at‐home testing with online proctoring (54.4%). The delivery platform only displayed the assessment and did not allow the users to bring up other applications. The test contained multiple sections, some multiple choice, others (specifically, the writing section), allowing typed responses. The total test time allowed was three hours and forty‐five minutes.</p> <p>On this assessment, each candidate wrote two essays. They had 30 minutes to complete each essay. One of the essay prompts required the candidate to produce a persuasive essay on a general topic (one that they could reasonably be expected to build an argument about from general background knowledge.) This task provided no source material that students could copy, nor were they allowed to look up source material while writing the essay. There were several dozen distinct prompts for this task in our sample. The second essay was source‐based and was excluded from the study.</p> <hd id="AN0192629998-19">Data Selection: Essay Similarity Detection</hd> <p>All essays written to the nonsource‐dependent writing prompt were processed using AutoESD, a tool developed by ETS researchers (Choi et al., [<reflink idref="bib19" id="ref72">19</reflink>]). At the core of AutoESD, pair‐wise essays are compared for similarity based on trigram cosine similarity. If two essays show excessive similarity, it indicates that the authors of the essays may have copied from a common essay template. Based on human expert reviews, essay pairs with cosine similarity between. 10 and. 30 generally displayed some overlapping content, but not enough to conclude that the writer was producing almost entirely nonoriginal text. Essay pairs with cosine similarity above. 30 contained large chunks of overlapping, near‐word‐for‐word text, sufficient to recommend that the essays in question be reviewed to consider score cancellation. In the source‐based writing task, an essay that merely quoted the prompt would also be considered nonoriginal text and therefore not scoreable as the candidate's own response.</p> <p>The distribution of trigram cosine similarity metrics for the entire corpus of 91,181 candidate essays are shown in Figure 4.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0004.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0004.jpg" title="4 Histogram of frequencies by maximum trigram cosine similarity metric to another essay in the same time period for Study 2." /> </p> <p></p> <p>As this figure illustrates, a very small percentage of the essays had trigram cosine similarities greater than. 20, and an order of magnitude fewer had trigram cosine similarities greater than. 30. In other words, clear cases of nonoriginal text were a very small portion of the total corpus.</p> <p>We first extracted all the responses from the set that had a trigram cosine similarity greater than. 30 to at least one other essay in the set of candidate responses. These were classified as "nonoriginal texts." We then randomly sampled essays written to the same prompts with a cosine similarity less than. 05 to any other submission. These were classified as "original texts." Filtering the data in this way yielded a total of 260 essays with a cosine similarity greater than. 30 for at least one other essay.[<reflink idref="bib7" id="ref73">7</reflink>] To obtain a sufficiently large sample, we added 1,210 essays that had been operationally flagged as being above the ESD similarity threshold of.30 on previous administrations. This yielded a total positive sample of 1,470 essays. We matched these essays with a random selection of 1,470 essays whose maximum cosine similarity to any other essay was less than. 05. This yielded a balanced set of 2,940 essays, half of which had high trigram cosine similarity to at least one other essay, and therefore contained large chunks of nonoriginal text.</p> <hd id="AN0192629998-21">Data Processing and Cleaning: Keystroke Feature Analysis</hd> <p>The keystroke logs for the 2,940 essays in our dataset were put through the keystroke pipeline described in Choi et al. ([<reflink idref="bib18" id="ref74">18</reflink>]). This pipeline extracted a larger set of features than the earlier pipeline described in Deane and Zhang (2015). This feature set included features sensitive to different definitions of "bursts" of text production (created by choosing different pause lengths to define the end of the burst, and by distinguishing between bursts that only involved append actions from bursts that included both append and delete actions). It also included features sensitive to additional kinds of edit actions (such as word substitution, created by backspacing over a partly created word and typing a different word instead). And it included features sensitive to spelling status, for example, features designed to recognize typo‐correction actions. However, similar basic metrics were extracted: total numbers and percents for specific events, and event summaries that measured total time in seconds, latency in log milliseconds, speed in characters per second, and burst length in characters.</p> <p>Before analysis, the data was cleaned by taking the following steps to handle issues related to features with large amounts of missing data:</p> <p></p> <ulist> <item> Features with zero variance were removed.</item> <p></p> <item> Features with less than 60% valid values were removed.</item> <p></p> <item> Missing values were imputed for features having at least 60% valid cases. Specifically, we imputed the median.</item> </ulist> <p>After the data cleaning step, the dataset contained 546 keystroke features.</p> <hd id="AN0192629998-22">Final Data Preparation and Analysis</hd> <p>Essays with a cosine similarity to some other essay greater than or equal to. 30 were labeled "1" (as consisting primarily of nonoriginal text. Essays with no cosine similarity to any other essay greater than. 05 were labeled "0" (as consisting of original, candidate‐produced text). As in Study 1, we trained and evaluated the performance of seven popular machine learning classifiers, using four‐fold cross‐validation. The seven machine learning models were trained and fine‐tuned using fourfold cross validation on the training set. Grid search algorithm was used for model tuning. We systematically evaluated every combination of hyperparameter values and the optimal combination of values that maximizes the average value in cross‐validation were selected for each of the machine learning models. We also examined feature importance (calculated using the Gini index), both on the two best‐performing models, but also on the logistic regression model, which provided a simple linear model.</p> <hd id="AN0192629998-23">Results</hd> <p>Table 4 presents the results of predictive modeling. The Gradient Boosting classifier had the strongest performance, with an AUC of.961, an overall accuracy of.904, an <emph>F</emph>1 of.899, a precision of.916, and a recall of.886. The Random Forest model came next, with an AUC of.930, an overall accuracy of.850, a precision of.858, and a recall of.838. The Logistic Regression model came next, with an AUC of.900, an overall accuracy of.944, an <emph>F</emph>1 score of.848, a precision of.832, and a recall of.863. The Naïve Bayes model came next, with an AUC of.845, an accuracy of.802, an <emph>F</emph>1 score of.812, a precision of.775, and a recall of.852. The kNN model came next, with an AUC of.851, a classification accuracy of.778, an <emph>F</emph>1 score of.787, a precision of.760, and a recall of.818. The Neural Network model performed worst, with an AUC of.841, a classification accuracy of.771, an <emph>F</emph>1 score of.724, a precision of.827, and a recall of.782.</p> <p>4 Table Study 2: Performance of Classifier Models Predicting from Keystroke Data Whether a Submitted Essay Has Large Amounts of Overlapping Content with Another Essay</p> <p> <ephtml> <table><thead><tr valign="bottom"><th /><th>Classification Accuracy</th><th>AUC</th><th><italic>F</italic>1</th><th>Precision</th><th>Recall</th></tr></thead><tbody><tr><td>Gradient Boosting</td><td>.904</td><td>.961</td><td>.899</td><td>.916</td><td>.886</td></tr><tr><td>Random Forest</td><td>.850</td><td>.930</td><td>.848</td><td>.858</td><td>.838</td></tr><tr><td>Logistic Regression</td><td>.844</td><td>.900</td><td>.848</td><td>.832</td><td>.863</td></tr><tr><td>Naïve Bayes</td><td>.802</td><td>.845</td><td>.812</td><td>.775</td><td>.852</td></tr><tr><td>kNN</td><td>.778</td><td>.851</td><td>.787</td><td>.760</td><td>.818</td></tr><tr><td>Neural Network</td><td>.771</td><td>.841</td><td>.724</td><td>.827</td><td>.782</td></tr></tbody></table> </ephtml> </p> <p>Table 5 shows the top 25 predictive features for the highest‐performing Gradient Boosting model. The most important feature was the number of word substitutions, with an importance of.43. The next most important features were the range of the log latency before a cut event, or multiple character deletion, with a feature importance of.0914, and the number of runs of consecutive backspacing that cross word boundaries, with a feature importance of.0842. Feature importance diminishes rapidly beyond this point. However, most of the features in Table 5 address backspacing, cut events, word substitutions, time spent at the end of the text buffer (rather than earlier, editing previously written words), and variability in burst lengths—for the most part, features that are associated, either directly or indirectly, with time spent pausing to decide how to edit or revise text content.</p> <p>5 Table The 25 Most Important Predictors for the Study 2 Gradient Boosting Model</p> <p> <ephtml> <table><thead><tr valign="bottom"><th>Feature Name</th><th align="center">Feature Importances</th></tr></thead><tbody><tr><td>Number of word substitutions</td><td>.4300</td></tr><tr><td>Range of log latency before a cut (multiple character deletion) event</td><td>.0914</td></tr><tr><td>Number of runs of consecutive backspacing events that cross word boundaries</td><td>.0842</td></tr><tr><td>Range of the speed of cut (multiple character deletion) events</td><td>.0297</td></tr><tr><td>Log odds of spending time in a cut (multiple character deletion) event compared to other events</td><td>.0236</td></tr><tr><td>Logit of probability of a word substitution event</td><td>.0137</td></tr><tr><td>Mean latency before deletion of characters between sentences</td><td>.0136</td></tr><tr><td>Median length in characters of bursts of typing delimited by pauses 8 SD > median speed of typing characters within a word</td><td>.0103</td></tr><tr><td>Range of the speed of deletion events involving characters between sentences</td><td>.0091</td></tr><tr><td>Mean log latency before a cut (multiple character deletion) event</td><td>.0090</td></tr><tr><td>Median log latency before deletion of characters between sentences</td><td>.0088</td></tr><tr><td>Mean speed of cut events (multiple character deletions)</td><td>.0079</td></tr><tr><td>Total time spent before cut events (multiple character deletions)</td><td>.0075</td></tr><tr><td>Logit of probability that an event will take place at the end of the text buffer</td><td>.0075</td></tr><tr><td>Standard deviation of the length in characters of a burst of fast typing (deliminated by pauses greater than 600 ms)</td><td>.0067</td></tr><tr><td>Total time spent elsewhere in the test after completing the essay</td><td>.0064</td></tr><tr><td>Range of the speed of consecutive backspacing events</td><td>.0064</td></tr><tr><td>Standard deviation length in characters of bursts of typing delimited by pauses 6 SD > than median speed of typing characters within a word</td><td>.0059</td></tr><tr><td>Range of length in characters of bursts of typing delimited by pauses 8 SD > than median speed of typing characters within a word</td><td>.0056</td></tr><tr><td>Mean speed of deletion events involving characters between sentences</td><td>.0054</td></tr><tr><td>Standard deviation of log latency when typing whitespace between words</td><td>.0047</td></tr><tr><td>Mean log latency between characters within a word</td><td>.0046</td></tr><tr><td>Number of bursts of typing delimited by pauses 5 SD > than the median pause between characters within a word</td><td>.0045</td></tr><tr><td>Median length in characters of bursts of typing delimited by pauses 2 SD > than the median pause between characters within a word</td><td>.0044</td></tr><tr><td>Total time spent deleting whitespace or punctuation at a sentence boundary</td><td>.0043</td></tr></tbody></table> </ephtml> </p> <p>Figure 5 illustrates the relationship between two of these features—the number of word substitutions, and the number of runs of consecutive backspacing events that cross word boundaries. As this figure illustrates, original texts tended to have higher incidence of these editing behaviors, whereas nonoriginal texts mostly fell in the lower left‐hand corner of the graph, where both of these (related) editing behaviors were infrequent. Figure 6 shows a classification tree that Gradient Boosting developed for Fold 4 of the analysis.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0005.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0005.jpg" title="5 Scatterplot: number of backspaces over a word versus number of word substitutions for Study 2." /> </p> <p></p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0006.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0006.jpg" title="6 Decision tree (Study 2, Fold 4)." /> </p> <p></p> <p>As this figure indicates, the primary split on this tree was on the number of substituted words.</p> <p></p> <ulist> <item> Nonoriginal texts had fewer word substitutions (less than 24.5 on average).</item> <p></p> <item> If there were a larger number of word substitutions, nonoriginal text was more likely if there was a lower number of consecutive runs of backspacing that crossed word boundaries (less than 22.5 on average).</item> <p></p> <item> If the number of word substitutions was below the cut, nonoriginal text was more likely if the variability of pauses before a cut action was relatively low (less than 1.544).</item> <p></p> <item> If the number of word substitutions and consecutive backspace runs over word boundaries were above the cut, nonoriginal texts showed less variability in the speed of cut actions (<.181).</item> <p></p> <item> If the number of word substitutions and the variability of pauses before a cut action were both below the cut, the log odds of spending time on a cut event were relatively low for nonoriginal texts (<.079).</item> <p></p> <item> If the number of word substitutions, the variability of pauses before a cut action, and the log odds of spending time on a cut event were relatively low, a text was more likely to be nonoriginal if the logit of words being produced but not appearing in the final text was low (less than –.067).</item> </ulist> <p>Essentially, original text showed more editing behavior that produced larger changes in the final text and greater variability in the timing of editing actions. Table 6 shows the top 25 predictive features for the next‐best Random Forest model.</p> <p>6 Table The 25 Most Important Features in the Study 2 Random Forest Model</p> <p> <ephtml> <table><thead><tr valign="bottom"><th>Feature Name</th><th align="center">Feature Importances</th></tr></thead><tbody><tr><td>Number of word substitutions</td><td>.5635</td></tr><tr><td>Number of runs of consecutive backspacing events that cross word boundaries</td><td>.1006</td></tr><tr><td>Range of log latency before a cut (multiple character deletion) event</td><td>.0971</td></tr><tr><td>Range of the speed of cut (multiple character deletion) events</td><td>.0217</td></tr><tr><td>Standard deviation of the speed of cut (multiple character deletion) events</td><td>.0185</td></tr><tr><td>Standard deviation of log latency before a cut (multiple character deletion) event</td><td>.0182</td></tr><tr><td>Logit probability of spending time in a cut (multiple character deletion) cut event</td><td>.0151</td></tr><tr><td>Log odds of spending time in a cut (multiple character deletion) event compared to other events</td><td>.0118</td></tr><tr><td>Standard deviation of the length of runs of consecutive backspacing events that cross word boundaries</td><td>.0117</td></tr><tr><td>Logit probability of word substitutions</td><td>.0076</td></tr><tr><td>Standard deviation of the length of runs of consecutive backspacing events</td><td>.0067</td></tr><tr><td>Mean speed of cut events (multiple character deletions)</td><td>.0055</td></tr><tr><td>Number of word tokens produced that are not in the final text</td><td>.0048</td></tr><tr><td>Logit of probability that an event will take place at the end of the text buffer</td><td>.0046</td></tr><tr><td>Median speed of cut events (multiple character deletions)</td><td>.0042</td></tr><tr><td>Mean log latency before a cut (multiple character deletion) event</td><td>.0037</td></tr><tr><td>Median log latency before deletion of characters between sentences</td><td>.0031</td></tr><tr><td>Total time spent before cut events (multiple character deletions)</td><td>.0031</td></tr><tr><td>Median log latency before deletion of characters between sentences</td><td>.0031</td></tr><tr><td>Mean number of variant forms of a word created</td><td>.0028</td></tr><tr><td>Mean speed of deletion events involving characters between sentences</td><td>.0025</td></tr><tr><td>Range of speed of deletion events involving characters between sentences</td><td>.0023</td></tr><tr><td>Median length in characters of bursts of typing delimited by pauses 6 SD > than median speed of typing characters within a word</td><td>.0023</td></tr><tr><td>Logit of probability of a cut (multiple character deletion) event</td><td>.0021</td></tr><tr><td>Range of the length in characters of bursts of typing delimited by pauses 2 SD > than the median pause between characters within a word</td><td>.0021</td></tr></tbody></table> </ephtml> </p> <p>Once again, the most important feature was the number of word substitutions, with an importance of.5635, the number of runs of consecutive backspacing events that cross word boundaries, with an importance of.1006, and the range of log latencies before a cut (multiple character deletion) event, with an importance of.0971. After that, feature importances drop off rapidly, though once again, most of the features in Table 7 address backspacing, cut events, word substitutions, time spent at the end of the text buffer (rather than earlier, editing previously written words), and variability in burst lengths.</p> <p>7 Table Performance of Study 3: Classifier Models Predicting from Keystroke Data Whether a Submitted Essay Has Large Amounts of Overlapping Content with Another Essay</p> <p> <ephtml> <table><thead><tr valign="bottom"><th /><th>Classification Accuracy</th><th>AUC</th><th><italic>F</italic>1</th><th>Precision</th><th>Recall</th></tr></thead><tbody><tr><td>Gradient Boosting</td><td>.897</td><td>.952</td><td>.891</td><td>.928</td><td>.859</td></tr><tr><td>Random Forest</td><td>.879</td><td>.944</td><td>.876</td><td>.941</td><td>.809</td></tr><tr><td>Neural Network</td><td>.879</td><td>.943</td><td>.881</td><td>.931</td><td>.827</td></tr><tr><td>Logistic Regression</td><td>.876</td><td>.943</td><td>.872</td><td>.898</td><td>.847</td></tr><tr><td>K Nearest Neighbors</td><td>.882</td><td>.927</td><td>.874</td><td>.935</td><td>.821</td></tr><tr><td>Naive Bayes</td><td>.855</td><td>.890</td><td>.854</td><td>.861</td><td>.849</td></tr></tbody></table> </ephtml> </p> <hd id="AN0192629998-26">Discussion for Study 2</hd> <p>The classifiers' performance in Study 2 is lower than that of Study 1. However, this is not surprising since Study 1 was conducted under conditions where the copy writing mode was set to completely copy from existing texts. In contrast, Study 2's "copy writing" mode defines "nonoriginal texts" as texts that had high levels of exact textual overlap with other submitted texts (a trigram cosine similarity greater than. 3). The feature sets used in Study 2 were different than those that were available for Study 1, so there is no direct comparison, but on close examination, similar themes emerge. The most important features measured behaviors associated with editing—word substitutions, long runs of backspacing, and cut actions—and indicated that original texts showed higher incidence of these editing behaviors and greater variability in their timing. Thus, in Study 2, as in Study 1, the differences between original and nonoriginal text focused on differences in timing patterns (greater or lesser variability in latency, speed, and burst lengths) and a greater or lesser incidence of editing behaviors.</p> <hd id="AN0192629998-27">Study 3</hd> <p></p> <hd id="AN0192629998-28">Methodology</hd> <p>To check whether the detection in Study 2 also holds for other writing tasks, we draw writing submissions to a different large‐scale, high‐stakes, assessment, which also required candidates to complete two distinct essay‐writing tasks. Keystroke logs and the final submitted essays were captured by the assessment delivery platform. Essays from 42,241 candidates randomly sampled from three months in 2022 were analyzed to construct the dataset used in this study.</p> <hd id="AN0192629998-29">Materials</hd> <p>The assessment was administered between February 1, 2022 and May 30, 2022 in an online, secure system, with administrations split between ETS testing centers with physical proctoring (25.8%) and at‐home testing with online proctoring (74.2%). The delivery platform only displayed the assessment and did not allow the users to bring up other applications. The test contained multiple sections, some multiple choice, others (specifically, the writing section), allowing typed responses. Total time allowed for the assessment was three hours.</p> <p>On this assessment, each candidate wrote two essays, with thirty minutes allowed to complete each essay. One of the essay prompt types required the candidate to produce a persuasive essay on a general topic (one that they could reasonably be expected to build an argument about from general background knowledge.) This task provided no source material that students could copy, nor were they allowed to look up source material while writing the essay. There were more than two hundred distinct prompts for this task in our sample. The second essay was source based and was excluded from this study.</p> <hd id="AN0192629998-30">Data Selection: Essay Similarity Detection</hd> <p>All essays were processed using the same AutoESD tool as Study 2. The distribution of essays by essay similarity score is shown in Figure 7.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0007.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0007.jpg" title="7 Histogram of frequencies by maximum trigram cosine similarity metric to another essay in the same time period for Study 3" /> </p> <p></p> <p>As Figure 7 shows, a very small percentage of the corpus had an essay similarity score above. 20, and essays with essay similarity scores above. 30 were an order of magnitude rarer. Thus, essays that unambiguously contained large portions of nonoriginal text were a very small portion of the total corpus.</p> <p>As in Study 2, we filtered the data, by selecting essays whose essay similarity scores were either below. 05 or above. 30. Filtering the data in this way yielded a total of 1,315 essays with a cosine similarity greater than. 30 for at least one other essay. We matched these essays with a random selection of 1,315 essays whose maximum cosine similarity to any other essay in the set of candidate responses was less than. 05. This yielded a balanced set of 2,630 essays, half of which contained large chunks of nonoriginal text.</p> <hd id="AN0192629998-32">Data Processing and Cleaning: Keystroke Feature Analysis</hd> <p>The keystroke logs for the 2,630 essays in our dataset were put through the keystroke pipeline described in Choi et al. ([<reflink idref="bib18" id="ref75">18</reflink>]). We applied the same data cleaning steps we used with Study 2. After the data cleaning step, the dataset contained 545 keystroke features.</p> <hd id="AN0192629998-33">Final Data Preparation and Analysis</hd> <p>Essays with a cosine similarity to some other essay greater than or equal to. 30 were labeled "1" (as consisting primarily of nonoriginal text). Essays with no cosine similarity to any other essay greater than. 05 were labeled "0" (as consisting of original, candidate‐produced text). The combined data was analyzed using four‐fold cross‐validation. The same machine learning classifiers were trained and evaluated as in the previous two studies. The seven machine learning models were trained and fine‐tuned using fourfold cross validation on the training set. Grid search algorithm was used for model tuning. We systematically evaluated every combination of hyperparameter values and the optimal combination of values that maximizes the average value in cross‐validation were selected for each of the machine learning models. We also examined feature importance (calculated using the Gini index), both on the two best‐performing models, but also on the logistic regression model, which provided a simple linear model of feature contributions to prediction.</p> <hd id="AN0192629998-34">Results</hd> <p>Table 7 presents the results of predictive modeling. The Gradient Boosting model had the strongest performance, with a classification accuracy of.897, an AUC of.952, an <emph>F</emph>1 score of.891, a precision of.928, and a recall score of.859. The Random Forest model performed slightly worse on most measures, with a classification accuracy of.879, an AUC of.944, an <emph>F</emph>1 score of.876, a precision of 0.941, and a recall of.809. The Logistic Regression and Neural Network models had very similar levels of performance to the Random Forest model. The Neural Network model had a classification accuracy of.879, an AUC of.943, an <emph>F</emph>1 score of.881, a precision of.930, and a recall of.827. The Logistic Regression model had a classification accuracy of.876, an AUC of.943, an <emph>F</emph>1 score of.872, a precision of.898 and a recall of.847. The K Nearest Neighbors was next on many of the measures, with a classification accuracy of.882, an AUC of.927, an <emph>F</emph>1 of.874, a precision of.935, and a recall of.821. The Naïve Bayes model had the worst performance, with a classification accuracy of.855, an AUC of.890, an <emph>F</emph>1 score of.854, a precision of.860, and a recall of.849.</p> <p>Table 8 shows the top 25 predictive features for the Gradient Boosting model.</p> <p>8 Table The 25 Most Important Features in the Study 3 Gradient Boosting Model</p> <p> <ephtml> <table><thead><tr valign="bottom"><th>Feature Name</th><th align="center">Feature Importances</th></tr></thead><tbody><tr><td>Total time spent paused after the last keystroke</td><td>.6787</td></tr><tr><td>Total time spent pausing before whitespace deletion between words</td><td>.0434</td></tr><tr><td>Logit probability of spending time typing characters within a word</td><td>.0295</td></tr><tr><td>Total time spent typing characters within a word</td><td>.0263</td></tr><tr><td>Mean log latency between words</td><td>.0120</td></tr><tr><td>Total time spent typing whitespace between words</td><td>.0095</td></tr><tr><td>Logit probability of word substitutions compared to other events</td><td>.0072</td></tr><tr><td>Mean log latency of pauses before typing whitespace between words</td><td>.0062</td></tr><tr><td>Mean log latency of whitespace typed between words</td><td>.0061</td></tr><tr><td>SD of the number of alternative word forms produced while typing a word</td><td>.0054</td></tr><tr><td>Logit probability of consecutive backspacing compared to other events</td><td>.0054</td></tr><tr><td>Logit probability of a typing event happening at the end of the text buffer</td><td>.0051</td></tr><tr><td>Range of burst length in characters (delimited by pauses 4 SD > the median pause between characters within a word</td><td>.0046</td></tr><tr><td>Total active writing time</td><td>.0034</td></tr><tr><td>Number of bursts of typing (delimited by pause greater than 4 seconds)</td><td>.0026</td></tr><tr><td>Mean log latency of burst length in characters (delimited by pauses 8 SD > the median pause between characters within a word)</td><td>.0023</td></tr><tr><td>Total time spent at the end of the text buffer</td><td>.0023</td></tr><tr><td>Mean log latency of pauses before deleting whitespace between words</td><td>.0023</td></tr><tr><td>Mean length of runs of consecutive backspacing events</td><td>.0022</td></tr><tr><td>Range in the length in characters of bursts of typing (delimited by pauses greater than 4 seconds)</td><td>.0021</td></tr><tr><td>SD of the length in characters of bursts of fast typing (delimited by pauses 2 SD > than the median pause time between characters within a word)</td><td>.0021</td></tr><tr><td>Range of the log latency before deleting whitespace between words</td><td>.0019</td></tr><tr><td>Median log latency before typing whitespace between words</td><td>.0019</td></tr><tr><td>The log odds of a cut event relative to other events</td><td>.0018</td></tr><tr><td>The log odds of spending time pausing before a jump to edit event relative to other events</td><td>.0017</td></tr><tr><td>Range of the speed (in chars. per second) of pauses before typing sentence internal punctuation marks</td><td>.0016</td></tr></tbody></table> </ephtml> </p> <p>The Total time spent paused after the last keystroke was the single most important feature, with a feature importance of.6787. A few other features had relatively strong importance, including the total time spent pausing before whitespace deletion between words (importance,. 0434), the logit probability of spending time typing characters within a word (importance,. 0295), the total time spent typing characters within a word (importance,. 0263), and the mean log latency between words (importance,. 120). Figure 8 shows the scatterplot for the two most important features.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0008.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0008.jpg" title="8 Scatterplot: total time spent pausing before deleting whitespace between words versus total time spent paused after the last keystroke for Study 3." /> </p> <p></p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01mar26/jedm12431-fig-0009.jpg?ephost1=dGJyMNXb4kSepq84yOvqOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12431-fig-0009.jpg" title="9 Decision tree (Study 3, Fold 4)." /> </p> <p></p> <p>Examination indicates that both features show strong separation between original and nonoriginal text. People who typed original texts typically paused for one half to three seconds after they finished typing their essay, before they moved on to the rest of the text, whereas people who typed nonoriginal texts moved on almost instantly. People who typed original texts tended not to spend much time deleting whitespace between words, whereas those who typed nonoriginal texts showed greater latency before deleting word boundaries.</p> <p>The first level split focuses on time spent pausing after the essay is complete, with a threshold of 546.336 milliseconds. Responses that fell above the first split were further subdivided by mean log latency between words, with more nonoriginal responses falling below the threshold value of 5.441. Responses that fell below the first split were further subdivided by total active writing time, with more nonoriginal text responses falling below the threshold of 1237.57 seconds (about 20 minutes), and more original text responses taking more than 20 minutes to complete the essay. Further subdivisions focused on editing behaviors (word substitutions and jump edits) and on the length and number of bursts of fast typing, with the splits favoring the proposition that original texts were more likely to display high fluency and a greater incidence of editing behaviors.</p> <p>Table 9 shows the most important features for the Random Forest model. Essentially the same features were the most important as in the gradient boosting model—total time spent paused after the last keystroke (importance,. 8114), logit probability of spending time typing characters within a word (importance,. 0275), mean log latency between words (importance,. 0165), and total time spent pausing before whitespace deletion between words (importance,. 0133). The remaining features in the top 25 list overlapped strongly with those for gradient boosting, and were similar where they did not overlap, involving mostly features linked to editing behaviors and variability of typing tempos.</p> <p>9 Table iost Important Features in the Study 3 Random Forest Model</p> <p> <ephtml> <table><thead><tr valign="bottom"><th>Feature Name</th><th align="center">Feature Importances</th></tr></thead><tbody><tr><td>Total time spent paused after the last keystroke</td><td>.8114</td></tr><tr><td>Logit probability of spending time typing characters within a word</td><td>.0275</td></tr><tr><td>Mean log latency between words</td><td>.0165</td></tr><tr><td>Total time spent pausing before whitespace deletion between words</td><td>.0113</td></tr><tr><td>Total time spent typing characters within a word</td><td>.0057</td></tr><tr><td>Logit probability of a typing event happening at the end of the text buffer</td><td>.0043</td></tr><tr><td>Total time spent typing whitespace between words</td><td>.0041</td></tr><tr><td>Mean log latency of pauses before typing whitespace between words</td><td>.0034</td></tr><tr><td>SD of the number of alternative word forms produced while typing a word</td><td>.0031</td></tr><tr><td>SD of log latency before whitespace deletion between words</td><td>.0025</td></tr><tr><td>SD of speed (in characters per second) of initial backspace events</td><td>.0023</td></tr><tr><td>Number of consecutive backspace events</td><td>.0020</td></tr><tr><td>Logit probability of a word type (defined by spelling) appearing in the final text</td><td>.0019</td></tr><tr><td>Logit probability of spending time pausing before a jump to edit event</td><td>.0018</td></tr><tr><td>Mean number of alternative word forms produced while typing a word</td><td>.0018</td></tr><tr><td>Logit probability of a word substitution event</td><td>.0016</td></tr><tr><td>Median length (in characters) of a burst of typing (delimited by pauses 4 SD > than the median pause between characters within a word)</td><td>.0016</td></tr><tr><td>Range of the length (in characters) of a burst of typing (delimited by pauses 4 SD > than the median pause between characters within a word)</td><td>.0015</td></tr><tr><td>Mean log latency before a line break event</td><td>.0015</td></tr><tr><td>The log odds of spending time pausing before a jump edit event compared to other events</td><td>.0015</td></tr><tr><td>The logit probability of a typing event being part of a run of backspacing events that crosses a word boundary</td><td>.0015</td></tr><tr><td>Mean length of runs of backspacing</td><td>.0014</td></tr><tr><td>Logit probability of consecutive backspacing events</td><td>.0014</td></tr><tr><td>The log odds of spending time pausing before a cut event compared to other events</td><td>.0014</td></tr><tr><td>interword_interval_logIKI_median</td><td>.0013</td></tr></tbody></table> </ephtml> </p> <hd id="AN0192629998-37">Discussion for Study 3</hd> <p>Overall, the results for Study 3 indicate that writers of nonoriginal text spent less time writing, spent more of that time typing within rather than between words, were more likely to move on quickly once they finished writing, and had different editing behaviors; for instance, they spent more time pausing before deleting whitespace between words. These patterns make sense when viewed in the light of the difference between copy‐typing and drafting. If the content is known in advance, there is no need to stop to think between words, resulting in a faster writing process. However, if typographic errors are made while copy‐typing, it might take some time to check the copied text against the original, resulting in more time spent pausing before the writer decides to backspace back to the point of error and retype. And when the writer finishes copying, there is no need to review the results to know are good enough; it makes sense simply to go on.</p> <hd id="AN0192629998-38">General Discussion</hd> <p>In this paper, we presented three studies to demonstrate that keystroke‐based features can be used to reliably detect copy writing from draft writing. All three studies revealed profound differences between copy‐typing (or, the production of nonoriginal text) and normal drafting processes. The best models were generally accurate (.951 classification accuracy in Study 1,. 904 in Study 2, and. 897 in Study 3). Certain characteristic differences between the task types appeared to generalize: Specifically, in all three studies, people appear to have produced nonoriginal text with less effort, revision, or editing. However, there were also differences between the studies. Study 1 represents a controlled experiment in which we possess a ground truth dataset consisting of full essay drafts or written copies. On the other hand, both Study 2 and Study 3 were grounded in real‐world writing assessment data, but they differed in terms of the specific writing tasks and the characteristics of the test taker population. In these latter studies (Study 2 and Study 3), the copy‐writing component may not necessarily encompass the entire essay, leading to a reduction in classification accuracy when compared to the results obtained in Study 1. Nevertheless, all three Studies converged, in that keystroke‐based writing process features, could be leveraged to detect whether the writing is copy writing at good accuracy. This could have significant implications for ensuring genuine and authentic responses in the context of writing assignments in an era where powerful generative AI is prevalent.</p> <p>However, the best models differed across the studies in highly suggestive ways. For example, the best model for Study 1 was particularly sensitive to measures of composition fluency and editing and revision behavior. While these features played a role in all three models, the best model in Study 2 was particularly sensitive to features measuring word substitution, cut and paste, and backspacing operations, and the best model in Study 3 was highly sensitive to overall patterns of time usage, such as how much time spent pausing after they had finished writing and how much time they paused before deleting (most likely, backspacing over) whitespace between words. These differences suggest that there may have been differences in the strategies used by writers of nonoriginal text across the two assessments in Studies 2 and 3, though all three studies suggest that writers of original text spent far more time and effort on planning, evaluating, and editing behaviors than those who were largely reproducing nonoriginal text.</p> <p>The primary potential application of the models we describe would be to confirm whether keystroke logs of interest showed patterns indicative of copying nonoriginal texts, perhaps in combination with other kinds of security measures, such as AI‐generated text detectors. The criterion measure we used—cosine trigram similarity between essays submitted on the same task—is a powerful security metric in its own right, but it depends on people writing from a common model. This is, historically, among the most common forms of cheating on writing assessments; but now that reasonably high‐quality essays can be generated by large language models such as ChatGPT, it is entirely feasible for people to use AI to create individualized essays that will not overlap with anything submitted by any other student. In this situation, keystroke log analysis has the potential to provide important evidence about whether an essay submission needs to be screened for possible cheating. The level of classification accuracy achieved in this study (around 90%) does not reach the same level of accuracy as the use of true biometric indicators, but it might be very useful to screen essay submissions to identify cases needing more careful review, or as supporting evidence in combination with other indicators of cheating.</p> <p>However, while our current work holds great promise, it is essential to acknowledge important limitations. First, in Studies 2 and 3, we flagged positive cases if they had cosine content similarity greater than. 3 to at least one other essay. While this method has proven effective, it is nonetheless a proxy, since it can only flag nonoriginal content in essays that have been submitted by at least two different test candidates. Given the relative rarity of essays with high essay similarity scores, the vast majority of cases flagged positive would not be cases that would be detected using automated essay similarity detection. It will therefore be necessary to conduct follow‐up studies to determine how effectively we can combine detection of suspicious keystroke log patterns with other metrics to identify unambiguous examples of cheating on high stakes tests.</p> <p>In addition, this method will not flag essays containing small amounts of nonoriginal content (though such essays may also be less likely to be considered inappropriate by human raters). We recognize the potential for improvement in our modeling by identifying specific segments of copied text based on exact matches, and modeling how writing processes change when writers are producing such segments. We are actively investigating the extent to which this approach can enhance detection accuracy. Similarly, cosine similarity is only one of several potential metrics. We are exploring the use of other methods, including an exact matching ratio, which quantifies the fraction of text overlapping with other essays, as it has the potential to provide a more accurate assessment of text similarity. Ongoing work is dedicated to evaluating whether this alternative measure can significantly enhance our detection accuracy.</p> <p>Finally, it is important to note that our results demonstrated variability in the features that flag nonoriginal text across different tasks and populations. This implies that the writing process cues that indicate nonoriginal text may be contingent on a variety of factors, in ways we do not yet understand. Great caution should therefore be exercised in generalizing the models we have developed beyond the specific contexts for which they were developed (for studies 2 and 3, standardized writing prompts, delivered digitally, as part of larger, proctored, high‐stakes assessments). To address this issue, we are expanding our research to encompass a broader range of writing tasks across various genres of essays. This expanded scope will allow us to assess the generalizability of our findings. We will report the corresponding findings in a future paper.</p> <ref id="AN0192629998-39"> <title> Footnotes </title> <blist> <bibl id="bib1" idref="ref19" type="bt">1</bibl> <bibtext> This paper is a companion to Jiang et al. ([39]), which cites this paper but happens to have been published first. The current paper focuses on evaluating, in depth, the potential of keystroke analysis to support prediction of nonoriginal text. Jiang and colleague's paper, by contrast, focuses on analysing the fairness implications of using such methods to support test security.</bibtext> </blist> <blist> <bibl id="bib2" idref="ref58" type="bt">2</bibl> <bibtext> Analyses like these seem to form part of a more general shift toward use of various kinds of deep learning models to analyze process data. For another example, see Xiong et al. (2024), which propose a sequential reservoir method to extract meaningful features from sequential log data.</bibtext> </blist> <blist> <bibl id="bib3" idref="ref42" type="bt">3</bibl> <bibtext> Missing values are common for some of the keystroke features we used in Studies 1, 2, and 3, simply because certain behaviors were rare. For example, features measuring the tempo of cut and paste operations will be missing if no cut or paste actions were taken. This creates issues that are not easily solved with standard statistical techniques, since the data is not missing at random and in fact when data is missing that implies qualitatively different writing processes. Restricting the feature set to those with very few missing values is not an option, because rare behaviors (such as editing actions) can be highly informative and predictive. Standard multiple imputation methods are also problematic because it is hard to see what feature values one should impute for an action (for instance, for the latency of an action) when that action does not occur. Given that our goal is prediction using multiple machine (ML) learning methods, and not to draw statistical conclusions from individual feature metrics, we decided to go with a fairly straightforward default rule commonly used in the machine learning community (imputing the median, after excluding features that had more than 40% missing values). While this could create bias, if missing values carry significant qualitative information, at least some of the ML methods we apply, such as Random Forest and Gradient Boosting, should be able to differentiate qualitatively distinct cases concentrated at the median, and use this information to build effective predictive models. Since the prediction accuracy did not change much after we introduced this way of handling missing data, we believe it was appropriate, especially since using a more sophisticated form of imputation would be more difficult for others to replicate and (for the reasons mentioned above) would probably not be appropriate.</bibtext> </blist> <blist> <bibl id="bib4" idref="ref40" type="bt">4</bibl> <bibtext> These statistics in the table are standard metrics, including classification accuracy, AUC, F1 score, precision, and recall. Classification Accuracy is the proportion of all classifications that were correct, whether positive or negative. AUC is the area under the receiver‐operating characteristic curve (ROC) curve, and it represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative. Precision is the proportion of all the model's positive classifications that are actually positive. Recall is the proportion of all actual positives that were classified correctly as positives. The F1‐score is the harmonic mean of the precision and recall, it symmetrically represents both precision and recall in one metric. See https://developers.google.com/machine‐learning/crash‐course/classification/accuracy‐precision‐recall and https://developers.google.com/machine‐learning/crash‐course/classification/roc‐and‐auc</bibtext> </blist> <blist> <bibl id="bib5" idref="ref70" type="bt">5</bibl> <bibtext> Both the Gradient Boosting and the Random Forest models used the Gini index to calculate feature importance.</bibtext> </blist> <blist> <bibl id="bib6" idref="ref41" type="bt">6</bibl> <bibtext> In Figure 3, each internal node in the decision tree represents a "test" on an attribute and each branch represents the outcome of the test. The "friedman_mse" is the mean squared error with improvement score using the Friendman rank test (cf. Biju & Prashanth, [12]). It is the best‐supported criterion to measure the quality of a split. The "value" refers to the predicted outcome or class label at a leaf node, which is the final decision reached by following the tree's branches.</bibtext> </blist> <blist> <bibl id="bib7" idref="ref12" type="bt">7</bibl> <bibtext> Above a cosine similarity of .30, there are multiple chunks of content in both essays that are word‐for‐word or nearly word‐for‐word identical. The sample thus produced may not represent all nonoriginal essays in the original sample, but we can be reasonably sure that they contain large amounts of nonoriginal text.</bibtext> </blist> </ref> <ref id="AN0192629998-40"> <title> References </title> <blist> <bibtext> Acien, A., Morales, A., Vera‐Rodriguez, R., Fierrez, J., & Monaco, J. V. (2020). TypeNet: Scaling up keystroke biometrics. In 2020 IEEE International Joint Conference on Biometrics (IJCB) (pp. 1–7). IEEE.</bibtext> </blist> <blist> <bibtext> Acien, A., Morales, A., Vera‐Rodriguez, R., Fierrez, J., Mondesire‐Crump, I., & Arroyo‐Gallego, T. (2022). Detection of mental fatigue in the general population: Feasibility study of keystroke dynamics as a real‐world biomarker. JMIR biomedical engineering, 7(2), e41003.</bibtext> </blist> <blist> <bibtext> Agarwal, N., Danielsen, N. F., Gravdal, P. K., & Bours, P. (2022). Contract cheat detection using biometric keystroke dynamics. In 2022 20th International Conference on Emerging eLearning Technologies and Applications (ICETA) (pp. 15–21). IEEE.</bibtext> </blist> <blist> <bibtext> Alsultan, A., & Warwick, K. (2013). Keystroke dynamics authentication: a survey of free‐text methods. International Journal of Computer Science Issues (IJCSI), 10(4), 1.</bibtext> </blist> <blist> <bibtext> Anson, C. M., & Kruse, O. (2023). Plagiarism detection and intertextuality software. In Digital Writing Technologies in Higher Education: Theory, Research, and Practice (pp. 231–243). Cham: Springer International Publishing.</bibtext> </blist> <blist> <bibtext> Ayotte, B., Banavar, M., Hou, D., & Schuckers, S. (2020). Fast free‐text authentication via instance‐based keystroke dynamics. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(4), 377–387.</bibtext> </blist> <blist> <bibtext> Baldwin, P., Yaneva, V., North, K., Ha, L. A., Zhou, Y., Mechaber, A. J., & Clauser, B. E. (2025). The vulnerability of AI‐based scoring systems to gaming strategies: A case study. Journal of Educational Measurement, 62(1), 172–194.</bibtext> </blist> <blist> <bibl id="bib8" idref="ref52" type="bt">8</bibl> <bibtext> Banerjee, R., Feng, S., Kang, J. S., & Choi, Y. (2014). Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essays. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1469–1473).</bibtext> </blist> <blist> <bibl id="bib9" idref="ref13" type="bt">9</bibl> <bibtext> Barrett, C., Boyd, B., Burzstein, E., Carlini, N., Chen, B., Choi, J., Chowdhury, A. R., Christodorescu, M., Datta, A., Feizi, S., Fisher, K., Hashimoto, T., Hendrycks, D., Jha, S., Kang, D., Kerschbaum, F., Mitchell, E., Mitchell, J., Ramzan, Z., ... Yang, D. (2023). Identifying and mitigating the security risks of generative AI. arXiv preprint arXiv:2308.14840.</bibtext> </blist> <blist> <bibtext> Bernardi, M. L., Cimitile, M., Martinelli, F., & Mercaldo, F. (2019). Keystroke analysis for user identification using deep neural networks. In 2019 international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.</bibtext> </blist> <blist> <bibtext> Bhana, B., & Flowerday, S. (2020). Passphrase and keystroke dynamics authentication: Usable security. Computers & Security, 96, 101925.</bibtext> </blist> <blist> <bibtext> Biju, V. G., & Prashanth, C. M. (2017). Friedman and Wilcoxon evaluations comparing SVM, bagging, boosting, K‐NN and decision tree classifiers. Journal of Applied Computer Science Methods, 9, 23–47.</bibtext> </blist> <blist> <bibtext> Bixler, R., & D'Mello, S. (2013, March). Detecting boredom and engagement during writing with keystroke analysis, task appraisals, and stable traits. In Proceedings of the 2013 international conference on Intelligent user interfaces (pp. 225–234). Association for Computing Machinery.</bibtext> </blist> <blist> <bibtext> Brizan, D. G., Goodkind, A., Koch, P., Balagani, K., Phoha, V. V., & Rosenberg, A. (2015). Utilizing linguistically enhanced keystroke dynamics to predict typist cognition and demographics. International Journal of Human‐Computer Studies, 82, 57–68.</bibtext> </blist> <blist> <bibtext> Buker, A. A., Roffo, G., & Vinciarelli, A. (2020). Type like a man! inferring gender from keystroke dynamics in live‐chats. IEEE Intelligent Systems, 34(6), 53–59.</bibtext> </blist> <blist> <bibtext> Burstein, J., Tetreault, J., & Madnani, N. (2013). The e‐rater automated essay scoring system. Handbook of automated essay evaluation: Current applications and new directions, 55–67.</bibtext> </blist> <blist> <bibtext> Chang, H. C. (2021). Keystroke dynamics based on machine learning. M.A. Thesis, Dept. of Computer Science, San Jose State University.</bibtext> </blist> <blist> <bibtext> Choi, I., Hao, J., Deane, P., & Zhang, M. (2021). Benchmark keystroke biometrics accuracy from high‐stakes writing tasks. ETS Research Report Series, 2021(1), 1–13.</bibtext> </blist> <blist> <bibtext> Choi, I., Hao, J., Li, C., Fauss, M., & Novak, J. (2024). AutoESD: An automated system for detecting non‐authentic texts for high‐stakes writing tests. ETS Research Report Series, 2024, 1–16.</bibtext> </blist> <blist> <bibtext> Choi, I., & Deane, P. (2021). Evaluating writing process features in an adult EFL writing assessment context: A keystroke logging study. Language Assessment Quarterly, 18(2), 107–132.</bibtext> </blist> <blist> <bibtext> Choi, I., Hao, J., Deane, P., & Zhang, M. (2021). Benchmark keystroke biometrics accuracy from high‐stakes writing tasks. ETS Research Report Series, 2021(1), 1–13.</bibtext> </blist> <blist> <bibtext> Conijn, R., Roeser, J., & Van Zaanen, M. (2019). Understanding the keystroke log: The effect of writing task on keystroke features. Reading and Writing, 32(9), 2353–2374.</bibtext> </blist> <blist> <bibtext> Crossley, S., Tian, Y., Choi, J. S., Holmes, L., & Morris, W. (2024). Plagiarism Detection Using Keystroke Logs. In Proceedings of the 17th International Conference on Educational Data Mining (pp. 476–483).</bibtext> </blist> <blist> <bibtext> Dawson, P. (2020). Defending assessment security in a digital world: Preventing e‐cheating and supporting academic integrity in higher education. Routledge.</bibtext> </blist> <blist> <bibtext> Deane, P., Roth, A., Litz, A., Goswami, V., Steck, F., Lewis, M., & Richter, T. (2018) Behavioral differences between retyping, drafting, and editing: A writing process analysis (RM‐18‐06). Princeton, NJ: Educational Testing Service.</bibtext> </blist> <blist> <bibtext> Deane, P., Wilson, J., Zhang, M., Li, C., van Rijn, P., Guo, H., ... & Richter, T. (2021). The sensitivity of a scenario‐based assessment of written argumentation to school differences in curriculum and instruction. International Journal of Artificial Intelligence in Education, 31, 57–98.</bibtext> </blist> <blist> <bibtext> Deane, P., & Zhang, M. (2015). Exploring the feasibility of using writing process features to assess text production skills. ETS Research Report Series, 2015(2), 1–16.</bibtext> </blist> <blist> <bibtext> Dinneen, C. (2021). Students' use of digital translation and paraphrasing tools in written assignments on Direct Entry English Programs. English Australia Journal, 37(1), 40–51.</bibtext> </blist> <blist> <bibtext> Epp, C. C. (2010). Identifying emotional states through keystroke dynamics (Doctoral dissertation, University of Saskatchewan).</bibtext> </blist> <blist> <bibtext> ETS. (n.d.). Criterion writing evaluation service. Retrieved January 22, 2024, from https://criterion.ets.org/</bibtext> </blist> <blist> <bibtext> Fenu, G., Marras, M., & Boratto, L. (2018). A multi‐biometric system for continuous student authentication in e‐learning platforms. Pattern Recognition Letters, 113, 83–92.</bibtext> </blist> <blist> <bibtext> Flior, E., & Kowalski, K. (2010, April). Continuous biometric user authentication in online examinations. In 2010 seventh International Conference on information technology: new generations (pp. 488–492). IEEE.</bibtext> </blist> <blist> <bibtext> Galbraith, D., & Baaijen, V. M. (2019). Aligning keystrokes with cognitive processes in writing. In Lindgren, E. & Sullivan, K. (Eds.), Observing writing (pp. 306–325). Brill.</bibtext> </blist> <blist> <bibtext> Gunetti, D., & C. Picardi (2005). Keystroke analysis of free text. ACM transactionson Information and System Security, 8(3), 312–347.</bibtext> </blist> <blist> <bibtext> Hao, J., & Fauss, M. (2023). Test security in remote testing age: Perspectives from process data analytics and AI. In H. Jiao and R.W. Lissitz (Eds.), Machine learning, natural language processing and psychometrics. IAP.</bibtext> </blist> <blist> <bibtext> Hayes, J. R. (2012). Modeling and remodeling writing. Written Communication, 29(3), 369–388.</bibtext> </blist> <blist> <bibtext> Idrus, F., Asadi, Z., & Mokhtar, N. (2016). Academic dishonesty and achievement motivation: A delicate relationship. Higher Education of Social Science, 11(1), 1–8.</bibtext> </blist> <blist> <bibtext> Ilonen, J. (2003). Keystroke dynamics. Advanced Topics in Information Processing—Lecture, 03–04.</bibtext> </blist> <blist> <bibtext> Jiang, Y., Zhang, M., Hao, J., Deane, P., & Li, C. (2024). Using keystroke behavior patterns to detect nonauthentic texts in writing assessments: Evaluating the fairness of predictive models. Journal of Educational Measurement, 61(4), 571–594.</bibtext> </blist> <blist> <bibtext> Kochegurova, E. A., & Zateev, R. P. (2022). Hidden monitoring based on keystroke dynamics in online examination system. Programming and Computer Software, 48(6), 385–398.</bibtext> </blist> <blist> <bibtext> Kundu, D., Mehta, A., Kumar, R., Lal, N., Anand, A., Singh, A., & Shah, R. R. (2024). Keystroke Dynamics against academic Dishonesty in the age of LLMs. In 2024 IEEE International Joint Conference on Biometrics (IJCB) (pp. 1–10). IEEE.</bibtext> </blist> <blist> <bibtext> Lane, S. (2013). Security issues in writing assessment. Wollack, J.A. & Fremer, J.J. (Eds.), Handbook of test security (Chapter 5, pp. 101–124). Routledge.</bibtext> </blist> <blist> <bibtext> Lochbaum, K. E., Rosenstein, M., Foltz, P., & Derr, M. A. (2013, April). Detection of gaming in automated scoring of essays with the IEA. In 75th Annual meeting of NCME (p. 121).</bibtext> </blist> <blist> <bibtext> Lu, X., Zhang, S., Hui, P., & Lio, P. (2020). Continuous authentication by free‐text keystroke based on CNN and RNN. Computers & Security, 96, 101861.</bibtext> </blist> <blist> <bibtext> Mahesh, B. (2020). Machine learning algorithms‐a review. International Journal of Science and Research (IJSR), 9(1), 381–386.</bibtext> </blist> <blist> <bibtext> Monaco, J. V., Bakelman, N., Cha, S.‐ H., & Tappert, C. C. (2012). Developing a keystroke biometric system for continual authentication of computer users. In Intelligence and Security Informatics Con‐ference (EISIC), 2012 European (pp. 210–216). https://doi.org/10.1109/EISIC.2012.58.</bibtext> </blist> <blist> <bibtext> Monaro, M., Galante, C., Spolaor, R., Li, Q. Q., Gamberini, L., Conti, M., & Sartori, G. (2018). Covert lie detection using keyboard dynamics. Scientific reports, 8(1), 1976.</bibtext> </blist> <blist> <bibtext> Mungai, P. K., & Huang, R. (2017). Using keystroke dynamics in a multi‐level architecture to protect online examinations from impersonation. In 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA) (pp. 622–627). IEEE.</bibtext> </blist> <blist> <bibtext> Parkinson, S., Khan, S., Crampton, A., Xu, Q., Xie, W., Liu, N., & Dakin, K. (2021). Password policy characteristics and keystroke biometric authentication. IET Biometrics, 10(2), 163–178.</bibtext> </blist> <blist> <bibtext> Pentel, A. (2017). Predicting age and gender by keystroke dynamics and mouse patterns. In Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization (pp. 381–385).</bibtext> </blist> <blist> <bibtext> Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023). Can AI‐generated text be reliably detected? arXiv preprint arXiv:2303.11156.</bibtext> </blist> <blist> <bibtext> Schneider, J., Bernstein, A., Vom Brocke, J., Damevski, K., & Shepherd, D. C. (2017). Detecting plagiarism based on the creation process. IEEE Transactions on Learning Technologies, 11(3), 348–361.</bibtext> </blist> <blist> <bibtext> Sun, L., Wang, Y., Cao, B., Yu, P. S., Srisa‐An, W., & Leow, A. D. (2017). Sequential keystroke behavioral biometrics for mobile user identification via multi‐view deep learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part III 17 (pp. 228–240). Springer International Publishing.</bibtext> </blist> <blist> <bibtext> Sun, Y., Ceker, H., & Upadhyaya, S. (2016). Shared keystroke dataset for continuous authentication. In 2016 IEEE international workshop on information forensics and security (WIFS) (pp. 1–6). IEEE.</bibtext> </blist> <blist> <bibtext> Tappert, C. C., Villani, M., & Cha, S.‐H. (2009). Keystroke biometric identification and authentication on long‐text input. Behavioral biometrics for human identification: Intelligent applications, 342–367.</bibtext> </blist> <blist> <bibtext> Trezise, K., Ryan, T., de Barba, P., & Kennedy, G. (2019). Detecting academic misconduct using learning analytics. Journal of Learning Analytics, 6(3), 90–104.</bibtext> </blist> <blist> <bibtext> Van Steendam, E., Vandermeulen, N., De Maeyer, S., Lesterhuis, M., Van den Bergh, H., & Rijlaarsdam, G. (2022). How students perform synthesis tasks: An empirical study into dynamic process configurations. Journal of Educational Psychology, 114(8), 1773.</bibtext> </blist> <blist> <bibtext> Xiong, J., Wang, S., Tang, C., Liu, Q., Sheng, R., Wang, B., Kuang, H., Cohen, A. S., & Xiong, X. (2024). Sequential reservoir computing for log file‐based behavior process data analyses. Journal of Educational Measurement.</bibtext> </blist> <blist> <bibtext> Yan, D., Fauss, M., Hao, J., & Cui, W. (2023). Detection of AI‐generated essays in writing assessment. Psychological Testing and Assessment Modeling, 65(2), 125–144.</bibtext> </blist> </ref> <aug> <p>By Paul Deane; Mo Zhang; Jiangang Hao and Chen Li</p> <p>Reported by Author; Author; Author; Author</p> <p></p> <p>PAUL D. DEANE is a Principal Research Scientist at ETS Research Institute, ETS, 660 rosedale Road, NJ 08541; pdeane@ets.org. His primary research interests include automated writing evaluation, reading and writing assessment, and computational analysis of textual and writing process data.</p> <p>JIANGANG HAO is a Research Director at ETS Research Institute, ETS, 660 Rosedale Road, NJ 08541; jhao@ets.org. His primary research interests include leveraging data science, artificial intelligence and psychometrics to assess complex skills.</p> <p>MO ZHANG is Senior Research Scientist at ETS, 660 Rosedale Road, Princeton, NJ, 08541; mzhang@ets.org. Her primary research interests include psychometric methods as applied in educational and psychological assessments, and the integration of AI in teaching, learning, and assessments.</p> <p>CHEN LI is a Principal Data Analyst at ETS, 660 Rosedale Rd, Princeton, NJ 08542; cli@ets.org. Her primary research interests include AI Scoring and psychometric measurement.</p> </aug> <nolink nlid="nl1" bibid="bib38" firstref="ref1"></nolink> <nolink nlid="nl2" bibid="bib11" firstref="ref2"></nolink> <nolink nlid="nl3" bibid="bib18" firstref="ref3"></nolink> <nolink nlid="nl4" bibid="bib13" firstref="ref4"></nolink> <nolink nlid="nl5" bibid="bib25" firstref="ref5"></nolink> <nolink nlid="nl6" bibid="bib57" firstref="ref6"></nolink> <nolink nlid="nl7" bibid="bib28" firstref="ref8"></nolink> <nolink nlid="nl8" bibid="bib16" firstref="ref9"></nolink> <nolink nlid="nl9" bibid="bib43" firstref="ref10"></nolink> <nolink nlid="nl10" bibid="bib24" firstref="ref11"></nolink> <nolink nlid="nl11" bibid="bib35" firstref="ref14"></nolink> <nolink nlid="nl12" bibid="bib51" firstref="ref15"></nolink> <nolink nlid="nl13" bibid="bib59" firstref="ref16"></nolink> <nolink nlid="nl14" bibid="bib36" firstref="ref17"></nolink> <nolink nlid="nl15" bibid="bib33" firstref="ref18"></nolink> <nolink nlid="nl16" bibid="bib22" firstref="ref21"></nolink> <nolink nlid="nl17" bibid="bib55" firstref="ref22"></nolink> <nolink nlid="nl18" bibid="bib46" firstref="ref23"></nolink> <nolink nlid="nl19" bibid="bib56" firstref="ref25"></nolink> <nolink nlid="nl20" bibid="bib23" firstref="ref28"></nolink> <nolink nlid="nl21" bibid="bib52" firstref="ref33"></nolink> <nolink nlid="nl22" bibid="bib41" firstref="ref34"></nolink> <nolink nlid="nl23" bibid="bib37" firstref="ref36"></nolink> <nolink nlid="nl24" bibid="bib49" firstref="ref37"></nolink> <nolink nlid="nl25" bibid="bib17" firstref="ref38"></nolink> <nolink nlid="nl26" bibid="bib34" firstref="ref39"></nolink> <nolink nlid="nl27" bibid="bib31" firstref="ref43"></nolink> <nolink nlid="nl28" bibid="bib32" firstref="ref44"></nolink> <nolink nlid="nl29" bibid="bib40" firstref="ref45"></nolink> <nolink nlid="nl30" bibid="bib48" firstref="ref46"></nolink> <nolink nlid="nl31" bibid="bib15" firstref="ref47"></nolink> <nolink nlid="nl32" bibid="bib14" firstref="ref48"></nolink> <nolink nlid="nl33" bibid="bib50" firstref="ref49"></nolink> <nolink nlid="nl34" bibid="bib29" firstref="ref50"></nolink> <nolink nlid="nl35" bibid="bib47" firstref="ref51"></nolink> <nolink nlid="nl36" bibid="bib53" firstref="ref53"></nolink> <nolink nlid="nl37" bibid="bib10" firstref="ref54"></nolink> <nolink nlid="nl38" bibid="bib44" firstref="ref55"></nolink> <nolink nlid="nl39" bibid="bib26" firstref="ref61"></nolink> <nolink nlid="nl40" bibid="bib30" firstref="ref66"></nolink> <nolink nlid="nl41" bibid="bib45" firstref="ref68"></nolink> <nolink nlid="nl42" bibid="bib19" firstref="ref72"></nolink>
Header DbId: eric
DbLabel: ERIC
An: EJ1501462
AccessLevel: 3
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Using Keystroke Dynamics to Detect Nonoriginal Text
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Paul+Deane%22">Paul Deane</searchLink><br /><searchLink fieldCode="AR" term="%22Mo+Zhang%22">Mo Zhang</searchLink><br /><searchLink fieldCode="AR" term="%22Jiangang+Hao%22">Jiangang Hao</searchLink><br /><searchLink fieldCode="AR" term="%22Chen+Li%22">Chen Li</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="SO" term="%22Journal+of+Educational+Measurement%22"><i>Journal of Educational Measurement</i></searchLink>. 2026 63(1).
– Name: Avail
  Label: Availability
  Group: Avail
  Data: Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
– Name: PeerReviewed
  Label: Peer Reviewed
  Group: SrcInfo
  Data: Y
– Name: Pages
  Label: Page Count
  Group: Src
  Data: 32
– Name: DatePubCY
  Label: Publication Date
  Group: Date
  Data: 2026
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Journal Articles<br />Reports - Research
– Name: Subject
  Label: Descriptors
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Keyboarding+%28Data+Entry%29%22">Keyboarding (Data Entry)</searchLink><br /><searchLink fieldCode="DE" term="%22Word+Processing%22">Word Processing</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+%28Composition%29%22">Writing (Composition)</searchLink><br /><searchLink fieldCode="DE" term="%22Natural+Language+Processing%22">Natural Language Processing</searchLink><br /><searchLink fieldCode="DE" term="%22Automation%22">Automation</searchLink><br /><searchLink fieldCode="DE" term="%22Identification%22">Identification</searchLink><br /><searchLink fieldCode="DE" term="%22Essays%22">Essays</searchLink><br /><searchLink fieldCode="DE" term="%22Artificial+Intelligence%22">Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Accuracy%22">Accuracy</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+Evaluation%22">Writing Evaluation</searchLink><br /><searchLink fieldCode="DE" term="%22Plagiarism%22">Plagiarism</searchLink>
– Name: DOI
  Label: DOI
  Group: ID
  Data: 10.1111/jedm.12431
– Name: ISSN
  Label: ISSN
  Group: ISSN
  Data: 0022-0655<br />1745-3984
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Keystroke analysis has often been used for security purposes, most often to authenticate users and identify impostors. This paper examines the use of keystroke analysis to distinguish between the behavior of writers who are composing an original text, vs. copying or otherwise reproducing a non-original texts. Recent advances in text generation using large language models makes the use of behavioral cues to identify plagiarism more pressing, since users seeking an advantage on a writing assessment may be able to submit unique AI-generated texts. We examine the use of keystroke log analysis to detect non-original text under three conditions: a laboratory study, where participants were either copying a known text or drafting an original essay, and two studies from operational assessments, where it was possible to identify essays that were non-original by reference to their content. Our results indicate that it is possible to achieve accuracies in excess of 94% under ideal conditions where the nature of each writing session is known in advance, and greater than 89% in operational conditions where proxies for non-original status, such as similarity to other submitted essays, must be used.
– Name: AbstractInfo
  Label: Abstractor
  Group: Ab
  Data: As Provided
– Name: DateEntry
  Label: Entry Date
  Group: Date
  Data: 2026
– Name: AN
  Label: Accession Number
  Group: ID
  Data: EJ1501462
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1501462
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1111/jedm.12431
    Languages:
      – Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 32
    Subjects:
      – SubjectFull: Keyboarding (Data Entry)
        Type: general
      – SubjectFull: Word Processing
        Type: general
      – SubjectFull: Writing (Composition)
        Type: general
      – SubjectFull: Natural Language Processing
        Type: general
      – SubjectFull: Automation
        Type: general
      – SubjectFull: Identification
        Type: general
      – SubjectFull: Essays
        Type: general
      – SubjectFull: Artificial Intelligence
        Type: general
      – SubjectFull: Accuracy
        Type: general
      – SubjectFull: Writing Evaluation
        Type: general
      – SubjectFull: Plagiarism
        Type: general
    Titles:
      – TitleFull: Using Keystroke Dynamics to Detect Nonoriginal Text
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Paul Deane
      – PersonEntity:
          Name:
            NameFull: Mo Zhang
      – PersonEntity:
          Name:
            NameFull: Jiangang Hao
      – PersonEntity:
          Name:
            NameFull: Chen Li
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 03
              Type: published
              Y: 2026
          Identifiers:
            – Type: issn-print
              Value: 0022-0655
            – Type: issn-electronic
              Value: 1745-3984
          Numbering:
            – Type: volume
              Value: 63
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: Journal of Educational Measurement
              Type: main
ResultId 1