View in EDS HTML Full Text PDF Full Text

Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems

Saved in:

Bibliographic Details
Title:	Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems
Language:	English
Authors:	Thompson, W. Jake (ORCID 0000-0001-7339-0300), Nash, Brooke (ORCID 0000-0001-9858-7062), Clark, Amy K. (ORCID 0000-0002-5804-8336), Hoover, Jeffrey C. (ORCID 0000-0002-0276-0308)
Source:	Journal of Educational Measurement. Fall 2023 60(3):455-475.
Availability:	Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
Peer Reviewed:	Y
Page Count:	21
Publication Date:	2023
Sponsoring Agency:	Office of Special Education Programs (OSEP) (ED/OSERS)
Contract Number:	84373X100001
Document Type:	Journal Articles Reports - Research
Descriptors:	Diagnostic Tests, Simulation, Test Reliability, Accuracy, Language Proficiency, English, Evaluation Methods
DOI:	10.1111/jedm.12359
ISSN:	0022-0655 1745-3984
Abstract:	As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions.
Abstractor:	As Provided
Entry Date:	2023
Accession Number:	EJ1391123
Database:	ERIC
Full text is not displayed to guests. Login for full access.

FullText	Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwGBqIaPpgpsFQfie9QDooCwAAAA4zCB4AYJKoZIhvcNAQcGoIHSMIHPAgEAMIHJBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDKBaUk38UdnYyzcDcwIBEICBm0dDfGNC0_5QxfxeHGwjnJGnjODcxUMrx9xC9yhEmu3YiuH2J-ibMXa8KYEZQ1XjjC6MT4-KZYzZjV9DMNpHIlH4UGmsMFgbEjwvddTPIWRPwSI49ZHdRbhRyGZch5L1v7cKjN-e99SpaQZk69o_EsduYu4wUT3qTADKpWy64VQ98Tf5BwOQ5uVnVYrsWiTxc2l21CsOlKWvfumS Text: Availability: 1 Value: <anid>AN0171370828;mea01sep.23;2023Sep06.06:28;v2.2.500</anid> <title id="AN0171370828-1">Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems </title> <p>As diagnostic classification models become more widely used in large‐scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions.</p> <p>Reliability of an assessment is a necessary and important source of validity evidence. Consistency of measurement must be demonstrated to support the valid interpretation and use of results. In the oft‐given example, using a measuring tape to measure the length of a box should produce the same result each time. The same can be said of measurement in education. If a test could be administered twice at the same point in time (e.g., no additional learning, practice effects, etc.), and the test provides an accurate measurement of knowledge, skills, and understandings, the respondent should, in theory, receive the same score from each administration. This is the concept behind test‐retest reliability (Guttman, [<reflink idref="bib15" id="ref1">15</reflink>]). Instances in which scores vary from one administration to the next indicate that the assessment lacks precision and that results are conflated with measurement error, which has an obvious negative impact on the validity of inferences made from the results.</p> <p>In large‐scale standardized testing environments, it is often impractical to administer the same assessment twice. Retest estimates may also be attenuated if knowledge is not retained between administrations or inflated if a practice effect is observed. For these reasons, reliability methods for operational programs often approximate test‐retest reliability through other means. For example, Cronbach's ([<reflink idref="bib8" id="ref2">8</reflink>]) coefficient alpha is one of the most commonly reported metrics of reliability for educational assessments. Rather than administering a test over two occasions, as is done for test‐retest reliability, coefficient alpha determines the average of all the possible split‐half reliability calculations for the assessment and represents the ratio of true score variance to observed score variance, effectively treating the halves as separate forms administered at the same time.</p> <p>Selection of a method for estimating the reliability of an assessment depends on several factors, including the design of the assessment, the scoring model used to provide results, and the availability of data. The guidelines put forth by the <emph>Standards for Educational and Psychological Testing</emph> (<emph>Standards</emph> hereafter; American Educational Research Association [AERA] et al., [<reflink idref="bib1" id="ref3">1</reflink>]) specify a number of considerations for reporting reliability of assessment results. For the purposes of this article, we focus on three specific standards:</p> <p></p> <ulist> <item> Standard 2.2: "The evidence provided for the reliability/precision of the scores should be consistent with the domain of replications associated with the testing procedures, and with the intended interpretations for use of the test scores" (p. 42).</item> <p></p> <item> Standard 2.3: "For each total score, subscore, or combination of scores that is to be interpreted, estimates of relevant indices of reliability/precision should be reported" (p. 43).</item> <p></p> <item> Standard 2.5: "Reliability estimation procedures should be consistent with the structure of the test" (p. 43).</item> </ulist> <p>Because classical test theory (CTT) and item‐response theory (IRT) models have dominated the field of educational measurement, methods for evaluating reliability aligned to these models have similarly dominated the reliability literature (Brennan, [<reflink idref="bib5" id="ref4">5</reflink>]; Haertel, [<reflink idref="bib16" id="ref5">16</reflink>]). While methods of obtaining traditional reliability estimates are well understood and documented, there is far less research on methods for calculating the reliability of assessment results derived from less commonly applied statistical models, namely, diagnostic classification models (DCMs).</p> <hd id="AN0171370828-2">Diagnostic Classification Models</hd> <p>DCMs, also known as cognitive diagnosis models (CDMs; e.g., Leighton &amp; Gierl, [<reflink idref="bib22" id="ref6">22</reflink>]), are confirmatory latent class models that represent the relationship of observed item responses to a set of categorical latent variables (e.g., Bradshaw, [<reflink idref="bib2" id="ref7">2</reflink>]; Rupp et al., [<reflink idref="bib30" id="ref8">30</reflink>]). Whereas traditional psychometric models (e.g., IRT) model a single, or occasionally multiple, continuous latent variables, DCMs model respondent mastery on a number of discrete latent variables (i.e., skills). Thus, a benefit of using DCMs for calibrating and scoring operational assessments is their ability to support instruction by providing fine‐grained reporting at the individual skill level.</p> <p>To provide detailed profiles of respondent mastery of skills measured by the assessment, DCMs require the specification of an item‐by‐skill (also referred to as item‐by‐attribute) matrix known as the Q‐matrix (Tatsuoka, [<reflink idref="bib34" id="ref9">34</reflink>]). Based on the collected item‐response data, the model determines the overall probability of respondents being classified into each latent class. The latent classes for DCMs are typically binary mastery status (master or nonmaster). This base‐rate probability of mastery (i.e., the structural parameter) is then related to respondents' individual response data to determine the respondents' posterior probability of mastery for each assessed skill. The posterior probability is on a scale of 0 to 1 and represents the certainty the respondent has mastered each skill. Values closer to the scale extremes of 0 or 1 indicate greater certainty in the classification; a value of 0 indicates the respondent has definitely not mastered the skill, and a value of 1 indicates the respondent definitely has mastered the skill. In contrast, values closer to.50 represent maximum uncertainty in the classification. A mastery probability of.50 indicates the model cannot distinguish whether, on the basis of the available response data, the respondent has mastered the skill; the respondent is just as likely a master as a nonmaster. Diagnostic assessment results are typically reported as the mastery probability values or as dichotomous mastery statuses when a threshold for demonstrating mastery is imposed (e.g.,.80). The dichotomous mastery statuses can also be aggregated into an skill‐mastery profile for reporting results.</p> <p>The diagnostic scoring approach is unique in that the probability of mastery provides an indication of error or, conversely, certainty, for each skill and examinee. That is, the skills are modeled as binary variables where the mastery probability, <emph>p</emph>, is the expected value, and the variance is derived directly from the mastery probability as <emph>p</emph>(1 – <emph>p</emph>). However, the mastery probability does not provide information about consistency of measurement for the skill or for the assessment as a whole. Furthermore, because assessment results are the collection of skill‐mastery results, rather than a total raw or scale score, traditional approaches to reliability are not appropriate, and alternate methods must be considered for reporting the reliability of operational assessment results.</p> <hd id="AN0171370828-3">Measuring the Reliability of Diagnostic Assessments</hd> <p>Because DCMs have not been widely used in operational or applied settings (Ravand &amp; Baghaei, [<reflink idref="bib27" id="ref10">27</reflink>]; Sessoms &amp; Henson, [<reflink idref="bib31" id="ref11">31</reflink>]), there has been limited research examining how best to report the reliability of classifications from a DCM‐based assessment. However, there has been recent theoretical research on reliability methods for DCMs (for a review, see Sinharay &amp; Johnson, [<reflink idref="bib33" id="ref12">33</reflink>]). In general, this research has been divided into two segments, depending on how stakeholders intend to report results for the assessment. If results are reported as the probability of mastery for each skill, then reliability should be reported as the precision of the estimated probability, similar to a conditional standard error of measurement in an IRT‐based assessment (e.g., Johnson &amp; Sinharay, [<reflink idref="bib20" id="ref13">20</reflink>]; Templin &amp; Bradshaw, [<reflink idref="bib35" id="ref14">35</reflink>]). In contrast, when results are reported as a binary classification (i.e., master or nonmaster) at the skill level, reliability is conceptualized as classification consistency and classification accuracy (for a thorough review of classification as a measure of reliability, see Johnson &amp; Sinharay, [<reflink idref="bib19" id="ref15">19</reflink>]). This classification‐based reliability is the focus of this article.</p> <p> <emph>Classification accuracy</emph> is defined as the probability that an examinee receives a classification that is consistent with his or her true mastery status. <emph>Classification consistency</emph> is defined as the probability that an examinee receives the same classification across multiple administrations of an assessment (Cui et al., [<reflink idref="bib9" id="ref16">9</reflink>]). While classification‐based reliability in DCMs can be evaluated at multiple levels (e.g., skill level, profile level; Cui et al., [<reflink idref="bib9" id="ref17">9</reflink>]; Johnson &amp; Sinharay, [<reflink idref="bib19" id="ref18">19</reflink>]; Wang et al., [<reflink idref="bib40" id="ref19">40</reflink>]), the definitions for classification accuracy and consistency are not altered by the level of analysis. Thus, the same statistical procedures can be used to estimate reliability in DCMs at each of these levels.</p> <p>Early research on reliability in DCMs was conducted by Cui et al. ([<reflink idref="bib9" id="ref20">9</reflink>]). They defined the cognitive diagnostic classification accuracy index and the cognitive diagnostic classification consistency index classification accuracy at the profile level. These indices provide the marginal probability of classifying an examinee accurately and consistently, respectively, at the profile level (Cui et al., [<reflink idref="bib9" id="ref21">9</reflink>]). However, these indices do not allow for evaluating accuracy and consistency at the skill level.</p> <p>For assessments reporting results at the skill level, reliability evidence at the skill level should also be reported (e.g., Standard 2.3; Standard 2.5; Sinharay &amp; Haberman, [<reflink idref="bib32" id="ref22">32</reflink>]). Wang et al. ([<reflink idref="bib40" id="ref23">40</reflink>]) extended the work of Cui et al. ([<reflink idref="bib9" id="ref24">9</reflink>]) by defining classification accuracy and consistency indices at the skill level. Wang et al. ([<reflink idref="bib40" id="ref25">40</reflink>]) calculated skill‐level classification accuracy and consistency as the proportion of examinees classified accurately and consistently within each skill's mastery status (i.e., masters and nonmasters).</p> <p>While the classification accuracy and consistency indices defined by Wang et al. ([<reflink idref="bib40" id="ref26">40</reflink>]) allow for calculating classification‐based reliability at the skill level, Johnson and Sinharay ([<reflink idref="bib19" id="ref27">19</reflink>]) noted these indices rely on the assumption that the posterior probabilities are constant across parallel forms of a test. Using a simple counterexample, Johnson and Sinharay demonstrated that this assumption is easily violated. They defined modified skill‐level classification accuracy and consistency indices at the skill level using consistent estimators, and they provided interpretive guidelines for these new indices. Other commonly reported indices that Johnson and Sinharay suggested calculating include Youden's ([<reflink idref="bib43" id="ref28">43</reflink>]) statistic, Goodman and Kruskal's ([<reflink idref="bib14" id="ref29">14</reflink>]) lambda, Cohen's ([<reflink idref="bib7" id="ref30">7</reflink>]) kappa, the tetrachoric correlation (Pearson, [<reflink idref="bib25" id="ref31">25</reflink>]), and sensitivity and specificity (Yerushalmy, [<reflink idref="bib42" id="ref32">42</reflink>]) to estimate reliability in DCMs.</p> <hd id="AN0171370828-4">Limitations of Current Classification‐Based Reliability</hd> <p>The classification‐based reliability indices defined by Cui et al. ([<reflink idref="bib9" id="ref33">9</reflink>]), Wang et al. ([<reflink idref="bib40" id="ref34">40</reflink>]), and Johnson and Sinharay ([<reflink idref="bib19" id="ref35">19</reflink>]) can be calculated using data from a single administration, which acknowledges limitations pertaining to administering large‐scale assessments multiple times. However, the existing classification‐based reliability indices are limited to reporting reliability evidence at the skill and profile levels. This limitation may be problematic if results are aggregated and reported at a different level. For example, results may be reported as the total number of skills mastered or aggregated into an overall performance level (e.g., for state accountability systems) or pass/fail determinations (e.g., certification and licensure), yet the existing classification‐based reliability indices do not support reporting reliability evidence at these levels. For example, if classification consistency is calculated for 5 separate skills, those individual estimates cannot be combined into a total measure of reliability for the total number of skills mastered. Thus, there is a need for methods to calculate classification‐based reliability that are flexible for reporting multiple levels of reporting to support evidence recommended by Standards 2.2, 2.3, and 2.5 (AERA et al., [<reflink idref="bib1" id="ref36">1</reflink>]).</p> <hd id="AN0171370828-5">Simulation‐Retest Reliability</hd> <p>Roussos et al. ([<reflink idref="bib29" id="ref37">29</reflink>]) explained how simulated data obtained from calibrated DCM parameters (according to real data) can be used to produce summary statistics for evaluating a model, including several types of reliability indices. Specifically, the proportion of times each examinee is classified correctly for each skill was also described as providing an estimate of the correspondence between the estimated skill classification in the observed and simulated data. Similarly, the proportion of times each examinee is classified to the same category (e.g., masters or nonmasters) across two parallel tests was described as providing an estimate of test‐retest consistency.</p> <p>Templin and Bradshaw ([<reflink idref="bib35" id="ref38">35</reflink>]) conducted a research study using a hypothetical second test administration to compare reliability estimates from a DCM to those of an IRT model for the same set of data collected from a single, fixed‐form assessment administered to approximately 2,300 students. Rather than using a diagnostic assessment constructed with the purpose of reporting results at the skill level, this application retrofitted a DCM to existing large‐scale assessment data designed to measure a single construct so that the assignment of items to skills was imposed post hoc. The researchers used posterior probabilities of mastery to calculate the probability of being assigned to each mastery profile and compared these probabilities to random draws from the theta distribution for the IRT‐scored assessment. Reliability results comparing the mastery statuses obtained from the DCM were reported with a tetrachoric correlation for each skill in the model. While their main findings demonstrated that the DCM produced higher reliability estimates than those obtained from the IRT model for a test of the same length, they also demonstrated that hypothetical retest methods may be useful for evaluating reliability.</p> <p>To report reliability evidence at multiple levels, a simulation‐retest methodology is one method for evaluating reliability of diagnostic assessment results. Conceptually, a second administration of an assessment can be simulated on the basis of the administered assessment. By simulating a second administration, scores from two assessments are available, providing a means for evaluating retest reliability in the traditional sense (i.e., consistency of scores across multiple administrations). The simulation‐retest approach differs from other CTT methods that report an estimate of the correlation between total scores from two forms, administrations, or halves of a test. Instead, a simulation‐retest approach reports the correspondence between the estimated mastery statuses in the observed and simulated data, and the interpretation of the reliability results remains the same as for CTT methods. That is, reliability estimates are provided on a metric of 0 to 1, with values of 0 being perfectly unreliable and all variation attributed to measurement error, and values of 1 being perfectly reliable and all variation attributed to respondent differences on the construct measured by the assessment.</p> <p>Consistent with existing classification‐based reliability procedures, the simulation‐retest methodology can be used to estimate the classification accuracy and consistency between the observed and simulated data at the skill and profile levels. However, the simulation‐retest methodology also allows for estimating reliability for other aggregated reporting levels. It is possible to compare, for example, overall performance level in the observed and simulated administrations, and the reliability indices can be calculated to compare the consistency of the performance level determination. Similarly, the simulation‐retest methodology can be used to estimate reliability at other levels of reporting.</p> <p>Thompson et al. ([<reflink idref="bib39" id="ref39">39</reflink>]) demonstrated how a simulation‐retest method could be used to estimate the reliability of assessments scaled with DCMs at different levels of reporting. Thompson et al. applied the simulation‐retest method to provide reliability evidence at multiple levels of reporting that are used for an operational, large‐scale state assessment. The purpose of the current article is to provide an in‐depth description of the simulation‐retest method for estimating reliability and compare the results from applying the simulation‐retest method to those from other existing nonsimulation‐based methods. Because existing nonsimulation‐based methods cannot report reliability evidence at the all levels that results may be reported when mastery results are aggregated, the simulation‐retest method offers a means for reporting reliability evidence and results at any level that is used for reporting results. Consequently, it is important to compare the reliability estimates from the simulation‐retest and nonsimulation‐based methods at the skill level to generally demonstrate the accuracy and consistency of the simulation‐retest method.</p> <hd id="AN0171370828-6">Calculating reliability estimates</hd> <p>The general approach to the simulation‐retest reliability method, as described by Thompson et al. ([<reflink idref="bib39" id="ref40">39</reflink>]), is to simulate a second set of responses based on actual respondent performance and calibrated‐model parameters, score real‐test data and simulated‐test data, and compare respondents' estimated mastery statuses for the observed and simulated data. That is, once response data has been collected, calibrated, and scored, a second administration can be simulated using the known model parameters from the first (i.e., real) administration. In the context of using DCMs to calibrate and score the assessment, respondent performance is the set of mastery statuses for each skill. The threshold for mastery status must be specified before calculating reliability.</p> <p>When calculating skill‐level classifications, a threshold is specified to distinguish masters and nonmasters, recognizing that values farther from.50 indicate greater certainty in the classification (e.g., respondents with a mastery probability of.80 or higher are classified as masters, and all others are classified as nonmasters). In applications of this methodology, the threshold value may vary depending on the design of the assessment, respondent population, stakeholder feedback, or other factors.</p> <p>Applying the mastery threshold to the posterior probabilities of mastery obtained from the diagnostic scoring model results in a dichotomous mastery status for each skill measured by the assessment. The mastery status is one level of reporting results for diagnostic assessments and, therefore, one level at which reliability should be summarized. Because the scoring model produces mastery decisions, the term "results" is used instead of the term "scores" throughout this article.</p> <p>The specific steps for a DCM‐based simulation are as follows:</p> <p></p> <ulist> <item> Sample respondent record: Sample with replacement a respondent record from the operational data set. The respondent's mastery status or posterior probability of mastery from the operational scoring for each measured skill serves as the true value for the simulated respondent.</item> <p></p> <item> Simulate second administration: For each item the respondent was administered, simulate a new response that is based on the model‐calibrated parameters, conditional on the true mastery probability or status for the skill.</item> <p></p> <item> Score simulated responses: Using the operational scoring method, assign mastery status by imposing a threshold for mastery on the posterior probability of mastery obtained from the model.</item> <p></p> <item> Repeat: Repeat the steps for a predetermined number of simulated respondents.</item> </ulist> <p>Step 1 draws respondent records from the operational data, and Step 2 simulates a second administration. This process ensures the simulation‐retest method replicates results from real examinees using the actual set of items each examinee has taken, which means that the two administrations are perfectly parallel. The sampling procedure is similar to bootstrap sampling, which is commonly used for estimating uncertainty (Efron, [<reflink idref="bib13" id="ref41">13</reflink>]), and, because respondents are sampled with replacement, the sampling methods can account for uncertainty in the population completing the assessment. Additionally, because the model‐calibrated parameters are used in Step 2, there is an implicit assumption that the model fits the data. Therefore, practitioners verify adequate model fit to ensure that the simulated retests adequately reflect the observed data generation process. In Step 3, the operational scoring procedure is applied to both the observed and simulated response data to calculate the posterior probability of mastery.</p> <p>To calculate reliability indices, the estimated skill‐mastery statuses for the observed and simulated data are compared across all replications determined in Step 4. Specifically, for each skill, reliability results are calculated using the 2 × 2 contingency table of estimated mastery statuses from the observed and simulated data, as shown in Table 1.</p> <p>1 Table 2×2 Contingency Table of Estimated Mastery in the Observed and Simulated Administrations</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr valign="bottom"&gt;&lt;th align="left"&gt;Observed Mastery Status&lt;/th&gt;&lt;th&gt;Simulated Mastery Status&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;0&lt;/th&gt;&lt;th&gt;1&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;n&lt;sub&gt;00&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;n&lt;sub&gt;01&lt;/sub&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;n&lt;sub&gt;10&lt;/sub&gt;&lt;/td&gt;&lt;td&gt;n&lt;sub&gt;11&lt;/sub&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>1 <emph>Note</emph>. 0 = skill nonmastery; 1 = skill mastery.</p> <p>In this study, the performance of the simulation‐retest reliability method is evaluated by comparing reliability estimates from the simulation‐retest method with multiple nonsimulation‐based reliability indices across a variety of simulated conditions. Because the nonsimulation‐based methods are limited to estimating reliability at the skill and profile level, we focus on skill‐level reliability estimates in this article. That is, we cannot compare the simulated‐retests estimate of performance‐level reliability to a corresponding measure from the Wang et al. ([<reflink idref="bib40" id="ref42">40</reflink>]) or Johnson and Sinharay ([<reflink idref="bib19" id="ref43">19</reflink>]) methods, as those methods only estimate skill‐level reliability. However, the benefit of the simulation‐retest method is that the same procedure can be used for other levels of reporting. For example, using the simulated‐retest method, we can calculate a performance level for both the observed and simulated data using an assessment's operational rules. If there were four performance levels, we would then create a 4 × 4 contingency table similar to Table 1, showing the observed and simulated performance levels. In addition to evaluating the comparability of the simulated‐retest reliability method to more traditional approaches through a simulation study, we also applied the simulated‐retest method in an empirical data analysis of the grammar subtest of the Examination for the Certificate of Proficiency in English (ECPE; Templin &amp; Hoffman, [<reflink idref="bib37" id="ref44">37</reflink>]), which was previously used by Sinharay and Johnson ([<reflink idref="bib33" id="ref45">33</reflink>]) to demonstrate the application of a variety of classification‐based reliability indices for DCMs.</p> <hd id="AN0171370828-7">Simulation Study</hd> <p>We conducted a simulation study to evaluate the accuracy of the reliability estimates from the simulation‐retest method described above. In this study, we manipulated the number of assessed skills (three, four, five) and the minimum number of items measuring each skill (three, four, five). These values were chosen as they represent common designs of applied DCMs (e.g., Bradshaw et al., [<reflink idref="bib3" id="ref46">3</reflink>]; Templin &amp; Hoffman, [<reflink idref="bib37" id="ref47">37</reflink>]; Thompson, [<reflink idref="bib38" id="ref48">38</reflink>]). Additionally, we manipulated the base rate of mastery (.10,.50,.90), the correlation between the assessed skills (.0,.35,.70), and item discrimination (1.0, 1.5, 2.0). These values represent low, moderate, and high levels of each factor and provide a wide range of plausible item and skill characteristics. This simulation used a full factorial design, resulting in 243 total conditions with 100 repetitions per condition.</p> <p>All simulations and analyses of the results were conducted in R version 3.6.1 (R Core Team, [<reflink idref="bib26" id="ref49">26</reflink>]). All DCMs were estimated using a maximum likelihood estimator developed by authors. The repetitions were conducted in parallel using the portableParallelSeeds package (Johnson, [<reflink idref="bib18" id="ref50">18</reflink>]) to ensure that each repetition could be accurately reproduced. Analysis of the simulation results was conducted using the tidyverse suite of R packages (Wickham et al., [<reflink idref="bib41" id="ref51">41</reflink>]). All R code for the simulation study and analysis of the results is available upon request from the first author.</p> <hd id="AN0171370828-8">Data Simulation</hd> <p>The simulation study is modeled on Johnson and Sinharay's ([<reflink idref="bib19" id="ref52">19</reflink>]) evaluation of skill‐level classification reliability indices. In the simulation for this study, each simulated assessment measured three, four, or five skills. The number of items included in each assessment (<emph>I</emph>) is the product of the number of assessed skills (<emph>A</emph>) and the minimum number of items measuring each skill (<emph>J</emph>; i.e., <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0001" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mspace width="0.28em" /&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mspace width="0.28em" /&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mspace width="0.28em" /&gt;&lt;mi&gt;J&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;$I\; = {\rm{\;}}A \times {\rm{\;}}J$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ). The Q‐matrix (Tatsuoka, [<reflink idref="bib34" id="ref53">34</reflink>]) is specified so that the first six items form an identity matrix, and each remaining item has a 50% chance of assessing a second skill in addition to the identity matrix. Consistent with Johnson and Sinharay ([<reflink idref="bib19" id="ref54">19</reflink>]), the items could not measure more than two skills. Table 2 presents an example Q‐matrix for an assessment measuring three skills with a minimum of three items per skill.</p> <p>2 Table Example Q‐Matrix</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Item&lt;/th&gt;&lt;th&gt;Skill 1&lt;/th&gt;&lt;th&gt;Skill 2&lt;/th&gt;&lt;th&gt;Skill 3&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The base rate of mastery and the distributions for the item parameters were also simulated according to the approach used by Johnson and Sinharay ([<reflink idref="bib19" id="ref55">19</reflink>]). The base rate of mastery for the first assessed skill was determined by the simulation condition, where 10%, 50%, or 90% of examinees mastered the first skill. The base rates of mastery for the remaining skills were determined by drawing a random number from a uniform distribution ranging from.2 to.8.</p> <p>The generating model for this simulation was a log‐linear cognitive diagnosis model (LCDM; Henson et al., [<reflink idref="bib17" id="ref56">17</reflink>]), meaning the item parameters include item intercepts, main effects, and interaction effects. The item intercepts, which correspond to the probability of a nonmaster correctly responding to the item, were drawn from a uniform distribution ranging from.00 to.35, following the approach used by Johnson and Sinharay ([<reflink idref="bib19" id="ref57">19</reflink>]). The item main effects, which correspond to the log odds increase in the probability for a master of the skill correctly responding to the item, were drawn from a truncated normal distribution with a mean of 1.0, 1.5, or 2.0 (representing low, moderate, and high discrimination, respectively) and a standard deviation of.17, where the values were constrained to be positive, using Johnson and Sinharay's ([<reflink idref="bib19" id="ref58">19</reflink>]) approach. The item interaction effects, which correspond to the log odds increase in the probability for a master of two skills correctly responding to the item, were also drawn from a truncated normal distribution with a mean of 1.0, 1.5, or 2.0 and a standard deviation of.17, but the values were constrained to be greater than negative one times the smallest item main effect (i.e., −1 × min[main effects]) to meet the monotonicity constraints of the LCDM (Henson et al., [<reflink idref="bib17" id="ref59">17</reflink>]). Like the other item parameters, the distribution for the item‐interaction‐effect parameters followed the approach used by Johnson and Sinharay ([<reflink idref="bib19" id="ref60">19</reflink>]).</p> <p>In this study, 2,000 respondents were simulated for each generated data set. For each generated data set, we fit an LCDM and a deterministic‐input, noisy‐and‐gate (DINA; de la Torre &amp; Douglas, [<reflink idref="bib10" id="ref61">10</reflink>]; Junker &amp; Sijtsma, [<reflink idref="bib21" id="ref62">21</reflink>]) model to each of the simulated data sets. Because the generating model for each data set is an LCDM, it is expected that the LCDM should demonstrate better fit than the DINA model. It is expected that these differences in model fit should have implications for classification‐based reliability. Specifically, the DINA model is expected to demonstrate lower reliability because model misfit is present. Additionally, because the simulated‐retest method treats the calibrated‐model parameters as true, fitting the DINA model allows us to evaluate the potential impact of model misfit on the resulting reliability estimates.</p> <p>The generated data sets and the estimated model parameters were then used to create the simulated retests. When calculating the simulation‐retest reliability estimates, 100,000 respondents were drawn with replacement and simulated for the retest data (i.e., we created 100,000 simulated retests from the original sample of 2,000 generated students). A threshold of.50 was used in this simulation study to determine skill mastery, as this is a commonly used threshold in the literature (e.g., Bradshaw &amp; Levy, [<reflink idref="bib4" id="ref63">4</reflink>]; Templin &amp; Bradshaw, [<reflink idref="bib35" id="ref64">35</reflink>]). However, any threshold can be used in an operational setting. The simulation‐retest classification consistency was then calculated as the proportion of respondent‐mastery classifications for each skill that matched the respondent's mastery status estimated from the original generated data set (i.e., how consistent the classifications were across the resampled respondents' simulated retests). Similarly, the simulation‐retest classification accuracy for each skill was calculated as the average probability associated with each mastery classification across all simulated retests for the resampled respondents.</p> <hd id="AN0171370828-9">Method Comparisons</hd> <p>The simulation‐retest reliability estimates for the LCDM were then compared to nonsimulation‐based methods for estimating the reliability of DCMs. Specifically, the simulation‐retest classification consistency was compared to the <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0002" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ck}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> classification consistency measure defined by Johnson and Sinharay ([<reflink idref="bib19" id="ref65">19</reflink>]; their equation 27). The simulation‐retest classification accuracy was compared to the <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0003" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat \tau &amp;#95;k}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> measure defined by Wang et al. ([<reflink idref="bib40" id="ref66">40</reflink>]; their equation 11). This measure was denoted as <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0004" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ak}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> by Johnson and Sinharay ([<reflink idref="bib19" id="ref67">19</reflink>]; their equation 9).</p> <p>Because the simulated retests use the estimated model parameters to simulate item responses for the resampled respondents, it is implied that the estimated parameters are correct. Thus, it is possible that the simulation‐retest method may produce biased estimates of reliability if there is model misfit. Therefore, it is important to examine the impact of model misfit on reliability estimates derived from the simulation‐retest method. To evaluate the impact of model misfit, we simulated retests using the true data‐generating parameters, the parameters estimated by the LCDM, and the parameters estimated by the DINA model. The LCDM was the data‐generating model; therefore, the estimated LCDM parameters should be similar to the true parameters with some sampling variability. In contrast, the DINA model is a more restrictive model and therefore represents a model that does not truly fit the data and may therefore potentially bias the reliability estimates. That is, the DINA model should exhibit misfit because the estimated parameters do not fully capture the data‐generating process of the LCDM. Estimating both the LCDM and the DINA models allowed us to evaluate how reliability measures derived from simulated retests with parameters that either fit (i.e., the LCDM parameters) or did not fit (i.e., the DINA parameters) compared with the reliability measures derived from the true data‐generating parameters. For all comparisons, we used the mean absolute difference to evaluate discrepancies between the reliability measures.</p> <hd id="AN0171370828-10">Simulation Study Results</hd> <p>There were 243 conditions with 100 repetitions per condition in this simulation (24,300 total repetitions). The models and reliability estimates were estimated for 22,158 (91.2%) of repetitions using the true data‐generating parameters as well as the estimated parameters from the LCDM and DINA models. The replications in which reliability estimates could not be estimated were not the result of a limitation of the reliability method, but rather of the failure to converge of one or more of the estimated models. Of the 2,142 (8.8%) repetitions where one or more model failed to converge, 1,689 (7.0% of all repetitions) were in the low discrimination condition. Another 292 repetitions (1.2% of all repetitions) were in moderate discrimination condition with 4 or 5 attributes. The remaining.6% of repetitions where one or more model failed to converge were evenly spread across the remaining conditions. These findings are consistent with published literature indicating that high‐quality items that discriminate between mastery classes are crucial for estimating a DCM, particularly as the number of classes (i.e., attributes) increases (Madison &amp; Bradshaw, [<reflink idref="bib24" id="ref68">24</reflink>]; Ravand &amp; Baghaei, [<reflink idref="bib27" id="ref69">27</reflink>]).</p> <p>For simplicity in the presentation of results, we report the estimated reliability for Skill 1 rather than for all skills that were included in each condition. All skills were generated following the same process (i.e., number of items, Q‐matrix generation, prevalence, discrimination, and association with other attributes), and the results for all skills were consistent with each other across conditions.</p> <hd id="AN0171370828-11">Classification Consistency</hd> <p>Figure 1 shows the average simulation‐retest classification consistency and nonsimulation‐based classification consistency ( <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0005" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ck}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ; Johnson &amp; Sinharay, [<reflink idref="bib19" id="ref70">19</reflink>]) for the first skill across all conditions. Overall, the estimates from the simulation‐retest and nonsimulation‐based methods are highly consistent, with an average absolute difference of only.0002. The similarity between the two measures of classification consistency was stable across all simulation conditions, indicating that the simulation‐retest measure of classification consistency provides reliability estimates comparable to nonsimulation‐based measures across a variety of assessment conditions.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01sep23/jedm12359-fig-0001.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12359-fig-0001.jpg" title="1 Comparison of classification consistency across all simulation conditions. Note. Dashed line represents perfect agreement." /> </p> <p></p> <hd id="AN0171370828-13">Classification Accuracy</hd> <p>When comparing the simulation‐retest reliability estimates from the LCDM model with the nonsimulation‐based reliability accuracy estimates, the classification accuracy estimates were also highly similar. Figure 2 shows a scatterplot with the average simulation‐retest classification accuracy estimate for the first skill for each condition on the <emph>x</emph>‐axis and the average nonsimulation‐based classification accuracy estimate ( <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0006" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat \tau &amp;#95;k}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ; Wang et al., [<reflink idref="bib40" id="ref71">40</reflink>]) for each condition on the <emph>y</emph>‐axis. In the scatterplot, the dashed line is the line of perfect agreement. The simulation‐retest and nonsimulation‐based reliability estimates are close to the line of perfect agreement, with an average absolute difference of.0001 across conditions. Thus, as with the classification consistency, when the estimated model matches the generating model, simulation and nonsimulation‐based methods give nearly identical estimates of skill reliability.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01sep23/jedm12359-fig-0002.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12359-fig-0002.jpg" title="2 Comparison of classification accuracy across all simulation conditions. Note. Dashed line represents perfect agreement." /> </p> <p></p> <hd id="AN0171370828-15">Model Fit</hd> <p>Figure 3 shows the average simulation‐retest classification consistency for the first skill across each of the manipulated factors in the study and each set of item parameters (i.e., true, LCDM, and DINA). As expected, when using parameters from the DINA model that do not fit the true structure of the data, the reliability estimates are slightly lower than the estimates derived when using the true data‐generating parameters (i.e., true reliability) or the LCDM estimates. The relatively small effect of misfit is likely an artifact of the Q‐matrix generation. In the Q‐matrix, each skill was always measured by two items in isolation (i.e., single‐skill items) and one to three items that may or may not have measured a second skill (e.g., Table 2). For single‐skill items, the LCDM and DINA models are equivalent (Rupp et al., [<reflink idref="bib30" id="ref72">30</reflink>]). Because the models are equivalent for single‐skill items, misfit would be present only for the comparatively few numbers of items that measured multiple skills. As such, this study included only small to moderate levels of misfit, depending on how many items were simulated to measure multiple skills. Thus, it is likely that more items measuring multiple skills or items measuring more than two skills would increase the observed differences for the DINA model, as there would be a greater difference between the DINA model and the data‐generating model.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01sep23/jedm12359-fig-0003.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12359-fig-0003.jpg" title="3 Average simulation‐retest classification consistency across study factors, by model." /> </p> <p></p> <p>In contrast to the results from the DINA model, the reliability estimates when using parameters from the LCDM that do fit the true structure of the data are slightly higher than the estimates derived when using the true data‐generating parameters. This observation is especially true at the highest value of each of the study factors. The high values of these factors are typically associated with high‐quality assessments (e.g., highly discriminating items, longer test length, etc.). Across all simulation conditions, the average absolute difference between the simulation‐retest classification consistency derived from the true and LCDM parameters was.0099, compared to a difference of.0168 between the true and DINA parameters.</p> <p>Classification accuracy shows a similar pattern in Figure 4. Again, we see that estimates of classification accuracy are generally slightly lower when the parameters come from a model that does not fit the data (i.e., the DINA model) and that estimates of classification accuracy are generally slightly higher when the parameters come from a model that does fit the data (i.e., the LCDM). The differences are again most pronounced at the highest levels of each factor, but the differences are smaller overall than what we observed for the classification consistency. Across all simulation conditions, the average absolute difference between the true and LCDM estimates of classification accuracy was.0071, compared to an average absolute difference of.0110 between the true and DINA estimates (a difference of.004, compared to a difference of.007 for classification consistency).</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/MEA/01sep23/jedm12359-fig-0004.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="jedm12359-fig-0004.jpg" title="4 Average simulation‐retest classification accuracy across study factors, by model." /> </p> <p></p> <p>Across all conditions, the estimates of classification accuracy were consistently higher than the estimates of classification consistency. Additionally, both the simulation‐retest classification consistency and classification accuracy show patterns that are expected for a reliability metric. For example, both consistency and accuracy tend to increase as the number of items increases (top left of Figures 3 and 4), as the items better differentiate between mastery classes (middle left of Figures 3 and 4) and the correlation between skills increases (bottom right of Figures 3 and 4).</p> <hd id="AN0171370828-18">Empirical Data Analysis</hd> <p>To evaluate the performance of the simulation‐retest reliability method in a real‐data setting, we applied the method to the data set for the grammar subtest of the ECPE (Templin &amp; Hoffman, [<reflink idref="bib37" id="ref73">37</reflink>]), as the nonsimulation‐based reliability estimates have been reported for the ECPE (Sinharay &amp; Johnson, [<reflink idref="bib33" id="ref74">33</reflink>]). The ECPE is an internationally administered assessment of the grammatical rules in English at the Proficient level of the Common European Framework of Reference for Languages. More specifically, the ECPE assesses morphosyntactic rules, cohesive rules, and lexical rules. The ECPE is intended for secondary‐school students and adults. The ECPE data set is available from the CDM package in R (Robitzsch et al., [<reflink idref="bib28" id="ref75">28</reflink>]), and it has previously been used to demonstrate the application of the nonsimulation‐based classification‐based reliability estimates for DCMs (Sinharay &amp; Johnson, [<reflink idref="bib33" id="ref76">33</reflink>]). The data for the grammar subtest of the ECPE include 2,922 examinees and 28 items, with 13 items measuring the morphosyntactic rules, six items measuring cohesive rules, and 18 items measuring the lexical rules. The Q‐matrix and the estimated structural and item parameters for the grammar subtest of the ECPE are available in Templin and Hoffman ([<reflink idref="bib37" id="ref77">37</reflink>]).</p> <p>In this empirical data analysis, we fit an LCDM to the ECPE data. Then, for each observed student, we simulated 100 retests for each examinee using the estimated parameters from the LCDM for a total of 292,200 retests (2,922 students ƀ 100 retests). The LCDM has been applied to this data set many times (e.g., Chen et al., [<reflink idref="bib6" id="ref78">6</reflink>]; Liu &amp; Johnson, [<reflink idref="bib23" id="ref79">23</reflink>]; Templin &amp; Bradshaw, [<reflink idref="bib36" id="ref80">36</reflink>]; Templin &amp; Hoffman, [<reflink idref="bib37" id="ref81">37</reflink>]). In each application, the parameter estimates are consistent, and adequate model fit has been reported. After fitting the LCDM and confirming that our parameter estimates were consistent with previous applications, we estimated the simulation‐retest estimates of classification consistency and accuracy by comparing the skill classifications from each retest to the classifications made from the observed data. We then compared the simulated‐retest estimates to the estimates of <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0007" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ck}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0008" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat \tau &amp;#95;k}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> (denoted as <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0009" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ak}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ) by Johnson and Sinharay ([<reflink idref="bib19" id="ref82">19</reflink>]; their tables 6 and 7). The <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0010" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ck}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0011" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat \tau &amp;#95;k}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> are the previously described classification consistency and accuracy estimates defined by Johnson and Sinharay ([<reflink idref="bib19" id="ref83">19</reflink>]) and Wang et al. ([<reflink idref="bib40" id="ref84">40</reflink>]), respectively.</p> <hd id="AN0171370828-19">Empirical Data‐Analysis Results</hd> <p>The reliability estimates from both the simulated‐retest and nonsimulation‐based methods are presented in Table 3. The simulation‐retest reliability estimates were similar to the nonsimulation‐based reliability estimates reported by Johnson and Sinharay ([<reflink idref="bib19" id="ref85">19</reflink>]). The simulation‐retest classification accuracy estimates were within.01 of the nonsimulation‐based classification accuracy estimates. This high degree of similarity is expected, given the high degree of consistency between these measures that was observed in the simulation study, and indicates that the similarity between the two methods persists for real data. However, it is also worth noting that the simulation‐retest reliability estimates were equal to or marginally larger than the nonsimulation‐based reliability estimates. However, this marginal inflation was never greater than.02.</p> <p>3 Table Comparison of Simulation‐Retest and Nonsimulation‐Based Skill‐Level Reliability Estimates for the Data from the Examination for the Certificate of Proficiency in English</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr valign="bottom"&gt;&lt;th&gt;Measure&lt;/th&gt;&lt;th&gt;&lt;p&gt;&lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0012" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ck}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th&gt;Simulation&amp;#8208;Retest Consistency&lt;/th&gt;&lt;th&gt;&lt;p&gt;&lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12359:jedm12359-math-0013" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;annotation encoding="application/x-tex"&gt;${\hat P&amp;#95;{ak}}$&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th&gt;Simulation&amp;#8208;Retest Accuracy&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Skill 1&lt;/td&gt;&lt;td&gt;.83&lt;/td&gt;&lt;td&gt;.85&lt;/td&gt;&lt;td&gt;.90&lt;/td&gt;&lt;td&gt;.92&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Skill 2&lt;/td&gt;&lt;td&gt;.81&lt;/td&gt;&lt;td&gt;.82&lt;/td&gt;&lt;td&gt;.86&lt;/td&gt;&lt;td&gt;.86&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Skill 3&lt;/td&gt;&lt;td&gt;.86&lt;/td&gt;&lt;td&gt;.87&lt;/td&gt;&lt;td&gt;.92&lt;/td&gt;&lt;td&gt;.93&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <hd id="AN0171370828-20">Discussion</hd> <p>In this study, we compared the performance of a simulation‐retest reliability method to nonsimulation‐based methods that have previously been described in the literature. Although the simulated‐retests method has been described and implemented in previous research (e.g., Thompson et al., [<reflink idref="bib39" id="ref86">39</reflink>]), additional research was needed to fully evaluate the estimates derived from such a method.</p> <p>The findings from this article demonstrate that simulated retests provide high‐fidelity measures of classification consistency and classification accuracy for diagnostic assessments. When comparing the scores from simulated retests to the scores from an original data set, the simulated‐retests method provided estimates of classification consistency and accuracy that were highly consistent with more traditional, nonsimulation‐based methods. This similarity in the reliability estimates was true across all conditions evaluated in this study. Additionally, we demonstrated that the simulated‐retest method demonstrates the expected properties of a reliability metric, such as increased reliability with longer assessments, more‐discriminating items, and association between the measured constructs (de la Torre &amp; Patz, [<reflink idref="bib11" id="ref87">11</reflink>]; DeVellis, [<reflink idref="bib12" id="ref88">12</reflink>]). Finally, an empirical analysis of ECPE data further demonstrated that the simulated‐retests method produced reliability estimates similar to those of the nonsimulation‐based reliability methods.</p> <p>Best practice in the literature indicates reliability evidence should be presented at the same level at which the results are reported (e.g., Standards 2.2, 2.3, and 2.5, AERA et al., [<reflink idref="bib1" id="ref89">1</reflink>]; Sinharay &amp; Haberman, [<reflink idref="bib32" id="ref90">32</reflink>]). For DCMs, when results are reported at the skill level, the reliability evidence should also be reported at the skill level. This guiding principle has motivated much of the existing research on reliability in DCMs (e.g., Cui et al., [<reflink idref="bib9" id="ref91">9</reflink>]; Johnson &amp; Sinharay, [<reflink idref="bib19" id="ref92">19</reflink>]; Wang et al., [<reflink idref="bib40" id="ref93">40</reflink>]), where classification‐based reliability at the skill and profile levels has been emphasized.</p> <p>However, a limitation of existing classification‐based reliability approaches is that they do not readily scale to other levels of reporting. For example, in addition to reporting the skill‐level results, a testing program may also report results as the total number of skills mastered, a performance level for state accountability systems, or a pass/fail decision for certification and licensure. Consequently, it is important that reliability evidence can be reported at these levels.</p> <p>This article expands on previous work (e.g., Roussos et al., [<reflink idref="bib29" id="ref94">29</reflink>]; Thompson et al., [<reflink idref="bib39" id="ref95">39</reflink>]) to examine how simulation‐retest estimates of classification accuracy and consistency compare to other methods. Because findings were generally consistent with other methods, we argue that simulated retests may be preferred because they can estimate reliability at multiple levels of reporting, not just the skill level (Thompson et al., [<reflink idref="bib39" id="ref96">39</reflink>]). As operational programs continue to adopt DCM‐based assessments, the capacity to report results and provide reliability evidence at levels beyond just the skill level is important for meeting the needs of stakeholders.</p> <p>We recognize that the simulated‐retests method may not be necessary or preferable in all contexts. The process of simulating retests, calculating results for each retest, and summarizing the results with an appropriate agreement metric requires more time and computing than other reliability metrics for DCMs that provide equally useful information. However, when an assessment reports results at an aggregated level (e.g., an overall performance level), the simulated‐retests method provides a consistent approach that can be used to report reliability for all levels of reported results. Thus, this method is an important tool for operational programs or accountability assessments that aggregate respondent‐mastery results in addition to reporting individual skill‐mastery statuses.</p> <p>Because the simulated retests use the estimated model parameters to simulate the retests, model fit is a key component of the method. The results of the current study demonstrated that even the small to moderate amounts of misfit introduced in this study by using the DINA model may introduce bias in the reliability estimates. Therefore, practitioners implementing this method should carefully evaluate the fit of their model before using simulated retests to estimate reliability. Future work may consider the impact of different types and amounts of model misfit on the reliability estimates produced by simulated retests. Additionally, future work should also consider the complexity of Q‐matrix configurations, as the number of skills measured by each items may impact the number of item parameters (for the LCDM) and the overall model complexity, which as has implications for model fit.</p> <hd id="AN0171370828-21">Conclusions</hd> <p>This study has positive implications for operational testing programs administering assessments that are scaled using DCMs. The simulation‐retest reliability method allows for reliability evidence to be reported at multiple levels, which can support programs reporting both skill‐mastery profiles and overall performance‐level results. The current study demonstrates that the simulated‐retests method generates reliability estimates that are consistent with nonsimulation‐based methods and with the true reliability. These findings indicate that the simulation‐retest reliability method produces accurate and consistent reliability estimates under a variety conditions. These findings are promising given the usefulness of the simulation‐retest reliability method to operational testing programs in reporting reliability evidence to support the use of their assessments.</p> <ref id="AN0171370828-22"> <title> References </title> <blist> <bibl id="bib1" idref="ref3" type="bt">1</bibl> <bibtext> American Educational Research Association, American Psychological Association, &amp; National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.</bibtext> </blist> <blist> <bibl id="bib2" idref="ref7" type="bt">2</bibl> <bibtext> Bradshaw, L. (2016). Diagnostic classification models. In A. A. Rupp &amp; J. P. Leighton (Eds.), The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications (1st ed., pp. 297 – 327). John Wiley &amp; Sons. https://doi.org/10.1002/9781118956588.ch13</bibtext> </blist> <blist> <bibl id="bib3" idref="ref46" type="bt">3</bibl> <bibtext> Bradshaw, L., Izsák, A., Templin, J., &amp; Jacobsen, E. (2014). Diagnosing teachers' understandings of rational numbers: Building a multidimensional test within the diagnostic classification framework. Educational Measurement: Issues and Practice, 33 (1), 2 – 14. https://doi.org/10.1111/emip.12020</bibtext> </blist> <blist> <bibl id="bib4" idref="ref63" type="bt">4</bibl> <bibtext> Bradshaw, L., &amp; Levy, R. (2019). Interpreting probabilistic classifications from diagnostic psychometric models. Educational Measurement: Issues and Practice, 38 (2), 79 – 88. https://doi.org/10.1111/emip.12247</bibtext> </blist> <blist> <bibl id="bib5" idref="ref4" type="bt">5</bibl> <bibtext> Brennan, R. L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38 (4), 295 – 317. https://doi.org/10.1111/j.1745‐3984.2001.tb01129.x</bibtext> </blist> <blist> <bibl id="bib6" idref="ref78" type="bt">6</bibl> <bibtext> Chen, F., Liu, Y., Xin, T., &amp; Cui, Y. (2018). Applying the M 2 statistic to evaluate the fit of diagnostic classification models in the presence of attribute hierarchies. Frontiers in Psychology, 9, 1875. https://doi.org/10.3389/fpsyg.2018.01875</bibtext> </blist> <blist> <bibl id="bib7" idref="ref30" type="bt">7</bibl> <bibtext> Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20 (1), 37 – 46. https://doi.org/10.1177/001316446002000104</bibtext> </blist> <blist> <bibl id="bib8" idref="ref2" type="bt">8</bibl> <bibtext> Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16 (3), 297 – 334. https://doi.org/10.1007/BF02310555</bibtext> </blist> <blist> <bibl id="bib9" idref="ref16" type="bt">9</bibl> <bibtext> Cui, Y., Gierl, M. J., &amp; Chang, H.‐H. (2012). Estimating classification consistency and accuracy for cognitive diagnostic assessment. Journal of Educational Measurement, 49 (1), 19 – 38. https://doi.org/10.1111/j.1745‐3984.2011.00158.x</bibtext> </blist> <blist> <bibtext> de la Torre, J., &amp; Douglas, J. A. (2004). Higher‐order latent trait models for cognitive diagnosis. Psychometrika, 69 (3), 333 – 353. https://doi.org/10.1007/BF02295640</bibtext> </blist> <blist> <bibtext> de la Torre, J., &amp; Patz, R. J. (2005). Make the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statisitcs, 30 (3), 295 – 311. https://doi.org/10.3102/10769986030003295</bibtext> </blist> <blist> <bibtext> DeVellis, R. F. (2006). Classical test theory. Medical Care, 44 (11), S50 – S59. https://doi.org/10.1097/01.mlr.0000245426.10853.30</bibtext> </blist> <blist> <bibtext> Efron, B. (2000). The bootstrap and modern statistics. Journal of the American Statistical Association, 95 (452), 1293 – 1296. https://doi.org/10.2307/2669773</bibtext> </blist> <blist> <bibtext> Goodman, L. A., &amp; Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732 – 764. https://doi.org/10.1080/01621459.1954.10501231</bibtext> </blist> <blist> <bibtext> Guttman, L. (1945). A basis for analyzing test‐retest reliability. Psychometrika, 10 (4), 255 – 282. https://doi.org/10.1007/BF02288892</bibtext> </blist> <blist> <bibtext> Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65 – 110). Praeger.</bibtext> </blist> <blist> <bibtext> Henson, R. A., Templin, J. L., &amp; Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log‐linear models with latent variables. Psychometrika, 74 (2), 191 – 210. https://doi.org/10.1007/s11336‐008‐9089‐5</bibtext> </blist> <blist> <bibtext> Johnson, P. (2016). portableParallelSeeds: Allow replication of simulations on parallel and serial computers. R package version 0.97. https://github.com/wjakethompson/portableParallelSeeds</bibtext> </blist> <blist> <bibtext> Johnson, M. S., &amp; Sinharay, S. (2018). Measures of agreement to assess attribute‐level classification accuracy and consistency for cognitive diagnostic assessments. Journal of Educational Measurement, 55 (4), 635 – 664. https://doi.org/10.1111/jedm.12196</bibtext> </blist> <blist> <bibtext> Johnson, M. S., &amp; Sinharay, S. (2020). The reliability of the posterior probability of skill attainment in diagnostic classification models. Journal of Educational and Behavioral Statistics, 45 (1), 5 – 31. https://doi.org/10.3102/1076998619864550</bibtext> </blist> <blist> <bibtext> Junker, B. W., &amp; Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25 (3), 258 – 272. https://doi.org/10.1177/01466210122032064</bibtext> </blist> <blist> <bibtext> Leighton, J., &amp; Gierl, M. (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge University Press.</bibtext> </blist> <blist> <bibtext> Liu, X., &amp; Johnson, M. S. (2019). Estimating CDMs using MCMC. In M. von Davier &amp; Y.‐S. Lee (Eds.), Handbook of diagnostic classification models (pp. 629 – 646). Springer Nature. https://doi.org/10.1007/978‐3‐030‐05584‐4_31</bibtext> </blist> <blist> <bibtext> Madison, M., &amp; Bradshaw, L. (2015). The effects of Q‐matrix design on classification accuracy in the log‐linear cognitive diagnosis model. Educational and Psychological Measurement, 75 (3), 491 – 511. https://doi.org/10.1177/0013164414539162</bibtext> </blist> <blist> <bibtext> Pearson, K. (1900). I. Mathematical contributions to the theory of evolution.—VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 195, 1 – 47. https://doi.org/10.1098/rsta.1900.0022</bibtext> </blist> <blist> <bibtext> R Core Team. (2019). R: A language and environment for statistical computing (Version 3.6.1) [Computer software]. https://r‐project.org</bibtext> </blist> <blist> <bibtext> Ravand, H., &amp; Baghaei, P. (2020). Diagnostic classification models: Recent developments, practical issues, and prospects. International Journal of Testing, 20 (1), 24 – 56. https://doi.org/10.1080/15305058.2019.1588278</bibtext> </blist> <blist> <bibtext> Robitzsch, A., Kiefer, T., George, A. C., &amp; Ünlü, A. (2020). CDM: Cognitive diagnosis modeling (Version 8.1‐12) [Computer software]. https://CRAN.R‐project.org/package=CDM</bibtext> </blist> <blist> <bibtext> Roussos, L. A., Dibello, L. V., Stout, W., Hartz, S. M., Henson, R. A., &amp; Templin, J. L. (2007). The fusion model skills diagnosis system. In J. Leighton &amp; M. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 275 – 318). Cambridge University Press. https://doi.org/10.1017/CBO9780511611186.010</bibtext> </blist> <blist> <bibtext> Rupp, A. A., Templin, J., &amp; Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.</bibtext> </blist> <blist> <bibtext> Sessoms, J., &amp; Henson, R. A. (2018). Applications of diagnostic classification models: A literature reivew and critical commentary. Measurement: Interdisciplinary Research and Perspectives, 16 (1), 1 – 17. https://doi.org/10.1080/15366367.2018.1435104</bibtext> </blist> <blist> <bibtext> Sinharay, S., &amp; Haberman, S. J. (2009). How much can we reliably know about what examinees know? Measurement, 7 (1), 46 – 49. https://doi.org/10.1080/15366360802715486</bibtext> </blist> <blist> <bibtext> Sinharay, S., &amp; Johnson, M. S. (2019). Measures of agreement: Reliability, classification accuracy, and classification consistency. In M. von Davier &amp; Y.‐S. Lee (Eds.), Handbook of diagnostic classification models (pp. 359 – 377). Springer Nature. https://doi.org/10.1007/978‐3‐030‐05584‐4_17</bibtext> </blist> <blist> <bibtext> Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20 (4), 345 – 354. https://doi.org/10.1111/j.1745‐3984.1983.tb00212.x</bibtext> </blist> <blist> <bibtext> Templin, J., &amp; Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30, 251 – 275. https://doi.org/10.1007/s00357‐013‐9129‐4</bibtext> </blist> <blist> <bibtext> Templin, J., &amp; Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79 (2), 317 – 339. https://doi.org/10.1007/s11336‐013‐9362‐0</bibtext> </blist> <blist> <bibtext> Templin, J., &amp; Hoffman, L. (2013). Obtaining diagnostic classification model estimates using Mplus. Educational Measurement Issues and Practice, 32 (2), 37 – 50. https://doi.org/10.1111/emip.12010</bibtext> </blist> <blist> <bibtext> Thompson, W. J. (2019). Bayesian psychometrics for diagnostic assessments: A proof of concept (Research Report No. 19‐01). University of Kansas; Accessible Teaching, Learning, and Assessment Systems. https://doi.org/10.35542/osf.io/jzqs8</bibtext> </blist> <blist> <bibtext> Thompson, W. J., Clark, A. K., &amp; Nash, B. (2019). Measuring the reliability of diagnostic mastery classifications at multiple levels of reporting. Applied Measurement in Education, 32 (4), 298 – 309. https://doi.org/10.1080/08957347.2019.1660345</bibtext> </blist> <blist> <bibtext> Wang, W., Song, L., Chen, P., Meng, Y., &amp; Ding, S. (2015). Attribute‐level and pattern‐level classification consistency and accuracy indices for cognitive diagnostic assessment. Journal of Educational Measurement, 52 (4), 457 – 476. https://doi.org/10.1111/jedm.12096</bibtext> </blist> <blist> <bibtext> Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., ... Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4 (43), 1686. https://doi.org/10.21105/joss.01686.</bibtext> </blist> <blist> <bibtext> Yerushalmy, J. (1947). Statistical problems in assessing methods of medical diagnosis, with special reference to X‐ray techniques. Public Health Records (1896–1970), 62 (40), 1432 – 1449. https://doi.org/10.2307/4586294</bibtext> </blist> <blist> <bibtext> Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3 (1), 32 – 35. https://doi.org/10.1002/1097‐0142(1950)3:1 &lt;32::AID‐CNCR2820030106&gt;3.0.CO;2‐3</bibtext> </blist> </ref> <aug> <p>By W. Jake Thompson; Brooke Nash; Amy K. Clark and Jeffrey C. Hoover</p> <p>Reported by Author; Author; Author; Author</p> <p></p> <p>W. JAKE THOMPSON is Assistant Director of Psychometrics at Accessible Teaching, Learning, and Assessment Systems, 1122 West Campus Road, Lawrence, KS 66045; jakethompson@ku.edu. His primary research interests include diagnostic score reporting, assessment of model fit, Bayesian methods, and R development.</p> <p>BROOKE NASH is Associate Director of Psychometrics at Accessible Teaching, Learning, and Assessment Systems, 1122 West Campus Road, Lawrence, KS 66045; bnash@ku.edu. Her primary research interests include formative assessment, scoring models for technology‐enabled items, implementation of diagnostic assessment platforms, and assessment of the Next Generation Science Standards.</p> <p>AMY K. CLARK is Associate Director of Operational Research at Accessible Teaching, Learning, and Assessment Systems, 1122 West Campus Road, Lawrence, KS 66045; akclark@ku.edu. Her research interests include assessment validation, score reporting, and operationalizing diagnostic assessment systems.</p> <p>JEFFREY C. HOOVER is a Psychometrician at Accessible Teaching, Learning, and Assessment Systems, 1122 West Campus Road, Lawrence, KS 66045; jhoover4@ku.edu. His primary research interested include diagnostic classification modeling, machine learning, and model fit.</p> </aug> <nolink nlid="nl1" bibid="bib15" firstref="ref1"></nolink> <nolink nlid="nl2" bibid="bib16" firstref="ref5"></nolink> <nolink nlid="nl3" bibid="bib22" firstref="ref6"></nolink> <nolink nlid="nl4" bibid="bib30" firstref="ref8"></nolink> <nolink nlid="nl5" bibid="bib34" firstref="ref9"></nolink> <nolink nlid="nl6" bibid="bib27" firstref="ref10"></nolink> <nolink nlid="nl7" bibid="bib31" firstref="ref11"></nolink> <nolink nlid="nl8" bibid="bib33" firstref="ref12"></nolink> <nolink nlid="nl9" bibid="bib20" firstref="ref13"></nolink> <nolink nlid="nl10" bibid="bib35" firstref="ref14"></nolink> <nolink nlid="nl11" bibid="bib19" firstref="ref15"></nolink> <nolink nlid="nl12" bibid="bib40" firstref="ref19"></nolink> <nolink nlid="nl13" bibid="bib32" firstref="ref22"></nolink> <nolink nlid="nl14" bibid="bib43" firstref="ref28"></nolink> <nolink nlid="nl15" bibid="bib14" firstref="ref29"></nolink> <nolink nlid="nl16" bibid="bib25" firstref="ref31"></nolink> <nolink nlid="nl17" bibid="bib42" firstref="ref32"></nolink> <nolink nlid="nl18" bibid="bib29" firstref="ref37"></nolink> <nolink nlid="nl19" bibid="bib39" firstref="ref39"></nolink> <nolink nlid="nl20" bibid="bib13" firstref="ref41"></nolink> <nolink nlid="nl21" bibid="bib37" firstref="ref44"></nolink> <nolink nlid="nl22" bibid="bib38" firstref="ref48"></nolink> <nolink nlid="nl23" bibid="bib26" firstref="ref49"></nolink> <nolink nlid="nl24" bibid="bib18" firstref="ref50"></nolink> <nolink nlid="nl25" bibid="bib41" firstref="ref51"></nolink> <nolink nlid="nl26" bibid="bib17" firstref="ref56"></nolink> <nolink nlid="nl27" bibid="bib10" firstref="ref61"></nolink> <nolink nlid="nl28" bibid="bib21" firstref="ref62"></nolink> <nolink nlid="nl29" bibid="bib24" firstref="ref68"></nolink> <nolink nlid="nl30" bibid="bib28" firstref="ref75"></nolink> <nolink nlid="nl31" bibid="bib23" firstref="ref79"></nolink> <nolink nlid="nl32" bibid="bib36" firstref="ref80"></nolink> <nolink nlid="nl33" bibid="bib11" firstref="ref87"></nolink> <nolink nlid="nl34" bibid="bib12" firstref="ref88"></nolink>
Header	DbId: eric DbLabel: ERIC An: EJ1391123 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Thompson%2C+W%2E+Jake%22">Thompson, W. Jake</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0001-7339-0300">0000-0001-7339-0300</externalLink>)<br /><searchLink fieldCode="AR" term="%22Nash%2C+Brooke%22">Nash, Brooke</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0001-9858-7062">0000-0001-9858-7062</externalLink>)<br /><searchLink fieldCode="AR" term="%22Clark%2C+Amy+K%2E%22">Clark, Amy K.</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0002-5804-8336">0000-0002-5804-8336</externalLink>)<br /><searchLink fieldCode="AR" term="%22Hoover%2C+Jeffrey+C%2E%22">Hoover, Jeffrey C.</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0002-0276-0308">0000-0002-0276-0308</externalLink>) – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22Journal+of+Educational+Measurement%22"><i>Journal of Educational Measurement</i></searchLink>. Fall 2023 60(3):455-475. – Name: Avail Label: Availability Group: Avail Data: Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 21 – Name: DatePubCY Label: Publication Date Group: Date Data: 2023 – Name: SourceSuprt Label: Sponsoring Agency Group: SrcSuprt Data: Office of Special Education Programs (OSEP) (ED/OSERS) – Name: NumberContract Label: Contract Number Group: NumCntrct Data: 84373X100001 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Diagnostic+Tests%22">Diagnostic Tests</searchLink><br /><searchLink fieldCode="DE" term="%22Simulation%22">Simulation</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Reliability%22">Test Reliability</searchLink><br /><searchLink fieldCode="DE" term="%22Accuracy%22">Accuracy</searchLink><br /><searchLink fieldCode="DE" term="%22Language+Proficiency%22">Language Proficiency</searchLink><br /><searchLink fieldCode="DE" term="%22English%22">English</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluation+Methods%22">Evaluation Methods</searchLink> – Name: DOI Label: DOI Group: ID Data: 10.1111/jedm.12359 – Name: ISSN Label: ISSN Group: ISSN Data: 0022-0655<br />1745-3984 – Name: Abstract Label: Abstract Group: Ab Data: As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2023 – Name: AN Label: Accession Number Group: ID Data: EJ1391123
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1391123
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1111/jedm.12359 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 21 StartPage: 455 Subjects: – SubjectFull: Diagnostic Tests Type: general – SubjectFull: Simulation Type: general – SubjectFull: Test Reliability Type: general – SubjectFull: Accuracy Type: general – SubjectFull: Language Proficiency Type: general – SubjectFull: English Type: general – SubjectFull: Evaluation Methods Type: general Titles: – TitleFull: Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Thompson, W. Jake – PersonEntity: Name: NameFull: Nash, Brooke – PersonEntity: Name: NameFull: Clark, Amy K. – PersonEntity: Name: NameFull: Hoover, Jeffrey C. IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2023 Identifiers: – Type: issn-print Value: 0022-0655 – Type: issn-electronic Value: 1745-3984 Numbering: – Type: volume Value: 60 – Type: issue Value: 3 Titles: – TitleFull: Journal of Educational Measurement Type: main
ResultId	1