View in EDS HTML Full Text PDF Full Text

Examining the Precision of Cut Scores within a Generalizability Theory Framework: A Closer Look at the Item Effect

Saved in:

Bibliographic Details
Title:	Examining the Precision of Cut Scores within a Generalizability Theory Framework: A Closer Look at the Item Effect
Language:	English
Authors:	Clauser, Brian E., Kane, Michael, Clauser, Jerome C.
Source:	Journal of Educational Measurement. Sum 2020 57(2):216-229.
Availability:	Wiley-Blackwell. 350 Main Street, Malden, MA 02148. Tel: 800-835-6770; Tel: 781-388-8598; Fax: 781-388-8232; e-mail: cs-journals@wiley.com; Web site: http://www.wiley.com/WileyCDA
Peer Reviewed:	Y
Page Count:	14
Publication Date:	2020
Document Type:	Journal Articles Reports - Evaluative
Descriptors:	Cutting Scores, Generalization, Decision Making, Standard Setting, Evaluators, Item Analysis, Error of Measurement, Difficulty Level, Probability, Item Response Theory, Guidelines
DOI:	10.1111/jedm.12247
ISSN:	0022-0655
Abstract:	An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory-based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.
Abstractor:	As Provided
Entry Date:	2020
Accession Number:	EJ1255534
Database:	ERIC
Full text is not displayed to guests. Login for full access.

FullText	Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwEN0APlCtxULQSfk8xH2zGVAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDCx4dArhs1DyNEBBUQIBEICBmuBJc7xFRhnsLzQEcRXG2vXCKY8e1tdCk5Cv7BGFw4DlpMbUw8eIOHRZcu2wmftv041dcM65bPCejyH_E_BEdsM2E77BRdsQMKhWxhZCp_s56u23RXHeMGYb9xExtrl0779AWCNcGhtqN7WXtUOTiYSpmivO4hIS-8-fXPEA0jS456pY5ylmUz43RQXCu7yNxYjpB9ndiIMCqFE= Text: Availability: 1 Value: <anid>AN0143594889;mea01jun.20;2020Jun06.03:00;v2.2.500</anid> <title id="AN0143594889-1">Examining the Precision of Cut Scores Within a Generalizability Theory Framework: A Closer Look at the Item Effect </title> <p>An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability‐theory–based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion‐correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.</p> <p>The credibility of the cut score (or cut scores) is an essential part of the validity evidence for the use of test scores to make classification decisions. There are several sources of evidence that might be considered in establishing the credibility of a cut score: conceptual coherence, procedural evidence, evidence of internal consistency, and evidence based on external criteria (Kane, [<reflink idref="bib14" id="ref1">14</reflink>]). Although each of these criteria is important, there is a kind of centrality to the type of internal consistency evidence that depends on the extent to which the results of the standard setting process are replicable. As Kane put it, "No matter how well designed the standard setting study and no matter how carefully implemented, we are not likely to have much faith in the outcome if we know that the results would be likely to be very different if the study were repeated" (pp. 70–71). The importance of this type of evidence is reflected in the <emph>Standards for Educational and Psychological Testing</emph> (American Educational Research Association, American Psychological Association, &amp; National Council on Measurement in Education, [<reflink idref="bib1" id="ref2">1</reflink>]) which state, "Where applicable, variability over participants should be reported. Whenever feasible, an estimate should be provided of the amount of variation in cut scores that might be expected if the standard‐setting procedure were replicated with a comparable standard‐setting panel" (Standard 5.21, p. 108). Several authors have recommended generalizability theory as a useful framework for evaluating the stability or replicability of cut scores estimated using the Angoff method (e.g., Brennan, [<reflink idref="bib2" id="ref3">2</reflink>]; Brennan &amp; Lockwood, [<reflink idref="bib4" id="ref4">4</reflink>]; Camilli, Cizek, &amp; Lugg, [<reflink idref="bib6" id="ref5">6</reflink>]; B. E. Clauser, Harik, et al., [<reflink idref="bib7" id="ref6">7</reflink>]; B. E. Clauser, Swanson, &amp; Harik, [<reflink idref="bib10" id="ref7">10</reflink>]; Hambleton, Pitoniak, &amp; Copella, [<reflink idref="bib13" id="ref8">13</reflink>]; Kane, [<reflink idref="bib14" id="ref9">14</reflink>]; Kane &amp; Wilson, [<reflink idref="bib15" id="ref10">15</reflink>]).</p> <p>In most common applications of generalizability theory, there is an object of measurement; typically the test taker or an aggregation of test takers such as a classroom or school. The variability in scores for objects of measurement over replication of the assessment is interpreted as measurement error. This error may be associated with several facets (e.g., with test items or tasks, with occasions, and with scorers). The variability associated with objects of measurement is not considered error, because it is the variable of interest. The universe score for each object of measurement is defined as the expected observed score for the object of measurement over replications and the variability over replications around this universe score is taken to define the error.</p> <p>The application of generalizability theory to Angoff standard‐setting data is somewhat different from applications in which the value of interest is the universe score for a test taker or classroom. In a simple Angoff study with a sample of judges reviewing a sample of items, we have a judgment for each judge‐item pair, and applying generalizability theory to these data we can estimate a variance component for judges, for items, and for the judge‐item interaction. In the context of standard setting, we want to estimate a single value for a cut score, rather than estimating a universe score for each object of measurement in some population. The error variance of interest in the standard‐setting context is the error for a mean score across all facets of the design (judges, items); the main concern is the estimation of the cut score and of the error in this estimate. For example, in a situation in which three panels are convened with 10 judges in each panel and each judge reviews the same set of items, and the estimated cut score is the mean across items, judges, and panels, there are potentially five variance components that could contribute to the error for the estimated cut score: (<reflink idref="bib1" id="ref11">1</reflink>) panel, (<reflink idref="bib2" id="ref12">2</reflink>) judge nested in panel, (<reflink idref="bib3" id="ref13">3</reflink>) item, (<reflink idref="bib4" id="ref14">4</reflink>) item by panel, and (<reflink idref="bib5" id="ref15">5</reflink>) item by judge nested in panel. The panel component represents the variability in mean judgments for panels. The judge‐nested‐within‐panel component represents the variability in mean judgment for judges within panels. The item component represents the variability in mean judgments for items. The item‐by‐panel component represents variability in the mean judgments for item‐panel combinations. The item‐by‐judge‐nested‐in‐panel component is the residual term.</p> <p>All five of these variance components may contribute to error. Which components do contribute to error depends on how we answer the question, what constitutes a replication? It is certainly possible to imagine a situation in which the answer is that all of the components contribute to error. If we are establishing a cut score for a domain and the items used in this exercise are a random sample from the domain, then a replication might well use different items evaluated by different judges in different panels. Although this is a possible view of a replication, more commonly the standard will be established for a single test form (or a set of test forms), and this standard will then be applied to other test forms, with the variability in empirical item difficulty across forms being controlled through equating. Under this condition, the empirical difficulty of the specific set of items used in the standard setting should not impact the precision of the estimated cut score.</p> <p>Because this empirical item difficulty is presumably reflected in the judgments, some previous researchers have suggested that the variance component associated with items should not be considered in estimating the error variance for the cut score (e.g., B. E. Clauser et al., [<reflink idref="bib10" id="ref16">10</reflink>]; B. E. Clauser, Harik, et al., [<reflink idref="bib7" id="ref17">7</reflink>]). Implicit in the decision to drop this component from the estimate of error variance is the assumption that the variability in judged item difficulty reflects the variability in empirical item difficulty, and therefore, would be eliminated by statistical equating across test forms. It has, however, been well established that the correspondence between judged and empirical item difficulties is at best moderate (e.g., Busch &amp; Jaeger, [<reflink idref="bib5" id="ref18">5</reflink>]; B. E. Clauser et al., [<reflink idref="bib10" id="ref19">10</reflink>]; B. E. Clauser, Harik, et al., [<reflink idref="bib7" id="ref20">7</reflink>]; B. E. Clauser, Mee, Baldwin, Margolis, &amp; Dillon, 2009). This raises a question regarding the extent to which—after accounting for random error in the judgments—the item difficulty judgments do correspond to the empirical item difficulties.</p> <p>Two recent studies report evidence suggesting that this lack of correspondence may in fact be systematic (B. E. Clauser, Mee, &amp; Margolis, [<reflink idref="bib9" id="ref21">9</reflink>]; Mee, Clauser, &amp; Margolis, [<reflink idref="bib17" id="ref22">17</reflink>]). In both of these studies, three panels were used and the correlation between judged item difficulties (for the minimally proficient test taker) was higher between panels than the correlation between any of the panel judgments and the empirical item difficulties for test takers with scores near the cut score. This provides fairly strong evidence that at best only a portion of the variance attributable to item difficulty can be eliminated through equating. Although it might be reasonable to exclude the portion of the item variance component for the Angoff judgments that is accounted for by empirical item difficulties when estimating the variability of cut scores across replications, the portion of that variance that is not associated with empirical item difficulty should be viewed as error. The issue is important because even when moderate samples of items are used, this source of variance may make a significant contribution to the overall error in estimating the cut score.</p> <p>The challenge is to separate that proportion of the variability in judged item difficulty that is associated with variability in empirical item difficulty from the remaining variance. In this article, we present two approaches for evaluating the item variance component for Angoff judgments and partitioning it into that part that is associated with empirical differences in item difficulty and that which results from other influences presumably related to judge perceptions; the first is correlation‐based, the second is based on item response theory (IRT).</p> <hd id="AN0143594889-2">The Correlation‐Based Approach</hd> <p>One approach to conceptualizing and estimating the proportion of variance in judged item difficulties associated with empirical item difficulties is based on the strength of the correlation between judged item difficulties and the empirical difficulties for those same items. The judges in an Angoff standard‐setting study are asked to estimate the probability of success on an item for a test taker whose proficiency level is at the cut score; that is, a conditional probability of success. If judges were completely accurate and internally consistent, their estimated probabilities would be perfectly correlated with empirical probabilities of success (<emph>p</emph>‐values) estimated for test takers whose observed scores were at the cut score. The square of the correlation between the mean item judgments and empirical conditional probabilities of success on the items is the proportion of variance in the mean judgments (across judges) that is accounted for by the empirical item difficulties. This squared correlation can be used as a basis for estimating the proportion of the Angoff item variance component that is associated with the empirical item difficulties.</p> <p>The exact application of this relationship varies depending on the design used in the study. For a simple crossed design (judges by items), it can be shown that the variance component for items equals the observed variance of the mean judgments minus the residual divided by the number of judges (as shown in Equation ).</p> <p>Based on the equations provided by Brennan ([<reflink idref="bib3" id="ref23">3</reflink>], pp. 26–27), the estimate of the variance component for items is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0001" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ji&lt;/mi&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mfrac&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>where <emph>i</emph> represent items, <emph>j</emph> represents judges, and <emph>MS</emph> refers to mean square. The variance component for the residual is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0002" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ji&lt;/mi&gt;&lt;/mfenced&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ji&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>And the mean square for items is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0003" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>Substituting Equations  and  into Equation  results in</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0004" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mspace width="0.33em" /&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ji&lt;/mi&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mfrac&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>The estimated variance of the mean item judgments is given by the first term on the right side of Equation :</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0005" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mspace width="0.33em" /&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>As indicated in Equation , the estimated variance of the item mean judgments (in Equation ) is larger than the item variance component (in Equation ) except in the unusual circumstance in which the final term in Equation  is 0 and then they are equal.</p> <p>The first term on the right of Equation  is the estimated variance in the mean item judgments, and the second term on the right is the estimated contribution of the residual error to the variance in the mean item judgments. By subtracting the second term from the first, we get an estimate of the item variance component. If the second term on the right of Equation  is as large as, or almost as large as, the first term on the right, the item variance component would be 0 or close to 0, and could be ignored in estimating the overall error for the Angoff cut score. If the second term on the right side of Equation  were larger than the first term, the estimated item variance component for items would be negative and would be set to 0. In the present application, this is unlikely. In G theory, small negative estimates for variance components are generally attributed to sampling error, and are set equal to 0. Variance components are, of course, nonnegative by definition; negative values may occur when sample sizes are relatively small or when there are a large number of effects in the design. Cronbach, Gleser, Nanda, and Rajaratnam ([<reflink idref="bib12" id="ref24">12</reflink>]) recommended that these negative values be set to 0 and this has become the convention in the generalizability theory literature (Brennan, [<reflink idref="bib3" id="ref25">3</reflink>]).</p> <p>The square of the correlation between the Angoff mean item judgments and the empirical conditional item difficulties is the proportion of the Angoff item mean variance that is accounted for by the empirical item difficulties, but we want the proportion of the item variance component that is accounted for by the empirical item difficulties, and as indicated in Equation  this variance component is smaller than the variance in item mean judgments. If it is assumed that the residual errors are uncorrelated with the empirical item difficulties, then the proportion of the variance component that can be accounted for by variability in empirical item difficulties will be equal to the square of the correlation between the Angoff judgments and the empirical conditional item difficulties multiplied by the ratio of the item variance for item judgments to the variance component for items (Equation  provides an estimate of the value in the numerator in Equation ):</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0006" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfrac&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ji&lt;/mi&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>where <emph>R</emph> is the correlation between the mean item judgments and the empirical item difficulties. The proportion of the variance component that might be viewed as error is then one minus the square of the adjusted correlation, which is given by Expression 6.</p> <p>As indicated above, if the second term in the denominator of Equation  were almost as large as the first term, the estimated item variance component could be quite small, and as a result the correction shown in Equation  could yield adjusted values that are greater than 1, and in such cases it would be reasonable to take the adjustment to be 1.0, indicating that the item variance component does not need to be included in the error of the Angoff cut score.</p> <p>This logic can readily be extended to other designs. The most likely alternative design in standard setting is one in which there are multiple panels with judges nested within panels. For this design—items (<emph>i</emph>) crossed with judges (<emph>j</emph>) nested in panels (<emph>p</emph>)— the variance component for items is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0007" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ip&lt;/mi&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>The mean square for items is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0008" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mspace width="0.33em" /&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>and</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0009" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;MS&lt;/mi&gt;&lt;mspace width="0.33em" /&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ip&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mspace width="0.33em" /&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ip&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;ij&lt;/mi&gt;&lt;mo&gt;:&lt;/mo&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>Substituting Equations  and  into Equation  produces</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0010" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mfenced&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;/mfrac&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;ij&lt;/mi&gt;&lt;mo&gt;:&lt;/mo&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>And again, if it is assumed that the residual variance component and the variance component of <emph>ip</emph> are uncorrelated with the empirical item difficulties, then the proportion of the variance component that can be accounted for by variability in empirical item difficulties will be equal to the square of the correlation between the judgments and the empirical conditional item difficulties multiplied by the ratio of the variance for item judgments to the variance component for items:</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0011" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfrac&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;/mfrac&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;ij&lt;/mi&gt;&lt;mo&gt;:&lt;/mo&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>where <emph>R</emph> is the correlation between the mean item judgments and the empirical item difficulties.</p> <p>As in the earlier example, if the second and third terms on the right of Equation  is as large as or larger than the first term on the right of Equation , the item variance component would be taken to be 0, and could be ignored in estimating the overall error for the Angoff cut score.</p> <hd id="AN0143594889-3">The Item Response Theory Framework</hd> <p>A second approach to evaluating the proportion of the item variance component that is independent of the variability in the empirical item difficulties is to transform the judgments using an IRT model. On an observed proportion‐correct scale, judgments made as part of an Angoff standard setting exercise are dependent on the specific set of items selected. In an IRT framework, the Angoff judgments can be transformed to a theta scale, and the transformed judgments would be independent of the difficulty of the specific items, assuming that the model fits the data. Within this IRT framework, if the judges in an Angoff exercise were behaving in a manner that was internally consistent and consistent with the IRT model, their judgments would always map to the same proficiency level (<emph>θ</emph>) regardless of the difficulty of the specific items (van der Linden, [<reflink idref="bib19" id="ref26">19</reflink>]). The panelists, of course, are not asked to make judgments on the theta scale, but if the item parameters are known or well estimated, the probabilities that the judges provide can be converted to the theta scale. For example, using the two‐parameter model</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0012" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;msup&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>the transformation is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0013" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mfenced separators="" open="[" close="]"&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;ln&lt;/mi&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mfrac&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mfenced open="/" close=""&gt;&lt;mphantom&gt;&lt;mpadded width="0pt"&gt;&lt;mfrac&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mpadded&gt;&lt;/mphantom&gt;&lt;/mfenced&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>In Equation , <emph>P</emph> is the judged probability and (as in Equation ) <emph>θ</emph> represents proficiency on the latent trait scale, and <emph>a</emph> and <emph>b</emph> are the discrimination and difficulty parameters, respectively. In theory, the transformation in Equation  yields an estimate of the cut score (on the theta scale) that is independent of the item difficulty for each judge on each item, because the item difficulty is taken into account by the IRT model, specifically by the difficulty parameter. Using this approach, each judgment on the proportion‐correct scale can be transferred to the theta scale. The generalizability analysis can then be carried out on the transformed judge‐item values on the theta scale rather than the proportion‐correct scale. With this approach, the portion of the item variance associated with differences in empirical item difficulty would be removed—to the extent that the item response theory model fits the data—and any remaining item variance component would appropriately be considered error.</p> <p>This approach is simple and elegant, but it has limitations. The main problem is that the empirical judgments have undergone a nonlinear transformation (as in Equation ) that complicates the interpretation of the resulting variance components. In addition, the IRT analysis assumes that it is reasonable to model the Angoff judgments in terms of the model that is used to analyze test‐taker responses and to expect that the item parameters would be the same for these two very different kinds of data. The IRT approach provides a theoretically elegant estimate of the effect of interest, but it makes strong assumptions, and therefore yields estimates of error variance that are difficult to interpret.</p> <hd id="AN0143594889-4">Method</hd> <p>To provide an empirical example of how the proposed procedures function, generalizability analyses were run on data from a standard‐setting exercise for Step 2 of the United States Medical Licensing Examination. This examination sequence is required for physicians with an MD degree who wish to practice medicine in the United States. The Step 2 examination is a day‐long computer‐administered test assessing clinical knowledge.</p> <p>Three separate panels were convened. The panels had nine, eight, and ten judges, respectively.[<reflink idref="bib1" id="ref27">1</reflink>] The judges were all practicing physicians who were additionally involved in medical education either as medical school faculty or as part of residency training. The data were collected as part of an operational standard setting exercise. For each panel, the group was convened and provided with background information on standard setting, the logic of the Angoff method, and the purpose and structure of the examination. They then discussed the concept of the minimally proficient test taker in the context of the examination. The judges then reviewed a sample set of items, made independent judgments about the items and discussed their judgments as a group. The sample of items was selected to include relatively difficult as well as relatively easy items, items with different numbers of options, and items with various nontext stimulus materials such as pictures and graphs. After this discussion, the judges independently provided judgments for a set of 75 items; the same items were used across the three panels. (For a more complete description of the entire standard‐setting process, see J. C. Clauser, Margolis, &amp; Clauser, [<reflink idref="bib11" id="ref28">11</reflink>]; Margolis &amp; Clauser, [<reflink idref="bib16" id="ref29">16</reflink>].)</p> <p>For the correlation‐based approach, variance components were estimated on the proportion‐correct scale. The variance component for items was then adjusted using the approach described previously. To do this, it was necessary to have (<reflink idref="bib1" id="ref30">1</reflink>) an estimate of the correlation between the mean item judgments and the empirical conditional item difficulties, (<reflink idref="bib2" id="ref31">2</reflink>) an estimate of the variance of the mean item judgments, and (<reflink idref="bib3" id="ref32">3</reflink>) estimated variance components.</p> <p>Previous researchers have suggested that conditional probabilities could be estimated by finding the mean observed proportion‐correct score for each item for test takers with a total score close to the estimated cut score (Smith &amp; Smith, [<reflink idref="bib18" id="ref33">18</reflink>]). Under some circumstances this might be a reasonable approach, but the approach has two potential limitations: (<reflink idref="bib1" id="ref34">1</reflink>) large samples are needed to accurately estimate the conditional probability because only that part of the sample with scores close to the cut score can be used, and (<reflink idref="bib2" id="ref35">2</reflink>) the definition of "close to the cut score" will be unavoidably arbitrary. As the range around the cut score is widened to increase the test taker sample, the resulting proportion‐correct score becomes less and less clearly representative of performance at the cut score.</p> <p>Alternatively, the item difficulties at the cut score can be estimated using the IRT model. With this approach, the cut score on the number‐ (or proportion‐) correct scale can be translated to the theta scale using the test characteristic curve. Using this <emph>θ</emph> value and item parameters estimated from the test administration, conditional probabilities can be estimated using the model of choice. Again, in the present research, the two‐parameter model was used, so the equation of interest is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0014" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;/mfenced&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;msup&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>Again, in this equation, θ represents proficiency on the latent scale, <emph>a</emph> is the discrimination parameter, <emph>b</emph> is the difficulty parameter, and <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0015" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> represents the probability of a correct response for a test taker with proficiency equal to <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0016" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mspace width="0.33em" /&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . This approach has been used in numerous previous studies (e.g., Clauser, Mee, et al., [<reflink idref="bib8" id="ref36">8</reflink>]). Variance components for an items‐crossed‐with‐judges‐nested‐in‐panels design were then estimated using the mGENOVA software (Brennan, [<reflink idref="bib3" id="ref37">3</reflink>]).</p> <p>For the IRT‐based approach, the individual judgments were converted to the theta scale using Equation  and item parameter estimates from a large‐scale calibration based on operational test‐taker responses. The two‐parameter model was used because previous unpublished analyses indicated that the model fit was not significantly improved by applying the three‐parameter model. Again, variance components for an items‐crossed‐with‐judges‐nested‐in‐panels design were then estimated using the mGENOVA software (Brennan, [<reflink idref="bib3" id="ref38">3</reflink>]).</p> <hd id="AN0143594889-5">Results</hd> <p>Table  presents the variance components for G and D studies with 75 items and three panels with a total of 27 judges on the proportion‐correct scale. The first column in Table  provides the estimated variance components on the proportion‐correct scale for the G study with items crossed with raters nested within panels. The middle column in Table  repeats the results from the first column, but with the estimated item variance component replaced by its adjusted value based on the correlation between the Angoff mean item judgments and the empirical conditional probabilities. That correlation was.514 and the variance of the item means was.00759. The value for the proportional adjustment is</p> <p> <ephtml> &lt;math display="block" altimg="urn:x-wiley:00220655:media:jedm12247:jedm12247-math-0017" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mspace width="0.33em" /&gt;&lt;mfrac&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced open="(" close=")"&gt;&lt;mi&gt;ip&lt;/mi&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;/mfrac&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mo&gt;&amp;#770;&lt;/mo&gt;&lt;/mover&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mfenced separators="" open="(" close=")"&gt;&lt;mrow&gt;&lt;mi&gt;ij&lt;/mi&gt;&lt;mo&gt;:&lt;/mo&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;mn&gt;30&lt;/mn&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p>Estimated Variance Components for Judgments Using the Proportion‐Correct Scale</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Source of Variance&lt;/th&gt;&lt;th align="center"&gt;Single Observation&lt;/th&gt;&lt;th align="center"&gt;Adjusted Single Observation&lt;/th&gt;&lt;th align="center"&gt;Mean Judgment&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Item (I)&lt;/td&gt;&lt;td&gt;.00666&lt;/td&gt;&lt;td&gt;.00466&lt;/td&gt;&lt;td&gt;.00005&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Panel (P)&lt;/td&gt;&lt;td&gt;&amp;#8722;.00033&lt;/td&gt;&lt;td&gt;&amp;#8722;.00033&lt;/td&gt;&lt;td&gt;.00000&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Judge: Panel (J:P)&lt;/td&gt;&lt;td&gt;.00465&lt;/td&gt;&lt;td&gt;.00465&lt;/td&gt;&lt;td&gt;.00017&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;I by P&lt;/td&gt;&lt;td&gt;&amp;#8722;.00008&lt;/td&gt;&lt;td&gt;&amp;#8722;.00008&lt;/td&gt;&lt;td&gt;.00000&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;I by J: P&lt;/td&gt;&lt;td&gt;.02585&lt;/td&gt;&lt;td&gt;.02585&lt;/td&gt;&lt;td&gt;.00001&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>1 <emph>Note</emph>: The D‐study variance component estimates assume a design with items crossed with judges nested within panels, and sample sizes of 75 items and three panels with a total of nine judges per panel.</p> <p>So the proportion of the item variance component that is not due to variability in empirical item difficulty, and therefore can be viewed as error, is.70. (That is.00466 is 70% of.00666; none of the other variance components are impacted by the adjustment.)</p> <p>The third column in Table  reports D‐study variance components for a design with items crossed with judges nested within panels, and sample sizes of 75 items and three panels with a total of nine judges per panel. In this third column, the negative estimates for the panel variance component and for the item‐by‐panel interaction component have been replaced by zeros.</p> <p>If only 30% of the variance component for items reflects variability in empirical item difficulty, removing the full item component from the estimate of total error for the estimated cut score is certainly questionable. The resulting adjusted value represents 22% of the total error variance for the mean judgment. This is certainly a nontrivial contribution to measurement error. The result suggests that ignoring the item variance component may significantly underestimate the error associated with the cut score resulting from Angoff standard setting.</p> <p>Table  presents the variance components estimated on the IRT scale. The first column in Table  provides the estimated variance components on the theta scale for the G study with items crossed with raters nested within panels. The second column in Table  reports D‐study variance components for a design with items crossed with judges nested within panels, and sample sizes of 75 items and three panels with a total of nine judges per panel. In the second column, the negative estimates for the panel variance component and for the item‐panel interaction component have been replaced by zeros.</p> <p>Estimated Variance Components for Judgments Using the IRT Scale</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Source of Variance&lt;/th&gt;&lt;th align="center"&gt;Single Observation&lt;/th&gt;&lt;th align="center"&gt;Mean Judgment&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Item (I)&lt;/td&gt;&lt;td&gt;1.74885&lt;/td&gt;&lt;td&gt;.02332&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Panel (P)&lt;/td&gt;&lt;td&gt;&amp;#8722;.02636&lt;/td&gt;&lt;td&gt;.00000&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Judge: Panel (J:P)&lt;/td&gt;&lt;td&gt;.56497&lt;/td&gt;&lt;td&gt;.02092&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;I by P&lt;/td&gt;&lt;td&gt;&amp;#8722;.03027&lt;/td&gt;&lt;td&gt;.00000&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;I by J:P&lt;/td&gt;&lt;td&gt;3.20434&lt;/td&gt;&lt;td&gt;.00158&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>2 <emph>Note</emph>: The D‐study variance component estimates assume a design with items crossed with judges nested within panels, and sample sizes of 75 items and three panels with a total of nine judges per panel.</p> <p>These IRT‐based results are more difficult to interpret than the proportion‐correct analyses because of the nonlinear transformation of the scale, but they show the variance component for items to be the second largest variance component in the G‐study analysis and to represent 51% of the total error variance in the D‐study. Although the IRT transformation accounts for item difficulty, and thereby removes some of the variability in mean item judgments associated with empirical item difficulty, a substantial item effect remains in the judgments.</p> <hd id="AN0143594889-6">Discussion</hd> <p>In this article, we have presented two approaches to evaluating the extent to which the variability represented by the item variance component in generalizability analyses of Angoff standard‐setting data can be accounted for by the variability in empirical item difficulties. To the extent that the item variance component in the Angoff judgments reflects differences in item difficulty, it would be reasonable to drop this component from estimates of the error for the cut score because the judgments are supposed to reflect the difficulties of the items on the test. The cut score can then be applied to different forms of the test using equating methodology. However, to the extent that the judged item difficulties are unrelated to empirical item difficulties, the variability in judged item difficulties would be viewed as a source of error in estimating the cut score.</p> <p>This has implications for both the estimate of the error and for efforts to develop optimal standard‐setting designs. It is central to answering the question of whether increased precision can most efficiently be achieved by increasing the number of panels, judges (within panels), or items. For example, with the design used in this study, without changing the total effort required of the judges, the number of items can be increased substantially by having each panel review a different set of items. This change in design would, however, only impact the item variance component, so there would be no reason to add to the complexity of administration if this component is not viewed as contributing to error in estimating the cut score.</p> <p>The analyses reported in this article make it clear that the variance associated with items can be a substantial part of the total error variance. On the IRT scale, the variance component for items conceptually represents error; it also represents the largest single contribution to error for mean judgments. As noted previously, it is difficult to directly interpret the results for the IRT analysis because they have undergone a nonlinear transformation. It remains, however, that in theory this value would be 0 if the judges were perfectly internally consistent in making their judgments and their judgments reflected the item difficulty parameter based on the test‐taker responses—that is if their definition of the minimally proficient test taker was the same across items. The fact that the item variance component is relatively large in this analysis certainly suggests that the judges display a significant lack of internal consistency across items.</p> <p>On the proportion‐correct scale, the variance component for items again makes a nontrivial contribution to total error variance. The unadjusted variance component includes both error variance and variance associated with the variability in empirical item difficulties; the adjusted variance components represent the values after the proportion of variance associated with empirical item difficulty is removed.</p> <p>The results presented in this article make it clear that assuming that the item variance component should be disregarded when estimating error variance associated with an Angoff cut score could lead to an overly optimistic report on the precision of the estimate. Disregarding the item variance is typically based on the assumption that the Angoff judgments reflect the actual spread of item difficulty. Application of the analytic approaches presented in this article call that practice into question.</p> <p>The procedures illustrated in this article can be used to evaluate the impact of the item variance component in generalizability analyses of standard‐setting data using the Angoff method. The empirical example presented in this article is not intended to represent a generalizable conclusion; it is intended as an example. Previous research has suggested that the correspondence between judged and empirical item difficulty is moderate at best (e.g., Busch &amp; Jaeger, [<reflink idref="bib5" id="ref39">5</reflink>]; B. E. Clauser et al., [<reflink idref="bib10" id="ref40">10</reflink>]; B. E. Clauser, Harik, et al., [<reflink idref="bib7" id="ref41">7</reflink>]; B. E. Clauser, Mee, et al., [<reflink idref="bib8" id="ref42">8</reflink>]), but it may be that in some content areas these relationships are much stronger. Additionally, in the data analyzed for this study the panel effect was near 0. J. C. Clauser et al. ([<reflink idref="bib11" id="ref43">11</reflink>]) reported generalizability analyses for other standard setting exercises that showed the panel effect to be substantial. In such cases, the proportion of the item variance component that would be viewed as contributing to error would likely be similar to the results reported for this study, but the proportion of total error variance represented by this effect would be substantially smaller.</p> <p>Although the general conclusions described in the previous paragraphs are clearly supported by the reported results, there are aspects of the results that are more difficult to understand. We did not expect the generalizability analyses on the two scales (the proportion‐correct scale and the theta scale) to yield the same pattern of results in their generalizability analyses for several reasons, but neither did we expect the differences to be as large as they are. As indicated in Table , which reports G‐study results for the theta scale, the variance component for items is over three times as large as the variance component for judges within panels. In contrast, the results in Table , which reports G‐study results for the original proportion‐correct scale, the variance component for items is less than 1.5 times as large as the variance component for judges within panels. In both analyses, the variance component for items is larger than the variance component for judges within panels, but the relative magnitudes are quite different; on the proportion‐correct scale, the two variance components are roughly comparable, whereas on the theta scale, the item component is much larger (by a factor of three) than the judge‐within‐panel component.</p> <p>As noted, we expected the pattern of variance components to be somewhat different across the two scales for at least three reasons. First, the transformation from the proportion‐correct scale to the θ scale for each item (an inverse logistic function) is nonlinear; under a linear transformation, the ratios (but not the magnitudes) of the variance components would remain the same, but no such expectation exists for a nonlinear transformation. Second, the transformations are different for different items, depending on their item parameters; so for example two equal Angoff judgments on two different items would generally get transformed into different <emph>θ</emph> values (perhaps very different). Third, to the extent that the Angoff judgments are correlated with the IRT difficulty levels, the transformation to the theta scale should lead to a relative reduction in the variance over items. A logistic item‐response curve for an item is approximately linear with a relatively steep slope near the inflection point (the empirical item difficulty value on the theta scale) and flattens out on either side, as it approaches asymptotes of.0 and 1.0; so as we move away from the inflection point, the strength of the relationship between the proportion‐correct scale and <emph>θ</emph> decreases.</p> <p>Tables  and  are quite consistent in a number of ways. First, in both tables the variance components for the main effect for panels and the item‐panel interaction have relatively small negative values, which can be considered to be 0. Second, the ratio of the residual variance (I by J:P) for <emph>θ</emph> to the residual variance for the proportion‐correct scale is similar to the corresponding ratio for the judge‐within‐panel effect, 3.20434/.02585 = 124.0 for the residual variance components, and.56497/.00465 = 121.5 for the judge‐within‐panel components. The big difference seems to be in the variance components for items; the ratio of the item variance component to the residual component (ij:p) for the analyses on the theta scale (1.74885/3.20434 =.546) is more than twice as large as the corresponding ratio on the proportion‐correct scale (.00666/.02585 =.258), and the difference is even larger if the adjusted estimate for the item variance component is used for the proportion‐correct G study (.00466/.02585 =.180).</p> <p>The two analyses are consistent in indicating that the Angoff judgments are not correlated with the empirical item difficulty levels strongly enough to justify ignoring the item variance component in estimating the error variance. In both analyses, the item effect makes a substantial contribution to the error. So, we suggest that it would be reasonable to partition the item variance component into a part that is accounted for by empirical item difficulties and therefore does not need to be included in the error variance and a part that is not accounted for by empirical item difficulties and therefore does need to be included in the error variance.</p> <p>Given that the two G‐study analyses are somewhat different, which should get more attention? We suggest that in most cases, analyses based on the original Angoff judgments (i.e., analyses on the proportion‐correct scale) should get more attention for two reasons. First, this is the scale on which the judges made their judgments, and it is the scale on which they will revise their judgments if there are several rounds of judgments (as there usually are). Second, it seems unrealistic to assume that the judges will accurately mimic item response functions in making their judgments, and using the item parameters based on the test taker's item responses to model the Angoff judgments seems highly questionable. If the two analyses yielded similar results, we could avail ourselves of the conveniences of using the IRT scale, but because they do not agree, we think it prudent to use the analyses that use the original data original proportion‐correct scale.</p> <ref id="AN0143594889-7"> <title> Footnotes </title> <blist> <bibl id="bib1" idref="ref2" type="bt">1</bibl> <bibtext> The original panels had a total of 29 judges. Two judges were dropped because they either had missing data or gave probability estimates of 1.0 that would lead to an estimate of positive infinity on the IRT scale.</bibtext> </blist> </ref> <ref id="AN0143594889-8"> <title> References </title> <blist> <bibtext> American Educational Research Association, American Psychological Association, &amp; National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC : American Educational Research Association.</bibtext> </blist> <blist> <bibl id="bib2" idref="ref3" type="bt">2</bibl> <bibtext> Brennan, R. L. (1995). Standard setting from the perspective of generalizability theory. In M. L. Bourque (Ed.), Joint conference on standard setting for large‐scale assessments (pp. 269 – 287). Washington, DC : NCSE‐NAGB.</bibtext> </blist> <blist> <bibl id="bib3" idref="ref13" type="bt">3</bibl> <bibtext> Brennan, R. L. (2001). Manual for mGENOVA. (Occasional Paper Number 47). Iowa City, IA : Iowa Testing Program.</bibtext> </blist> <blist> <bibl id="bib4" idref="ref4" type="bt">4</bibl> <bibtext> Brennan, R. L., &amp; Lockwood, R. E. (1980). A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Psychological Measurement, 4, 219 – 240.</bibtext> </blist> <blist> <bibl id="bib5" idref="ref15" type="bt">5</bibl> <bibtext> Busch, J. C., &amp; Jaeger, R. M. (1990). Influence of type of judge, normative information, and discussion on standards recommended for the National Teacher Examinations. Journal of Educational Measurement, 27, 145 – 163.</bibtext> </blist> <blist> <bibl id="bib6" idref="ref5" type="bt">6</bibl> <bibtext> Camilli, G., Cizek, G. J., &amp; Lugg, C. A. (2001). Psychometric theory and the validation of performance standards: History and future perspectives. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 445 – 475). Mahwah, NJ : Lawrence Erlbaum.</bibtext> </blist> <blist> <bibl id="bib7" idref="ref6" type="bt">7</bibl> <bibtext> Clauser, B. E., Harik, P., Margolis, M. J., McManus, I. C., Mollon, J., Chis, L., &amp; Williams, S. (2009). An empirical examination of the impact of group discussion and examinee performance information on judgments made in the Angoff standard‐setting procedure. Applied Measurement in Education, 22, 1 – 21.</bibtext> </blist> <blist> <bibl id="bib8" idref="ref36" type="bt">8</bibl> <bibtext> Clauser, B. E., Mee, J., Baldwin, S. G., Margolis, M. J., &amp; Dillon, G. F. (2009). Judges' use of examinee performance data in an Angoff standard‐setting exercise for a medical licensing examination: An experimental study. Journal of Educational Measurement, 46, 390 – 407.</bibtext> </blist> <blist> <bibl id="bib9" idref="ref21" type="bt">9</bibl> <bibtext> Clauser, B. E., Mee, J., &amp; Margolis, M. J. (2013). The effect of data format on integration of performance data into Angoff judgments. International Journal of Testing, 13, 65 – 85.</bibtext> </blist> <blist> <bibtext> Clauser, B. E., Swanson, D. B., &amp; Harik, P. (2002). A multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an Angoff‐style standard‐setting procedure. Journal of Educational Measurement, 39, 269 – 290.</bibtext> </blist> <blist> <bibtext> Clauser, J. C., Margolis, M. J., &amp; Clauser, B. E. (2014). An examination of the replicability of Angoff standard setting results within a generalizability theory framework. Journal of Educational Measurement, 51, 127 – 140.</bibtext> </blist> <blist> <bibtext> Cronbach, L. J., Gleser, G. C., Nanda, H., &amp; Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY : Wiley.</bibtext> </blist> <blist> <bibtext> Hambleton, R. K., Pitoniak, M. J., &amp; Copella, J. M. (2012). Essential steps in setting performance standards on educational tests and strategies for assessing the reliability of results. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 47 – 76). New York, NY : Routledge.</bibtext> </blist> <blist> <bibtext> Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting standards. In G. Cizek (Ed.), Standard setting: Concepts, methods, and perspectives (pp. 53 – 88). Mahwah, NJ : Lawrence Erlbaum.</bibtext> </blist> <blist> <bibtext> Kane, M. T., &amp; Wilson, J. (1984). Errors of measurement in standard setting in mastery testing. Applied Psychological Measurement, 8, 107 – 115.</bibtext> </blist> <blist> <bibtext> Margolis, M. J., &amp; Clauser, B. E. (2014). The impact of examinee performance information on judges' cut scores in modified‐Angoff standard setting exercises. Educational Measurement: Issues and Practice, 33 (1), 15 – 22.</bibtext> </blist> <blist> <bibtext> Mee, J., Clauser, B. E., &amp; Margolis, M. J. (2013). The impact of process instructions on judges' use of examinee performance data in Angoff standard setting exercises. Educational Measurement: Issues and Practice, 32 (3), 27 – 35.</bibtext> </blist> <blist> <bibtext> Smith, R. L., &amp; Smith, J. K. (1988). Differential use of item information by judges using Angoff and Nedelsky procedures. Journal of Educational Measurement, 25, 259 – 274.</bibtext> </blist> <blist> <bibtext> van der Linden, W. J. (1982). A latent trait method for determining intrajudge consistency in the Angoff and Nedelsky techniques of standard setting. Journal of Educational Measurement, 19, 295 – 308.</bibtext> </blist> </ref> <aug> <p>By Brian E. Clauser; Michael Kane and Jerome C. Clauser</p> <p>Reported by Author; Author; Author</p> <p></p> <p>BRIAN E. CLAUSER is Vice President, Center for Advanced Assessment, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104;. His primary research interests include psychometric methods.</p> <p>MICHAEL KANE holds the Messick Chair in Validity, Educational Testing Service, 660 Rosedale Road, Princeton, NJ, 08541;. His primary research interests include test validity.</p> <p>JEROME C. CLAUSER is Director, Research &amp; Innovations, American Board of Internal Medical, 510 Walnut Street, Suite 1700, Philadelphia, PA 19106;. His primary research interests include standard setting, applications of generalizability theory, and simulations in assessment.</p> </aug> <nolink nlid="nl1" bibid="bib14" firstref="ref1"></nolink> <nolink nlid="nl2" bibid="bib10" firstref="ref7"></nolink> <nolink nlid="nl3" bibid="bib13" firstref="ref8"></nolink> <nolink nlid="nl4" bibid="bib15" firstref="ref10"></nolink> <nolink nlid="nl5" bibid="bib17" firstref="ref22"></nolink> <nolink nlid="nl6" bibid="bib12" firstref="ref24"></nolink> <nolink nlid="nl7" bibid="bib19" firstref="ref26"></nolink> <nolink nlid="nl8" bibid="bib11" firstref="ref28"></nolink> <nolink nlid="nl9" bibid="bib16" firstref="ref29"></nolink> <nolink nlid="nl10" bibid="bib18" firstref="ref33"></nolink>
Header	DbId: eric DbLabel: ERIC An: EJ1255534 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Examining the Precision of Cut Scores within a Generalizability Theory Framework: A Closer Look at the Item Effect – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Clauser%2C+Brian+E%2E%22">Clauser, Brian E.</searchLink><br /><searchLink fieldCode="AR" term="%22Kane%2C+Michael%22">Kane, Michael</searchLink><br /><searchLink fieldCode="AR" term="%22Clauser%2C+Jerome+C%2E%22">Clauser, Jerome C.</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22Journal+of+Educational+Measurement%22"><i>Journal of Educational Measurement</i></searchLink>. Sum 2020 57(2):216-229. – Name: Avail Label: Availability Group: Avail Data: Wiley-Blackwell. 350 Main Street, Malden, MA 02148. Tel: 800-835-6770; Tel: 781-388-8598; Fax: 781-388-8232; e-mail: cs-journals@wiley.com; Web site: http://www.wiley.com/WileyCDA – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 14 – Name: DatePubCY Label: Publication Date Group: Date Data: 2020 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Evaluative – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Cutting+Scores%22">Cutting Scores</searchLink><br /><searchLink fieldCode="DE" term="%22Generalization%22">Generalization</searchLink><br /><searchLink fieldCode="DE" term="%22Decision+Making%22">Decision Making</searchLink><br /><searchLink fieldCode="DE" term="%22Standard+Setting%22">Standard Setting</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluators%22">Evaluators</searchLink><br /><searchLink fieldCode="DE" term="%22Item+Analysis%22">Item Analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Error+of+Measurement%22">Error of Measurement</searchLink><br /><searchLink fieldCode="DE" term="%22Difficulty+Level%22">Difficulty Level</searchLink><br /><searchLink fieldCode="DE" term="%22Probability%22">Probability</searchLink><br /><searchLink fieldCode="DE" term="%22Item+Response+Theory%22">Item Response Theory</searchLink><br /><searchLink fieldCode="DE" term="%22Guidelines%22">Guidelines</searchLink> – Name: DOI Label: DOI Group: ID Data: 10.1111/jedm.12247 – Name: ISSN Label: ISSN Group: ISSN Data: 0022-0655 – Name: Abstract Label: Abstract Group: Ab Data: An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory-based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2020 – Name: AN Label: Accession Number Group: ID Data: EJ1255534
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1255534
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1111/jedm.12247 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 14 StartPage: 216 Subjects: – SubjectFull: Cutting Scores Type: general – SubjectFull: Generalization Type: general – SubjectFull: Decision Making Type: general – SubjectFull: Standard Setting Type: general – SubjectFull: Evaluators Type: general – SubjectFull: Item Analysis Type: general – SubjectFull: Error of Measurement Type: general – SubjectFull: Difficulty Level Type: general – SubjectFull: Probability Type: general – SubjectFull: Item Response Theory Type: general – SubjectFull: Guidelines Type: general Titles: – TitleFull: Examining the Precision of Cut Scores within a Generalizability Theory Framework: A Closer Look at the Item Effect Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Clauser, Brian E. – PersonEntity: Name: NameFull: Kane, Michael – PersonEntity: Name: NameFull: Clauser, Jerome C. IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2020 Identifiers: – Type: issn-print Value: 0022-0655 Numbering: – Type: volume Value: 57 – Type: issue Value: 2 Titles: – TitleFull: Journal of Educational Measurement Type: main
ResultId	1