Routing Strategies and Optimizing Design for Multistage Testing in International Large-Scale Assessments
Saved in:
| Title: | Routing Strategies and Optimizing Design for Multistage Testing in International Large-Scale Assessments |
|---|---|
| Language: | English |
| Authors: | Svetina, Dubravka, Liaw, Yuan-Ling, Rutkowski, Leslie, Rutkowski, David |
| Source: | Journal of Educational Measurement. Spr 2019 56(1):192-213. |
| Availability: | Wiley-Blackwell. 350 Main Street, Malden, MA 02148. Tel: 800-835-6770; Tel: 781-388-8598; Fax: 781-388-8232; e-mail: cs-journals@wiley.com; Web site: http://www.wiley.com/WileyCDA |
| Peer Reviewed: | Y |
| Page Count: | 22 |
| Publication Date: | 2019 |
| Document Type: | Journal Articles Reports - Research |
| Descriptors: | Measurement, Item Analysis, Test Construction, Item Response Theory, Test Length, Scoring, Test Bias, Test Items, Simulation |
| DOI: | 10.1111/jedm.12206 |
| ISSN: | 0022-0655 |
| Abstract: | This study investigates the effect of several design and administration choices on item exposure and person/item parameter recovery under a multistage test (MST) design. In a simulation study, we examine whether number-correct (NC) or item response theory (IRT) methods are differentially effective at routing students to the correct next stage(s) and whether routing choices (optimal versus suboptimal routing) have an impact on achievement precision. Additionally, we examine the impact of testlet length on both person and item recovery. Overall, our results suggest that no single approach works best across the studied conditions. With respect to the mean person parameter recovery, IRT scoring (via either Fisher information or preliminary EAP estimates) outperformed classical NC methods, although differences in bias and root mean squared error were generally small. Item exposure rates were found to be more evenly distributed when suboptimal routing methods were used, and item recovery (both difficulty and discrimination) was most precisely observed for items with moderate difficulties. Based on the results of the simulation study, we draw conclusions and discuss implications for practice in the context of international large-scale assessments that recently introduced adaptive assessment in the form of MST. Future research directions are also discussed. |
| Abstractor: | As Provided |
| Entry Date: | 2019 |
| Accession Number: | EJ1208659 |
| Database: | ERIC |
|
Full text is not displayed to guests.
Login for full access.
|
|
| FullText | Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwE6k_RwMcF9jPyR1DldZADnAAAA4zCB4AYJKoZIhvcNAQcGoIHSMIHPAgEAMIHJBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDEsux-R9Kioe5hblEAIBEICBm4Ziev9886pjqhVRO1K7f6kaJCCQp5P3T3UZAchIPUObWjbNxl9vG9ypLhfqjvZOK1wewYgHClvEInqYeOumPcPV5736iTx1Xcvnoy3BybuctYA_T3szVmj4Mm19l1K3dKk3uO8SiUHq8dw7szwV6zHjQG5YpPIclx81nK4nkx8ES5CUOZeY1Ebbpop2cjxQVa8M3POkWiBsW8VJ Text: Availability: 1 Value: <anid>AN0135294541;mea01mar.19;2019Mar15.05:21;v2.2.500</anid> <title id="AN0135294541-1">Routing Strategies and Optimizing Design for Multistage Testing in International Large‐Scale Assessments </title> <p>This study investigates the effect of several design and administration choices on item exposure and person/item parameter recovery under a multistage test (MST) design. In a simulation study, we examine whether number‐correct (NC) or item response theory (IRT) methods are differentially effective at routing students to the correct next stage(s) and whether routing choices (optimal versus suboptimal routing) have an impact on achievement precision. Additionally, we examine the impact of testlet length on both person and item recovery. Overall, our results suggest that no single approach works best across the studied conditions. With respect to the mean person parameter recovery, IRT scoring (via either Fisher information or preliminary EAP estimates) outperformed classical NC methods, although differences in bias and root mean squared error were generally small. Item exposure rates were found to be more evenly distributed when suboptimal routing methods were used, and item recovery (both difficulty and discrimination) was most precisely observed for items with moderate difficulties. Based on the results of the simulation study, we draw conclusions and discuss implications for practice in the context of international large‐scale assessments that recently introduced adaptive assessment in the form of MST. Future research directions are also discussed.</p> <p>Recognizing the advantages of computer‐based assessment (CBA; Jodoin, Zenisky, &amp; Hambleton, [<reflink idref="bib8" id="ref1">8</reflink>]; Yan, Lewis, &amp; von Davier, [<reflink idref="bib30" id="ref2">30</reflink>], p. 4), the Organization for Economic Cooperation and Development (OECD) implemented an optional CBA in the 2012 cycle of their flagship study, the Programme for International Student Assessment (PISA; OECD, [<reflink idref="bib20" id="ref3">20</reflink>]). In particular, the OECD cited CBAs as a way to "make the assessment process more efficient and narrow the time lag between collecting the data and making results available to feed into educational improvement" (OECD, [<reflink idref="bib19" id="ref4">19</reflink>], p. 4). Given that dozens of highly heterogeneous educational systems take part in PISA and other international large‐scale assessments (ILSAs), a computerized platform offers a further advantage: the potential to include an adaptive element.</p> <p>Beginning in 2011, the OECD's Program for the International Assessment of Adult Competencies (PIAAC; Kirsch &amp; Lennon, [<reflink idref="bib11" id="ref5">11</reflink>]) implemented the first adaptive ILSA in the form of a multistage test (MST). Following PIAAC's lead, the 2018 cycle of PISA also features an MST component (Educational Testing Service, [<reflink idref="bib5" id="ref6">5</reflink>]). Two key reasons for this innovation include an improved assessment experience for students and more accurate and valid population estimates for many participating systems (OECD, [<reflink idref="bib19" id="ref7">19</reflink>], p. 4). In spite of the advantages of an MST design over a more traditional linear fixed‐length test, implementing this sort of approach presents unique challenges in an ILSA context. For example, PIAAC faced administration challenges because the adaptive algorithms needed to be sufficiently flexible to account for population diversity while ensuring even exposure of test booklets across populations (Chen, Yamamoto, &amp; von Davier, [<reflink idref="bib3" id="ref8">3</reflink>]). Further issues, discussed subsequently, involve test stage length and scoring. In the current study, we are interested in even item exposure within a population, with an interest in examining these issues in multiple populations in the future. It is in this context that the current article is written.</p> <hd id="AN0135294541-2">Background</hd> <p>MST is a design that allows for limited adaptation of a test's difficulty to the proficiency of the examinee. It is limited in that, by comparison to a fully computerized adaptive test (CAT), adaptation happens not at the item level but instead at the <emph>module</emph> or testlet level. Here, module is defined as a group of items that always appear together in a block or cluster. One example of an MST is presented in Figure 1. Here, a three‐stage MST structure, referred to collectively as a <emph>panel</emph>, shows the various paths that an examinee may take.[<reflink idref="bib1" id="ref9">1</reflink>] Examinees begin in Stage 1 with a core or routing testlet (S11). Depending on the score in S11, an easier or more difficult Stage 2 testlet is selected, either S21 or S22. Again, based on performance in the previous stage(s), the examinee is routed into a Stage 3 testlet. In total, each examinee receives three blocks of nonoverlapping items. Although the above description is limited to a single panel, an MST can be composed of many panels. Further, the described design is just one of many possible in the MST framework.[<reflink idref="bib2" id="ref10">2</reflink>]</p> <p>Graph: An example of a 1‐2‐3 three‐stage MST design used in the study.</p> <p>Scholars have pointed to several advantages of an MST design when compared to other testing formats. For example, when compared to linear fixed‐length tests, MST showed greater testing efficiency (Jodoin et al., [<reflink idref="bib8" id="ref11">8</reflink>]; H. Kim &amp; Plake, [<reflink idref="bib9" id="ref12">9</reflink>]) and increased accuracy in ability estimates (and classification). Moreover, while not as efficient as a CAT, MST may still be a desirable operational choice (Melican, Breithaupt, &amp; Zhang, [<reflink idref="bib16" id="ref13">16</reflink>]). As Luo and Kim ([<reflink idref="bib14" id="ref14">14</reflink>]) suggested, MST provides several practical advantages over CAT (Melican et al., [<reflink idref="bib16" id="ref15">16</reflink>]), including the <emph>a priori</emph> knowledge of psychometric and content properties of all possible test forms, as MST is constructed prior to administration; a more efficient approach to deal with complex test constraints and thus reduce computing power; and the flexibility for the test taker to review and revise responses in the same stage of the assessment.</p> <p>Although MST theory and practice is fairly well‐established for tests used to make individual decisions, the novelty of this method in an ILSA setting implies several open research questions. As noted above, controlling item exposure in PIAAC posed operational challenges. Although controlling item <emph>over‐</emph>exposure in many adaptive tests is desirable for test security (Stocking &amp; Lewis, [<reflink idref="bib25" id="ref16">25</reflink>]), even item exposure in an ILSA setting is necessary to ensure that item parameter estimates are not biased. This is because item parameters are estimated subsequent to data collection and exposure to limited subsets of the tested populations, which, in PISA, number more than 80, raises concerns about the stability of parameter estimates. In response, Chen and colleagues developed a probabilistic routing procedure that relied on examinee background information known <emph>a priori</emph>, including their native language and education levels, as well as performance on previous testlets (Chen et al., [<reflink idref="bib3" id="ref17">3</reflink>]). This approach ensured that, regardless of background and prior performance, examinees had a nonzero probability of being routed into any of the available subsequent testlets, offering some safeguards against item over‐ or underexposure. In contrast to PIAAC, most other ILSAs, including PISA and the Trends in International Mathematics and Science Study (TIMSS) collect background information on students <emph>ex post facto</emph>, eliminating the possibility of using student background as part of a routing scheme. Especially in a setting where educational systems differ meaningfully across the proficiency spectrum, an open question regards how best to ensure even testlet exposure. For example, given a historic proficiency mean and standard deviation of 500 and 100 (Mullis, Martin, Foy, &amp; Hooper, [<reflink idref="bib18" id="ref18">18</reflink>]), respectively, the difference between the highest and lowest 2015 grade eight TIMSS mathematics performers was just over 2.5 standard deviations (Singapore, <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12206:jedm12206-math-0001" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mn&gt;621&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , Saudi Arabia, <ephtml> &lt;math display="inline" altimg="urn:x-wiley:00220655:media:jedm12206:jedm12206-math-0002" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mspace width="0.33em" /&gt;&lt;mn&gt;368&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> ). Such wide differences suggest that in an MST setting that uses a routing scheme based only on performance will underexpose difficult items and overexpose easy items in low‐performing systems and vice versa in high‐performing systems.</p> <p>A second issue involves the method used to score testlets for routing purposes. Routing to the next optimal testlet is often, although not always, based on one of two general approaches: (a) a number‐correct (NC) scoring approach (i.e., <emph>classical</emph>) using some predetermined observed score cutoff value (and/or associated percentage of passing/not passing examinees); or (b) an item‐pattern (item response theory [IRT]) approach such as using maximum information or an interim ability estimate[<reflink idref="bib3" id="ref19">3</reflink>] (Hendrickson, [<reflink idref="bib7" id="ref20">7</reflink>]; Yan, von Davier, &amp; Lewis, [<reflink idref="bib28" id="ref21">28</reflink>]). As S. Kim, Moses, and Yoo ([<reflink idref="bib10" id="ref22">10</reflink>]) suggested, the practical benefits of NC scoring, including being easier for test takers to understand (e.g., Armstrong, [<reflink idref="bib1" id="ref23">1</reflink>]), versus the psychometric benefits of item‐pattern scoring, such as more precise estimation, have been debated in the literature. For example, Luecht and Nungester ([<reflink idref="bib13" id="ref24">13</reflink>]) empirically showed that NC scoring can be sufficiently accurate to select testlets. Similarly, although Robin, Manfred, and Liang ([<reflink idref="bib23" id="ref25">23</reflink>]) found that NC yielded loss in measurement accuracy in their preliminary study (when compared to maximum likelihood estimates), it presented small observed losses (across the most of the latent trait continuum of 5% or less, and 10% of loss in the upper and lower ends of the continuum). Practical support for an NC method was also reported by Weissman, Belov, and Armstrong ([<reflink idref="bib27" id="ref26">27</reflink>]), who found that, although information‐based methods yielded higher overall classification rates over NC, it came at the expense of item (over)exposure, particularly in later MST stages. Still others found that shrinkage is more pronounced under NC scoring due to its lower precision than item‐pattern scoring such as expected <emph>a posteriori</emph> (EAP; e.g., Kolen &amp; Tong, [<reflink idref="bib12" id="ref27">12</reflink>]). The above literature was situated in the context of high‐stakes tests used for decision making. To our knowledge, there is little ILSA‐based research on this question, where inferences are limited to the population and subpopulation level (Mislevy, Johnson, &amp; Muraki, [<reflink idref="bib17" id="ref28">17</reflink>]). A particular focus is on the way in which performance‐based and probabilistic routing decisions might interact. We pursue this question here.</p> <p>A third issue considered in the current article is one aspect of test assembly. To that end, Yan, Lewis, and von Davier ([<reflink idref="bib29" id="ref29">29</reflink>]) examined dimensions of assembly in the context of a small sample size study with few items to study the performance of regression‐tree‐based scoring. In the same volume, Zheng, Wang, Culbertson, and Chang ([<reflink idref="bib31" id="ref30">31</reflink>]) review several assembly methods, including a number of automated approaches. In both studies, the findings are not well connected to the ILSA setting, where, again, item parameters are only known subsequent to data collection (although preliminary estimates are available from field trials) and sample sizes are large (hundreds of thousands of students are available for international item calibration). As such, we consider one aspect of testlet and panel assembly here: that of testlet length. In other words, we consider whether having testlets of equal or unbalanced lengths demonstrates advantages.</p> <p>As ILSAs move into MST, it is timely that these design and administration choices are more systematically considered. It is in this context that we situate our article. Specifically, we examine the effect of several design and administration choices on item exposure and person‐parameter recovery. In order to address the main goals of the study, we utilize a Monte Carlo simulation approach. In the next section, we describe the methods used in the current study with an emphasis on the implemented study design, rationale for selected design choices, and the outcome variables.</p> <hd id="AN0135294541-3">Methods</hd> <p>We utilize a Monte Carlo simulation study to address the research questions in our study via the <emph>R</emph> (R Development Core Team, 2018) package <emph>mstR</emph> 1.2 (Magis, Yan, &amp; von Davier, [<reflink idref="bib15" id="ref31">15</reflink>])[<reflink idref="bib4" id="ref32">4</reflink>] for MST simulation and analyses and the <emph>mirt</emph> package (Chalmers, [<reflink idref="bib2" id="ref33">2</reflink>]) for item parameter calibration. In the study, several aspects were treated as fixed: the sample size (<emph>n</emph> = 4,000 simulees), for which abilities were generated from a normal distribution <emph>N</emph>(0, 1); number of items per design was fixed at 36; and the MST was set as a <emph>1‐2‐3</emph> three‐stage design (as presented in Figure 1).[<reflink idref="bib5" id="ref34">5</reflink>] One hundred replications were performed within each condition. Our sample size is generally reflective of operational ILSA settings, where sample sizes range from 3,000 to 8,000 or more students per tested population.</p> <p>Several manipulated factors were included in the study: (a) the number of items per testlet, (b) routing method to the next testlet, and (c) routing probabilities. Next, we elaborate on the study design, including the manipulated factors (and respective levels), provide a rationale for the design choices as driven by the ILSA context, and outline the data generation and analysis plan, including outcome variables that align with our research goals.</p> <hd id="AN0135294541-4">Manipulated Factors</hd> <p></p> <hd id="AN0135294541-5">Number of items per testlet</hd> <p>We manipulated four sizes that any one testlet could assume. Recall that our design involved a 1‐2‐3 MST form. We were particularly interested in examining different lengths of testlets because of our interest in balanced item exposures and parameter recovery. Across all stages, either 6, 10, 12, or 20 items per testlet were selected in any condition, while maintaining a fixed number of 36 items for all simulees. This resulted in four designs, with balanced and imbalanced testlet lengths:</p> <p></p> <p>• (<reflink idref="bib1" id="ref35">1</reflink>)</p> <p></p> <ulist> <item> Design 1 (<emph>equal</emph>) had 12 items in each of the three testlets (<emph>EQ</emph>; 12‐12‐12 items),</item> <p></p> </ulist> <p>• (<reflink idref="bib2" id="ref36">2</reflink>)</p> <p></p> <ulist> <item> Design 2 (<emph>short‐to‐long</emph>) had six items in the Core testlet, followed by 10 and 20 items in subsequent testlets (<emph>S‐L;</emph> 6‐10‐20 items),</item> <p></p> </ulist> <p>• (<reflink idref="bib3" id="ref37">3</reflink>)</p> <p></p> <ulist> <item> Design 3 (<emph>long‐to‐short</emph>) had 20 items in the Core testlet, followed by 10 and 6 items in subsequent testlets (<emph>L‐S</emph>; 20‐10‐6 items), and</item> <p></p> </ulist> <p>• (<reflink idref="bib4" id="ref38">4</reflink>)</p> <p></p> <ulist> <item> Design 4 (<emph>short‐long‐short</emph>) had six items in the Core testlet, followed by 20 and 10 items in subsequent testlets (<emph>S‐L‐S</emph>; 6‐20‐10 items).</item> </ulist> <hd id="AN0135294541-6">Routing method</hd> <p>In the current study, we utilized five different routing methods in order to select the optimal next testlet:</p> <p></p> <p>• (<reflink idref="bib1" id="ref39">1</reflink>)</p> <p></p> <ulist> <item> random selection, which meant that examinees were distributed to the next testlet with equal probability regardless of their provisional performance,</item> <p></p> </ulist> <p>• (<reflink idref="bib2" id="ref40">2</reflink>)</p> <p></p> <ulist> <item> NC cumulative score (i.e., using a cutoff value based on total score),</item> <p></p> </ulist> <p>• (<reflink idref="bib3" id="ref41">3</reflink>)</p> <p></p> <ulist> <item> NC last testlet score (i.e., using a cut off value based on the last administered testlet only, rather than performance on all previously administered testlets),</item> <p></p> </ulist> <p>• (<reflink idref="bib4" id="ref42">4</reflink>)</p> <p></p> <ulist> <item> IRT EAP (i.e., IRT EAP ability estimate), and</item> <p></p> </ulist> <p>• (<reflink idref="bib5" id="ref43">5</reflink>)</p> <p></p> <ulist> <item> IRT information (i.e., IRT maximum Fisher information function).</item> </ulist> <p>Cutoff values for the routing methods (<reflink idref="bib2" id="ref44">2</reflink>) and (<reflink idref="bib3" id="ref45">3</reflink>) for two adaptations are listed in Table . For example, under the <emph>EQ</emph> design, there are 12 items in the Core testlet, so a simulee could earn anywhere between 0 points (no correct answer) to 12 points (all correct answers). If a simulee scored 6 or below on the Core testlet, they would be routed to the lower testlet in Stage 2 (labeled as S21 in Figure 1), otherwise they would be routed to the higher testlet in Stage 2 (labeled as S22 in Figure 1).</p> <p>Ranges of Values for Classical Approaches in Selecting the Next Optimal Testlet</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Design&lt;/th&gt;&lt;th align="center"&gt;Stage 1&lt;/th&gt;&lt;th align="center"&gt;Stage 2&lt;/th&gt;&lt;th align="center"&gt;Stage 3&lt;/th&gt;&lt;th align="center"&gt;1st Routing Range Values&lt;/th&gt;&lt;th align="center"&gt;2nd Routing Range Values&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1. EQ&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;(0,6) (7,12)&lt;/td&gt;&lt;td&gt;(0,7) (8,16) (17,24)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2. S&amp;#8208;L&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;(0,3) (4,6)&lt;/td&gt;&lt;td&gt;(0,5) (6,11) (12,16)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3. L&amp;#8208;S&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;(0,10) (11,20)&lt;/td&gt;&lt;td&gt;(0,9) (11,20) (21,30)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4. S&amp;#8208;L&amp;#8208;S&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;(0,3) (4,6)&lt;/td&gt;&lt;td&gt;(0,8) (9,17) (18,26)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>1220610001 <emph>Note</emph>. Design 1: EQ (equal) had 12 items in each of the three testlets (12‐12‐12 items), Design 2: S‐L (short‐to‐long) had six items in the Core testlet, followed by 10 and 20 items in subsequent testlets (6‐10‐20 items), Design 3: L‐S (long‐to‐short) had 20 items in the Core testlet, followed by 10 and 6 items in subsequent testlets (20‐10‐6 items), and Design 4: S‐L‐S (short‐long‐short) had six items in the Core testlet, followed by 20 and 10 items in subsequent testlets (6‐20‐10 items). Routing range values represent cutoff values used in selecting the next module. For example, under EQ design, the first routing implies that if a simulee obtains a total score in Stage 1 (Core) module of 6 or below (out of 12), they would be routed to S21 module in Stage 2. If they obtained a score 7 or above (out of 12), they would be routed to S22 in Stage 2.</p> <hd id="AN0135294541-7">Routing probabilities</hd> <p>To investigate the effect on item exposure and parameter recovery, we consider a probabilistic routing mechanism, where, regardless of performance, a student can be routed to either an optimal or suboptimal testlet, with some nonzero probability (e.g., dashed lines in Figure 1). In other words, a student who answers all items correctly and <emph>should</emph> be routed to a more difficult panel in the stage would have a predetermined probability of being routed into an easier stage, ensuring that some portion of high achievers would be exposed to easy items.</p> <p>Levels of this manipulated factor included optimal testlet routing probabilities of (a) 1.00, (b).80, or (c).70. When probability equals 1.00, all simulees are optimally routed based on performance only. In the other two conditions, all simulees face a.20 or.30 probability of routing to a suboptimal testlet regardless of performance. For example, optimal routing from Stage 1 to Stage 2 has a probability of.80 or.70, respectively, while suboptimal routing happens with a probability of.20 or.30, respectively. An identical procedure is implemented for routing from Stage 2 to Stage 3. A distinction in the second routing procedure is that when suboptimal routing is selected, either of two possible testlets is selected with equal probability.</p> <p>When the routing probability was 1.00 for an optimal testlet, there existed four possible paths: Path 1: Core – Moderately Easy – Easy; Path 2: Core – Moderately Easy– Moderate; Path 3: Core – Moderately Difficult – Moderate; Path 4: Core – Moderately Difficult – Difficult. Introducing a probabilistic element into routing produced an additional two paths to which simulees could be routed. That is, when the routing probability was less than 1.00, the two additional paths (marked as dashed lines in Figure 1) were: Path 5: Core – Moderately Easy– Difficult, and Path 6: Core – Moderately Difficult – Easy.</p> <hd id="AN0135294541-8">Data Generation and Analysis</hd> <p>In simulating data, we selected item parameters with specific ranges for each testlet. The Core testlet at Stage 1 was of medium difficulty where item difficulty (location) parameters were selected randomly between –1 and 1 logits. At Stage 2, item difficulty parameters were selected randomly between.5 and 1.5 for the moderately high difficulty testlet and between –1.5 and –.5 for moderately low difficulty testlet. At Stage 3, the item difficulty parameters for the highest, medium, and lowest level of difficulty testlets were randomly selected from 1 to 2, from –1 to 1, and from –2 to –1 logits, respectively. Item discrimination for items were randomly sampled between.5 and 1.5 logits for all items, balancing similar levels of discriminations across all testlets. Table  displays mean and variances for item difficulty and discrimination parameters used in data generation. These item parameters are generally reflective of empirical parameters observed in PISA (OECD, [<reflink idref="bib21" id="ref46">21</reflink>]) and were chosen to reflect differences between the more difficult and easier testlets. All item responses were generated under the dichotomous two‐parameter logistic (2PL) IRT model using above‐discussed item parameters and randomly selected person parameters from a standard normal distribution.</p> <p>Means and SDs of Item Difficulty (Panel a) and Discrimination (Panel b) Parameters in Each Testlet and for Each MST Design</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th /&gt;&lt;th align="center"&gt;Stage 1&lt;/th&gt;&lt;th align="center"&gt;Stage 2&lt;/th&gt;&lt;th align="center"&gt;Stage 3&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th /&gt;&lt;th align="center"&gt;Testlet 1&lt;/th&gt;&lt;th align="center"&gt;Testlet 21&lt;/th&gt;&lt;th align="center"&gt;Testlet 22&lt;/th&gt;&lt;th align="center"&gt;Testlet 31&lt;/th&gt;&lt;th align="center"&gt;Testlet 32&lt;/th&gt;&lt;th align="center"&gt;Testlet 33&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;Design&lt;/th&gt;&lt;th align="center"&gt;Mean&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;SD&lt;/italic&gt;&lt;/th&gt;&lt;th align="center"&gt;Mean&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;SD&lt;/italic&gt;&lt;/th&gt;&lt;th align="center"&gt;Mean&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;SD&lt;/italic&gt;&lt;/th&gt;&lt;th align="center"&gt;Mean&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;SD&lt;/italic&gt;&lt;/th&gt;&lt;th align="center"&gt;Mean&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;SD&lt;/italic&gt;&lt;/th&gt;&lt;th align="center"&gt;Mean&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;SD&lt;/italic&gt;&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td /&gt;&lt;td align="center"&gt;&lt;italic&gt;Panel (a) Difficulty Parameters&lt;/italic&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;1. EQ&lt;/td&gt;&lt;td&gt;&amp;#8722;0.117&lt;/td&gt;&lt;td&gt;0.545&lt;/td&gt;&lt;td&gt;&amp;#8722;1.192&lt;/td&gt;&lt;td&gt;0.265&lt;/td&gt;&lt;td&gt;0.712&lt;/td&gt;&lt;td&gt;0.154&lt;/td&gt;&lt;td&gt;&amp;#8722;1.579&lt;/td&gt;&lt;td&gt;0.312&lt;/td&gt;&lt;td&gt;&amp;#8722;0.477&lt;/td&gt;&lt;td&gt;0.396&lt;/td&gt;&lt;td&gt;1.338&lt;/td&gt;&lt;td&gt;0.300&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2. S&amp;#8208;L&lt;/td&gt;&lt;td&gt;0.022&lt;/td&gt;&lt;td&gt;0.789&lt;/td&gt;&lt;td&gt;&amp;#8722;1.128&lt;/td&gt;&lt;td&gt;0.252&lt;/td&gt;&lt;td&gt;0.746&lt;/td&gt;&lt;td&gt;0.273&lt;/td&gt;&lt;td&gt;&amp;#8722;1.621&lt;/td&gt;&lt;td&gt;0.302&lt;/td&gt;&lt;td&gt;&amp;#8722;0.243&lt;/td&gt;&lt;td&gt;0.604&lt;/td&gt;&lt;td&gt;1.379&lt;/td&gt;&lt;td&gt;0.302&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3. L&amp;#8208;S&lt;/td&gt;&lt;td&gt;&amp;#8722;0.005&lt;/td&gt;&lt;td&gt;0.612&lt;/td&gt;&lt;td&gt;&amp;#8722;1.223&lt;/td&gt;&lt;td&gt;0.240&lt;/td&gt;&lt;td&gt;0.746&lt;/td&gt;&lt;td&gt;0.273&lt;/td&gt;&lt;td&gt;&amp;#8722;1.707&lt;/td&gt;&lt;td&gt;0.270&lt;/td&gt;&lt;td&gt;&amp;#8722;0.114&lt;/td&gt;&lt;td&gt;0.819&lt;/td&gt;&lt;td&gt;1.272&lt;/td&gt;&lt;td&gt;0.232&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4. S&amp;#8208;L&amp;#8208;S&lt;/td&gt;&lt;td&gt;0.022&lt;/td&gt;&lt;td&gt;0.789&lt;/td&gt;&lt;td&gt;&amp;#8722;1.121&lt;/td&gt;&lt;td&gt;0.302&lt;/td&gt;&lt;td&gt;0.879&lt;/td&gt;&lt;td&gt;0.302&lt;/td&gt;&lt;td&gt;&amp;#8722;1.700&lt;/td&gt;&lt;td&gt;0.251&lt;/td&gt;&lt;td&gt;&amp;#8722;0.209&lt;/td&gt;&lt;td&gt;0.619&lt;/td&gt;&lt;td&gt;1.519&lt;/td&gt;&lt;td&gt;0.333&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td align="center"&gt;&lt;italic&gt;Panel (b) Discrimination Parameters&lt;/italic&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;1. EQ&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.424&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.318&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.534&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.461&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.389&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.433&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2. S&amp;#8208;L&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.415&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.399&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.547&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.415&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.449&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.416&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3. L&amp;#8208;S&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.462&lt;/td&gt;&lt;td&gt;1.427&lt;/td&gt;&lt;td&gt;0.369&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.547&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.499&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.314&lt;/td&gt;&lt;td&gt;1.205&lt;/td&gt;&lt;td&gt;0.371&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4. S&amp;#8208;L&amp;#8208;S&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.415&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.387&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.501&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.350&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.426&lt;/td&gt;&lt;td&gt;1.200&lt;/td&gt;&lt;td&gt;0.492&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The optimal panels (a)–(d) in Figure 2 show the results for the test information function constructed from four different paths for each test length design. Under the same path, different test length designs yield parallel test forms. The peaks of the information functions for the four paths are located at approximately –1.14, –0.69, 0.28, and 0.89, respectively, on standard normal proficiency scale. And the maximum values of the information functions are 12.39, 12.43, 11.79, and 12.10. On the other hand, panels (e) and (f) display the results of two nonoptimal paths with information functions that are more irregular. In terms of Path 5, the peaks for each test length design are located at –0.60, 1.27, –0.62, and –0.97, respectively, on a theta scale and the maximum values of path information function ranged from 9.16 to 10.91. As for Path 6, the peaks are located at 0.02, –1.36, 0.45, and 0.75 and the maximum values of path information function ranged from 9.44 to 10.88. Taken all together, Figure 2 shows a contrasting breath and height of information functions between the optimal (panels a through d) and nonoptimal (panels e and f) paths. Namely, the height of information functions was higher for optimal paths than nonoptimal paths, while the breadth of the information function was larger for nonoptimal paths than it was in optimal path counterparts.</p> <p>Graph: Test information function by optimal path (panels a–d) and nonoptimal path (panels e–f).</p> <p>Graph: image_n/jedm12206-fig-0002.png</p> <hd id="AN0135294541-9">Evaluation Criteria</hd> <p>As stated above, the twofold purpose of the study was to examine person/item parameter estimates and item exposure. With respect to ability estimation, we evaluated performance by computing the average bias and root mean square error (RMSE) for the parameter estimates across 100 replications within each condition, as well as correlations between the estimated and true values of theta. Item parameter bias was defined as the average difference between the estimated and true values of the parameters across 36 items, while the RMSE was obtained by taking the square root of the mean of squared deviations of estimated parameter values about their true values. With respect to item exposures, we computed item exposure rates as the ratio between the number of times an item was administered and total number of simulees. We also examined item parameter recovery to examine whether items' calibration was affected by items' exposure rates.</p> <hd id="AN0135294541-10">Results</hd> <p>We report results in two sections, with Section I presenting results for proficiency estimation recovery, while Section II discusses results related to item exposure and recovery. For space reasons, we present selected results to highlight the main findings; however, all tabulated and expanded graphical results can be found at https://figshare.com/s/a9345b3c71a8b973630f.</p> <hd id="AN0135294541-11">Section I. Person Ability Parameter Recovery</hd> <p>In Figure 3, results related to the recovery of person parameter are presented. In the figure, rows represent three different outcome variables: bias, RMSE, and correlations between the true and estimated thetas. Columns represent different groups of simulees. Results reported are (albeit arbitrarily) divided into three groups: low (estimated theta values are below –1), moderate (estimated theta values are between –1 and +1), and high (estimated theta values are greater than +1). Within each figure, results for four designs are represented as various markers (e.g., squares represent the EQ design, which had 12 items in each of the three testlets).[<reflink idref="bib6" id="ref47">6</reflink>] Further, in each of the graphs, the <emph>x</emph>‐axis represents the five different methods used as the basis for testlet selection (e.g., random, NC cumulative score, NC last testlet score, IRT EAP, and IRT information), while the <emph>y</emph>‐axis represents the three outcomes, respectively. Finally, the <emph>x</emph>‐axis ticks on the bottom indicate the various studied probabilities for routing.</p> <p>Graph: Person parameter recovery rates of bias, RMSE, and correlations for low, medium, and high performers.</p> <p>Graph: image_n/jedm12206-fig-0003.png</p> <p>Focusing on bias, it was noted that the IRT‐based methods (EAP and information) and random routing were most successful in recovering the theta estimates across the three group levels for all designs and probability routing conditions. Across the conditions, bias ranged from –0.001 to 0.001 for the IRT‐based methods. Unlike the IRT methods, however, slightly larger bias was found under the NC methods, although the bias was inverse for the low‐ versus the high‐performing group. Specifically, bias was positive for low performers, while negative bias of similar magnitudes was found for high performers. Two further patterns were noted. First, for the NC methods, differences in designs were noted across the probability levels. Namely, the smallest levels of bias were found under <emph>S‐L</emph> design while the largest bias was found under the <emph>L‐S</emph> design. Second, for the low/high performing group, as the routing probabilities decreased from 1.0 to.8 and.7, bias slightly increased/decreased for all four designs. We noted that for the medium group (in the –1 to +1 theta range) recovery was near perfect across all methods and routing probabilities.</p> <p>The second row of Figure 3 reports theta recovery expressed as RMSE values. While the conclusions are in line with the bias results, patterns of design performances across routing probabilities are more pronounced. One notable pattern was observed with regard to the poorer performance of the NC methods when compared to the random and IRT‐based methods for selecting the next optimal testlet for all three groups, even though the RMSEs were higher for the more extreme groups (i.e., low and high) as suggested by larger magnitudes of RMSEs for the four designs.</p> <p>The bottom row of Figure 3 shows correlations between true and estimated person proficiency values. It was noted that high correlations were found mostly in the medium group and for the IRT‐based and random methods, regardless of the routing probability. Most notably, for NC methods in low/high groups, correlations decreased as the routing probabilities moved away from 1.0 (optimal routing), with <emph>L‐S</emph> design being the most affected (correlations ranged from mid‐.50 at its highest to.40 at its lowest). The order (pattern of the performance) of the four studied designs also remained unchanged, with <emph>S‐L</emph> yielding the highest correlations across routing probabilities in low and high groups, while <emph>L‐S</emph> yielded the lowest correlations.</p> <hd id="AN0135294541-12">Section II. Item Exposure Rates and Parameter Recovery</hd> <p>In the current study, we were interested in examining designs that would yield the most even item exposure rates. For space reasons, we summarize exposure rates across the studied designs and conditions for different types of items. Specifically, in Table  we report the average of exposures for items that are (albeit arbitrarily) divided into three groups: easy (items with generated b values less than –1), moderate (items with generated b values between –1 and +1), and difficult (items with generated <emph>b</emph> values greater than +1). This grouping is consistent with our person parameter criteria.</p> <p>Average Item Exposure Rates (in Proportions) Across Routing Methods and Probabilities of Routing</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center"&gt;Routing to Next Testlet Methods&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th valign="bottom" /&gt;&lt;th valign="bottom" align="center"&gt;Random&lt;/th&gt;&lt;th valign="bottom" align="center"&gt;Number&amp;#8208;correct Cumulative Score&lt;/th&gt;&lt;th valign="bottom" align="center"&gt;Number&amp;#8208;correct Last Testlet Score&lt;/th&gt;&lt;th valign="bottom" align="center"&gt;IRT EAP&lt;/th&gt;&lt;th valign="bottom" align="center"&gt;IRT Information&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;&lt;bold&gt;b level/prob&lt;/bold&gt;&lt;/th&gt;&lt;th align="center"&gt;1&lt;/th&gt;&lt;th align="center"&gt;1&lt;/th&gt;&lt;th align="center"&gt;.8&lt;/th&gt;&lt;th align="center"&gt;.7&lt;/th&gt;&lt;th align="center"&gt;1&lt;/th&gt;&lt;th align="center"&gt;.8&lt;/th&gt;&lt;th align="center"&gt;.7&lt;/th&gt;&lt;th align="center"&gt;1&lt;/th&gt;&lt;th align="center"&gt;.8&lt;/th&gt;&lt;th align="center"&gt;.7&lt;/th&gt;&lt;th align="center"&gt;1&lt;/th&gt;&lt;th align="center"&gt;.8&lt;/th&gt;&lt;th align="center"&gt;.7&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center"&gt;Panel (a) &lt;italic&gt;EQ&lt;/italic&gt; (12&amp;#8208;12&amp;#8208;12)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Easy (22)&lt;/td&gt;&lt;td&gt;.36&lt;/td&gt;&lt;td&gt;.29&lt;/td&gt;&lt;td&gt;.32&lt;/td&gt;&lt;td&gt;.34&lt;/td&gt;&lt;td&gt;.32&lt;/td&gt;&lt;td&gt;.34&lt;/td&gt;&lt;td&gt;.36&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.32&lt;/td&gt;&lt;td&gt;.34&lt;/td&gt;&lt;td&gt;.26&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.32&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Moderate (25)&lt;/td&gt;&lt;td&gt;.50&lt;/td&gt;&lt;td&gt;.61&lt;/td&gt;&lt;td&gt;.55&lt;/td&gt;&lt;td&gt;.53&lt;/td&gt;&lt;td&gt;.56&lt;/td&gt;&lt;td&gt;.52&lt;/td&gt;&lt;td&gt;.49&lt;/td&gt;&lt;td&gt;.60&lt;/td&gt;&lt;td&gt;.57&lt;/td&gt;&lt;td&gt;.54&lt;/td&gt;&lt;td&gt;.55&lt;/td&gt;&lt;td&gt;.54&lt;/td&gt;&lt;td&gt;.52&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Difficult (13)&lt;/td&gt;&lt;td&gt;.27&lt;/td&gt;&lt;td&gt;.19&lt;/td&gt;&lt;td&gt;.23&lt;/td&gt;&lt;td&gt;.25&lt;/td&gt;&lt;td&gt;.23&lt;/td&gt;&lt;td&gt;.26&lt;/td&gt;&lt;td&gt;.29&lt;/td&gt;&lt;td&gt;.18&lt;/td&gt;&lt;td&gt;.21&lt;/td&gt;&lt;td&gt;.23&lt;/td&gt;&lt;td&gt;.34&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.29&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;Panel (b) &lt;italic&gt;S&amp;#8208;L&lt;/italic&gt; (6&amp;#8208;10&amp;#8208;20)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Easy (27)&lt;/td&gt;&lt;td&gt;.31&lt;/td&gt;&lt;td&gt;.23&lt;/td&gt;&lt;td&gt;.27&lt;/td&gt;&lt;td&gt;.29&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.32&lt;/td&gt;&lt;td&gt;.34&lt;/td&gt;&lt;td&gt;.22&lt;/td&gt;&lt;td&gt;.25&lt;/td&gt;&lt;td&gt;.28&lt;/td&gt;&lt;td&gt;.22&lt;/td&gt;&lt;td&gt;.25&lt;/td&gt;&lt;td&gt;.28&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Moderate (32)&lt;/td&gt;&lt;td&gt;.50&lt;/td&gt;&lt;td&gt;.66&lt;/td&gt;&lt;td&gt;.59&lt;/td&gt;&lt;td&gt;.56&lt;/td&gt;&lt;td&gt;.56&lt;/td&gt;&lt;td&gt;.52&lt;/td&gt;&lt;td&gt;.48&lt;/td&gt;&lt;td&gt;.65&lt;/td&gt;&lt;td&gt;.59&lt;/td&gt;&lt;td&gt;.55&lt;/td&gt;&lt;td&gt;.57&lt;/td&gt;&lt;td&gt;.55&lt;/td&gt;&lt;td&gt;.53&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Difficult (21)&lt;/td&gt;&lt;td&gt;.26&lt;/td&gt;&lt;td&gt;.12&lt;/td&gt;&lt;td&gt;.18&lt;/td&gt;&lt;td&gt;.21&lt;/td&gt;&lt;td&gt;.18&lt;/td&gt;&lt;td&gt;.23&lt;/td&gt;&lt;td&gt;.26&lt;/td&gt;&lt;td&gt;.16&lt;/td&gt;&lt;td&gt;.20&lt;/td&gt;&lt;td&gt;.23&lt;/td&gt;&lt;td&gt;.28&lt;/td&gt;&lt;td&gt;.26&lt;/td&gt;&lt;td&gt;.26&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;Panel (c) &lt;italic&gt;L&amp;#8208;S&lt;/italic&gt; (20&amp;#8208;10&amp;#8208;6)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Easy (15)&lt;/td&gt;&lt;td&gt;.40&lt;/td&gt;&lt;td&gt;.39&lt;/td&gt;&lt;td&gt;.40&lt;/td&gt;&lt;td&gt;.40&lt;/td&gt;&lt;td&gt;.39&lt;/td&gt;&lt;td&gt;.40&lt;/td&gt;&lt;td&gt;.41&lt;/td&gt;&lt;td&gt;.36&lt;/td&gt;&lt;td&gt;.37&lt;/td&gt;&lt;td&gt;.38&lt;/td&gt;&lt;td&gt;.36&lt;/td&gt;&lt;td&gt;.37&lt;/td&gt;&lt;td&gt;.38&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Moderate (16)&lt;/td&gt;&lt;td&gt;.50&lt;/td&gt;&lt;td&gt;.54&lt;/td&gt;&lt;td&gt;.51&lt;/td&gt;&lt;td&gt;.51&lt;/td&gt;&lt;td&gt;.53&lt;/td&gt;&lt;td&gt;.51&lt;/td&gt;&lt;td&gt;.48&lt;/td&gt;&lt;td&gt;.57&lt;/td&gt;&lt;td&gt;.55&lt;/td&gt;&lt;td&gt;.53&lt;/td&gt;&lt;td&gt;.50&lt;/td&gt;&lt;td&gt;.51&lt;/td&gt;&lt;td&gt;.50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Difficult (7)&lt;/td&gt;&lt;td&gt;.29&lt;/td&gt;&lt;td&gt;.21&lt;/td&gt;&lt;td&gt;.25&lt;/td&gt;&lt;td&gt;.27&lt;/td&gt;&lt;td&gt;.24&lt;/td&gt;&lt;td&gt;.27&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.20&lt;/td&gt;&lt;td&gt;.23&lt;/td&gt;&lt;td&gt;.25&lt;/td&gt;&lt;td&gt;.37&lt;/td&gt;&lt;td&gt;.32&lt;/td&gt;&lt;td&gt;.31&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;Panel (d) &lt;italic&gt;S&amp;#8208;L&amp;#8208;S&lt;/italic&gt; (6&amp;#8208;20&amp;#8208;10)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Easy (24)&lt;/td&gt;&lt;td&gt;.40&lt;/td&gt;&lt;td&gt;.37&lt;/td&gt;&lt;td&gt;.38&lt;/td&gt;&lt;td&gt;.39&lt;/td&gt;&lt;td&gt;.41&lt;/td&gt;&lt;td&gt;.41&lt;/td&gt;&lt;td&gt;.42&lt;/td&gt;&lt;td&gt;.31&lt;/td&gt;&lt;td&gt;.35&lt;/td&gt;&lt;td&gt;.36&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.33&lt;/td&gt;&lt;td&gt;.36&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Moderate (30)&lt;/td&gt;&lt;td&gt;.50&lt;/td&gt;&lt;td&gt;.59&lt;/td&gt;&lt;td&gt;.55&lt;/td&gt;&lt;td&gt;.54&lt;/td&gt;&lt;td&gt;.54&lt;/td&gt;&lt;td&gt;.51&lt;/td&gt;&lt;td&gt;.49&lt;/td&gt;&lt;td&gt;.59&lt;/td&gt;&lt;td&gt;.56&lt;/td&gt;&lt;td&gt;.53&lt;/td&gt;&lt;td&gt;.57&lt;/td&gt;&lt;td&gt;.54&lt;/td&gt;&lt;td&gt;.53&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Difficult (16)&lt;/td&gt;&lt;td&gt;.34&lt;/td&gt;&lt;td&gt;.21&lt;/td&gt;&lt;td&gt;.27&lt;/td&gt;&lt;td&gt;.29&lt;/td&gt;&lt;td&gt;.25&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.33&lt;/td&gt;&lt;td&gt;.30&lt;/td&gt;&lt;td&gt;.32&lt;/td&gt;&lt;td&gt;.33&lt;/td&gt;&lt;td&gt;.37&lt;/td&gt;&lt;td&gt;.35&lt;/td&gt;&lt;td&gt;.35&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>1220630001 <emph>Notes</emph>. b level Easy = values in cells represent average proportions of items with <emph>b</emph> value less than –1; Moderate indicates average proportions of items with <emph>b</emph> values between –1 and +1; Difficult indicates average proportions of items with <emph>b</emph> values &gt; +1. Numbers in () represent the number of items that belong to a particular b level, of which the averages were taken and reported. Prob refers to the probability of routing correctly to the next testlet such that 1 means that the next testlet is always correctly selected, while probabilities of.8 and.7 represent conditions in which suboptimal routing is modeled in 20% and 30% of examinees. Each panel represents a different design used in the study: <emph>EQ</emph> ‐ Design 1 (equal) had 12 items in each of the three testlets; <emph>S‐L</emph> ‐ Design 2 (short‐to‐long) had six items in the Core testlet, followed by 10 and 20 items in subsequent testlets; <emph>L‐S</emph> ‐ Design 3 (long‐to‐short) had 20 items in the Core testlet, followed by 10 and 6 items in subsequent testlets; <emph>S‐L‐S</emph> ‐ Design 4 (short‐long‐short) had six items in the Core testlet, followed by 20 and 10 items in subsequent testlets.</p> <p>We note that Table  does not include items at the Core stage (e.g., the first 12 items that all simulees were administered, because their exposure rates were 1.00). Further, in order to gain insight into the exposures for various types of items (e.g., easier, or those more difficult according to their <emph>b</emph> value), we averaged across items with similar difficulty levels (presented in Table  as b level). Below, we highlight several interesting findings, first by looking across the different designs and then within each routing method. With respect to different study designs—albeit not surprisingly it was noted that when the design was balanced (<emph>EQ</emph> design) and when the routing method to select the next testlet was randomly selected—items of moderate difficulty were more exposed (.50) than either easier or more difficult items (at.36 and.27 exposure rates, respectively). Similarly, across the remaining four routing methods, items in the moderate range of difficulties were more exposed than items on either extreme (ranges from.49 to.61). However, an effect for suboptimal routing was observed such that as suboptimal probabilities increased (going from.00 to.20 to.30 for probabilities of 1.0,.80, and.70, respectively), item exposure rates for easier/difficulty items also tended to increase. This was the case for all conditions except for the IRT information method for difficult items, in which the exposure rates decreased, in contrast to what we would have expected.</p> <p>Under the <emph>S‐L</emph> design similar findings were observed for the random selection routing method as in the <emph>EQ</emph> design. That is, the moderate items were more exposed at.50 proportion than either easy or difficult items (.31 and.26, respectively). Additionally, as the probability of optimal routing decreased (from 1 to.7), exposure rates for moderate items decreased across other four routing methods as well. More so, the NC last testlet score method yielded the highest exposure rates for easy items under the <emph>S‐L</emph> design, with ranges of.30 to.34. The highest item exposure rates for difficult items were observed for the IRT information method, ranging from.26 to.28; while the lowest item exposure rates for difficult items were observed for the NC cumulative score method, ranging from.12 to.21.</p> <p>While similar patterns to those observed in the <emph>EQ</emph> and <emph>S‐L</emph> designs were found in the <emph>L‐S</emph> design, on average, easy and difficult items were exposed at higher rates than in the previous two designs. The highest exposure rates for easy items were found under NC last testlet score, which ranged from.39 to.41, and under IRT information method for difficult items with an exposure rate ranged of.31 to.37. Under the <emph>S‐L‐S</emph> design, exposure rates for easy or difficult items were all above.30 across all routing methods except NC cumulative score method and when applying NC last testlet score method at probability of 1 as exception.</p> <p>Examining exposure rates within a particular routing method, we observed that when random selection was utilized, the <emph>S‐L</emph> design yielded the lowest exposure rates for easy/difficult items at.31 and.26, respectively, while the <emph>S‐L‐S</emph> designs yielded the highest exposure rates in the range of.34 to.40. Comparing the classical approaches NC cumulative score and NC last testlet score, the latter yielded equally high or higher exposure rates for easy and difficult items, and lower exposure rates for moderate items, regardless of the design. The most extreme differences among the rates between the two classical approaches were found in exposure rates under the <emph>S‐L</emph> design for moderate items. In particular, under a routing probability of 1, exposure rates for NC cumulative score method were.66 (which was also the highest exposure rates reported for any item type across any method and design) compared to.56 for its NC last testlet score counterpart.</p> <p>Examining performance of the IRT‐based routing methods, EAP ability and information, a pattern was noted. For difficult items, using IRT information, items were exposed more frequently on average than using EAP ability estimate as a routing method. However, the opposite was true for the easy items, which were exposed generally at equal or higher rates using EAP ability estimates compared to IRT information. The exposure rates for easy items were equivalent across the two methods and across all three routing probabilities between EAP ability and IRT information under the <emph>S‐L</emph> design (.22,.25, and.28) and the <emph>L‐S</emph> design (.36,.37, and.38), respectively.</p> <p>A brief remark regarding the exposure rates concerns the idea of suboptimal routing. As noted above, the probability of routing had a major impact on item exposures for all items. This was not surprising as the study design was meant to impact item exposure such that some people were purposefully misrouted with the goal of evening out exposure rates. However, different routing methods were impacted in different ways. When it came to moderate items, in all but one case, an increase in misrouting (i.e., probability of selection of optimal testlet decreases from 1 to.70) yielded lower exposure rates of items that ranged from –1 to +1 (moderate items). One exception was noted under IRT information in the <emph>L‐S</emph> design where the exposure rates for moderate items roughly maintained at.50 (specifically, exposure rates at 1.00,.80, and.70 probability were.50,.51, and.50, respectively). This irregular pattern of decreased exposure rate across routing probabilities less than 1 was also noted across all designs (<emph>EQ</emph>, <emph>S‐L</emph>, <emph>L‐S</emph>, and <emph>S‐L‐S</emph>) for difficult items under IRT information, but it was not noted for easy items where exposure rates either remained the same or increased across all methods and designs.</p> <p>Recovery of item difficulty (Figure 4) and item discrimination (Figure 5) parameters varied depending on the type of items. Namely, in general, items that were recovered with the highest level of precision (i.e., smallest biases/RMSEs, highest correlations) were those for items in the middle of distribution—that is, items considered moderate in difficulty regardless of the probability routing levels or designs (with a couple of exceptions in <emph>L‐S</emph> designs). More so, as Figure 4 suggests, bias for easy items ranged from –.01 to.06, with recovery being influenced by the design choice (note the irregular scatter of the markers within any one routing method, in particular for the NC routing methods). Similar, and perhaps a slightly more pronounced reverse patterns were found for difficult items, where unlike for the easier items, bias tended to be near zero or negative.</p> <p>Graph: Item difficulty parameter recovery rates reported as bias, RMSE, and correlations for easy, moderate, and difficulty items.</p> <p>Graph: image_n/jedm12206-fig-0004.png</p> <p>Graph: Item discrimination parameter recovery rates reported as bias, RMSE, and correlations for easy, moderate, and difficulty items.</p> <p>Graph: image_n/jedm12206-fig-0005.png</p> <p>RMSE values for item difficulty recovery provide further insight into the irregularity of the difficulty recovery across the designs and studied conditions. Namely, in some instances, <emph>S‐L‐S</emph> design yielded lower RMSEs (IRT‐based routing methods and NC last testlet for easy items), while in others it yielded one of the highest RMSE values (difficult items under NC cumulative testlet score with routing probability of 1). The inconsistency of the methods across the four designs was most clearly noted using the RMSE values as an outcome to evaluate the recovery of item difficulty parameters under studied conditions. It was thus not surprising that when examining correlations, correlations of nearly 1.0 were found for moderate items, while lower correlations were found for easy and difficult items. However, we note that almost all correlations were.90 or higher (with one exception for <emph>L‐S</emph> design under random method at routing probability of 1 for difficult items, where the correlation dipped just below.90).</p> <p>The recovery of item discriminations varied across the routing methods for all three types of items. While the random routing method yielded, on average, the least amount of bias across the four studied designs (in particular for easy items), discriminations of moderate items were recovered most similarly across the four studied design and IRT‐based routing methods yielded on average the lowest bias.</p> <p>It was also noted that item discriminations were recovered to a varying degree based on the routing probabilities and routing methods, with no one routing method outperforming the others or any one design yielding the lowest bias across different items types. For example, the <emph>E‐Q</emph> design was impacted more (i.e., largest differences in bias) by routing probabilities under the NC methods for easy items (yielding as large or larger bias than other designs). However, for difficult items, the results were opposite—under the <emph>E‐Q</emph> design, routing probabilities away from 1.0 yielded less bias for those same routing methods. Similarly, for difficult items and the two NC routing methods, <emph>L‐S</emph> design produced the highest bias for routing probabilities of.8 and.7, but yielded the lowest bias for routing probabilities of 1.0 for the easy items.</p> <p>RMSEs (as shown in the second row of Figure 5) further supported conclusions that across the different item types, no one routing method consistently outperformed the others, nor did one design yield the lowest RMSEs across the studied conditions. It was noted that the moderate items' discriminations were most successfully recovered under all five routing methods. Different ordering of the designs (i.e., when a design yielded lowest RMSEs) was more pronounced for the easy and difficult items, in particular when routing probabilities were set at 1.0. Additionally, for difficult items and routing probabilities of.8 or.7, item discriminations were recovered very similarly among the four designs for all five routing methods. This was less obvious for the easy items' recovery of discrimination, which RMSEs tended to be slightly higher than in other item groups.</p> <p>Correlations were generally high, with most of them above.90 across the easy/moderate/difficult items, with only a few instances where the correlations between the generated and estimated item discriminations were below.90 (last row of Figure 5). Most notably, the lowest correlations were obtained in conditions under routing probability of 1.0 for <emph>S‐L‐S</emph>, <emph>S‐L</emph>, and <emph>E‐Q</emph> designs for easy and difficult items. Out of the five routing methods, random generally yielded as high or higher correlations across all designs.</p> <hd id="AN0135294541-13">Discussion</hd> <p>ILSAs' move from paper‐and‐pencil administration to a computerized platform affords testing organizations a number of advantages, including the possibility of implementing an adaptive test. This move is particularly important in low‐performing countries where average student proficiency is well below the international average. Without some type of adaptive testing the vast majority of students in these systems receive questions that are too difficult, offering little opportunity to engage with items and potentially resulting in biased achievement estimates for the lowest performers (Rutkowski, Rutkowski, &amp; Liaw, [<reflink idref="bib24" id="ref48">24</reflink>]). In response, PISA is implementing a multistage test design for the 2018 cycle. With the promise of MST, a number of open questions remain, particularly in the unique context of ILSA.</p> <p>To begin answering some of these research questions, we used a simulation study to examine several design possibilities, including testlet length and routing procedures. We considered these design features in terms of item and person parameter recovery as well as item exposure rates. Our findings suggest that no single approach is best for all purposes. We summarize this point subsequently. In terms of mean person parameter recovery, IRT scoring (via either Fisher information or preliminary EAP estimates) outperformed classical NC methods. This was true in terms of bias, RMSE, and correlations between generating and estimated proficiency. In spite of this well‐known performance advantage, we note that from an operational perspective, NC methods offer a much simpler algorithm that is easier to implement. To that end, NC methods exhibited only slightly worse bias, especially if suboptimal probability routing is used with cumulative scoring for routing to the third stage. Alternatively, if only the last testlet is considered for routing, a design with fewer questions in the first (routing) stage is better at recovering mean proficiency, with lower RMSE values and higher generating‐estimated proficiency correlations, compared to all other testlet length designs.</p> <p>We also considered the way in which design and scoring choices impacted item exposure rates and recovery. With respect to exposure rates, as expected, suboptimal routing produced the most even item exposure across scoring methods and testlet lengths. Further, in nearly every condition with a routing probability less than 1, moderately difficult items were exposed to about 50% to 60% of examinees, while easy and difficult items were exposed to about 30% to 40% of examinees. We found, however, that IRT routing and a short routing test produced lower exposure rates, especially for difficult items. Our primary interest in exposure rates was driven by an interest in producing stable item parameters, which we turn to next.</p> <p>Item difficulties were generally recovered well, with bias ranging from –.04 to.06 across the varying routing methods and designs. Items considered moderate in difficulty were recovered best, and most consistently across the four designs. Easy and difficult items tended to be biased in terms of recovery, and in most occasions the NC‐based routing methods tended to be less precise in difficulty parameter recovery. Similarly, items' discriminations were generally recovered well (with magnitude of bias of.04 or lower for most conditions), especially for items considered of moderate difficulty. In general, random and IRT‐based routing methods tended to outperform the NC methods, although this pattern did not hold across the four studied designs. In other words, the length of the testlets yielded varying degrees of precision in item discrimination recovery within a routing method.</p> <p>In summary, this article provides some practical guidance for testing organizations, as they begin and/or continue to implement adaptive designs. As a simulation study, this project has limitations. First, although we considered several design and administration conditions and we modeled our simulation after empirically observed conditions, other options exist. For example, we considered a single population with no model misspecification. And further research with many groups and, especially, violations of the measurement invariance assumption, are needed for a fuller picture. Similarly, an issue of item drift should be considered, with focus on establishing ways by the programs to prevent item parameter from drifts. Nevertheless, our findings point to the fact that no single design will meet all goals. Rather, a careful evaluation of trade‐offs should be made for any ILSA design.</p> <hd id="AN0135294541-14">Acknowledgments</hd> <p>This research was supported in part by a grant from the Norwegian Research Council under the FINNUT program (Grant Number 255246).</p> <ref id="AN0135294541-15"> <title> Footnotes </title> <blist> <bibl id="bib1" idref="ref9" type="bt">1</bibl> <bibtext> In the MST literature, panels represent a test form in testing. However, there could exist multiple different forms within the same panel structure. In the current study, we utilize a single panel as represented in Figure 2, but we manipulate different test lengths, thus creating multiple forms.</bibtext> </blist> <blist> <bibl id="bib2" idref="ref10" type="bt">2</bibl> <bibtext> In their edited volume, Yan, von Davier, and Lewis ([28]) provide an in depth research regarding important issues related to computerized multistage testing, in which authors and co‐contributors discuss an array of MST designs and applications.</bibtext> </blist> <blist> <bibl id="bib3" idref="ref8" type="bt">3</bibl> <bibtext> Weissman ([26]) discusses in depth two routing rules that these two broad approaches encompass: (a) a static routing rule(s) such as NC, and (b) dynamic routing rules. Within each type of routing rules, several decisions ought to be made with respect to administration. For example, when using a static routing rule, one ought to determine a threshold score that would apply to a group of examinees, whereas under dynamic routing rule different algorithms can be used in real time to make routing decisions (i.e., focus is on an individual test taker).</bibtext> </blist> <blist> <bibl id="bib4" idref="ref32" type="bt">4</bibl> <bibtext> The <emph>mst</emph> function was custom modified by Magis et al. ([15]) to allow for probabilistic routing element.</bibtext> </blist> <blist> <bibl id="bib5" idref="ref6" type="bt">5</bibl> <bibtext> These choices of fixed factors are not unusual, albeit there is a wide range of sample sizes across studies. For example, Hambleton and Xing ([6]) sampled 5,000 scores from a standard normal distribution for their 1‐3‐3 MST design, with 20 items per testlet. Chuah, Drasgow, and Luecht ([4]) suggested that as few as 300 examinees per item might be sufficient in MST design for accurate item parameter estimation, whereas S. Kim et al. ([10]) simulated 2,000 simulees at each of the 41 quadrature points across the continuum for a total sample size of 82,000.</bibtext> </blist> <blist> <bibl id="bib6" idref="ref47" type="bt">6</bibl> <bibtext> Star represents the <emph>S‐L</emph> design that had six items in the Core testlet, followed by 10 and 20 items in subsequent testlets; <emph>L‐S</emph> design (diamond) had 20 items in the Core testlet, followed by 10 and 6 items in subsequent testlets; and <emph>S‐L‐S</emph> (circle) had six items in the Core testlet, followed by 20 and 10 items in subsequent testlets.</bibtext> </blist> </ref> <ref id="AN0135294541-16"> <title> References </title> <blist> <bibtext> Armstrong, A. (2002). Routing rules for multiple‐form structures (LSAC Research Report Series No. 02–08). Newtown, PA : Law School Admission Council. Retrieved from <ulink href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.310.3575&amp;rep=rep1&amp;type=pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.310.3575&amp;rep=rep1&amp;type=pdf</ulink></bibtext> </blist> <blist> <bibtext> Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48 (6), 1 – 29. https://doi.org/10.18637/jss.v048.i06</bibtext> </blist> <blist> <bibtext> Chen, H., Yamamoto, K., &amp; von Davier, M. (2014). Controlling multistage testing exposure rates in international large‐scale assessments. In D. Yan, A. A. von Davier, &amp; C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 391 – 409). Boca Raton, FL : CRC Press.</bibtext> </blist> <blist> <bibtext> Chuah, S. C., Drasgow, F., &amp; Luecht, R. M. (2006). How big is big enough? Sample size requirements for CAST item parameter estimation. Applied Measurement in Education, 19, 241 – 255.</bibtext> </blist> <blist> <bibtext> Educational Testing Service. (2016). PISA 2018 integrated design. Princeton, NJ : Author. Retrieved from https://<ulink href="http://www.oecd.org/pisa/pisaproducts/PISA-2018-INTEGRATED-DESIGN.pdf">www.oecd.org/pisa/pisaproducts/PISA-2018-INTEGRATED-DESIGN.pdf</ulink></bibtext> </blist> <blist> <bibtext> Hambleton, R. K., &amp; Xing, D. (2006). Optimal and nonoptimal computer based test designs for making pass‐fail decisions. Applied Measurement in Education, 19, 221 – 239.</bibtext> </blist> <blist> <bibl id="bib7" idref="ref20" type="bt">7</bibl> <bibtext> Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement: Issues and Practice, 26 (2), 44 – 52.</bibtext> </blist> <blist> <bibl id="bib8" idref="ref1" type="bt">8</bibl> <bibtext> Jodoin, M. G., Zenisky, A., &amp; Hambleton, R. K. (2006). Comparison of the psychometric properties of several computer‐based test designs for credentialing exams with multiple purposes. Applied Measurement in Education, 19, 203 – 220.</bibtext> </blist> <blist> <bibl id="bib9" idref="ref12" type="bt">9</bibl> <bibtext> Kim, H., &amp; Plake, B. S. (1993, April). Monte Carlo simulation comparison of two‐stage testing and computerized adaptive testing. Paper presented at the meeting of the National Council on Measurement in Education, Atlanta, GA.</bibtext> </blist> <blist> <bibtext> Kim, S., Moses, T., &amp; Yoo, H. H. (2015). A comparison of IRT proficiency estimation methods under adaptive multistage testing: A comparison of IRT proficiency estimation methods. Journal of Educational Measurement, 52, 70 – 79.</bibtext> </blist> <blist> <bibtext> Kirsch, I., &amp; Lennon, M. L. (2017). PIAAC: A new design for a new era. Large‐Scale Assessments in Education, 5 (1). https://doi.org/10.1186/s40536-017-0046-6</bibtext> </blist> <blist> <bibtext> Kolen, M. J., &amp; Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational Measurement: Issues and Practice, 29 (3), 8 – 14.</bibtext> </blist> <blist> <bibtext> Luecht, R. M., &amp; Nungester, R. J. (1998). Some practical examples of computer‐adaptive sequential testing. Journal of Educational Measurement, 35, 229 – 249.</bibtext> </blist> <blist> <bibtext> Luo, X., &amp; Kim, D. (2018). A top‐down approach to designing the computerized adaptive multistage test: Top‐down multistage. Journal of Educational Measurement, 55, 243 – 263.</bibtext> </blist> <blist> <bibtext> Magis, D., Yan, D., &amp; von Davier, A. (2018). mstR: Procedures to generate patterns under multistage testing (R package version 1.2). Retrieved from https://CRAN.R-project.org/package=mstR</bibtext> </blist> <blist> <bibtext> Melican, G. J., Breithaupt, K., &amp; Zhang, Y. (2009). Designing and implementing a multistage adaptive test: The uniform CPA exam. In W. J. van der Linden &amp; C. A. W. Glas (Eds.), Elements of adaptive testing (pp. 167 – 189). New York, NY : Springer.</bibtext> </blist> <blist> <bibtext> Mislevy, R. J., Johnson, E. G., &amp; Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational and Behavioral Statistics, 17 (2), 131 – 154.</bibtext> </blist> <blist> <bibtext> Mullis, I. V. S., Martin, M. O., Foy, P., &amp; Hooper, M. (2016). TIMSS 2015 international results in mathematics. Boston, MA : TIMSS &amp; PIRLS International Study Center, Lynch School of Education, Boston College. Retrieved from <ulink href="http://timssandpirls.bc.edu/timss2015/international-results/timss-2015/mathematics/student-achievement/">http://timssandpirls.bc.edu/timss2015/international-results/timss-2015/mathematics/student-achievement/</ulink></bibtext> </blist> <blist> <bibtext> OECD. (2010). PISA Computer‐based assessment of student skills in science. Paris, France : OECD Publishing. Retrieved from <ulink href="http://www.oecd.org/education/school/programmeforinternationalstudentassessmentpisa/pisacomputer-basedassessmentofstudentskillsinscience.htm">http://www.oecd.org/education/school/programmeforinternationalstudentassessmentpisa/pisacomputer-basedassessmentofstudentskillsinscience.htm</ulink></bibtext> </blist> <blist> <bibtext> OECD. (2014). PISA 2012 technical report. Paris, France : OECD Publishing.</bibtext> </blist> <blist> <bibtext> OECD. (2017). PISA 2015 technical report. Paris, France : OECD Publishing. Retrieved from <ulink href="http://www.oecd.org/pisa/data/2015-technical-report/">http://www.oecd.org/pisa/data/2015-technical-report/</ulink></bibtext> </blist> <blist> <bibtext> R Development Core Team. (2012). R: A language and environment for statistical computing. Vienna, Austria : R Foundation for Statistical Computing. <ulink href="http://www.R-project.org/">http://www.R-project.org/</ulink></bibtext> </blist> <blist> <bibtext> Robin, F., Manfred, S., &amp; Liang, L. (2014). The multistage test implementation of the GRE revised general test. In D. Yan, A. A. von Davier, &amp; C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 325 – 341). Boca Raton, FL : CRC Press.</bibtext> </blist> <blist> <bibtext> Rutkowski, D., Rutkowski, L., &amp; Liaw, Y.‐L. (2018). Measuring widening proficiency differences in international assessments: Are current approaches enough? Educational Measurement: Issues and Practice, 37 (4), 40 – 48.</bibtext> </blist> <blist> <bibtext> Stocking, M. L., &amp; Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23 (1), 57 – 75.</bibtext> </blist> <blist> <bibtext> Weissman, A. (2014). IRT‐based multistage testing. In D. Yan, A. A. von Davier, &amp; C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 153 – 168). Boca Raton, FL : CRC Press.</bibtext> </blist> <blist> <bibtext> Weissman, A., Belov, D., &amp; Armstrong, A. (2007). Information‐based versus number‐correct routing in multistage classification tests (LSAC Research Report Series No. 07–05). Newtown, PA : Law School Admission Council. Retrieved from https://<ulink href="http://www.lsac.org/docs/default-source/research-(lsac-resources)/rr-07-05.pdf">www.lsac.org/docs/default-source/research-(lsac-resources)/rr-07-05.pdf</ulink></bibtext> </blist> <blist> <bibtext> Yan, D., von Davier, A. A., &amp; Lewis, C. (Eds.), (2014). Computerized multistage testing: Theory and applications. Boca Raton, FL : CRC Press.</bibtext> </blist> <blist> <bibtext> Yan, D., Lewis, C., &amp; von Davier, A. A. (2014a). Multistage test design and scoring with small samples. In D. Yan, A. A. von Davier, &amp; C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 303 – 324). Boca Raton, FL : CRC Press.</bibtext> </blist> <blist> <bibtext> Yan, D., Lewis, C., &amp; von Davier, A. A. (2014b). Overview of computerized multistage tests. In D. Yan, A. A. von Davier, &amp; C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 3 – 20). Boca Raton, FL : CRC Press.</bibtext> </blist> <blist> <bibtext> Zheng, Y., Wang, C., Culbertson, M. J., &amp; Chang, H.‐H. (2014). Overview of test assembly methods in multistage testing. In D. Yan, A. A. von Davier, &amp; C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 87 – 99). Boca Raton, FL : CRC Press.</bibtext> </blist> </ref> <aug> <p>By Dubravka Svetina; Yuan‐Ling Liaw; Leslie Rutkowski and David Rutkowski</p> <p>Reported by Author; Author; Author; Author</p> <p></p> <p>DUBRAVKA SVETINA is Associate Professor of Inquiry Methodology, School of Education, Indiana University, 201 N. Rose Avenue, Bloomington, IN 47405;. Her primary research interests include educational and psychological measurement, (international) large‐scale assessment, item response theory, measurement invariance, and psychometric modeling (e.g., Bayesian and cognitive diagnostic models).</p> <p>YUAN‐LING LIAW is Postdoctoral Researcher at the Centre for Educational Measurement (CEMO), Faculty of Educational Sciences, University of Oslo, P.O Box 1161, Blindern, 0318 Oslo, Norway;. Yuan‐Ling's primary research focuses on practical applications of item response theory, with greatest emphasis on international large‐scale assessment. These include test fairness, differential item functioning, and ability estimation.</p> <p>LESLIE RUTKOWSKI is Associate Professor of Inquiry Methodology, School of Education, Indiana University, 201 N. Rose Ave, Bloomington, IN 47405;. Her primary research interests are in latent variable modeling, especially models that pertain to cross‐cultural measurement and international comparisons among heterogeneous populations.</p> <p>DAVID RUTKOWSKI is Associate Professor of Educational Policy, School of Education, Indiana University, 201 N. Rose Ave, Bloomington, IN 47405;. His primary research interests are in educational policy and large‐scale assessment, focusing on cross‐cultural measurement and international comparisons among heterogeneous populations.</p> </aug> <nolink nlid="nl1" bibid="bib30" firstref="ref2"></nolink> <nolink nlid="nl2" bibid="bib20" firstref="ref3"></nolink> <nolink nlid="nl3" bibid="bib19" firstref="ref4"></nolink> <nolink nlid="nl4" bibid="bib11" firstref="ref5"></nolink> <nolink nlid="nl5" bibid="bib16" firstref="ref13"></nolink> <nolink nlid="nl6" bibid="bib14" firstref="ref14"></nolink> <nolink nlid="nl7" bibid="bib25" firstref="ref16"></nolink> <nolink nlid="nl8" bibid="bib18" firstref="ref18"></nolink> <nolink nlid="nl9" bibid="bib28" firstref="ref21"></nolink> <nolink nlid="nl10" bibid="bib10" firstref="ref22"></nolink> <nolink nlid="nl11" bibid="bib13" firstref="ref24"></nolink> <nolink nlid="nl12" bibid="bib23" firstref="ref25"></nolink> <nolink nlid="nl13" bibid="bib27" firstref="ref26"></nolink> <nolink nlid="nl14" bibid="bib12" firstref="ref27"></nolink> <nolink nlid="nl15" bibid="bib17" firstref="ref28"></nolink> <nolink nlid="nl16" bibid="bib29" firstref="ref29"></nolink> <nolink nlid="nl17" bibid="bib31" firstref="ref30"></nolink> <nolink nlid="nl18" bibid="bib15" firstref="ref31"></nolink> <nolink nlid="nl19" bibid="bib21" firstref="ref46"></nolink> <nolink nlid="nl20" bibid="bib24" firstref="ref48"></nolink> |
|---|---|
| Header | DbId: eric DbLabel: ERIC An: EJ1208659 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: Routing Strategies and Optimizing Design for Multistage Testing in International Large-Scale Assessments – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Svetina%2C+Dubravka%22">Svetina, Dubravka</searchLink><br /><searchLink fieldCode="AR" term="%22Liaw%2C+Yuan-Ling%22">Liaw, Yuan-Ling</searchLink><br /><searchLink fieldCode="AR" term="%22Rutkowski%2C+Leslie%22">Rutkowski, Leslie</searchLink><br /><searchLink fieldCode="AR" term="%22Rutkowski%2C+David%22">Rutkowski, David</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22Journal+of+Educational+Measurement%22"><i>Journal of Educational Measurement</i></searchLink>. Spr 2019 56(1):192-213. – Name: Avail Label: Availability Group: Avail Data: Wiley-Blackwell. 350 Main Street, Malden, MA 02148. Tel: 800-835-6770; Tel: 781-388-8598; Fax: 781-388-8232; e-mail: cs-journals@wiley.com; Web site: http://www.wiley.com/WileyCDA – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 22 – Name: DatePubCY Label: Publication Date Group: Date Data: 2019 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Measurement%22">Measurement</searchLink><br /><searchLink fieldCode="DE" term="%22Item+Analysis%22">Item Analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Construction%22">Test Construction</searchLink><br /><searchLink fieldCode="DE" term="%22Item+Response+Theory%22">Item Response Theory</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Length%22">Test Length</searchLink><br /><searchLink fieldCode="DE" term="%22Scoring%22">Scoring</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Bias%22">Test Bias</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Items%22">Test Items</searchLink><br /><searchLink fieldCode="DE" term="%22Simulation%22">Simulation</searchLink> – Name: DOI Label: DOI Group: ID Data: 10.1111/jedm.12206 – Name: ISSN Label: ISSN Group: ISSN Data: 0022-0655 – Name: Abstract Label: Abstract Group: Ab Data: This study investigates the effect of several design and administration choices on item exposure and person/item parameter recovery under a multistage test (MST) design. In a simulation study, we examine whether number-correct (NC) or item response theory (IRT) methods are differentially effective at routing students to the correct next stage(s) and whether routing choices (optimal versus suboptimal routing) have an impact on achievement precision. Additionally, we examine the impact of testlet length on both person and item recovery. Overall, our results suggest that no single approach works best across the studied conditions. With respect to the mean person parameter recovery, IRT scoring (via either Fisher information or preliminary EAP estimates) outperformed classical NC methods, although differences in bias and root mean squared error were generally small. Item exposure rates were found to be more evenly distributed when suboptimal routing methods were used, and item recovery (both difficulty and discrimination) was most precisely observed for items with moderate difficulties. Based on the results of the simulation study, we draw conclusions and discuss implications for practice in the context of international large-scale assessments that recently introduced adaptive assessment in the form of MST. Future research directions are also discussed. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2019 – Name: AN Label: Accession Number Group: ID Data: EJ1208659 |
| PLink | https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1208659 |
| RecordInfo | BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1111/jedm.12206 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 22 StartPage: 192 Subjects: – SubjectFull: Measurement Type: general – SubjectFull: Item Analysis Type: general – SubjectFull: Test Construction Type: general – SubjectFull: Item Response Theory Type: general – SubjectFull: Test Length Type: general – SubjectFull: Scoring Type: general – SubjectFull: Test Bias Type: general – SubjectFull: Test Items Type: general – SubjectFull: Simulation Type: general Titles: – TitleFull: Routing Strategies and Optimizing Design for Multistage Testing in International Large-Scale Assessments Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Svetina, Dubravka – PersonEntity: Name: NameFull: Liaw, Yuan-Ling – PersonEntity: Name: NameFull: Rutkowski, Leslie – PersonEntity: Name: NameFull: Rutkowski, David IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2019 Identifiers: – Type: issn-print Value: 0022-0655 Numbering: – Type: volume Value: 56 – Type: issue Value: 1 Titles: – TitleFull: Journal of Educational Measurement Type: main |
| ResultId | 1 |