Impact of Design Effects in Large-Scale District and State Assessments
Saved in:
| Title: | Impact of Design Effects in Large-Scale District and State Assessments |
|---|---|
| Language: | English |
| Authors: | Phillips, Gary W. |
| Source: | Applied Measurement in Education. 2015 28(1):33-47. |
| Availability: | Routledge. Available from: Taylor & Francis, Ltd. 325 Chestnut Street Suite 800, Philadelphia, PA 19106. Tel: 800-354-1420; Fax: 215-625-2940; Web site: http://www.tandf.co.uk/journals |
| Peer Reviewed: | Y |
| Page Count: | 15 |
| Publication Date: | 2015 |
| Document Type: | Journal Articles Reports - Research |
| Descriptors: | State Programs, Sampling, Research Design, Error of Measurement, Testing Programs, Statistical Significance, School Districts, Probability, Sample Size, Response Rates (Questionnaires), Weighted Scores, Item Analysis, Equated Scores, Item Response Theory, Testing Problems, Evaluation Methods, Evaluation Problems, Experimenter Characteristics, Test Reliability, Test Validity, Group Testing |
| DOI: | 10.1080/08957347.2014.973561 |
| ISSN: | 0895-7347 |
| Abstract: | This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling errors, in turn, results in unanticipated instability in the testing program and an increase in Type I errors in significance tests. This is especially true when the standard error of equating is underestimated. The problem is caused by the typical district and state practice of using nonprobability cluster-sampling procedures, such as convenience, purposeful, and quota sampling, then calculating statistics and standard errors as if the samples were simple random samples. |
| Abstractor: | As Provided |
| Number of References: | 21 |
| Entry Date: | 2015 |
| Accession Number: | EJ1048933 |
| Database: | ERIC |
|
Full text is not displayed to guests.
Login for full access.
|
|
| FullText | Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwH97dfwJl3AEEZ9g175v6HRAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDO3-s2uD_HK1bDhDtAIBEICBmhnzVasjKthAh7KnGwjxpvUEvJ2nfWzyOy7s20evvXtT7oVOcZnAhL4ln78vf5-z7pdLEZp4PEz3FjKAk4U8dJqMTsZiO48HQJxeiIVMjfIIm8ac3P9O7HdrqWuZ-4iAo4u8VDxQiIXHABG-T3aqr4-GQgr13DbCUfmzQeAlOv1_f9pH4t0CPyVbDt73PgjnN4bEq0z_85yk5tk= Text: Availability: 1 Value: <anid>AN0100241096;7lg01jan.15;2019Mar05.13:41;v2.2.500</anid> <title id="AN0100241096-1">Impact of Design Effects in Large-Scale District and State Assessments. </title> <p>This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling errors, in turn, results in unanticipated instability in the testing program and an increase in Type I errors in significance tests. This is especially true when the standard error of equating is underestimated. The problem is caused by the typical district and state practice of using nonprobability cluster-sampling procedures, such as convenience, purposeful, and quota sampling, then calculating statistics and standard errors as if the samples were simple random samples.</p> <p>The last decade in the United States has seen an increase in the number of district and state large-scale testing programs. This was brought about by the passage of the No Child Left Behind Act of 2001 (PL 107-110). The law required all schools receiving federal funds to administer a standardized statewide test in reading and mathematics to all students in grades 3 through 8 and high school. The goal of the legislation was to improve student achievement by setting high standards for academic achievement. The increase in the number of testing programs has been accompanied by an increase in the stakes of testing. Schools receiving Title I funding through the Elementary and Secondary Education Act are held accountable and must make adequate yearly progress (AYP).</p> <p>As the stakes have increased, so have the expectations that tests provide accurate estimates of student progress. When the state superintendent reports progress, he or she must be confident that students have in fact made progress. Education policy makers assume that the state testing program is using the most accurate sampling and analysis procedures possible. Unfortunately, this is not always the case. One indication of the lack of accuracy occurs when state results bounce around from year to year. The average state results may have gone up in a prior year, which was a cause for celebration, but gone down in the current year, which cannot adequately be explained. Behind the scenes, the state testing staff may see considerable year-to-year instability in their item statistics and equating results, which they cannot explain. For example, the difficulty of items may vary considerably across forms for no apparent reason and estimates of equating parameters may bounce around from year to year. This article proposes that a significant portion of the instability in large-scale testing programs may be due to the lack of proper sampling techniques in district and state testing programs, and to the unrecognized influence of design effects.</p> <p>A design effect in a large-scale assessment is the inflation in the error variance of item and test statistics caused by the design of the sample. The major contributor to the design effect is the intraclass correlation of students within the sampled clusters. Design effects are well understood and managed in large-scale national assessments such as the National Assessment of Educational Progress (NAEP) and large-scale international assessments such as Trends in International Mathematics and Science (TIMSS), Progress in International Reading Literacy Study (PIRLS), and Programme for International Student Assessment (PISA). In fact, the three international assessments listed above employ an international sampling referee who provides sampling plans that minimize design effects, and guarantee that these effects are accurately reported in technical documentation. However, design effects are not well understood in large-scale district and state assessment programs. This is probably because of the widespread practice of using nonprobability sampling methods such as convenience sampling, purposeful sampling, and quota sampling. In <emph>convenience sampling,</emph> schools are selected because they are known to be cooperative. In <emph>purposeful sampling,</emph> schools are handpicked to cover a range of demographics and types of schools. In <emph>quota sampling,</emph> types of schools are selected in proportion to their known incidence in the population. Unfortunately, nonprobability sampling does not guarantee a representative sample, so such samples are likely to be biased estimates of population parameters. More relevant to this article is the fact that non-probability sampling methods listed above are cluster samples (CS), but they are often analyzed as if they were simple random samples (SRSs). They are cluster samples because the sample is drawn in several stages. For example, a sample of schools might first be selected, then, students clustered within schools are selected as the second stage of sampling. Treating them as SRSs underestimates standard errors and increases Type I errors. Furthermore, the practice of analyzing cluster samples as if they were SRSs will lead the district and state testing programs to overstate the sample size in their significance testing and overstate the power of statistical tests.</p> <p>The concept of the design effect originated with Cornfield ([<reflink idref="bib8" id="ref1">8</reflink>]) as a way to characterize the efficiency of a sample design. His idea was to compare the error variance of an SRS with the error variance of a given sample design of the same sample size. Kish ([<reflink idref="bib10" id="ref2">10</reflink>]) used the inverse of Cornfield's ratio and named it the design effect (<emph>Deff</emph>). Design effects are described in more detail by Cochran ([<reflink idref="bib4" id="ref3">4</reflink>]), Levy ([<reflink idref="bib13" id="ref4">13</reflink>]), and Lohr ([<reflink idref="bib15" id="ref5">15</reflink>]). There is also recent literature related to design effects in education group randomized trials by Bloom, Bos, and Lee ([<reflink idref="bib3" id="ref6">3</reflink>]), Hedges and Hedberg ([<reflink idref="bib9" id="ref7">9</reflink>]), and Raudenbush ([<reflink idref="bib20" id="ref8">20</reflink>]). However, there is virtually no research literature on design effects in district and state large-scale assessments. This might be because there is no reference to the role that design effects play in the standard error calculations of item and test statistics in popular testing and measurement books. For example, even though design effects are often a major contributor to the standard error of equating in large-scale assessments, recent books on equating and linking do not even mention the concept of design effects (Kolen &amp; Brennan, [<reflink idref="bib11" id="ref9">11</reflink>], [<reflink idref="bib12" id="ref10">12</reflink>]). Similarly, the most recent joint technical standards by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurements in Education (NCME) <emph>Standards for Educational and Psychological Testing</emph> (1999)—make no reference to sample design effects. There is also no reference to design effects in the upcoming revised joint technical standards scheduled to be released in 2014.</p> <p>To illustrate how design effects can have a major unrecognized impact on test statistics and test results in large-scale assessments, this article provides a set of data obtained through probability sampling, then compares the standard errors of calibration and equating, assuming SRS versus complex sampling. The article provides some typical steps associated with probability sampling such as sample size selection through power analysis, stratification, weighting, and adjustments for non-response to give the reader an understanding of some of the details of complex sampling.</p> <hd id="AN0100241096-2">METHOD</hd> <p></p> <hd id="AN0100241096-3">Sample Design</hd> <p>The study used a school district with <emph>M</emph> = 21 schools as the sampling frame from which a sample of <emph>m</emph> = 13 schools was randomly selected with probability proportional to size (PPS).[<reflink idref="bib1" id="ref11">1</reflink>] The sample of schools was also stratified by type of school—urban, suburban, or rural. Within each selected school, an SRS of students was selected. The sampling was carried out using the complex sampling add-on module to SPSS (<ulink href="http://www.spss.com/">http://www.spss.com/</ulink>). Subsequently, each sampled student was administered an independent field-test form consisting of 40 grade 8 mathematics items with 35 common items (both multiple choice and constructed response items) and 5 items unique to each form (all multiple choice items). There were a total of 15 unique field-test items distributed over three forms with 5 unique field-test items per form. The three forms were randomly assigned to selected students within each school, which resulted in each form's being administered to about one third of the students in each school. The goal of the study was to calibrate and equate all items to a common scale using the one-parameter logistic item response theory model, also called the Rasch model. The most common software used to calibrate items and equate test forms with the Rasch model is Winsteps® (Linacre, [<reflink idref="bib14" id="ref12">14</reflink>]). As this article will demonstrate, Winsteps assumes that the sample on which items are calibrated is an SRS. However, in almost every case in large-scale assessments, the calibration sample is actually a cluster sample; so Winsteps underestimates the standard error of item calibrations.</p> <hd id="AN0100241096-4">Sample Size</hd> <p>Two considerations formed the basis for determining the sample size: the anticipated response rate and power analysis. On the basis of previous experience with the school district, it was anticipated that all selected schools and about 90% of the students would agree to participate. Therefore, the overall response rate was assumed to be about 90%. The sample sizes used in the subsequent power analysis were based on the anticipated response rates.</p> <p>A power analysis was used to determine the number of schools and students to be included in the sample. The previous practice in the school system was to consider item parameter estimates as unstable if their Rasch item calibrations varied by more than 0.30 across forms (this is sometimes referred to as the.3 rule in state assessments). Therefore, the minimum detectable effect (MDE) was set at 0.30. The probability of detecting the MDE was set at , with the probability of a Type I error equal to . According to the calculations provided by Cohen ([<reflink idref="bib5" id="ref13">5</reflink>], [<reflink idref="bib6" id="ref14">6</reflink>]), to meet these a priori criteria, we would need an SRS of 175 students in the sample.[<reflink idref="bib2" id="ref15">2</reflink>] Since we did not have an SRS, Cohen's traditional sample-size calculation is incorrect. Instead, the sample-size calculation must factor in the design effect of our two-stage cluster sample.</p> <p>TABLE 1 Sample Design Table for Two-Stage Cluster Samples That Yield an Effective Sample Size Equal to 175</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Number&lt;/td&gt;&lt;td&gt;Number&lt;/td&gt;&lt;td&gt;Cluster Size&lt;/td&gt;&lt;td&gt;Design Effect&lt;/td&gt;&lt;td&gt;Number&lt;/td&gt;&lt;td&gt;Number&lt;/td&gt;&lt;td&gt;Effective&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Students&lt;/td&gt;&lt;td&gt;Students&lt;/td&gt;&lt;td&gt;per Item for&lt;/td&gt;&lt;td&gt;per Item for&lt;/td&gt;&lt;td&gt;Responding&lt;/td&gt;&lt;td&gt;Responding&lt;/td&gt;&lt;td&gt;Sample Size&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Selected&lt;/td&gt;&lt;td&gt;Responding&lt;/td&gt;&lt;td&gt;Responding&lt;/td&gt;&lt;td&gt;Responding&lt;/td&gt;&lt;td&gt;Students Needed&lt;/td&gt;&lt;td&gt;Schools&lt;/td&gt;&lt;td&gt;Needed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;per School&lt;/td&gt;&lt;td&gt;per School&lt;/td&gt;&lt;td&gt;Students&lt;/td&gt;&lt;td&gt;Students&lt;/td&gt;&lt;td&gt;per Item&lt;/td&gt;&lt;td&gt;Needed&lt;/td&gt;&lt;td&gt;per Item&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;22&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1.34&lt;/td&gt;&lt;td&gt;235&lt;/td&gt;&lt;td&gt;35&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;44&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;1.74&lt;/td&gt;&lt;td&gt;305&lt;/td&gt;&lt;td&gt;23&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;67&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;2.14&lt;/td&gt;&lt;td&gt;375&lt;/td&gt;&lt;td&gt;19&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;89&lt;/td&gt;&lt;td&gt;80&lt;/td&gt;&lt;td&gt;27&lt;/td&gt;&lt;td&gt;2.54&lt;/td&gt;&lt;td&gt;445&lt;/td&gt;&lt;td&gt;17&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;111&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;td&gt;33&lt;/td&gt;&lt;td&gt;2.94&lt;/td&gt;&lt;td&gt;515&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;133&lt;/td&gt;&lt;td&gt;120&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;3.34&lt;/td&gt;&lt;td&gt;585&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;156&lt;/td&gt;&lt;td&gt;140&lt;/td&gt;&lt;td&gt;47&lt;/td&gt;&lt;td&gt;3.74&lt;/td&gt;&lt;td&gt;655&lt;/td&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;178&lt;/td&gt;&lt;td&gt;160&lt;/td&gt;&lt;td&gt;53&lt;/td&gt;&lt;td&gt;4.14&lt;/td&gt;&lt;td&gt;725&lt;/td&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;180&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;4.54&lt;/td&gt;&lt;td&gt;795&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;222&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;67&lt;/td&gt;&lt;td&gt;4.94&lt;/td&gt;&lt;td&gt;865&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;244&lt;/td&gt;&lt;td&gt;220&lt;/td&gt;&lt;td&gt;73&lt;/td&gt;&lt;td&gt;5.34&lt;/td&gt;&lt;td&gt;935&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;267&lt;/td&gt;&lt;td&gt;240&lt;/td&gt;&lt;td&gt;80&lt;/td&gt;&lt;td&gt;5.74&lt;/td&gt;&lt;td&gt;1,005&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;289&lt;/td&gt;&lt;td&gt;260&lt;/td&gt;&lt;td&gt;87&lt;/td&gt;&lt;td&gt;6.14&lt;/td&gt;&lt;td&gt;1,075&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;311&lt;/td&gt;&lt;td&gt;280&lt;/td&gt;&lt;td&gt;93&lt;/td&gt;&lt;td&gt;6.54&lt;/td&gt;&lt;td&gt;1,145&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;333&lt;/td&gt;&lt;td&gt;300&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;td&gt;6.94&lt;/td&gt;&lt;td&gt;1,215&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The design effect is the inflation in the error variance of a statistic caused by the intraclass correlation among observations (Kish, [<reflink idref="bib10" id="ref16">10</reflink>]). The design effect will be different for different statistics from the same sample. For example, the design effect for a <emph>p</emph>-value will be different from the design effect for a Rasch item calibration or a biserial correlation. In the results reported below the focus is on the design effects for Rasch calibrations. In a two-stage cluster sample, the design effect is</p> <p>(<reflink idref="bib1" id="ref17">1</reflink>)</p> <p>Graph</p> <p>In Equation 1, is the average cluster size (or the average number of students per school per form in the sample) and is the intraclass correlation for Rasch calibrations within schools. From previous experience in the school system, it was assumed that the school intraclass correlation for item calibration estimates was approximately . With cluster sampling, the goal is to find the effective sample size that meets power analysis specifications where the effective sample size is defined as . Table 1 provides 15 different two-stage cluster designs of different combinations of schools and students that meet the power analysis specifications and result in an effective sample size of 175 when . The table assumes that all selected schools will participate, 90% of the selected students will participate and the three field test forms will be spiraled between students within schools (about a third of the students in each school will take each of the three field test forms).</p> <p>From among the available sampling designs, <emph>m</emph> = 13 schools and 200 students responding per school was selected. To confirm that this design yields an effective sample size equal to 175, the anticipated sample size in column 5 is divided by the anticipated design effect in column 4 (i.e., 865/4.94 = 175). The cluster size per item in column three was determined by dividing the number of responding students per school by three—which is the number of field test forms spiraled within each school. The design effect was determined via Equation 1, where is the cluster size per item. The number of responding students need per item was determined by multiplying the needed effect sample size by the expected design effect.</p> <hd id="AN0100241096-5">RESULTS</hd> <p></p> <hd id="AN0100241096-6">Response Rates</hd> <p>Table 2 presents the school response rates associated with the field-test sample, and Table 3 shows the student response rates.</p> <p>TABLE 2 Response Rates for Schools</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Stratum&lt;/td&gt;&lt;td&gt;Schools in Population&lt;/td&gt;&lt;td&gt;Selected Schools&lt;/td&gt;&lt;td&gt;Responding Schools&lt;/td&gt;&lt;td&gt;School Response Rate&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Urban&lt;/td&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0.86&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Suburban&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rural&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Total&lt;/td&gt;&lt;td&gt;21&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;0.92&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>TABLE 3 Response Rates for Students</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Stratum&lt;/td&gt;&lt;td&gt;Students in Population&lt;/td&gt;&lt;td&gt;Selected Students&lt;/td&gt;&lt;td&gt;Responding Students&lt;/td&gt;&lt;td&gt;Student Response Rate&lt;/td&gt;&lt;td&gt;Overall Response Rate&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Urban&lt;/td&gt;&lt;td&gt;3,848&lt;/td&gt;&lt;td&gt;1,400&lt;/td&gt;&lt;td&gt;1,035&lt;/td&gt;&lt;td&gt;0.74&lt;/td&gt;&lt;td&gt;0.63&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Suburban&lt;/td&gt;&lt;td&gt;1,834&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;753&lt;/td&gt;&lt;td&gt;0.94&lt;/td&gt;&lt;td&gt;0.94&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rural&lt;/td&gt;&lt;td&gt;1,089&lt;/td&gt;&lt;td&gt;400&lt;/td&gt;&lt;td&gt;374&lt;/td&gt;&lt;td&gt;0.94&lt;/td&gt;&lt;td&gt;0.94&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Total&lt;/td&gt;&lt;td&gt;6,771&lt;/td&gt;&lt;td&gt;2,600&lt;/td&gt;&lt;td&gt;2,162&lt;/td&gt;&lt;td&gt;0.83&lt;/td&gt;&lt;td&gt;0.77&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>In Table 2, out of <emph>M</emph> = 21 schools, <emph>m</emph> = 13 were sampled but only responded (school number 10 refused to participate). This resulted in a school response rate equal to.92. In planning for the study, it was assumed all selected schools would participate, but this turned out not to be the case. In Table 3, of 6,771 students, 2,600 were selected but only 2,162 responded— a student response rate equal to.83. The overall response rate was equal to.92 *.83 =.77. The response rates were used to adjust the sampling weights below for non-response.</p> <hd id="AN0100241096-7">Sampling Weights</hd> <p>Sampling weights are useful for improving the representativeness of the sample and providing unbiased estimates of population parameters. Table 4 provides the sampling weights for the primary sampling units (schools). The probability of selecting school <emph>i</emph> in stratum <emph>h</emph> is found by , where is the number of schools selected in stratum <emph>h</emph>, is the number of students in school <emph>i</emph> in stratum <emph>h</emph>, and is the number of students in stratum <emph>h</emph>. The school weight is the inverse of the probability of selection.</p> <p>TABLE 4 Sampling Weights for Primary Sampling Units (Schools)</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Stratum&lt;/td&gt;&lt;td&gt;School&lt;/td&gt;&lt;td&gt;Population Size in Each School&lt;/td&gt;&lt;td&gt;School Selected in the Sample&lt;/td&gt;&lt;td&gt;Probability of Selecting Each School&lt;/td&gt;&lt;td&gt;School Weight&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Urban&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;449&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;387&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.70&lt;/td&gt;&lt;td&gt;1.42&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;385&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;381&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.69&lt;/td&gt;&lt;td&gt;1.44&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;356&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;347&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.63&lt;/td&gt;&lt;td&gt;1.58&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;340&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.62&lt;/td&gt;&lt;td&gt;1.62&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;314&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;299&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.54&lt;/td&gt;&lt;td&gt;1.84&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;295&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.54&lt;/td&gt;&lt;td&gt;1.86&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;295&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.54&lt;/td&gt;&lt;td&gt;1.86&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Suburban&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;292&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.64&lt;/td&gt;&lt;td&gt;1.57&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;290&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.63&lt;/td&gt;&lt;td&gt;1.58&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;287&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.63&lt;/td&gt;&lt;td&gt;1.60&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;285&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.62&lt;/td&gt;&lt;td&gt;1.61&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;16&lt;/td&gt;&lt;td&gt;399&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;17&lt;/td&gt;&lt;td&gt;281&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rural&lt;/td&gt;&lt;td&gt;18&lt;/td&gt;&lt;td&gt;277&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.51&lt;/td&gt;&lt;td&gt;1.97&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;19&lt;/td&gt;&lt;td&gt;275&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0.51&lt;/td&gt;&lt;td&gt;1.98&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;270&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;21&lt;/td&gt;&lt;td&gt;267&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 5 presents the sampling weights for the secondary sampling units (students). The probability of selecting student <emph>j</emph> in the sample is equal to . The student base weight, , is the inverse of the probabilities of selection. It should be noted that the sum of the base weights equals the population size in each stratum, . This ensures that the sample is representative of each stratum individually and of the population as a whole. The weighted number of students in each cell in Table 5 is equal to .</p> <p>TABLE 5 Sampling Weights for Secondary Sampling Units (Students)</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Stratum&lt;/td&gt;&lt;td&gt;School&lt;/td&gt;&lt;td&gt;Number Students Selected&lt;/td&gt;&lt;td&gt;Probability of Selecting Each Student&lt;/td&gt;&lt;td&gt;Sampling Base Weight for Each Student&lt;/td&gt;&lt;td&gt;Weighted Number of Students&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Urban&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;2.75&lt;/td&gt;&lt;td&gt;550&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;3&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;2.75&lt;/td&gt;&lt;td&gt;550&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;5&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;2.75&lt;/td&gt;&lt;td&gt;550&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;2.75&lt;/td&gt;&lt;td&gt;550&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;8&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;2.75&lt;/td&gt;&lt;td&gt;550&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;2.75&lt;/td&gt;&lt;td&gt;550&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;2.75&lt;/td&gt;&lt;td&gt;550&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Suburban&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.44&lt;/td&gt;&lt;td&gt;2.29&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.44&lt;/td&gt;&lt;td&gt;2.29&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.44&lt;/td&gt;&lt;td&gt;2.29&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.44&lt;/td&gt;&lt;td&gt;2.29&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;16&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;17&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rural&lt;/td&gt;&lt;td&gt;18&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.37&lt;/td&gt;&lt;td&gt;2.72&lt;/td&gt;&lt;td&gt;545&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;19&lt;/td&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;0.37&lt;/td&gt;&lt;td&gt;2.72&lt;/td&gt;&lt;td&gt;545&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;20&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;21&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The base weight was adjusted for non-response as presented in Table 6.</p> <p>TABLE 6 Adjustments to Sampling Weights for Non-Response</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Stratum&lt;/td&gt;&lt;td&gt;School&lt;/td&gt;&lt;td&gt;School Responding in the Selected Sample&lt;/td&gt;&lt;td&gt;Sampling Weight After Adjustment for Non-Response&lt;/td&gt;&lt;td&gt;Weighted Number of Students&lt;/td&gt;&lt;td&gt;Normalized Weight After Adjustment for Non-Response&lt;/td&gt;&lt;td&gt;Normalized Weighted Number of Students in Responding Sample&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Urban&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3.66&lt;/td&gt;&lt;td&gt;641.3&lt;/td&gt;&lt;td&gt;.98&lt;/td&gt;&lt;td&gt;173&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;3&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3.54&lt;/td&gt;&lt;td&gt;641&lt;/td&gt;&lt;td&gt;.95&lt;/td&gt;&lt;td&gt;173&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;5&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3.72&lt;/td&gt;&lt;td&gt;641&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;173&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3.83&lt;/td&gt;&lt;td&gt;641&lt;/td&gt;&lt;td&gt;1.03&lt;/td&gt;&lt;td&gt;173&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;8&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3.76&lt;/td&gt;&lt;td&gt;641&lt;/td&gt;&lt;td&gt;1.01&lt;/td&gt;&lt;td&gt;173&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;10&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3.82&lt;/td&gt;&lt;td&gt;641&lt;/td&gt;&lt;td&gt;1.03&lt;/td&gt;&lt;td&gt;173&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Suburban&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.45&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;188&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.43&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;188&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.44&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;188&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.43&lt;/td&gt;&lt;td&gt;459&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;188&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;16&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;17&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rural&lt;/td&gt;&lt;td&gt;18&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.89&lt;/td&gt;&lt;td&gt;545&lt;/td&gt;&lt;td&gt;.99&lt;/td&gt;&lt;td&gt;187&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;19&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.93&lt;/td&gt;&lt;td&gt;545&lt;/td&gt;&lt;td&gt;1.01&lt;/td&gt;&lt;td&gt;187&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;20&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;21&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The adjustments to the base weights are largest for urban schools because that is the stratum in which the largest amount of non-response was observed. The weight adjusted for non-response, , is found by . It should be noted that the weights adjusted for non-response sum to the population size within each stratum, . These would be the weights used in SPSS for all analyses involving complex sampling. The weighted number of students in each cell in Table 6 is equal to . Ultimately, the weights are used in secondary software, such as iAM,[<reflink idref="bib3" id="ref18">3</reflink>] which require normalized weights that sum to the responding sample size rather than the population size. The normalized weights, , are found by dividing the weights (after adjustment for non-response) by the average of the weights within each stratum . The normalized weights are provided in column 6 of Table 6 and sum to the responding sample size in each stratum, . The normalized weighted number of students in each cell in Table 6 is equal to .</p> <hd id="AN0100241096-8">Standard Error of Item Calibration</hd> <p>The above normalized weights, stratification information, and cluster designation, along with the student item responses, was used to calibrate and equate the three test forms using the iAM software. In Table 7, the <emph>p</emph>-values associated with the 15 field-test items are presented along with their standard errors. There are two sets of standard errors associated with each <emph>p</emph>-value. The first standard error in column 4 assumes that the <emph>p</emph>-value was obtained from an SRS and the second standard error in column 5 assumes the data were collected under a complex sampling design. The design-consistent standard errors are larger by a factor equal to the root design effect (defined as the square root of the design effect). Related to this is the fact that any significance test involving the <emph>p</emph>-value should be based on the effective sample sizes, not the actual sample sizes. The effective sample size is the actual sample size divided by the design effect.</p> <p>TABLE 7 Standard Errors of P-Values Based on Simple Random Sample Versus Complex Sample</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Field Test Item&lt;/td&gt;&lt;td&gt;Sample Size&lt;/td&gt;&lt;td&gt;P-Value&lt;/td&gt;&lt;td&gt;SRS SE (P)&lt;/td&gt;&lt;td&gt;CS SE (P)&lt;/td&gt;&lt;td&gt;Design Effect&lt;/td&gt;&lt;td&gt;Root Design Effect&lt;/td&gt;&lt;td&gt;Effective Sample Size&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;.57&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.028&lt;/td&gt;&lt;td&gt;3.8&lt;/td&gt;&lt;td&gt;1.9&lt;/td&gt;&lt;td&gt;212&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;.24&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.014&lt;/td&gt;&lt;td&gt;1.3&lt;/td&gt;&lt;td&gt;1.1&lt;/td&gt;&lt;td&gt;626&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;.28&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.020&lt;/td&gt;&lt;td&gt;2.4&lt;/td&gt;&lt;td&gt;1.5&lt;/td&gt;&lt;td&gt;338&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;.57&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.030&lt;/td&gt;&lt;td&gt;4.2&lt;/td&gt;&lt;td&gt;2.1&lt;/td&gt;&lt;td&gt;189&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;.75&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.021&lt;/td&gt;&lt;td&gt;2.9&lt;/td&gt;&lt;td&gt;1.7&lt;/td&gt;&lt;td&gt;279&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;.68&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.017&lt;/td&gt;&lt;td&gt;1.6&lt;/td&gt;&lt;td&gt;1.3&lt;/td&gt;&lt;td&gt;506&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;.35&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.012&lt;/td&gt;&lt;td&gt;1.0&lt;/td&gt;&lt;td&gt;1.0&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;.80&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.018&lt;/td&gt;&lt;td&gt;2.4&lt;/td&gt;&lt;td&gt;1.5&lt;/td&gt;&lt;td&gt;335&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;.69&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.024&lt;/td&gt;&lt;td&gt;3.2&lt;/td&gt;&lt;td&gt;1.8&lt;/td&gt;&lt;td&gt;251&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;.74&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.022&lt;/td&gt;&lt;td&gt;3.0&lt;/td&gt;&lt;td&gt;1.7&lt;/td&gt;&lt;td&gt;273&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;.69&lt;/td&gt;&lt;td&gt;.02&lt;/td&gt;&lt;td&gt;.019&lt;/td&gt;&lt;td&gt;1.3&lt;/td&gt;&lt;td&gt;1.1&lt;/td&gt;&lt;td&gt;425&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;.29&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.015&lt;/td&gt;&lt;td&gt;1.0&lt;/td&gt;&lt;td&gt;1.0&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;.80&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.018&lt;/td&gt;&lt;td&gt;1.7&lt;/td&gt;&lt;td&gt;1.3&lt;/td&gt;&lt;td&gt;334&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;.47&lt;/td&gt;&lt;td&gt;.02&lt;/td&gt;&lt;td&gt;.023&lt;/td&gt;&lt;td&gt;1.8&lt;/td&gt;&lt;td&gt;1.3&lt;/td&gt;&lt;td&gt;313&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;.89&lt;/td&gt;&lt;td&gt;.01&lt;/td&gt;&lt;td&gt;.014&lt;/td&gt;&lt;td&gt;1.6&lt;/td&gt;&lt;td&gt;1.3&lt;/td&gt;&lt;td&gt;352&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>TABLE 8 Standard Error (SE) of Item Calibration Based on Simple Random Sampling Versus Complex Sampling</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td&gt;Field Test Item&lt;/td&gt;&lt;td&gt;Sample Size&lt;/td&gt;&lt;td&gt;b-Value&lt;/td&gt;&lt;td&gt;SRS SE (b)&lt;/td&gt;&lt;td&gt;CS SE (b)&lt;/td&gt;&lt;td&gt;Design Effect&lt;/td&gt;&lt;td&gt;Root Design Effect&lt;/td&gt;&lt;td&gt;Effective Sample Size&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;&amp;#8722;0.33&lt;/td&gt;&lt;td&gt;0.09&lt;/td&gt;&lt;td&gt;0.19&lt;/td&gt;&lt;td&gt;4.8&lt;/td&gt;&lt;td&gt;2.2&lt;/td&gt;&lt;td&gt;166&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;1.38&lt;/td&gt;&lt;td&gt;0.10&lt;/td&gt;&lt;td&gt;0.12&lt;/td&gt;&lt;td&gt;1.6&lt;/td&gt;&lt;td&gt;1.3&lt;/td&gt;&lt;td&gt;503&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;1.12&lt;/td&gt;&lt;td&gt;0.09&lt;/td&gt;&lt;td&gt;0.15&lt;/td&gt;&lt;td&gt;2.8&lt;/td&gt;&lt;td&gt;1.7&lt;/td&gt;&lt;td&gt;287&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;&amp;#8722;0.36&lt;/td&gt;&lt;td&gt;0.09&lt;/td&gt;&lt;td&gt;0.20&lt;/td&gt;&lt;td&gt;5.3&lt;/td&gt;&lt;td&gt;2.3&lt;/td&gt;&lt;td&gt;152&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;800&lt;/td&gt;&lt;td&gt;&amp;#8722;1.3&lt;/td&gt;&lt;td&gt;0.10&lt;/td&gt;&lt;td&gt;0.20&lt;/td&gt;&lt;td&gt;4.2&lt;/td&gt;&lt;td&gt;2.0&lt;/td&gt;&lt;td&gt;191&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;&amp;#8722;0.87&lt;/td&gt;&lt;td&gt;0.09&lt;/td&gt;&lt;td&gt;0.16&lt;/td&gt;&lt;td&gt;3.2&lt;/td&gt;&lt;td&gt;1.8&lt;/td&gt;&lt;td&gt;252&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;0.76&lt;/td&gt;&lt;td&gt;0.09&lt;/td&gt;&lt;td&gt;0.20&lt;/td&gt;&lt;td&gt;4.4&lt;/td&gt;&lt;td&gt;2.1&lt;/td&gt;&lt;td&gt;182&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;&amp;#8722;1.6&lt;/td&gt;&lt;td&gt;0.10&lt;/td&gt;&lt;td&gt;0.17&lt;/td&gt;&lt;td&gt;2.9&lt;/td&gt;&lt;td&gt;1.7&lt;/td&gt;&lt;td&gt;280&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;&amp;#8722;0.9&lt;/td&gt;&lt;td&gt;0.09&lt;/td&gt;&lt;td&gt;0.18&lt;/td&gt;&lt;td&gt;4.1&lt;/td&gt;&lt;td&gt;2.0&lt;/td&gt;&lt;td&gt;197&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;804&lt;/td&gt;&lt;td&gt;&amp;#8722;1.19&lt;/td&gt;&lt;td&gt;0.09&lt;/td&gt;&lt;td&gt;0.15&lt;/td&gt;&lt;td&gt;2.8&lt;/td&gt;&lt;td&gt;1.7&lt;/td&gt;&lt;td&gt;283&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;&amp;#8722;0.93&lt;/td&gt;&lt;td&gt;0.11&lt;/td&gt;&lt;td&gt;0.16&lt;/td&gt;&lt;td&gt;2.3&lt;/td&gt;&lt;td&gt;1.5&lt;/td&gt;&lt;td&gt;247&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;1.1&lt;/td&gt;&lt;td&gt;0.12&lt;/td&gt;&lt;td&gt;0.17&lt;/td&gt;&lt;td&gt;2.1&lt;/td&gt;&lt;td&gt;1.4&lt;/td&gt;&lt;td&gt;269&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;&amp;#8722;1.63&lt;/td&gt;&lt;td&gt;0.12&lt;/td&gt;&lt;td&gt;0.18&lt;/td&gt;&lt;td&gt;2.5&lt;/td&gt;&lt;td&gt;1.6&lt;/td&gt;&lt;td&gt;221&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;0.16&lt;/td&gt;&lt;td&gt;0.10&lt;/td&gt;&lt;td&gt;0.19&lt;/td&gt;&lt;td&gt;3.6&lt;/td&gt;&lt;td&gt;1.9&lt;/td&gt;&lt;td&gt;154&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;556&lt;/td&gt;&lt;td&gt;&amp;#8722;2.36&lt;/td&gt;&lt;td&gt;0.14&lt;/td&gt;&lt;td&gt;0.21&lt;/td&gt;&lt;td&gt;2.3&lt;/td&gt;&lt;td&gt;1.5&lt;/td&gt;&lt;td&gt;244&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The Rasch calibrations for the 15 field-test items are presented in Table 8. The standard errors in column 4 are similar to those that Winsteps would produce.[<reflink idref="bib4" id="ref19">4</reflink>] However, these standard errors do not reflect the design effect in the calibration sample caused by the intraclass correlation between observations within clusters. The design- consistent standard errors appear in column 5. The average design effect, , across the field-test items in Table 8 was equal to 3.3. This means that the design-consistent standard errors of item calibration were, on average, times larger than those that Winsteps would report. This represents an unrecognized source of error in the testing program, which in this example, will substantially increase the Type I error rate of all significance tests involving the item calibrations. For example, the researcher will over-identify items that do not fit the model, over-identify items exhibiting differential item functioning (DIF), and underestimate the standard error of equating.</p> <hd id="AN0100241096-9">Standard Error of Equating</hd> <p>The standard error of equating is an index of random equating error defined as "the standard deviation of equated scores over hypothetical replications of an equating procedure in samples from a population of examinees" (Kolen &amp; Brennan, [<reflink idref="bib12" id="ref20">12</reflink>], p. 232). The standard error of equating has received considerable attention in the psychometric research. Much of the research is summarized in Kolen and Brennan ([<reflink idref="bib12" id="ref21">12</reflink>]) who provide several procedures for estimating the standard error of equating including the bootstrap method and the delta method. They also provide analytic formulas for common designs such as single group designs, randomly equivalent group designs, and common-item nonequivalent group designs. However, the procedures and formula provided by Kolen and Brennan assume simple random sampling so they do not capture the additional error in the standard error of equating caused by design effects in large-scale assessment sampling. One example of research that explicitly incorporates the design effect in the standard error of equating was reported by Cohen et al. ([<reflink idref="bib7" id="ref22">7</reflink>]).</p> <p>The design effect in the error variance of item calibration propagates to the error variance in equating. In other words, the design of the sample on which the calibrations are based can dramatically affect the standard error of equating. In this article mean/mean equating was used with the two sets of <emph>p</emph> = 35 common item difficulty estimates on three forms, which will be labeled <emph>Y, X<subs>1</subs></emph>, and <emph>X<subs>2</subs></emph>. The goal is to equate Forms <emph>X<subs>1</subs></emph> and <emph>X<subs>2</subs></emph> to Form <emph>Y</emph>. Equating Form <emph>X<subs>1</subs></emph> to Form Y will be used to illustrate the process.</p> <p>Let the two sets of Rasch difficulty parameter estimates be symbolized by , and . For a partial credit item (Masters, [<reflink idref="bib17" id="ref23">17</reflink>]), the above notation represents the difficulty of the partial credit item, which is the average of the step values of the item. The Rasch equating constant that re-expresses <emph>X<subs>1</subs></emph> on the <emph>Y</emph> scale is</p> <p>(<reflink idref="bib2" id="ref24">2</reflink>)</p> <p>Graph</p> <p>The difficulty estimates for <emph>X<subs>1</subs></emph> are equated to <emph>Y</emph> through</p> <p>(<reflink idref="bib3" id="ref25">3</reflink>)</p> <p>Graph</p> <p>The standard error of equating[<reflink idref="bib5" id="ref26">5</reflink>] is</p> <p>(<reflink idref="bib4" id="ref27">4</reflink>)</p> <p>Graph</p> <p>In the above equation, the notation <emph>m</emph> and <emph>n</emph> indexes the rows and columns of the variance–covariance matrix. The notation indicates that both <emph>m</emph> and <emph>n</emph> start from 1, but we keep .</p> <p>In Equation 4, both the error variance of the item difficulty estimates and the covariance of the item difficulty estimates influence the standard error of equating. Both the error variance and the covariance are affected by design effects in a complex sample. When using commercially available software such as Winsteps to calibrate the items, the covariances are assumed to be equal to zero, and the design effects are assumed to be equal to one.</p> <p>Table 9 presents the standard errors of equating. In this example <emph>X<subs>1</subs></emph> and <emph>X<subs>2</subs></emph> were equated to <emph>Y</emph> using mean/mean equating. The Rasch equating constants were calculated via Equation 2, and the standard errors of equating were calculated using Equation 4. The dramatic increase in the standard error of equating caused by the sample design effects can be illustrated by reviewing the results for equating <emph>X<subs>1</subs></emph> to <emph>Y</emph>. If output similar to that of Winsteps were used, the district testing staff would believe that the standard error of equating was equal to.022. However, Winsteps omits two important sources of error variance: the error covariance between item parameter estimates and the design effects of the calibration sample. When these components of error are factored in, the standard error of equating is.191, which is an 868% increase in the standard error of equating. Not accounting for these additional components of error—and, therefore, underestimating the standard error of equating—is an unrecognized source of Type I errors in large-scale testing programs. In this example, the testing director would conduct significance tests using the Winsteps-based standard error of equating, not realizing that the actual standard error of equating was actually almost nine times larger.</p> <p>TABLE 9 Standard Error of Equating Based on Simple Random Sampling Versus Complex Sampling</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;Equating&lt;/td&gt;&lt;td&gt;Equating&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;X&lt;sub&gt;1&lt;/sub&gt; to Y&lt;/td&gt;&lt;td&gt;X&lt;sub&gt;2&lt;/sub&gt; to Y&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;Simple&lt;/td&gt;&lt;td /&gt;&lt;td&gt;Simple&lt;/td&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;Random&lt;/td&gt;&lt;td&gt;Complex&lt;/td&gt;&lt;td&gt;Random&lt;/td&gt;&lt;td&gt;Complex&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;Sample&lt;/td&gt;&lt;td&gt;Sample&lt;/td&gt;&lt;td&gt;Sample&lt;/td&gt;&lt;td&gt;Sample&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Rasch equating constants&lt;/td&gt;&lt;td&gt;0.040&lt;/td&gt;&lt;td&gt;0.040&lt;/td&gt;&lt;td&gt;0.038&lt;/td&gt;&lt;td&gt;0.038&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Standard error of equating (no item covariance)&lt;/td&gt;&lt;td&gt;0.022&lt;/td&gt;&lt;td&gt;0.042&lt;/td&gt;&lt;td&gt;0.024&lt;/td&gt;&lt;td&gt;0.044&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Standard error of equating (with item covariance)&lt;/td&gt;&lt;td&gt;0.052&lt;/td&gt;&lt;td&gt;0.191&lt;/td&gt;&lt;td&gt;0.061&lt;/td&gt;&lt;td&gt;0.180&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>To illustrate the way underestimating the standard error of equating will lead to unrecognized Type I errors, Table 10 presents the situation in which the district in our study used Form <emph>Y</emph> of the test in 2011 and Form <emph>X<subs>1</subs></emph> in 2012. The district performance mean was 0.00 in 2011 and was.11 in 2012 (expressed in the Rasch logit metric).</p> <p>TABLE 10 Illustration of Impact of the Standard Error of Equating on Test of Significance for Mean Difference</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;District Mean (Logits)&lt;/td&gt;&lt;td&gt;District Standard Deviation (Logits)&lt;/td&gt;&lt;td&gt;Standard Error of Mean (Without Including Standard Error of Equating)&lt;/td&gt;&lt;td&gt;Winsteps Standard Error of Equating (Simple Random Sampling)&lt;/td&gt;&lt;td&gt;Design Consistent Standard Error of Equating (Complex Sampling)&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;2011&lt;/td&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&lt;td&gt;0.01&lt;/td&gt;&lt;td /&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2012&lt;/td&gt;&lt;td&gt;0.11&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;0.01&lt;/td&gt;&lt;td&gt;0.022&lt;/td&gt;&lt;td&gt;0.191&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>If there were no standard error of equating, the significance test for the mean difference between 2011 and 2012 would be . Since | z | &lt; 1.96, the district would conclude that there was a significant increase in student achievement from 2011 to 2012. In this case, the district testing director believes that the margin of error is . However, Form <emph>X<subs>1</subs></emph> was equated to Form <emph>Y</emph> so the significance test needs to include the standard error of equating. If the results from Winsteps were used, the revised significance test would be , which is still statistically significant. Now the margin of error is .</p> <p>Unfortunately, the standard error of equating of.022 is an underestimate because it does not include all the components of error that affect our estimate of the linking constant. Even though the analysts may not be aware of these components of error, they are still there in our sample estimate of the equating constant. When iAM is used to calculate the standard error of equating, which includes the components of error due to the design effect and the error covariance between item parameter estimates, the significance test would be , which is not significant. Concluding that there was a significant annual increase in student achievement would turn out to be a Type I error. In fact, once all the components of error are estimated and accounted for, the district increase of.11 Rasch logits is less than the margin of error, which is . Any increases or decreases in the annual district mean within –.192 and +.192 are really just random mean changes bouncing around within the margin of error. These unrecognized (and therefore unanticipated) random fluctuations in annual results of large-scale assessments have been referred to as "score drift" (Phillips, [<reflink idref="bib18" id="ref28">18</reflink>]; Phillips, Doorey, Forgione, &amp; Monfils, [<reflink idref="bib19" id="ref29">19</reflink>]).</p> <p>As the preceding material shows, sample design effects can cause substantial inflation in the standard error of equating. However, this inflation could have been a lot worse if a less efficient field-test design had been used. The field-test administration design in this article randomly assigned field-test forms to students within schools. This is referred to as <emph>spiraling by student</emph> because testing vendors package the field-test booklets for each school and alternate or spiral the field-test forms within each package of booklets. What would the design effect be if we had instead randomly assigned forms to schools, instead of to students within schools? This is referred to as <emph>spiraling by school,</emph> in which all students in a given school receive the same form. Districts and states often spiral forms by school as a way to reduce costs (it costs less to package and distribute one form to a school than to do so with multiple forms) and reduce the burden on the schools (each school only has to deal with one form, instead of coping with the logistics of multiple forms). Unfortunately, spiraling by schools would triple the cluster size and greatly increase the design effect in our estimates of item calibrations, which would in turn increase the standard error of equating. Table 11 presents a comparison between student-spiraling and school-spiraling.</p> <p>TABLE 11 Two Field-Test Designs for a Given Sample Design</p> <p> <ephtml> &lt;table&gt;&lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;td /&gt;&lt;td&gt;Student Spiraling&lt;/td&gt;&lt;td&gt;School Spiraling&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Number forms per school&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Population schools (&lt;italic&gt;M&lt;/italic&gt;)&lt;/td&gt;&lt;td&gt;21&lt;/td&gt;&lt;td&gt;21&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;School sample size (&lt;italic&gt;m&lt;/italic&gt;)&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Population students (&lt;italic&gt;N&lt;/italic&gt;)&lt;/td&gt;&lt;td&gt;6,771&lt;/td&gt;&lt;td&gt;6,771&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Student sample size (&lt;italic&gt;n&lt;/italic&gt;)&lt;/td&gt;&lt;td&gt;2,160&lt;/td&gt;&lt;td&gt;2,160&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average intraclass correlation &lt;p id="ILM0040"&gt;&lt;inline-graphic href="hame&amp;#95;a&amp;#95;973561&amp;#95;ilm0040.gif" /&gt;&lt;/p&gt; for Rasch difficulties&lt;/td&gt;&lt;td&gt;.038&lt;/td&gt;&lt;td&gt;.038&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average cluster size per form (&lt;p id="ILM0041"&gt;&lt;inline-graphic href="hame&amp;#95;a&amp;#95;973561&amp;#95;ilm0041.gif" /&gt;&lt;/p&gt;)&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;180&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average design effect for Rasch difficulty&lt;/td&gt;&lt;td&gt;3.3&lt;/td&gt;&lt;td&gt;7.8&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average root design effect for Rasch difficulties&lt;/td&gt;&lt;td&gt;1.82&lt;/td&gt;&lt;td&gt;2.79&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average sample size per item&lt;/td&gt;&lt;td&gt;720&lt;/td&gt;&lt;td&gt;720&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average effective sample size per item&lt;/td&gt;&lt;td&gt;222&lt;/td&gt;&lt;td&gt;92&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Average minimum detectable effect size&lt;/td&gt;&lt;td&gt;.27&lt;/td&gt;&lt;td&gt;.41&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>At the time the sampling was planned, based on past experience, it was assumed that the average intraclass correlation for the Rasch difficulty estimates for field-test items would equal about.06. After the data were collected, it turned out that the average intraclass correlation for Rasch item difficulties was equal to.04. The average intraclass correlations can be inferred from the average design effects provided by iAM by . Therefore, the effective sample size turned out to be larger than anticipated (effective sample size was 222, instead of 175) and the minimum detectable effect size was smaller than anticipated (the MDE was equal to 0.27 instead of 0.30). However, what would be the result if the field test forms had been spiraled by school instead of by student? This would have increased the cluster size from 60 to 180 and the average design effect from 3.25 to . Spiraling by schools would more than double the design effect of spiraling by students. Even though the average sample size per form would be about students, the average effective sample size associated with spiraling by schools is . Most districts and states would consider such a small effective sample size inadequate for calibrating and equating items. Furthermore, the sample is not large enough to have the power to detect the MDE of.30. Instead of being able to detect the minimum effect of 0.30 (which was one of the goals of the complex sample) it can only has the power to detect a minimum effect size of 0.41.</p> <hd id="AN0100241096-10">DISCUSSION</hd> <p>Since the passage of the No Child Left Behind Act of 2001, there has been a sense of urgency among the states and districts to show progress in kindergarten through 12th-grade education. The No Child Left Behind Act shined a bright light on the inner workings of district and state large-scale assessments. Federal peer review required large-scale assessments to meet a high standard of technical adequacy and transparency. If things go as planned, beginning in the 2014– 2015 school year, two state consortia with federal funding from the Race to the Top grants will begin implementing even larger scale assessments. These are Partnership for Assessment of Readiness for College and Careers (PARCC), at <ulink href="http://www.fldoe.org/parcc/,">http://www.fldoe.org/parcc/,</ulink> and SMARTER Balanced Assessment Consortium (SBAC), at <ulink href="http://www.k12.wa.us/smarter/">http://www.k12.wa.us/smarter/</ulink>. This will raise expectations for improvement once again, and make large-scale assessments subject to greater external monitoring and evaluation. As this article has demonstrated, sampling error can be far more complicated (and often substantially larger) than large-scale assessment programs report. Underreporting the margin of error leads to increased Type I errors, unexpected bouncing around testing results and anxious post hoc spin control by the district superintendent or chief state school officer. The solution to this problem is for large-scale assessments programs to consistently use proper probability sampling techniques, and to manage and minimize the design effects in their samples.</p> <p>The above example of spiraling by student versus spiraling by school illustrates the way the item field-test design interacts with the student sample design. An inefficient item field-test design can spoil an efficient student sample design. The best strategy for field-testing with clustered samples is to spread the sample out over as many clusters (or schools) as possible. This reduces the cluster size and reduces the design effect, which reduces Type I errors with significance testing.</p> <p>A topic that often comes up in discussion of IRT sampling in large-scale district and state assessments is the concept of invariance of item parameters and person ability in item response theory models. The analysis of design effects reported in this paper grew out of a series of workshops and seminars conducted over a 4-year period (2008–2011) on how state testing results can bounce around from year to year due to underestimating the margin of error in state statistics. During the CCSSO workshops the state testing directors frequently argued that proper probability sampling was not necessary because IRT models are sample invariant. However, there are at least two reasons why the IRT invariance assumption does not eliminate the need for sufficiently large and representative probability samples in large-scale assessments.</p> <p>First, as Lord has pointed out, invariance is a property of the model parameters and not the sample estimates of model parameters (Lord, [<reflink idref="bib16" id="ref30">16</reflink>]). There is a difference between a property of a parameter versus a property of an estimate of a parameter. IRT sample estimates would only approximate invariance if the model fits the data and the item parameter estimates are accurately estimated. If the item calibrations are poorly estimated (due to large amounts of sampling error) then the invariance property will not hold in the sample.</p> <p>Second, there are many parameters used in a large-scale testing program that do not claim to have the invariance property. These include classical test theory statistics such as <emph>p</emph>-values, item–test correlation coefficients, discrimination indices, classification consistency indices, reliability coefficients, and differential item functioning statistics. Estimates of these parameters require sufficiently large and representative samples of examinees.</p> <ref id="AN0100241096-11"> <title> Footnotes </title> <blist> <bibl id="bib1" idref="ref11" type="bt">1</bibl> <bibtext> A district with 21 schools was selected for illustrative purposes from a larger statewide sample of schools in which the district was a stratum. Data from the district were reanalyzed in order to provide a simple illustration of design effects for this article.</bibtext> </blist> <blist> <bibl id="bib2" idref="ref15" type="bt">2</bibl> <bibtext> For example, see page 53, Table 2.4.1, in Cohen ([5]).</bibtext> </blist> <blist> <bibl id="bib3" idref="ref6" type="bt">3</bibl> <bibtext> iAM is free psychometric software available from the American Institutes for Research (AIR) and Jon Cohen, and can be downloaded at <ulink href="http://am.air.org">http://am.air.org</ulink>. The software uses an item response theory module (iAM) for parameter estimation of all major item response theory models, along with design consistent estimates of standard errors (Cohen, Chan, Jiang, &amp; Seburn, [7]). The standard errors are estimated using Taylor series approximations, which take into account student weights, stratification information, and clustering (Binder, [2]; Woodruff, [21]). The iAM module was used for all psychometric analyses in this article. When SRS is assumed, and it is also assumed that there is no covariance between item parameter estimates, then iAM provides essentially the same item parameter estimates and standard errors as Winsteps. The minor differences are related to the fact that Winsteps uses joint maximum likelihood estimation (JMLE), whereas iAM uses marginal maximum likelihood estimation (MMLE).</bibtext> </blist> <blist> <bibl id="bib4" idref="ref3" type="bt">4</bibl> <bibtext> When simple random sampling is assumed (no stratification and no clustering) the iAM software estimates of the standard errors of item calibrations are comparable to those in WINSTEPS (within 2 to 3 decimal places).</bibtext> </blist> <blist> <bibl id="bib5" idref="ref13" type="bt">5</bibl> <bibtext> I thank Dr. Tao Jiang, at the American Institutes for Research (AIR), for the derivation of this equation.</bibtext> </blist> </ref> <ref id="AN0100241096-12"> <title> REFERENCES </title> <blist> <bibtext> American Educational Research Association, American Psychological Association, &amp; National Council on Measurements in Education. (1999). Standards for educational and psychological testing. Washington, DC : American Psychological Association.</bibtext> </blist> <blist> <bibtext> Binder , D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review , 51 , 279 – 292.</bibtext> </blist> <blist> <bibtext> Bloom , H. , Bos , J. , &amp; Lee , S. (1999). Using cluster random assignment to measure program impact: Statistical implications for evaluation of education programs. Evaluation Review , 23 (4), 445 – 469.</bibtext> </blist> <blist> <bibtext> Cochran , W. G. (1977). Sampling techniques (3rd ed.). New York, NY : John Wiley and Sons.</bibtext> </blist> <blist> <bibtext> Cohen , J. (1969). Statistical power analysis for the behavioral sciences. New York, NY : Academic Press.</bibtext> </blist> <blist> <bibl id="bib6" idref="ref14" type="bt">6</bibl> <bibtext> Cohen , J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ : Lawrence Erlbaum Associates.</bibtext> </blist> <blist> <bibl id="bib7" idref="ref22" type="bt">7</bibl> <bibtext> Cohen , J. , Chan , T. , Jiang , T. , &amp; Seburn , M. (2008). Consistent estimation of Rasch item parameters and their standard errors under complex sample designs. Applied Psychological Measurement , 32 (4), 289 – 310.</bibtext> </blist> <blist> <bibl id="bib8" idref="ref1" type="bt">8</bibl> <bibtext> Cornfield , J. (1951). Modern methods in the sampling of human populations. American Journal of Public Health , 41 , 654 – 661.</bibtext> </blist> <blist> <bibl id="bib9" idref="ref7" type="bt">9</bibl> <bibtext> Hedges , L. , &amp; Hedberg , E. (2007). Intraclass correlation values for planning group randomized trials in education. Educational Evaluation and Policy Analysis , 29 (1), 60 – 87.</bibtext> </blist> <blist> <bibtext> Kish , L. (1965). Survey sampling. New York, NY : John Wiley and Sons.</bibtext> </blist> <blist> <bibtext> Kolen , M. J. , &amp; Brennan , R. L. (1995). Test equating: Methods and practices. New York, NY : Springer-Verlag.</bibtext> </blist> <blist> <bibtext> Kolen , M. J. , &amp; Brennan , R. L. (2004). Test equating, scaling and linking: Methods and practices (2nd ed.). New York, NY : Springer.</bibtext> </blist> <blist> <bibtext> Levy , P. S. (1999). Sampling populations: Methods and applications (3rd ed.). New York, NY : John Wiley and Sons.</bibtext> </blist> <blist> <bibtext> Linacre , J. M. (2011). Winsteps® (Version 3.72.0) [Computer software]. Beaverton, OR : Winsteps.com.</bibtext> </blist> <blist> <bibtext> Lohr , S. L. (1999). Sampling: Design and analysis. Pacific Grove, CA : Duxbury Press, Brooks/Cole Publishing.</bibtext> </blist> <blist> <bibtext> Lord , F. M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ : Lawrence Erlbaum Associates.</bibtext> </blist> <blist> <bibtext> Masters , G. N. (1982). A Rasch model for partial credit scoring. Psychometrika , 47 , 149 – 174.</bibtext> </blist> <blist> <bibtext> Phillips , G. W. (2010). Score drift: Why district and state achievement results unexpectedly bounce up and down from year to year. Training session at the National Council for Measurement in Education , Denver, CO.</bibtext> </blist> <blist> <bibtext> Phillips , G. W. , Doorey , N. A. , Forgione , P. D. , &amp; Monfils , L. (2011). Addressing two commonly unrecognized sources of score instability in annual state assessments. Washington, DC : Council of Chief State School Officers.</bibtext> </blist> <blist> <bibtext> Raudenbush , S. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods , 2 (2), 173 – 185.</bibtext> </blist> <blist> <bibtext> Woodruff , R. S. (1971). A Simple method for approximating the variance of a complicated estimate. Journal of the American Statistical Association , 66 , 411 – 414.</bibtext> </blist> </ref> <aug> <p>By Gary W. Phillips</p> <p>Reported by Author</p> </aug> <nolink nlid="nl1" bibid="bib10" firstref="ref2"></nolink> <nolink nlid="nl2" bibid="bib13" firstref="ref4"></nolink> <nolink nlid="nl3" bibid="bib15" firstref="ref5"></nolink> <nolink nlid="nl4" bibid="bib20" firstref="ref8"></nolink> <nolink nlid="nl5" bibid="bib11" firstref="ref9"></nolink> <nolink nlid="nl6" bibid="bib12" firstref="ref10"></nolink> <nolink nlid="nl7" bibid="bib14" firstref="ref12"></nolink> <nolink nlid="nl8" bibid="bib17" firstref="ref23"></nolink> <nolink nlid="nl9" bibid="bib18" firstref="ref28"></nolink> <nolink nlid="nl10" bibid="bib19" firstref="ref29"></nolink> <nolink nlid="nl11" bibid="bib16" firstref="ref30"></nolink> |
|---|---|
| Header | DbId: eric DbLabel: ERIC An: EJ1048933 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: Impact of Design Effects in Large-Scale District and State Assessments – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Phillips%2C+Gary+W%2E%22">Phillips, Gary W.</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22Applied+Measurement+in+Education%22"><i>Applied Measurement in Education</i></searchLink>. 2015 28(1):33-47. – Name: Avail Label: Availability Group: Avail Data: Routledge. Available from: Taylor & Francis, Ltd. 325 Chestnut Street Suite 800, Philadelphia, PA 19106. Tel: 800-354-1420; Fax: 215-625-2940; Web site: http://www.tandf.co.uk/journals – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 15 – Name: DatePubCY Label: Publication Date Group: Date Data: 2015 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22State+Programs%22">State Programs</searchLink><br /><searchLink fieldCode="DE" term="%22Sampling%22">Sampling</searchLink><br /><searchLink fieldCode="DE" term="%22Research+Design%22">Research Design</searchLink><br /><searchLink fieldCode="DE" term="%22Error+of+Measurement%22">Error of Measurement</searchLink><br /><searchLink fieldCode="DE" term="%22Testing+Programs%22">Testing Programs</searchLink><br /><searchLink fieldCode="DE" term="%22Statistical+Significance%22">Statistical Significance</searchLink><br /><searchLink fieldCode="DE" term="%22School+Districts%22">School Districts</searchLink><br /><searchLink fieldCode="DE" term="%22Probability%22">Probability</searchLink><br /><searchLink fieldCode="DE" term="%22Sample+Size%22">Sample Size</searchLink><br /><searchLink fieldCode="DE" term="%22Response+Rates+%28Questionnaires%29%22">Response Rates (Questionnaires)</searchLink><br /><searchLink fieldCode="DE" term="%22Weighted+Scores%22">Weighted Scores</searchLink><br /><searchLink fieldCode="DE" term="%22Item+Analysis%22">Item Analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Equated+Scores%22">Equated Scores</searchLink><br /><searchLink fieldCode="DE" term="%22Item+Response+Theory%22">Item Response Theory</searchLink><br /><searchLink fieldCode="DE" term="%22Testing+Problems%22">Testing Problems</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluation+Methods%22">Evaluation Methods</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluation+Problems%22">Evaluation Problems</searchLink><br /><searchLink fieldCode="DE" term="%22Experimenter+Characteristics%22">Experimenter Characteristics</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Reliability%22">Test Reliability</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Validity%22">Test Validity</searchLink><br /><searchLink fieldCode="DE" term="%22Group+Testing%22">Group Testing</searchLink> – Name: DOI Label: DOI Group: ID Data: 10.1080/08957347.2014.973561 – Name: ISSN Label: ISSN Group: ISSN Data: 0895-7347 – Name: Abstract Label: Abstract Group: Ab Data: This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling errors, in turn, results in unanticipated instability in the testing program and an increase in Type I errors in significance tests. This is especially true when the standard error of equating is underestimated. The problem is caused by the typical district and state practice of using nonprobability cluster-sampling procedures, such as convenience, purposeful, and quota sampling, then calculating statistics and standard errors as if the samples were simple random samples. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: Ref Label: Number of References Group: RefInfo Data: 21 – Name: DateEntry Label: Entry Date Group: Date Data: 2015 – Name: AN Label: Accession Number Group: ID Data: EJ1048933 |
| PLink | https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1048933 |
| RecordInfo | BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1080/08957347.2014.973561 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 15 StartPage: 33 Subjects: – SubjectFull: State Programs Type: general – SubjectFull: Sampling Type: general – SubjectFull: Research Design Type: general – SubjectFull: Error of Measurement Type: general – SubjectFull: Testing Programs Type: general – SubjectFull: Statistical Significance Type: general – SubjectFull: School Districts Type: general – SubjectFull: Probability Type: general – SubjectFull: Sample Size Type: general – SubjectFull: Response Rates (Questionnaires) Type: general – SubjectFull: Weighted Scores Type: general – SubjectFull: Item Analysis Type: general – SubjectFull: Equated Scores Type: general – SubjectFull: Item Response Theory Type: general – SubjectFull: Testing Problems Type: general – SubjectFull: Evaluation Methods Type: general – SubjectFull: Evaluation Problems Type: general – SubjectFull: Experimenter Characteristics Type: general – SubjectFull: Test Reliability Type: general – SubjectFull: Test Validity Type: general – SubjectFull: Group Testing Type: general Titles: – TitleFull: Impact of Design Effects in Large-Scale District and State Assessments Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Phillips, Gary W. IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2015 Identifiers: – Type: issn-print Value: 0895-7347 Numbering: – Type: volume Value: 28 – Type: issue Value: 1 Titles: – TitleFull: Applied Measurement in Education Type: main |
| ResultId | 1 |