The Effect of Small Calibration Sample Sizes on TOEFL IRT-Based Equating.

Saved in:
Bibliographic Details
Title: The Effect of Small Calibration Sample Sizes on TOEFL IRT-Based Equating.
Language: English
Authors: Tang, K. Linda, Educational Testing Service, Princeton, NJ.
Peer Reviewed: N
Page Count: 52
Publication Date: 1993
Document Type: Reports - Evaluative
Descriptors: Comparative Analysis, Computer Simulation, Equated Scores, Estimation (Mathematics), Item Response Theory, Pretests Posttests, Sample Size, Scaling, Simulation, Test Construction
Assessment and Survey Identifiers: Test of English as a Foreign Language
Abstract: This study compared the performance of the LOGIST and BILOG computer programs on item response theory (IRT) based scaling and equating for the Test of English as a Foreign Language (TOEFL) using real and simulated data and two calibration structures. Applications of IRT for the TOEFL program are based on the three-parameter logistic (3PL) model. The results of the study show that item parameter estimates obtained from the smaller real data sample sizes were more consistent with the larger sample estimates when based on BILOG than when based on LOGIST. In addition, the root mean squared error statistics suggest that the BILOG estimates for the item parameters and item characteristic curves were closer in magnitude to the "true" parameter values than were the LOGIST estimates. The equating results based on the parameter estimates suggest that the rule of thumb recommendation that pretest sample sizes be at least 1,000 for LOGIST should be retained if at all possible. Eight tables and 13 figures present results of the analyses. Two appendixes contain specifications and summary statistics. (Contains 15 references.) (Author/SLD)
Entry Date: 1995
Accession Number: ED382662
Database: ERIC
Description
Abstract:This study compared the performance of the LOGIST and BILOG computer programs on item response theory (IRT) based scaling and equating for the Test of English as a Foreign Language (TOEFL) using real and simulated data and two calibration structures. Applications of IRT for the TOEFL program are based on the three-parameter logistic (3PL) model. The results of the study show that item parameter estimates obtained from the smaller real data sample sizes were more consistent with the larger sample estimates when based on BILOG than when based on LOGIST. In addition, the root mean squared error statistics suggest that the BILOG estimates for the item parameters and item characteristic curves were closer in magnitude to the "true" parameter values than were the LOGIST estimates. The equating results based on the parameter estimates suggest that the rule of thumb recommendation that pretest sample sizes be at least 1,000 for LOGIST should be retained if at all possible. Eight tables and 13 figures present results of the analyses. Two appendixes contain specifications and summary statistics. (Contains 15 references.) (Author/SLD)