Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018

Saved in:
Bibliographic Details
Title: Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018
Language: English
Authors: Jung Yeon Park, Sean Joo (ORCID 0000-0003-4861-4362), Zikun Li (ORCID 0000-0002-3572-707X), Hyejin Yoon
Source: Educational Measurement: Issues and Practice. 2025 44(1):55-65.
Availability: Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
Peer Reviewed: Y
Page Count: 11
Publication Date: 2025
Document Type: Journal Articles
Reports - Research
Education Level: Secondary Education
Descriptors: Achievement Tests, Secondary School Students, International Assessment, Test Bias, Multilingualism, Monolingualism, Reaction Time, Native Language, Research Problems
Geographic Terms: United States
Assessment and Survey Identifiers: Program for International Student Assessment
DOI: 10.1111/emip.12640
ISSN: 0731-1745
1745-3992
Abstract: This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non-MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science). Differential item functioning (DIF) analysis reveals that 7-14% of items exhibit DIF-related problems in scored responses between the two groups, aligning with PISA technical report results. While MLs generally spend more time on the test than non-MLs across cognitive levels, differential response time (DRT) functioning identifies significant time differences in 7-10% of items for students with similar cognitive levels. It was noticeable that items with DIF and DRT issues show limited overlap, suggesting diverse reasons for student struggles in the assessment. A deeper examination of item characteristics is recommended for test developers and teachers to gain a better understanding of these nuances.
Abstractor: As Provided
Entry Date: 2025
Accession Number: EJ1460468
Database: ERIC
Full text is not displayed to guests.
FullText Links:
  – Type: pdflink
    Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwF99OX6L8vRIhlbg0HRGkd_AAAA4zCB4AYJKoZIhvcNAQcGoIHSMIHPAgEAMIHJBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDE4J6UAmP1smm7L3TgIBEICBm-mC-hSA4iHYo3rM6jGW2E2zD-IBXtACIkx52PEK5S-5-FwVL96bdJIl5hYs0jbuwcih_nOsGgCWzVSkKp6K3egDAhE5pJNN8jaiUYtj3ZrPm5IrDAzp6ZNqXqp75bpKcLPmHAhi4Dd8qnedAhiIRa1u-co8VMSnEx2KcJGADt-QjTZehCrFBu0VZpi5eyuOXPdw05aQLRZ1owYL
Text:
  Availability: 1
  Value: <anid>AN0183983946;ems01mar.25;2025Mar26.05:51;v2.2.500</anid> <title id="AN0183983946-1">Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018 </title> <p>This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non‐MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science). Differential item functioning (DIF) analysis reveals that 7–14% of items exhibit DIF‐related problems in scored responses between the two groups, aligning with PISA technical report results. While MLs generally spend more time on the test than non‐MLs across cognitive levels, differential response time (DRT) functioning identifies significant time differences in 7–10% of items for students with similar cognitive levels. It was noticeable that items with DIF and DRT issues show limited overlap, suggesting diverse reasons for student struggles in the assessment. A deeper examination of item characteristics is recommended for test developers and teachers to gain a better understanding of these nuances.</p> <p>Keywords: DIF; differential response time; multilingual learners; PISA; process data</p> <p>Language proficiency and understanding of the host country's culture not only enable students to maximize their educational opportunities but also foster a sense of belonging and alignment with their school communities (Cummins, [<reflink idref="bib13" id="ref1">13</reflink>]; Geay et al., [<reflink idref="bib20" id="ref2">20</reflink>]; Krashen, [<reflink idref="bib32" id="ref3">32</reflink>]; Phinney et al., [<reflink idref="bib44" id="ref4">44</reflink>]; Sam et al., [<reflink idref="bib46" id="ref5">46</reflink>]; Zhou & Xiong, [<reflink idref="bib59" id="ref6">59</reflink>]). However, a contrasting cultural and linguistic environment at home poses challenges. When students are exposed to a language at home that diverges from the instructional language in school, they often face an unseen academic hurdle. This divergence is recognized as a risk factor, amplifying the vulnerability of students who are part of linguistic minorities.</p> <p>A few initiatives have been taken to support culturally and linguistically minority students in the United States. The landscape of education policy, particularly the No Child Left Behind (NCLB) legislation, has been shaped by various efforts to ensure equal educational opportunities and hold schools accountable for the performance of subgroups, including English Learners (ELs) and Multilingual Learners (MLs). These students, who come from non‐English speaking backgrounds, often face unique challenges in the American education system, and the provisions and ramifications of NCLB have had profound implications for their academic journeys (Adger et al., [<reflink idref="bib5" id="ref7">5</reflink>]; Gándara & Hopkins, [<reflink idref="bib17" id="ref8">17</reflink>]; Menken & Kleyn, [<reflink idref="bib38" id="ref9">38</reflink>]). As shown by results from regional and national tests, there is a consistent and significant performance gap between ELs/MLs and their native English speaker counterparts (Carnoy & García, [<reflink idref="bib12" id="ref10">12</reflink>]; Maxwell, 2013; Thomas & Collier; 2002; Umansky & Reardon; [<reflink idref="bib52" id="ref11">52</reflink>]). For example, an analysis of the National Assessment of Educational Progress (NAEP) underscored the stagnant performance of non‐English speakers in the past decade (Carnoy & García, [<reflink idref="bib12" id="ref12">12</reflink>]). The analysis from the National Center for Education Statistics (NCES) also revealed a potential academic disparity between native English‐speaking students and those who speak other languages at home (García & Weiss, [<reflink idref="bib18" id="ref13">18</reflink>]).</p> <hd id="AN0183983946-2">Validity Concerns in the Assessment of ELs/MLs</hd> <p>The complexities in the interpretation of assessment results in K–12 schools are heightened by the presence of ELs/MLs in the United States. At the core of this concern is the assumption that a student's test score is appropriately representative of one's ability in a particular school subject. However, this might not be true for ELs/MLs. As the National Research Council ([<reflink idref="bib41" id="ref14">41</reflink>]) has pointed out, content‐based assessments might not be valid for students who are unable to speak English fluently, stating that "its results will be invalid if the test‐takers' limited English proficiency prevents them from understanding all of the questions, presenting their answers, or completing the work in the allotted time" (p. 20). The challenge in deciphering ELs/MLs' test results stems from the phenomenon of construct‐irrelevant variance, which occurs when an assessment evaluates aspects other than the intended construct the test is designed to measure (American Educational Research Association et al., [<reflink idref="bib7" id="ref15">7</reflink>]). Extensive scholarly research on ELs/MLs has consistently revealed that the level of English language proficiency among these students can significantly impact their performance on assessments (Abedi, [<reflink idref="bib4" id="ref16">4</reflink>]; Kopriva et al., [<reflink idref="bib26" id="ref17">26</reflink>]; Martiniello, [<reflink idref="bib37" id="ref18">37</reflink>]; Noble et al., [<reflink idref="bib51" id="ref19">51</reflink>]; Solano‐Flores & Trumbull, [<reflink idref="bib49" id="ref20">49</reflink>]; Wolf & Leon, [<reflink idref="bib58" id="ref21">58</reflink>]).</p> <p>Abedi ([<reflink idref="bib1" id="ref22">1</reflink>]) critically examined the implications of the NCLB Act for ELs/MLs, emphasizing issues related to assessment and accountability. Specifically, the author pointed out that conventional assessments might introduce construct‐irrelevant variance, especially when these tests contain linguistically complex items unrelated to the subject being measured. This perspective is further echoed by Solano‐Flores ([<reflink idref="bib48" id="ref23">48</reflink>]). The study stressed the need to address both linguistic complexity and cultural relevance to ensure that assessments appropriately measure the intended constructs, particularly for linguistically and culturally diverse student populations. Meanwhile, Bailey and Wolf ([<reflink idref="bib10" id="ref24">10</reflink>]) delved into the intersection of language and content in large‐scale assessments. Their research underscored the significance of integrating both content and language considerations when designing and interpreting assessments for culturally and linguistically diverse students.</p> <hd id="AN0183983946-3">Measurement Comparability in Large‐Scale Assessments</hd> <p>When evaluating large‐scale standardized assessments such as the Programme for International Student Assessment (PISA), achieving "measurement comparability" (Oliveri et al., [<reflink idref="bib43" id="ref25">43</reflink>]) when comparing linguistically and culturally diverse groups can be challenging. The challenge arises from test‐related factors, including linguistic/cultural context embedded, test format (e.g., multiple‐choice, open‐ended), and translation for different language countries or language groups (Akour et al., 2014; Kankaraš & Moors, [<reflink idref="bib30" id="ref26">30</reflink>]). For example, in standardized assessments, the mathematics domain that presumes advanced literacy skills and implicit cultural context may not only assessing mathematical proficiency but also reading comprehension skills. This can lead to construct‐irrelevant variations in performance on math exams (Walker et al., [<reflink idref="bib56" id="ref27">56</reflink>]).</p> <p>Methodologically speaking, the measurement comparability within PISA concerning cultural, linguistic, and national contexts has been investigated based on confirmatory factor analysis (CFA) and item response theory (IRT). For example, Asil and Brown (2015) examined test invariance between Austria and other countries for the PISA 2009 literacy test, using multiple group CFA (MG‐CFA), whereas Kankaraš and Moors ([<reflink idref="bib30" id="ref28">30</reflink>]) analyzed differential item functioning (DIF) on mathematics, reading, and science tests for the same assessment across different countries, using the two‐parameter logistic (2PL) IRT model. Huang et al. (2014) examined PISA 2006 science assessments, identifying plausible sources for DIF (e.g., language, curriculum, and culture), all of which can influence the assessment outcomes. Le ([<reflink idref="bib33" id="ref29">33</reflink>]) focused on gender‐related DIF across countries and test languages for science tests. The study demonstrates that gender‐related DIF is contingent upon item formats and content domains. Joo et al. ([<reflink idref="bib28" id="ref30">28</reflink>]) and Khorramdel et al. ([<reflink idref="bib31" id="ref31">31</reflink>]) examined the potential impact of DIF using PISA 2018 data, comparing country‐by‐language groups based on mean deviation (MD) and root mean squared deviations (RMSD). Costa and Araújo ([<reflink idref="bib14" id="ref32">14</reflink>]) analyzed the PISA reading test using DIF to make comparisons between immigrant students and their nonimmigrant counterparts from European countries. They found immigrant students excelled at reading items focused on learning, while native students excelled at items related to personal or public situations. Liu and Bradley ([<reflink idref="bib34" id="ref33">34</reflink>]) addressed the achievement gap in mathematics between English language learners and non‐English language learners (ELLs) by examining DIF on PISA 2012. They found that among the DIF items, more than three times as many in mathematics were more difficult for ELLs. Similarly, in other large‐scale assessments, Gökçe et al. ([<reflink idref="bib21" id="ref34">21</reflink>]) conducted Differential Item Functioning (DIF) on TIMSS assessment items from 2015 and found differences between language versions of the test were more influential than the differences between countries, likely due to variations in culture and curriculum. In contrast, Mahoney ([<reflink idref="bib36" id="ref35">36</reflink>]), based on NAEP 1995 mathematics, found no DIF for second‐language learners compared to native speakers.</p> <hd id="AN0183983946-4">Differential Response Time Using Digitally Based Assessments</hd> <p>With the recent adoption of digital‐based assessment (DBA) in large‐scale assessments, there has been a growing need for uncovering variations in test‐taking behaviors that may not be reflected in the item scores derived from the final responses submitted by students. Among the data that are recorded using DBA, response time (RT) has been extensively analyzed as a potential indicator of test‐taking behaviors (Lundgren & Eklöf, [<reflink idref="bib35" id="ref36">35</reflink>]). Shin et al. ([<reflink idref="bib47" id="ref37">47</reflink>]) examined common RT scales across countries by conducting IRT analysis with categorized RT data in PISA 2015. Ercikan et al. ([<reflink idref="bib16" id="ref38">16</reflink>]) and Guo and Ercikan ([<reflink idref="bib23" id="ref39">23</reflink>]) investigated the differences between EL and non‐EL students in a large‐scale mathematics assessment conducted in English. While their study found no DIF issues in any of the item responses between the two groups, upon analyzing response times within individual items, they observed a difference that suggests that students with similar cognitive abilities in the domain tend to spend more time answering most test questions. Guo and Ercikan ([<reflink idref="bib22" id="ref40">22</reflink>]) examined rapid response rate and performance across different countries and found that the rapid responses and their impact on performance varied across countries in PISA 2018. These recent articles utilized a novel approach to evaluate differential response time (DRT) functioning (Guo & Ercikan, [<reflink idref="bib23" id="ref41">23</reflink>]) which is an extended framework of DIF for response time data: when matched on the total response time, EL students tended to spend a significantly longer time than their peers on items in the first half of the form. The specifics are delineated in the Methods section of this paper.</p> <hd id="AN0183983946-5">Purpose</hd> <p>This study aims to explore potential assessment bias in relation to students' primary language status in the United States. Using digitally based PISA 2018, a comparison is made between ELs/MLs (whose primary home language is not English) and their mainstream peers (i.e., non‐ELs/MLs). For the sake of our convenience for the remainder of the study, we will refer to them as MLs and non‐MLs, respectively. In the setup, this study is motivated to contribute to the existing literature as follows. First, the study presents a comprehensive view of test fairness between the two language groups by utilizing most items across the three cognitive domains (reading, mathematics, and science). Secondly, the study is motivated to fill the gap in measurement comparability studies that have primarily focused on between‐country (or between‐language) differences. Specifically, it aims to zoom in on situations where comparability issues arise within a country when the test is administered in a single language, concerning linguistic/cultural diversity. Thirdly, the study addresses the need to identify test‐taking behavior and potential item bias that can be informed by response time data, utilizing recent developments in psychometric methods. Therefore, the study investigates the following research questions:</p> <p></p> <ulist> <item> Do MLs and non‐MLs exhibit a difference in performance, even when their ability levels are the same? Does this result vary across different domains and item formats?</item> <p></p> <item> Does the item response time of MLs differ from that of their mainstream peers (non‐MLs) during assessments? Is this difference consistent across varying ability levels?</item> <p></p> <item> Is the overall response time of MLs distinct from their mainstream peers (non‐MLs) at the same ability level in assessments? Does this result vary across domains and item formats?</item> </ulist> <p>The rest of the paper is as follows: First, we provide an overview of the cognitive domains, samples, and study variables in PISA 2018. Then, we introduce the theoretical framework of DIF and DRT functioning procedures. Next, we demonstrate results corresponding to the two methods. Finally, we provide the implications of the results in the discussion.</p> <hd id="AN0183983946-6">Methods</hd> <p></p> <hd id="AN0183983946-7">PISA 2018 Cognitive Domains</hd> <p>PISA 2018 was the seventh round of the international assessment since the program was launched in 2000. Every PISA test assesses students' knowledge and skills in reading, mathematics, and science, but each assessment focuses on one of these subjects and provides a summary assessment of the other two. In PISA 2018, reading was the major domain, focusing on assessing student's reading literacy in a digital environment and measuring its trend over the past two decades. Particularly, reading literacy is defined as "an individual's capacity to understand, use, evaluate, reflect on and engage with texts in order to achieve one's goals, develop one's knowledge and potential, and participate in society" (OECD, [<reflink idref="bib42" id="ref42">42</reflink>], p. 14). PISA evaluated students' reading abilities by examining their proficiency across various processes, text formats, and situations. In terms of processes, PISA focuses on students' capabilities in locating information, understanding texts, and evaluating and reflecting on them. This encompasses tasks such as accessing and retrieving information, comprehending literal and integrated meanings, and assessing text quality. The assessment also incorporates diverse text formats, including single and multiple‐source texts, both static and dynamic, continuous formats like paragraphs, non‐continuous ones like lists or diagrams, and mixed texts. Additionally, the context in which a text is used, termed as situations, plays a pivotal role. Texts can be designed for personal, public, occupational, or educational purposes. Recognizing that students might excel in different reading contexts, PISA ensures a broad spectrum of reading situations in the test (OECD, [<reflink idref="bib42" id="ref43">42</reflink>]).</p> <p>Mathematics and science were the minor domains in PISA 2018, allowing for comparisons of student performance over time. Mathematical literacy is an individual's capacity to "analyze, reason and communicate ideas effectively as they pose, formulate, solve and interpret solutions to mathematical problems in a variety of situations" (OECD, [<reflink idref="bib42" id="ref44">42</reflink>], p. 15). PISA's assessment of students' mathematical skills revolves around specific processes, content, and contexts. The processes encompass three main categories—formulating mathematical situations, employing mathematical concepts and reasoning, and interpreting mathematical outcomes—which are underpinned by seven core capabilities, all rooted in detailed mathematical knowledge. The content focuses on four intertwined ideas related to traditional subjects like algebra and geometry. Lastly, the problems are set within four real‐world contexts: personal, educational, societal, and scientific (OECD, [<reflink idref="bib42" id="ref45">42</reflink>]).</p> <p>Lastly, according to PISA's definition, scientific literacy is the ability to engage with science‐related issues and with the ideas of science as a reflective citizen, and PISA evaluates students' scientific proficiency based on contexts, knowledge, and competencies. The contexts encompass personal, local/national, and global issues, both contemporary and historical, requiring an understanding of science and technology. Knowledge entails grasping essential facts, concepts, and theories foundational to scientific understanding, which covers content knowledge of the natural world and technological artifacts, procedural knowledge of how these ideas emerge, and epistemic knowledge explaining the rationale behind these procedures. Competencies focus on students' capabilities to scientifically explain phenomena, assess and formulate scientific inquiries, and interpret data and evidence (OECD, [<reflink idref="bib42" id="ref46">42</reflink>]).</p> <hd id="AN0183983946-8">Sample</hd> <p>In this study, we utilized a sample of <emph>N</emph> = 4,838 students in the United States, comprising <emph>n</emph> = 2,462 boys and <emph>n</emph> = 2,376 girls. Notably, a significant majority of these 15‐year‐old participants, approximately 74.37%, were enrolled in Grade 10. To ascertain students' primary language at home, the response to the question in the Questionnaire "What language do you speak at home most of the time" was served as the indicator. Students who reported "language of the test" was the predominant home language were classified as non‐MLs in this study, whereas those who reported "Other languages" were categorized as MLs. See Table 1 for details. It was also suggested that for ML students, parental background was noticeably different: only 12% of mothers and 11% of fathers were originally born in the United States. In contrast, for non‐ML students, 85% of mothers and 83% of fathers were born in the United States, respectively.</p> <p>1 Table Characteristics of Students in the United States</p> <p> <ephtml> <table><thead><tr><th align="left">Characteristics</th><th align="center"><italic>n</italic></th><th align="center"><italic>%</italic></th></tr></thead><tbody><tr><td>Gender</td><td align="center" /><td align="center" /></tr><tr><td>Boy</td><td>2,462</td><td>50.9%</td></tr><tr><td>Girl</td><td>2,376</td><td>49.1%</td></tr><tr><td>Grade</td><td align="center" /><td align="center" /></tr><tr><td>8th grade</td><td>8</td><td>.2%</td></tr><tr><td>9th Grade</td><td>401</td><td>8.3%</td></tr><tr><td>10th grade</td><td>3,598</td><td>74.4%</td></tr><tr><td>11th grade</td><td>826</td><td>17.1%</td></tr><tr><td>12th grade</td><td>5</td><td>.1%</td></tr><tr><td>Home language</td><td align="center" /><td align="center" /></tr><tr><td>English</td><td>4,054</td><td>83.8%</td></tr><tr><td>Other language</td><td>736</td><td>15.2%</td></tr></tbody></table> </ephtml> </p> <hd id="AN0183983946-9">Item Responses and Response Times</hd> <p>We used student responses to a total of 511 items across three cognitive domains in the PISA database. Of these 511 items, 366 were designed to assess reading literacy, 70 aimed at evaluating mathematical literacy, and the remaining 115 focused on scientific literacy. There are two types of item formats (see Figure 1): (a) the computer scored items comprise simple (for a single response selection) and complex (for multiple response selections) multiple‐choice items and constructed responses (questions starting with "C"), and (b) the human coded items are the constructed response items graded by human raters (questions starting with "D"). Two primary coding approache were utilized in the database: (a) a binary scoring system, wherein a full credit of 1 was given to a correct answer, and no credit of 0 was denoted as 0 and (b) a ternary scoring system, in which a completely correct answer was denoted a full credit of 2, a partially correct answer received a partial credit of 1, and an incorrect answer was given 0. Furthermore, a process of recoding was implemented for several coded response questions that exhibited multiple single values corresponding to partial credit. In such instances, all partial credit values were merged into a single value of 1.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/EMS/01mar25/emip12640-fig-0001.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="emip12640-fig-0001.jpg" title="1 PISA 2018 Reading Sample QuestionsNote. Retrieved from https://www.oecd.org/pisa/test/." /> </p> <p></p> <p>Regarding response time data, we analyzed all available item‐level response time data for a total of 423 items within three cognitive domains present in the PISA database. Among these, 238 items were specifically designed to assess reading literacy, 70 were aimed at evaluating mathematical literacy, and the remaining 115 focused on scientific literacy. These variables can be identified by their names, which all end with "TT" in the dataset.</p> <hd id="AN0183983946-11">Measurement Invariance for Item Responses</hd> <p>To evaluate the measurement invariance of MLs and non‐MLs with regard to scored responses (0/1 or 0/1/2) in each test, we employed the conceptual framework of multiple‐group IRT (MG‐IRT; Reckase, [<reflink idref="bib45" id="ref47">45</reflink>]). This framework is rooted in the basic concept of IRT, enabling the comparison of how each item functions across different groups. Specifically, it constrains some item parameters to be equal across the groups, while permitting the parameters on the remaining items to differ across the groups. This property allows for the examination of measurement invariance across the groups by assessing model‐data misfit resulting from assuming homogeneity of item parameters between the groups in PISA (e.g., von Davier et al., [<reflink idref="bib55" id="ref48">55</reflink>]).</p> <p>In our study, a two‐parameter logistic model (2PL; Birnbaum, [<reflink idref="bib11" id="ref49">11</reflink>]) was fitted to the data with dichotomous responses and a generalized partial credit model (GPCM; Muraki, [<reflink idref="bib39" id="ref50">39</reflink>]) for polytomous responses: 1 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0001" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>P</mi><mspace width="0.33em" /><mfenced separators="" open="(" close=")"><mrow><mspace width="0.33em" /><msub><mi>X</mi><mrow><mi>i</mi><mi>j</mi><mi>g</mi></mrow></msub><mo linebreak="goodbreak">=</mo><mn>1</mn><mo>|</mo><mi>θ</mi></mrow></mfenced><mo linebreak="badbreak">=</mo><mfrac><mrow><mi>exp</mi><mfenced separators="" open="[" close="]"><mrow><msub><mi>a</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><mfenced separators="" open="(" close=")"><mrow><mi>θ</mi><mo>−</mo><msub><mi>b</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub></mrow></mfenced></mrow></mfenced></mrow><mrow><mn>1</mn><mo>+</mo><mi>exp</mi><mfenced separators="" open="[" close="]"><mrow><msub><mi>a</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><mfenced separators="" open="(" close=")"><mrow><mi>θ</mi><mo>−</mo><msub><mi>b</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub></mrow></mfenced></mrow></mfenced></mrow></mfrac><mo>,</mo><mspace width="0.33em" /></mrow><annotation encoding="application/x-tex">$$\begin{equation}P\ \left({\ {{X}_{ijg}} = 1{\mathrm{|}}\theta } \right) = \frac{{\exp \left[ {{{a}_{jg}}\left({\theta - {{b}_{jg}}} \right)} \right]}}{{1 + \exp \left[ {{{a}_{jg}}\left({\theta - {{b}_{jg}}} \right)} \right]}},\ \end{equation}$$</annotation></semantics></math> </ephtml> 2 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0002" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>P</mi><mspace width="0.33em" /><mfenced separators="" open="(" close=")"><mrow><mspace width="0.33em" /><msub><mi>X</mi><mrow><mi>i</mi><mi>j</mi><mi>g</mi></mrow></msub><mo linebreak="goodbreak">=</mo><mi>x</mi><mo>|</mo><mi>θ</mi></mrow></mfenced><mo linebreak="badbreak">=</mo><mfrac><mrow><mi>exp</mi><mfenced separators="" open="[" close="]"><mrow><msubsup><mo>∑</mo><mrow><mi>r</mi><mo>=</mo><mn>0</mn></mrow><mi>x</mi></msubsup><msub><mi>a</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><mspace width="0.33em" /><mfenced separators="" open="(" close=")"><mrow><mi>θ</mi><mo>−</mo><msub><mi>b</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><mo>+</mo><msub><mi>t</mi><mrow><mi>j</mi><mi>r</mi><mi>g</mi></mrow></msub></mrow></mfenced></mrow></mfenced></mrow><mrow><msubsup><mo>∑</mo><mrow><mi>u</mi><mo>=</mo><mn>0</mn></mrow><msub><mi>m</mi><mi>j</mi></msub></msubsup><mi>exp</mi><mfenced separators="" open="[" close="]"><mrow><msubsup><mo>∑</mo><mrow><mi>r</mi><mo>=</mo><mn>0</mn></mrow><mi>u</mi></msubsup><msub><mi>a</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><mspace width="0.33em" /><mfenced separators="" open="(" close=")"><mrow><mi>θ</mi><mo>−</mo><msub><mi>b</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><mo>+</mo><msub><mi>t</mi><mrow><mi>j</mi><mi>r</mi><mi>g</mi></mrow></msub></mrow></mfenced></mrow></mfenced></mrow></mfrac><mspace width="0.33em" /><mo>,</mo></mrow><annotation encoding="application/x-tex">$$\begin{equation}P\ \left({\ {{X}_{ijg}} = x{\mathrm{|}}\theta } \right) = \frac{{\exp \left[ {\sum_{r = 0}^x {{a}_{jg}}\ \left({\theta - {{b}_{jg}} + {{t}_{jrg}}} \right)} \right]}}{{\sum_{u = 0}^{{{m}_j}} \exp \left[ {\sum_{r = 0}^u {{a}_{jg}}\ \left({\theta - {{b}_{jg}} + {{t}_{jrg}}} \right)} \right]}}\ ,\end{equation}$$</annotation></semantics></math> </ephtml> where <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0003" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>θ</mi><annotation encoding="application/x-tex">$\theta $</annotation></semantics></math> </ephtml> is latent trait parameter, <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0004" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>a</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><annotation encoding="application/x-tex">${{a}_{jg}}$</annotation></semantics></math> </ephtml> and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0005" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>b</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub><annotation encoding="application/x-tex">${{b}_{jg}}$</annotation></semantics></math> </ephtml> are discrimination and difficulty parameters for a group <emph>g</emph>. For the GPCM, <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0006" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>t</mi><mrow><mi>j</mi><mi>r</mi><mi>g</mi></mrow></msub><annotation encoding="application/x-tex">${{t}_{jrg}}$</annotation></semantics></math> </ephtml> is thcategory threshold parameter of an item <emph>j</emph> for a group <emph>g</emph>. The model has constraints where <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0007" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>t</mi><mrow><mi>j</mi><mn>0</mn><mi>g</mi></mrow></msub><mo>=</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">${{t}_{j0g}} = 0$</annotation></semantics></math> </ephtml> and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0008" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msubsup><mo>∑</mo><mrow><mi>r</mi><mo>=</mo><mn>1</mn></mrow><mi>k</mi></msubsup><mspace width="0.33em" /><msub><mi>t</mi><mrow><mi>j</mi><mi>r</mi><mi>g</mi></mrow></msub><mo>=</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">$\sum_{r = 1}^k \ {{t}_{jrg}} = 0$</annotation></semantics></math> </ephtml> . Finally, <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0009" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>m</mi><mi>j</mi></msub><annotation encoding="application/x-tex">${{m}_j}$</annotation></semantics></math> </ephtml> is the total number of categories – 1 for an item <emph>j</emph> (e.g., <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0010" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>X</mi><mrow><mi>i</mi><mi>j</mi><mi>g</mi></mrow></msub><annotation encoding="application/x-tex">${{X}_{ijg}}$</annotation></semantics></math> </ephtml> = 0, 1, ..., <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0011" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>m</mi><mi>j</mi></msub><annotation encoding="application/x-tex">${{m}_j}$</annotation></semantics></math> </ephtml> ). In the multigroup structure, <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0012" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>θ</mi><annotation encoding="application/x-tex">$\theta $</annotation></semantics></math> </ephtml> is assumed to be distributed as <emph>N</emph>( <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0013" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>μ</mi><mi>g</mi></msub><annotation encoding="application/x-tex">${{\mu }_g}$</annotation></semantics></math> </ephtml> , <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0014" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msubsup><mi>σ</mi><mi>g</mi><mn>2</mn></msubsup><annotation encoding="application/x-tex">$\sigma _g^2$</annotation></semantics></math> </ephtml> ), where <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0015" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>μ</mi><mi>g</mi></msub><annotation encoding="application/x-tex">${{\mu }_g}$</annotation></semantics></math> </ephtml> is the mean and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0016" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msubsup><mi>σ</mi><mi>g</mi><mn>2</mn></msubsup><annotation encoding="application/x-tex">$\sigma _g^2$</annotation></semantics></math> </ephtml> is the variance of the group <emph>g</emph>.</p> <p>In the IRT scaling procedure, the DIF items were detected as follows. First, the IRT models with the equality constraint on the item parameters (item parameters are the same between the ML and non‐ML groups) were fitted to the data. In this way, the common item parameters can be estimated. Based on the item parameter estimated using the marginal maximum likelihood (MML) approach with the expectation‐maximization (EM) algorithm, latent trait scores were subsequently estimated using weighted likelihood estimates (WLE; Warm, [<reflink idref="bib57" id="ref51">57</reflink>]). Next, to detect items that reveals any misfit due to the restrictive assumption of equalizing the item parameters between the groups, the model‐based (or expected) item characteristics curve (ICC) was obtained based on the IRT models with the common parameter estimates. Item fit statistics were then computed to measure the extent to which the model‐based ICC is away from the ICC calculated by observed data. The observed ICC is obtained by approximating the pseudo responses from the MML‐EM algorithm (von Davier, [<reflink idref="bib54" id="ref52">54</reflink>]). The group indicator <emph>g</emph> is dropped in the following Equations (<reflink idref="bib3" id="ref53">3</reflink>) and (<reflink idref="bib4" id="ref54">4</reflink>) for simplicity.3 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0017" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>P</mi><mi>obs</mi></msub><mspace width="0.33em" /><mfenced separators="" open="(" close=")"><mrow><msub><mi>X</mi><mi>i</mi></msub><mo linebreak="goodbreak">=</mo><mi>x</mi><mo>|</mo><mi>θ</mi></mrow></mfenced><mo linebreak="badbreak">=</mo><munderover><mo>∑</mo><mrow><mi>p</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mfrac><mrow><msub><mi>x</mi><mrow><mi>p</mi><mi>i</mi></mrow></msub><mi>L</mi><mfenced separators="" open="(" close=")"><mrow><mrow><mi>θ</mi><mo>|</mo></mrow><msub><mi mathvariant="bold-italic">X</mi><mi>p</mi></msub></mrow></mfenced><mi>A</mi><mfenced open="(" close=")"><mi>θ</mi></mfenced></mrow><mrow><msubsup><mo>∑</mo><mrow><mi>q</mi><mo>=</mo><mn>1</mn></mrow><mi>Q</mi></msubsup><mi>L</mi><mfenced separators="" open="(" close=")"><mrow><msub><mi>θ</mi><mi>q</mi></msub><mrow><mo>|</mo></mrow><msub><mi mathvariant="bold-italic">X</mi><mi>p</mi></msub></mrow></mfenced><mi>A</mi><mfenced separators="" open="(" close=")"><msub><mi>θ</mi><mi>q</mi></msub></mfenced></mrow></mfrac><mspace width="0.33em" /><mo>,</mo></mrow><annotation encoding="application/x-tex">$$\begin{equation}{{P}_{{\mathrm{obs}}}}\ \left({ {{X}_i} = x{\mathrm{|}}\theta } \right) = \sum_{p = 1}^N \frac{{{{x}_{pi}}L\left({\theta |{{{\bm{X}}}_p}} \right)A\left(\theta \right)}}{{\sum_{q = 1}^Q L\left({{{\theta }_q}|{{{\bm{X}}}_p}} \right)A\left({{{\theta }_q}} \right)}}\ ,\end{equation}$$</annotation></semantics></math> </ephtml> where <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0018" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>x</mi><mrow><mi>p</mi><mi>i</mi></mrow></msub><annotation encoding="application/x-tex">${{x}_{pi}}$</annotation></semantics></math> </ephtml> is the observed item response from an examinee <emph>p</emph> for an item <emph>i</emph>, <emph>N</emph> is the total number of examinees (<emph>p</emph> = 1, ..., <emph>N</emph>), <emph>Q</emph> is the total number quadrature of points (<emph>q</emph> = 1, ..., <emph>Q</emph>) in MML estimation, and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0019" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi><mo>(</mo><mi>θ</mi><mo>)</mo></mrow><annotation encoding="application/x-tex">$A(\theta)$</annotation></semantics></math> </ephtml> is the normalized prior distribution in MML estimation. Note that <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0020" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi><mo>(</mo><mi>θ</mi><mo>|</mo><msub><mi mathvariant="bold-italic">X</mi><mi>p</mi></msub><mo>)</mo></mrow><annotation encoding="application/x-tex">$L(\theta |{{{\bm{X}}}_p})$</annotation></semantics></math> </ephtml> is the likelihood function for an examinee <emph>p</emph> computed from 4 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0021" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi><mspace width="0.33em" /><mfenced separators="" open="(" close=")"><mrow><mrow><mi>θ</mi><mo>|</mo></mrow><msub><mi mathvariant="bold-italic">X</mi><mi>p</mi></msub></mrow></mfenced><mo linebreak="badbreak">=</mo><munderover><mo>∏</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>J</mi></munderover><mi>P</mi><mfenced separators="" open="(" close=")"><mrow><msub><mi>X</mi><mrow><mi>p</mi><mi>i</mi></mrow></msub><mo>=</mo><msub><mi>x</mi><mrow><mi>p</mi><mi>i</mi></mrow></msub><mrow><mo>|</mo><mi>θ</mi></mrow></mrow></mfenced><mspace width="0.33em" /><mo>,</mo></mrow><annotation encoding="application/x-tex">$$\begin{equation}L\ \left({\theta | {{{\bm{X}}}_p}} \right) = \mathop \prod \limits_{i = 1}^J P\left({{{X}_{pi}} = {{x}_{pi}}|\theta } \right)\ ,\end{equation}$$</annotation></semantics></math> </ephtml> where <emph>J</emph> is the total number of items (<emph>i</emph> = 1, ..., <emph>J</emph>).</p> <p>Because the observed ICC is specific to the home language groups, any item revealing a noticeable discrepancy between the model‐based and observed ICCs at either side of the groups indicates that the item(s) does not function equally between the two groups (i.e., DIF). Two item fit statistics were considered by two forms such as mean deviation (MD) and root mean squared deviation (RMSD), indicating (squared) difference between the two ICCs, which was weighted by densities of the latent trait scores as follows:5 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0022" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>M</mi><msub><mi>D</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo linebreak="badbreak">=</mo><mo>∫</mo><mfenced separators="" open="[" close="]"><mrow><msub><mi>P</mi><mi>obs</mi></msub><mfenced separators="" open="(" close=")"><mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo>|</mo><mi>θ</mi></mrow></mfenced><mo>−</mo><msub><mi>P</mi><mi>exp</mi></msub><mfenced separators="" open="(" close=")"><mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo>|</mo><mi>θ</mi></mrow></mfenced></mrow></mfenced><msub><mi>f</mi><mi>g</mi></msub><mfenced open="(" close=")"><mi>θ</mi></mfenced><mi>d</mi><mi>θ</mi><mo>,</mo><mspace width="0.33em" /><mspace width="0.33em" /></mrow><annotation encoding="application/x-tex">$$\begin{equation}M{{D}_{ig}} = \smallint \left[ {{{P}_{{\mathrm{obs}}}}\left({{{X}_{ig}}{\mathrm{|}}\theta } \right) - {{P}_{{\mathrm{exp}}}}\left({{{X}_{ig}}{\mathrm{|}}\theta } \right)} \right]{{f}_g}\left(\theta \right)d\theta ,\ \ \end{equation}$$</annotation></semantics></math> </ephtml> where <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0023" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>P</mi><mi>obs</mi></msub><mrow><mo>(</mo><mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo>|</mo><mi>θ</mi></mrow><mo>)</mo></mrow></mrow><annotation encoding="application/x-tex">${{P}_{{\mathrm{obs}}}}({{{X}_{ig}}{\mathrm{|}}\theta })$</annotation></semantics></math> </ephtml> is the observed ICC for the group <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0024" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>g</mi><annotation encoding="application/x-tex">$g$</annotation></semantics></math> </ephtml> ( <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0025" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn></mrow><annotation encoding="application/x-tex">$g = 1,2$</annotation></semantics></math> </ephtml> ) in item <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0026" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>i</mi><annotation encoding="application/x-tex">$i$</annotation></semantics></math> </ephtml> , <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0027" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>P</mi><mi>exp</mi></msub><mrow><mo>(</mo><mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo>|</mo><mi>θ</mi></mrow><mo>)</mo></mrow></mrow><annotation encoding="application/x-tex">${{P}_{{\mathrm{exp}}}}({{{X}_{ig}}{\mathrm{|}}\theta })$</annotation></semantics></math> </ephtml> is the model‐based (or expected) ICC for the group <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0028" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>g</mi><annotation encoding="application/x-tex">$g$</annotation></semantics></math> </ephtml> in item <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0029" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>i</mi><annotation encoding="application/x-tex">$i$</annotation></semantics></math> </ephtml> , and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0030" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>f</mi><mi>g</mi></msub><mrow><mo>(</mo><mi>θ</mi><mo>)</mo></mrow></mrow><annotation encoding="application/x-tex">${{f}_g}(\theta)$</annotation></semantics></math> </ephtml> is a latent trait density function for the group <emph>g</emph>. The MD spans a range of values from negative to positive. A negative value for a specific group <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0031" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>g</mi><annotation encoding="application/x-tex">$g$</annotation></semantics></math> </ephtml> suggests that the item was comparatively more difficult for that group.</p> <p>On the other hand, the RMSD measure varies from 0 to 1, and a higher RMSD on either side of the groups indicates a higher level of misfit: 6 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0032" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>R</mi><mi>M</mi><mi>S</mi><msub><mi>D</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo linebreak="badbreak">=</mo><msqrt><mrow><mo>∫</mo><msup><mfenced separators="" open="[" close="]"><mrow><msub><mi>P</mi><mi>obs</mi></msub><mfenced separators="" open="(" close=")"><mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo>|</mo><mi>θ</mi></mrow></mfenced><mo>−</mo><msub><mi>P</mi><mi>exp</mi></msub><mfenced separators="" open="(" close=")"><mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub><mo>|</mo><mi>θ</mi></mrow></mfenced></mrow></mfenced><mn>2</mn></msup><msub><mi>f</mi><mi>g</mi></msub><mfenced open="(" close=")"><mi>θ</mi></mfenced><mspace width="0.33em" /><mi>d</mi><mi>θ</mi></mrow></msqrt><mo>,</mo></mrow><annotation encoding="application/x-tex">$$\begin{equation}RMS{{D}_{ig}} = \sqrt {\smallint {{{\left[ {{{P}_{{\mathrm{obs}}}}\left({{{X}_{ig}}{\mathrm{|}}\theta } \right) - {{P}_{{\mathrm{exp}}}}\left({{{X}_{ig}}{\mathrm{|}}\theta } \right)} \right]}}^2}{{f}_g}\left(\theta \right)\ d\theta } ,\end{equation}$$</annotation></semantics></math> </ephtml></p> <p>In the PISA operational procedure, the thresholds for the DIF were established at RMSD = .12 for cognitive domains (Joo et al., [<reflink idref="bib29" id="ref55">29</reflink>]) and RMSD = .15 for noncognitive domains (OECD, 2019). Considering that each item has two RMSD values, items exhibiting RMSD values exceeding .12 on either side were identified as having DIF. The IRT scaling and DIF detection procedure was conducted using the "mdltm" software (von Davier, [<reflink idref="bib54" id="ref56">54</reflink>]).</p> <hd id="AN0183983946-12">Measurement Invariance for Response Time</hd> <p>To examine invariance with regard to response time between MLs and non‐MLs in a test, we conducted two analyses. First, we examined mean response time conditional on the latent trait scores obtained from the IRT models, using locally weighted scatterplot smoothing (LOWESS). Next, we then adopted a hypothesis test to examine the differential response time (DRT) functioning, which was developed by Guo and Ercikan ([<reflink idref="bib23" id="ref57">23</reflink>]). The basic steps are as follows: For each individual stratum, we compute the response time difference between the two home language groups, under the assumptions that (a) response times are independent among individuals and between the groups conditional upon the stratum; (b) response times are a random sample from the specific population; and (c) both language groups experience speededness, fatigue, or motivation similarly. See Guo and Ercikan ([<reflink idref="bib23" id="ref58">23</reflink>]; pp. 4, 5) for details with regard to the assumptions and the derivation of the test statistic: 7 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0033" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>z</mi><mo linebreak="badbreak">=</mo><mfrac><mn>1</mn><msqrt><mrow><mn>2</mn><msubsup><mo>∑</mo><mrow><mi>s</mi><mo>=</mo><mn>1</mn></mrow><mi>S</mi></msubsup><msup><mfenced separators="" open="(" close=")"><mfrac><msub><mi>n</mi><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><msub><mi>N</mi><mi>focal</mi></msub></mfrac></mfenced><mn>2</mn></msup></mrow></msqrt></mfrac><mspace width="0.33em" /><munderover><mo>∑</mo><mrow><mi>s</mi><mo>=</mo><mn>1</mn></mrow><mi>S</mi></munderover><mfrac><msub><mi>n</mi><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><msub><mi>N</mi><mi>focal</mi></msub></mfrac><mspace width="0.33em" /><mfrac><mrow><msub><mover accent="true"><mi>μ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><mo>−</mo><msub><mover accent="true"><mi>μ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>ref</mi></mrow></mrow></msub></mrow><msqrt><mrow><mfrac><msubsup><mover accent="true"><mi>σ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mi>focal</mi></mrow><mn>2</mn></msubsup><msub><mi>n</mi><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub></mfrac><mo>+</mo><mfrac><msubsup><mover accent="true"><mi>σ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mi>ref</mi></mrow><mn>2</mn></msubsup><msub><mi>n</mi><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>ref</mi></mrow></mrow></msub></mfrac></mrow></msqrt></mfrac><mspace width="0.33em" /><mo>∼</mo><mspace width="0.33em" /><mi>N</mi><mfenced separators="" open="(" close=")"><mrow><mn>0</mn><mo>,</mo><mn>1</mn></mrow></mfenced><mo>,</mo></mrow><annotation encoding="application/x-tex">$$\begin{equation}z = \frac{1}{{\sqrt {2\sum_{s = 1}^S {{{\left({\frac{{{{n}_{s,{\mathrm{\ focal}}}}}}{{{{N}_{{\mathrm{focal}}}}}}} \right)}}^2}} }}\ \sum_{s = 1}^S \frac{{{{n}_{s,{\mathrm{\ focal}}}}}}{{{{N}_{{\mathrm{focal}}}}}}\ \frac{{{{{\hat{\mu }}}_{s,{\mathrm{\ focal}}}} - {{{\hat{\mu }}}_{s,{\mathrm{\ ref}}}}}}{{\sqrt {\frac{{\hat{\sigma }_{s,{\mathrm{focal}}}^2}}{{{{n}_{s,{\mathrm{\ focal}}}}}} + \frac{{\hat{\sigma }_{s,{\mathrm{ref}}}^2}}{{{{n}_{s,{\mathrm{\ ref}}}}}}} }}\ \sim \ N\left({0,1} \right),\end{equation}$$</annotation></semantics></math> </ephtml> where <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0034" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>n</mi><mrow><mi>j</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><annotation encoding="application/x-tex">${{n}_{j,{\mathrm{\ focal}}}}$</annotation></semantics></math> </ephtml> and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0035" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>n</mi><mrow><mi>j</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>ref</mi></mrow></mrow></msub><annotation encoding="application/x-tex">${{n}_{j,{\mathrm{\ ref}}}}$</annotation></semantics></math> </ephtml> are the number of students for the stratum <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0036" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>s</mi><annotation encoding="application/x-tex">$s$</annotation></semantics></math> </ephtml> ( <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0037" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>s</mi><annotation encoding="application/x-tex">$s$</annotation></semantics></math> </ephtml> = 1,..., <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0038" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>S</mi><annotation encoding="application/x-tex">$S$</annotation></semantics></math> </ephtml> ) in the focal group and in the reference group, respectively; <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0039" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>N</mi><mi>focal</mi></msub><annotation encoding="application/x-tex">${{N}_{{\mathrm{focal}}}}$</annotation></semantics></math> </ephtml> is the total number of students in the focal group; <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0040" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mover accent="true"><mi>μ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><annotation encoding="application/x-tex">${{\hat{\mu }}_{s,{\mathrm{\ focal}}}}$</annotation></semantics></math> </ephtml> and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0041" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mover accent="true"><mi>μ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>ref</mi></mrow></mrow></msub><annotation encoding="application/x-tex">${{\hat{\mu }}_{s,{\mathrm{\ ref}}}}$</annotation></semantics></math> </ephtml> are the mean response time for the stratum <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0042" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>s</mi><annotation encoding="application/x-tex">$s$</annotation></semantics></math> </ephtml> in the focal group and in the reference group, respectively; <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0043" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mover accent="true"><mi>σ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><annotation encoding="application/x-tex">${{\hat{\sigma }}_{s,{\mathrm{\ focal}}}}$</annotation></semantics></math> </ephtml> and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0044" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mover accent="true"><mi>σ</mi><mo>̂</mo></mover><mrow><mi>s</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>ref</mi></mrow></mrow></msub><annotation encoding="application/x-tex">${{\hat{\sigma }}_{s,{\mathrm{\ ref}}}}$</annotation></semantics></math> </ephtml> are the standard deviation of response time for the stratum <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0045" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>s</mi><annotation encoding="application/x-tex">$s$</annotation></semantics></math> </ephtml> in the focal group and in the reference group, respectively.</p> <p>The effect size indicates the average difference between the conditional mean response times across strata between the focal and reference groups. It is informed by standardized proportion difference statistic (Dorans & Kulick, [<reflink idref="bib15" id="ref59">15</reflink>]) and standardized mean difference (Zwick & Thayer, [<reflink idref="bib60" id="ref60">60</reflink>]) as follows: 8 <ephtml> <math display="block" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0046" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mi>DRT</mi><mspace width="0.33em" /></mrow><mo linebreak="badbreak">=</mo><munderover><mo>∑</mo><mrow><mi>g</mi><mo>=</mo><mn>0</mn></mrow><mi>G</mi></munderover><mfrac><msub><mi>n</mi><mrow><mi>j</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><msub><mi>N</mi><mi>focal</mi></msub></mfrac><mspace width="0.33em" /><mfenced separators="" open="[" close="]"><mrow><msub><mover accent="true"><mi>μ</mi><mo>̂</mo></mover><mrow><mi>j</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>focal</mi></mrow></mrow></msub><mo>−</mo><msub><mover accent="true"><mi>μ</mi><mo>̂</mo></mover><mrow><mi>j</mi><mo>,</mo><mrow><mspace width="0.33em" /><mi>reference</mi></mrow></mrow></msub></mrow></mfenced><mo>.</mo></mrow><annotation encoding="application/x-tex">$$\begin{equation}{\mathrm{DRT\ }} = \sum_{g = 0}^G \frac{{{{n}_{j,{\mathrm{\ focal}}}}}}{{{{N}_{{\mathrm{focal}}}}}}\ \left[ {{{{\hat{\mu }}}_{j,{\mathrm{\ focal}}}} - {{{\hat{\mu }}}_{j,{\mathrm{\ reference}}}}} \right].\end{equation}$$</annotation></semantics></math> </ephtml></p> <p>The test statistic and effect size derived from Guo and Ercikan ([<reflink idref="bib23" id="ref61">23</reflink>]) were coded in R software. The R coded written by the author is available upon request.</p> <hd id="AN0183983946-13">Results</hd> <p></p> <hd id="AN0183983946-14">Results of Differential Item Functioning: Scored Response</hd> <p>Figure 2 demonstrates three panels displaying RMSD results for three cognitive domains: reading (left), mathematics (middle), and science (right). In each panel, the <emph>x</emph>‐axis represents RMSD values for MLs, while the <emph>y</emph>‐axis represents RMSD values for non‐MLs. A diagonal line (solid gray) within the panel indicates that the RMSD values for both groups are identical for each item. Additionally, vertical and horizontal lines (dotted black) create four quadrants based on the .12 threshold for RMSD, indicating DIF based on the PISA operational procedure. Dots falling in the upper right, upper left, and lower right quadrants (created by the dotted lines) indicate RMSD values exceeding the threshold (RMSD > .12) for both groups, non‐ML group only, and ML group only, respectively. They signify DIF detection for the corresponding items. Different dot shapes represent two different question types: (a) computer‐scored response (circled pink); (b) human‐coded response (triangled blue). Across the cognitive domains, larger RMSD values appear more frequently among MLs compared to non‐MLs, as evidenced by dots located below the diagonal line in each panel. In reading, it was found 13.66% of the overall items (50 items) demonstrated DIF for MLs, and it encompasses 12.81% of the computer‐scored items (36 out of 281 items) and 16.47% of the human‐coded items (14 out of 85 items). In mathematics, approximately 7.14% of the total items (5 items) displayed DIF, in which it includes 5.77% of the computer‐scored items (three out of 52 items) and 11.11% of the human‐coded response items (two out of 18 items), with a concentration of high RMSD values within the ML group, suggesting potential differences in item characteristics in this group. In science, 11% of the items (13 items) exhibited DIF, with 13.25% of the computer‐scored items (11 out of 83 items) and 6.25% of the human‐coded response items (two out of 32 items), all within the ML group. Overall, similar to PISA country‐language groups, the highest comparability was found in mathematics, while the lowest comparability was in reading. The percentage of DIF items seems to align with the previous studies (e.g., Joo et al., [<reflink idref="bib29" id="ref62">29</reflink>]).</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/EMS/01mar25/emip12640-fig-0002.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="emip12640-fig-0002.jpg" title="2 Root Mean Squared Deviations (RMSDs) for Differential Item FunctioningNote. RMSD = root mean square deviation; dotted lines = threshold of .12 for ML (vertical) and non‐ML (horizontal); CR = coded response (multiple‐choice) for Reading; DR = human‐coded response (open‐ended) for Reading; CM = coded response (multiple‐choice) for Mathematics; DM = human‐coded response (open‐ended) for Mathematics; CS = coded response (multiple‐choice) for Science; DS = human‐coded response (open‐ended) for Science." /> </p> <p></p> <p>Based on MD statistics, in reading, 50% of the total items were negative MD for the ML group. In both mathematics and science, 49% of the total items were negative MD for the ML group. These findings indicate that the difficulty of the items was balanced between the two groups. However, among the DIF items, we found that 22 out of 50 DIF items (44%) in reading, 1 out of 5 DIF items (20%) in mathematics, and 9 out of 13 DIF items (69%) in science were negative MD for the ML group. This indicates that math DIF items tend to be easier for the ML group and science DIF items tend to be more difficult for ML groups.</p> <hd id="AN0183983946-16">Home Language and Abilities Estimates</hd> <p>Figure 3 compares latent ability scores obtained using WLE estimation for non‐ML and ML groups in the reading, mathematics, and science domains, respectively. Note that these scores were estimated after releasing constraints on the item parameters that were identified to have DIF, as informed by the results presented in the previous section. While all distributions are approximately normal, the interquartile ranges are within –1 and 1 for both groups. We found that the average scores of non‐MLs' latent ability levels were significantly greater than those of MLs in all domains: <emph>t</emph>(<reflink idref="bib4" id="ref63">4</reflink>,<reflink idref="bib778" id="ref64">778</reflink>) = 6.51, <emph>p</emph> < .001, <emph>d</emph> = .26 for reading; <emph>t</emph>(<reflink idref="bib2" id="ref65">2</reflink>,<reflink idref="bib594" id="ref66">594</reflink>) = 4.19, <emph>p</emph> < .001, <emph>d</emph> = .23 for mathematics; <emph>t</emph>(<reflink idref="bib2" id="ref67">2</reflink>,<reflink idref="bib588" id="ref68">588</reflink>) = 7.18, <emph>p</emph> < .001, <emph>d</emph> = .39 for science, indicating small effect sizes according to Cohen's rule of thumb. It may be noteworthy to mention the results of Levene's test for equality of variances, revealing that variances between the two groups showed no significant difference when constraints on the DIF items were released. However, it was not the case when the item parameters wereconstrained to be equal between the two groups on all items. s. Specifically, there were significant differences in their variances for reading and science due to greater variances for ML groups.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/EMS/01mar25/emip12640-fig-0003.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="emip12640-fig-0003.jpg" title="3 Boxplots of Ability EstimatesNote. ◊ = mean ability estimates." /> </p> <p></p> <p>Next, it was also found that there were strong correlations between reading scores and the two STEM subjects, <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0047" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>r</mi><mrow><mo>(</mo><mrow><mi>reading</mi><mo>,</mo><mi>math</mi></mrow><mo>)</mo></mrow></msub><annotation encoding="application/x-tex">${{r}_{({{\mathrm{reading}},{\mathrm{math}}})}}$</annotation></semantics></math> </ephtml> = .74; and <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0048" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><msub><mi>r</mi><mrow><mo>(</mo><mrow><mi>reading</mi><mo>,</mo><mi>science</mi></mrow><mo>)</mo></mrow></msub><annotation encoding="application/x-tex">${{r}_{({{\mathrm{reading}},{\mathrm{science}}})}}$</annotation></semantics></math> </ephtml> = .81. Analyses of covariance (ANCOVA) were conducted to test if the group difference in the subjects remains significant, while reading scores is controlled. Results showed that the two groups are not significantly different in math ability scores, after controlling for reading scores. However, the difference between the two groups remains significant, even controlling for reading scores (<emph>F</emph>(<reflink idref="bib1" id="ref69">1</reflink>, 2,<reflink idref="bib578" id="ref70">578</reflink>) = 18.53, <emph>p</emph> < .001). Possible explanations for the difference in science scores between groups could include factors beyond reading skills, such as decontextualized language, students' self‐assessment of language proficiency, socioeconomic status, and potentially sociocultural factors embedded in the test items (Abedi & Gándara, [<reflink idref="bib3" id="ref71">3</reflink>]; Hiebert & Lubliner, [<reflink idref="bib24" id="ref72">24</reflink>]; Van Laere et al., [<reflink idref="bib53" id="ref73">53</reflink>]).</p> <hd id="AN0183983946-18">Differential Response Time (DRT) Functioning</hd> <p></p> <hd id="AN0183983946-19">Relationship between response time and ability estimate</hd> <p>Figure 4 illustrates the mean response time taken to complete the test (<emph>y</emph>‐axis) against the range of latent ability scores (<emph>x</emph>‐axis) using WLE estimation. The mean RT was calculated by dividing the total time spent that a student spent to complete the test by the number of items (s)he answered. Each data point represents the (observed) mean RT at the specific ability score. The figure consists of three plots corresponding to reading (left), math (middle), and science (right) domains. In each plot, LOWESS curves with 95% confidence intervals (CIs) are presented by home language groups (solid line for MLs and dotted line for non‐MLs).</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/EMS/01mar25/emip12640-fig-0004.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="emip12640-fig-0004.jpg" title="4 Mean Response Time and Ability Estimates (Locally Weighted Scatterplot Smoothing)" /> </p> <p></p> <p>Across all domains, we observed a nonmonotonic relationship between ability score and response time, suggesting that the mean test‐taking time varied depending on the ability ranges and the home language groups, respect to strength and direction.</p> <p>More specifically, in the reading domain, we found that the mean RT increases more rapidly with higher ability scores for students with abilities below average (i.e., negative ability scores). For above‐average students, although the rate of change is subtle, there appears to be an increase for the MLs and a decrease for the non‐MLs, with the trend becoming more evident for students whose ability scores exceed <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0049" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mover accent="true"><mi>θ</mi><mo>̂</mo></mover><mo>=</mo></mrow><annotation encoding="application/x-tex">$\hat{\theta } = $</annotation></semantics></math> </ephtml> 1.5.</p> <p>In the math domain, the rapid increase for the below‐average students persists: with the pattern being more noticeable for students in the range of <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0050" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mover accent="true"><mi>θ</mi><mo>̂</mo></mover><annotation encoding="application/x-tex">$\hat{\theta }$</annotation></semantics></math> </ephtml> < –1. Regarding the mean RT, it was found that an average ML student ( <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0051" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mover accent="true"><mi>θ</mi><mo>̂</mo></mover><annotation encoding="application/x-tex">$\hat{\theta }$</annotation></semantics></math> </ephtml> = 0) has spent 75 seconds to complete a reading item but longer than that for a math item; an average non‐ML student has spent 70 seconds to complete a reading item but longer than that for a math item.</p> <p>Finally, in the science domain, the smoothing curve pattern for MLs looks noticeably different from their patterns for a few aspects: among below‐average MLs, those with very low ability scores ( <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0052" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mover accent="true"><mi>θ</mi><mo>̂</mo></mover><annotation encoding="application/x-tex">$\hat{\theta }$</annotation></semantics></math> </ephtml> < –1.5) tended to show a decreasing trend in their mean RT; students with ability scores in the range –1.5 < <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0053" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mover accent="true"><mi>θ</mi><mo>̂</mo></mover><annotation encoding="application/x-tex">$\hat{\theta }$</annotation></semantics></math> </ephtml> < –.5 show an increase in their mean RT; for the above‐average MLs ( <ephtml> <math display="inline" altimg="urn:x-wiley:07311745:media:emip12640:emip12640-math-0054" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mover accent="true"><mi>θ</mi><mo>̂</mo></mover><annotation encoding="application/x-tex">$\hat{\theta }$</annotation></semantics></math> </ephtml> > –.5), there is a subtle increase in their mean RT. For non‐ML students, the mean response time increased for below‐average students, yet for students with greater ability, we observe that the trend weakens and diminishes.</p> <p>When the two home language groups are compared, the overall results in Figure 4 indicate that the MLs (solid curve) generally spend more time than non‐MLs (dotted curve) in the mean RT when their abilities are matched. Specifically, no interaction effect was found between the ability levels and the home language groups. These findings could suggest that the test functions differently, particularly revealing a uniform bias against the MLs.</p> <hd id="AN0183983946-21">Differential response time (DRT) functioning</hd> <p>Figure 5 provides histograms to summarize the results of investigating DRT functioning, as developed by Guo and Ercikan ([<reflink idref="bib23" id="ref74">23</reflink>]). To conduct this test, the students were grouped according to a criterion—latent ability scores obtained by WLE in MG‐IRT analysis, as described in the section above.</p> <p> <img src="https://imageserver.ebscohost.com/img/embimages/rdk/EMS/01mar25/emip12640-fig-0005.jpg?ephost1=dGJyMMvl7ESepq84yOvsOLCmsE6epq5Srqa4SK6WxWXS" alt="emip12640-fig-0005.jpg" title="5 Differential Response Time (Effect Size) Stratified by Ability LevelNote. No = items with RMSD < .12; Yes = items with RMSD > .12; ability level was obtained by weighted likelihood estimation (WLE)." /> </p> <p></p> <p>Regarding the criterion (five strata in total), we found that 18 items (8%) in reading, 7 items (10%) in mathematics, and 8 items (7%) in science exceede the critical value of 1.96. The DRT effect size values range as follows: (–8.36, 42.59) in reading, (–12.61, 42.68) in mathematics, and (–6.82, 41.52) in science. Although not presented in the figure, we also tried to use a mean item score as another criterion and found that there were close‐to‐perfect linear relationships between the two criteria for DRT: the Pearson correlation coefficients were as follows: <emph>r</emph> = .99 (reading), .99 (mathematics), and .98 (science).</p> <p>Similarly, the skewness of DRT values was the most severe in science (1.29) followed by reading (1.04) and mathematics (.58). In contrast, items with positive DRT were the most in mathematics (89%) followed by science (85%) and reading (85%). It is worthwhile to note that the positive values in DRT indicate that the mean response time for MLs was predominately slower than non‐MLs.</p> <hd id="AN0183983946-23">Discussion</hd> <p>In this study, we investigated potential assessment bias in the context of linguistically and culturally diverse students as compared to their mainstream peers in the PISA 2018 data. The study utilized all available items within reading, mathematics, and science domains, which were generated from DBA. In relation to item score data, we examined issues of measurement comparability via DIF analysis within each domain. This analysis was conducted using IRT‐based item fit statistics, specifically the RMSD statistic, which assesses the discrepancy in item estimates between the assumption of item universality and the assumption of item uniqueness for both MLs and non‐MLs. The results indicated that the small proportion of items exceeded the predefined thresholds, signifying that the two groups responded differently to each item, even when their performance and estimated ability scores were controlled. The DIF proportions were approximately 14% (reading), 7% (mathematics), and 11% (science) and the proportions were similar to the DIF proportions across multiple country‐by‐language groups reported in the PISA 2018 technical report (OECD, [<reflink idref="bib42" id="ref75">42</reflink>]), where reading items showed more DIF than mathematics and science items</p> <p>Next, we conducted a further examination of the discrepancy in item response times between the two groups across the three domains. We consistently observed a greater amount of time spent, on average, on MLs, regardless of the three domains. We consistently observed a greater amount of time spent, on average, on MLs, regardless of their ability estimates. That is, MLs tended to spend more time completing the test compared to non‐MLs. Regarding response time for individual items, results of DRT functioning revealed significant differences in time spent between ML and non‐ML students with similar abilities spent on a number of items. The consistent directions of the majority of DRT effect sizes suggest that distributions of response time were skewed against ML students. Among the cognitive domains, it was found that, in mathematics, the mean response time was longer for MLs the most.</p> <p>As suggested by Figure 5, it is worth noting that the items identified as having a DIF issue did not substantially overlap with the items with a DRT issue. This suggests that the reason for students with similar ability levels struggling to answer items and spending more time on them did not always align. It is possible that MLs, when faced with less time pressure, could devote extra time to solve certain items, ultimately answering them correctly. On the other hand, there were a few items that displayed both DIF and DRT issues, indicating that MLs with similar abilities not only needed more time but still could not answer them correctly. To effectively identify potentially problematic items, we recommend isolating items with each type of issues (or both) and further interpreting the nuances by examining item characteristic estimates, self‐report difficulty in relation to linguistic and cultural factors, and item format (multiple‐choice versus open‐ended options).</p> <p>Finally, the results of ANCOVA analyses, utilizing the WLE estimates on latent abilities, suggested that reading comprehension significantly influences success in STEM‐related subjects: we observed that the ability scores of MLs in mathematics no longer differ from those of non‐MLs. While reading comprehension remains influential in science ability scores, our findings indicate that the disparity between groups in science scores can be attributed not only to reading skills but also to other factors. Such factors could include student and school‐related variables such as decontextualized language, students' self‐assessment in their proficiency in the language of instruction, socioeconomic status, etc. (Hiebert & Lubliner, [<reflink idref="bib24" id="ref76">24</reflink>]; Van Laere et al., [<reflink idref="bib53" id="ref77">53</reflink>]). Alternatively, addressing sociocultural factors underlying the test items and a linguistically modified version of the same items could enhance MLs' performance in the domain (Abedi et al., [<reflink idref="bib2" id="ref78">2</reflink>]; Abedi & Gándara, [<reflink idref="bib3" id="ref79">3</reflink>]). The findings of the current study emphasize the necessity for further development of assessments that are equitable for students from diverse cultural and linguistic backgrounds. These assessments should provide students with access to additional time or effective resources to address construct‐irrelevant factors, while maintaining test validity.</p> <hd id="AN0183983946-24">Limitation and Conclusion</hd> <p>There are several limitations in this study. First, while the study focuses on the test‐taking behavior of MLs, there may be variations among students within the groups, depending on external factors including socioeconomic status (e.g., MLs—low SES versus MLs—high SES), which could constitute an additional layer of subgroups concerning DIF and DRT. To address this, an extension of methods capable of handling multiple groups with adequate sample sizes for these finely differentiated subgroups is necessary. Secondly, regarding DRT functioning, we suggest that there are relatively few studies in this area likely because of the recent emergence of both data availability and methods. We recommend further research to offer guidance on the practical implication of the DRT effect size in assessments such as PISA and other large‐scale evaluations (e.g., RMSD criterion in DIF).</p> <p>Revisiting our introduction on NCLB, the study's findings of test‐taking behaviors and achievement outcomes still reveal distinct disparities between MLs and their non‐ML peers, within the context of NCLB policy‐based discussions about educational equity. Despite the enactment of the policy aimed at ensuring equal access to education, these results indicate persistent challenges to fair assessment practices, particularly for MLs in K–12 education settings.</p> <p>Bringing attention to the complexity of accurately assessing MLs' abilities, our results align with concerns raised by Abedi ([<reflink idref="bib1" id="ref80">1</reflink>]) and Solano‐Flores ([<reflink idref="bib48" id="ref81">48</reflink>]) about the potential for construct‐irrelevant factors in conventional assessments. Additionally, in line with Bailey and Wolf ([<reflink idref="bib10" id="ref82">10</reflink>]), this study reinforces the importance of considering linguistic complexity and cultural relevance in assessments. It provides empirical evidence of performance disparities between MLs and non‐MLs on standardized tests, despite similar cognitive abilities. Such discrepancies challenge the validity of these assessments for MLs and emphasize the need for further investigation for test developers and teachers to better interpret the nuances by examining item characteristic estimates, self‐report difficulty in relation to linguistic and cultural factors, and item format (multiple‐choice versus open‐ended options).</p> <p>In conclusion, this study has addressed a gap in the literature by focusing on within‐country comparability issues in the large‐scale assessment of PISA 2018. This focus is particularly relevant given the increasing linguistic diversity in U.S. schools and the necessity for policies and practices that rectify the principles of educational equity established by NCLB. The findings necessitate a reexamination of policies to ensure they are effectively contributing to the academic success of MLs. This may require a comprehensive review and revision of accountability measures, including the criteria and processes used to evaluate the performance of schools and students. Furthermore, it is essential that schools are equipped with the necessary resources, and that teachers participate in professional development programs designed to effectively support culturally and linguistically diverse students.</p> <hd id="AN0183983946-25">Conflict of Interest Statement</hd> <p>The authors report there are no competing interests to declare.</p> <ref id="AN0183983946-26"> <title> References </title> <blist> <bibl id="bib1" idref="ref22" type="bt">1</bibl> <bibtext> Abedi, J. (2004). The No Child Left Behind Act and English language learners: Assessment and accountability issues. Educational Researcher, 33 (1), pp. 4 – 14. https://doi.org/10.3102/0013189×033001004</bibtext> </blist> <blist> <bibl id="bib2" idref="ref65" type="bt">2</bibl> <bibtext> Abedi, J., Courtney, M., & Leon, S. (2003). Effectiveness and validity of accommodations for English language learners in large‐scale assessments (CSE Tech. Rep. No. 608). Los Angeles : University of California, National Center for Research on Evaluation, Standards, and Student Testing.</bibtext> </blist> <blist> <bibl id="bib3" idref="ref53" type="bt">3</bibl> <bibtext> Abedi, J., & Gándara, P. (2006). Performance of English language learners as a subgroup in large‐scale assessment: Interaction of research and policy. Educational Measurement, Issues and Practice, 25 (4), pp. 36 – 46.</bibtext> </blist> <blist> <bibl id="bib4" idref="ref16" type="bt">4</bibl> <bibtext> Abedi, J. (2007). Language factors in the assessment of English language learners: Theory and principles underlying the linguistic modification approach. Paper developed for the U.S. Department of Education LEP Partnership. Retrieved from <ulink href="http://www.ncela.gwu.edu/files/uploads/11/abedi%5fsato.pdf">http://www.ncela.gwu.edu/files/uploads/11/abedi%5fsato.pdf</ulink></bibtext> </blist> <blist> <bibl id="bib5" idref="ref7" type="bt">5</bibl> <bibtext> Adger, C. T., Snow, C. E., & Christian, D. (Eds.). (2018). What teachers need to know about language (Vol. 2). Multilingual Matters.</bibtext> </blist> <blist> <bibl id="bib6" type="bt">6</bibl> <bibtext> Akour, M., Sabah, S., & Hammouri, H. (2014). Net and global differential item functioning in PISA polytomously scored science items. Journal of Psychoeducational Assessment, 33 (2), pp. 166 – 176. https://doi.org/10.1177/0734282914541337</bibtext> </blist> <blist> <bibl id="bib7" idref="ref15" type="bt">7</bibl> <bibtext> American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Retrieved from https://<ulink href="http://www.aera.net/Publications/Books/Standards‐for‐Educational‐Psychological‐Testing‐2014‐Edition">www.aera.net/Publications/Books/Standards‐for‐Educational‐Psychological‐Testing‐2014‐Edition</ulink></bibtext> </blist> <blist> <bibl id="bib8" type="bt">8</bibl> <bibtext> Asil, M., & Brown, G. T. L. (2015). Comparing OECD PISA reading in English to other languages: Identifying potential sources of non‐invariance. International Journal of Testing, 16 (1), pp. 71 – 93. https://doi.org/10.1080/15305058.2015.1064431</bibtext> </blist> <blist> <bibl id="bib9" type="bt">9</bibl> <bibtext> Ayvalli, M., & Biçak, B. (2018). An investigation into the measurement invariance of PISA 2012 mathematical literacy test. European Journal of Education Studies, 4, 39 – 58.</bibtext> </blist> <blist> <bibtext> Bailey, A. L., & Wolf, M. K. (2012). The role and importance of content and language integration in large‐scale assessment. In A. L. Bailey (Ed.), The Oxford handbook of language and large‐scale assessment (pp. 58 – 78). Oxford University Press.</bibtext> </blist> <blist> <bibtext> Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397 – 479). Addison‐Wesley.</bibtext> </blist> <blist> <bibtext> Carnoy, M., & García, E. (2017). Five key trends in U.S. student performance: Progress by blacks and Hispanics, the takeoff of Asians, the stall of non‐English speakers, the persistence of socioeconomic gaps, and the damaging effect of highly segregated schools. Economic Policy Institute. Retrieved from https://<ulink href="http://www.epi.org/publication/five‐key‐trends‐in‐u‐s‐student‐performance‐progress‐by‐blacks‐and‐hispanics‐the‐takeoff‐of‐asians‐the‐stall‐of‐non‐english‐speakers‐the‐persistence‐of‐socioeconomic‐gaps‐and‐the‐damaging‐effect/">www.epi.org/publication/five‐key‐trends‐in‐u‐s‐student‐performance‐progress‐by‐blacks‐and‐hispanics‐the‐takeoff‐of‐asians‐the‐stall‐of‐non‐english‐speakers‐the‐persistence‐of‐socioeconomic‐gaps‐and‐the‐damaging‐effect/</ulink></bibtext> </blist> <blist> <bibtext> Cummins, J. (2000). Language, power, and pedagogy: Bilingual children in the crossfire. Clevedon, UK : Multilingual Matters.</bibtext> </blist> <blist> <bibtext> Da Costa, P. D., & Araújo, L. (2012). Differential item functioning (DIF): What function differently for Immigrant students in PISA 2009 reading items. JRC Scientific and Policy Reports. Luxembourg: European Commission. <ulink href="http://doi.org/10.2788/60811">http://doi.org/10.2788/60811</ulink></bibtext> </blist> <blist> <bibtext> Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23 (4), 355 – 368. https://doi.org/10.1111/j.1745‐3984.1986.tb00255.x</bibtext> </blist> <blist> <bibtext> Ercikan, K., Guo, H., & He, Q. (2020). Use of response process data to inform group comparisons and fairness research. Educational Assessment, 25 (3), pp. 179 – 197. https://doi.org/10.1080/10627197.2020.1804353</bibtext> </blist> <blist> <bibtext> Gándara, P., & Hopkins, M. (2010). Forbidden language: English learners and restrictive language policies (pp. 321 – 323). New York : Teachers College Press.</bibtext> </blist> <blist> <bibtext> García, E., & Weiss, E. (2017). Education inequalities at the school starting gate: Gaps, trends, and strategies to address them. Economic Policy Institute. Retrieved from https://<ulink href="http://www.epi.org/publication/education‐inequalities‐at‐the‐school‐starting‐gate/">www.epi.org/publication/education‐inequalities‐at‐the‐school‐starting‐gate/</ulink></bibtext> </blist> <blist> <bibtext> Geay, C., McNally, S., & Telhaj, S. (2012). Non‐native speakers of English in the classroom: what are the effects on pupil performance? SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2039637</bibtext> </blist> <blist> <bibtext> Geay, C., McNally, S., & Telhaj, S. (2013). Non‐native speakers of English in the classroom: what are the effects on pupil performance? The Economic Journal, 123 (570), F281 ‐ F307. https://doi.org/10.1111/ecoj.12054</bibtext> </blist> <blist> <bibtext> Gökçe, S., Berberoğlu, G., Wells, C. S., & Sireci, S. G. (2021). Linguistic distance and translation differential item functioning on trends in international mathematics and science study mathematics assessment items. Journal of Psychoeducational Assessment, 39 (6), pp. 728 – 745. https://doi.org/10.1177/07342829211010537</bibtext> </blist> <blist> <bibtext> Guo, H., & Ercikan, K. (2020). Differential rapid responding across language and cultural groups. Educational Research and Evaluation, 26 (5–6), pp. 302 – 327. https://doi.org/10.1080/13803611.2021.1963941</bibtext> </blist> <blist> <bibtext> Guo, H., & Ercikan, K. (2021). Comparing test‐taking behaviors of English language learners (ELLs) to Non‐ELL students: Use of response time in measurement comparability research. ETS Research Report Series, 2021 (1), pp. 1 – 15. https://doi.org/10.1002/ets2.12340</bibtext> </blist> <blist> <bibtext> Hiebert, E. H., & Lubliner, S. (2008). The nature, learning, and instruction of general academic vocabulary. In A. E. Farstrup & S. J. Samuels (Eds.), What research has to say about vocabulary instruction (pp. 106 – 1129). Newark, DE : International Reading Association.</bibtext> </blist> <blist> <bibtext> Hopkins, D. J. (2010). Politicized places: Explaining where and when immigrants provoke local opposition. American Political Science Review, 104 (1), pp. 40 – 60. https://doi.org/10.1017/s0003055409990360</bibtext> </blist> <blist> <bibtext> Kopriva, R. J., Emick, J. E., Hipolito‐Delgado, C. P., & Cameron, C. A. (2007). Do proper accommodation assignments make a difference? Examining the impact of improved decision making on scores for English language learners. Educational Measurement: Issues and Practice, 26 (3), pp. 11 – 20. https://doi.org/10.1111/j.1745‐3992.2007.00097.x</bibtext> </blist> <blist> <bibtext> Huang, X., Wilson, M., & Wang, L. (2014). Exploring plausible causes of differential item functioning in the PISA science assessment: Language, curriculum or culture. Educational Psychology, 36 (2), pp. 378 – 390. https://doi.org/10.1080/01443410.2014.946890</bibtext> </blist> <blist> <bibtext> Joo, S., Ali, U., Robin, F., & Shin, H. J. (2022). Impact of differential item functioning on group score reporting in the context of large‐scale assessments. Large‐Scale Assessments in Education, 10 (1). https://doi.org/10.1186/s40536‐022‐00135‐7</bibtext> </blist> <blist> <bibtext> Joo, S. H., Khorramdel, L., Yamamoto, K., Shin, H. J., & Robin, F. (2021). Evaluating item fit statistic thresholds in PISA: Analysis of cross‐country comparability of cognitive items. Educational Measurement: Issues and Practice, 40 (2), pp. 37 – 48. https://doi.org/10.1111/emip.12404</bibtext> </blist> <blist> <bibtext> Kankaraš, M., & Moors, G. (2013). Analysis of cross‐cultural comparability of PISA 2009 scores. Journal of Cross‐Cultural Psychology, 45 (3), pp. 381 – 399. https://doi.org/10.1177/0022022113511297</bibtext> </blist> <blist> <bibtext> Khorramdel, L., Pokropek, A., Joo, S. H., Kirsch, I., & Halderman, L. (2020). Examining gender DIF and gender differences in the PISA 2018 reading literacy scale: A partial invariance approach. Psychological Test and Assessment Modeling, 62 (2), 179 – 231. https://<ulink href="http://www.proquest.com/scholarly‐journals/examining‐gender‐dif‐differences‐pisa‐2018/docview/2426140657/se‐2">www.proquest.com/scholarly‐journals/examining‐gender‐dif‐differences‐pisa‐2018/docview/2426140657/se‐2</ulink></bibtext> </blist> <blist> <bibtext> Krashen, S. D. (1982). Principles and practice in second language acquisition. Oxford, UK : Pergamon.</bibtext> </blist> <blist> <bibtext> Le, L. T. (2009). Investigating gender differential item functioning across countries and test languages for PISA science items. International Journal of Testing, 9 (2), pp. 122 – 133. https://doi.org/10.1080/15305050902880769</bibtext> </blist> <blist> <bibtext> Liu, R., & Bradley, K. D. (2021). Differential item functioning among English language learners on a large‐scale mathematics assessment. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021.657335</bibtext> </blist> <blist> <bibtext> Lundgren, E., & Eklöf, H. (2020). Within‐item response processes as indicators of test‐taking effort and motivation. Educational Research and Evaluation, 26 (5–6), pp. 275 – 301. https://doi.org/10.1080/13803611.2021.1963940</bibtext> </blist> <blist> <bibtext> Mahoney, K. (2008). Linguistic influences on differential item functioning for second language learners on the National Assessment of Educational Progress. International Journal of Testing, 8 (1), pp. 14 – 33. https://doi.org/10.1080/15305050701808615</bibtext> </blist> <blist> <bibtext> Martiniello, M. (2009). Linguistic complexity, schematic representations, and differential item functioning for English language learners in math tests. Educational Assessment, 14 (3–4), pp. 160 – 179. https://doi.org/10.1080/10627190903422906</bibtext> </blist> <blist> <bibtext> Menken, K., & Kleyn, T. (2010). The long‐term impact of subtractive schooling in the educational experiences of secondary English language learners. International Journal of Bilingual Education and Bilingualism, 13 (4), pp. 399 – 417. https://doi.org/10.1080/13670050903370143</bibtext> </blist> <blist> <bibtext> Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992 (1), pp. i – 23.</bibtext> </blist> <blist> <bibtext> National Center for Education Statistics. (2020). U.S. Department of Education. Institute of Education Sciences, National Center for Education Statistics.</bibtext> </blist> <blist> <bibtext> National Research Council. (2001). Testing English‐language learners in US schools: Report and workshop summary. National Academies Press. https://doi.org/10.17226/9998</bibtext> </blist> <blist> <bibtext> OECD. (2019). PISA 2018 Assessment and Analytical Framework. Paris : OECD Publishing. https://doi.org/10.1787/b25efab8‐en</bibtext> </blist> <blist> <bibtext> Oliveri, M. E., Ercikan, K., & Simon, M. (2015). A framework for developing comparable multilingual assessments for minority populations: Why context matters. International Journal of Testing, 15 (2), pp. 94 – 113. https://doi.org/10.1080/15305058.2014.986271</bibtext> </blist> <blist> <bibtext> Phinney, J. S., Berry, J. W., Sam, D. L., & Vedder, P. (2022). Understanding immigrant youth: Conclusions and implications. In J. W. Berry, J. S. Phinney, D. L. Sam, & P. Vedder (Eds.), Immigrant youth in cultural transition (pp. 212 – 236). Routledge.</bibtext> </blist> <blist> <bibtext> Reckase, M. D. (2009). Multidimensional item response theory. New York : Springer‐Verlag.</bibtext> </blist> <blist> <bibtext> Sam, D. L., Vedder, P., Ward, C., & Horenczyk, G. (2022). Psychological and sociocultural adaptation of immigrant youth. In J. W. Berry, J. S. Phinney, D. L. Sam, & P. Vedder (Eds.), Immigrant youth in cultural transition (pp. 119 – 143). Routledge.</bibtext> </blist> <blist> <bibtext> Shin, H. J., Kerzabi, E., Joo, S. H., Robin, F., & Yamamoto, K. (2020). Comparability of response time scales in PISA. Psychological Test and Assessment Modeling, 62 (1), pp. 107 – 135. https://shorturl.at/klAC3</bibtext> </blist> <blist> <bibtext> Solano‐Flores, G. (2016). Assessing the cultural validity of assessment practices: An introduction. In G. Solano‐Flores (Ed.), Cultural validity in assessment: Addressing linguistic and cultural diversity (pp. 1 – 13). Routledge.</bibtext> </blist> <blist> <bibtext> Solano‐Flores, G., & Trumbull, E. (2003). Examining language in context: The need for new research and practice paradigms in the testing of English‐language learners. Educational Researcher, 32 (2), pp. 3 – 13. https://doi.org/10.3102/0013189×032002003</bibtext> </blist> <blist> <bibtext> Thomas, W. P., & Collier, V. P. (2002). A national study of school effectiveness for language minority students' long‐term academic achievement. Washington, DC : Center for Research on Education, Diversity & Excellence.</bibtext> </blist> <blist> <bibtext> Noble T., Wells C. S. &, & Rosebery A. S. (2023). English learners and constructed‐response science test items challenges and opportunities. Educational Assessment, pp. 1 – 27. https://doi.org/10.1080/10627197.2023.2226387</bibtext> </blist> <blist> <bibtext> Umansky, I. M., & Reardon, S. F. (2014). Reclassification patterns among Latino English learner students in bilingual, dual immersion, and English immersion classrooms. American Educational Research Journal, 51 (5), pp. 879 – 912. https://doi.org/10.3102/0002831214545110</bibtext> </blist> <blist> <bibtext> Van Laere, E., Aesaert, K., & van Braak, J. (2014). The Role of Students' Home Language in Science Achievement: A multilevel approach. International Journal of Science Education, 36 (16), pp. 2772 – 2794.</bibtext> </blist> <blist> <bibtext> von Davier, M. (2005). mdltm: Software for the general diagnostic model and for estimating mixtures of multidimensional discrete latent traits models [Computer software]. Princeton, NJ : ETS.</bibtext> </blist> <blist> <bibtext> von Davier, M., Yamamoto, K., Shin, H. J., Chen, H., Khorramdel, L., Weeks, J., ... & Kandathil, M. (2019). Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assessment in Education: Principles, Policy & Practice, 26 (4), 466 – 488. https://doi.org/10.1080/0969594X.2019.1586642</bibtext> </blist> <blist> <bibtext> Walker, C. M., Zhang, B., & Surber, J. (2008). Using a multidimensional differential item functioning framework to determine if reading ability affects student performance in mathematics. Applied Measurement in Education, 21 (2), pp. 162 – 181. https://doi.org/10.1080/08957340801926201</bibtext> </blist> <blist> <bibtext> Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54 (3), 427 – 450. https://doi.org/10.1007/BF02294627</bibtext> </blist> <blist> <bibtext> Wolf, M. K., & Leon, S. (2009). An investigation of the language demands in content assessments for English language learners. Educational Assessment, 14 (3–4), pp. 139 – 159. https://doi.org/10.1080/10627190903425883</bibtext> </blist> <blist> <bibtext> Zhou, M., & Xiong, Y. S. (2005). The multifaceted American experiences of the children of Asian immigrants: Lessons for segmented assimilation. Ethnic and Racial Studies, 28 (6), pp. 1119 – 1152. https://doi.org/10.1080/01419870500224455</bibtext> </blist> <blist> <bibtext> Zwick, R., & Thayer, D. T. (1996). Evaluating the magnitude of differential item functioning in polytomous items. Journal of Educational and Behavioral Statistics, 21 (3), 187 – 201. https://doi.org/10.3102/10769986021003187</bibtext> </blist> </ref> <aug> <p>By Jung Yeon Park; Sean Joo; Zikun Li and Hyejin Yoon</p> <p>Reported by Author; Author; Author; Author</p> </aug> <nolink nlid="nl1" bibid="bib13" firstref="ref1"></nolink> <nolink nlid="nl2" bibid="bib20" firstref="ref2"></nolink> <nolink nlid="nl3" bibid="bib32" firstref="ref3"></nolink> <nolink nlid="nl4" bibid="bib44" firstref="ref4"></nolink> <nolink nlid="nl5" bibid="bib46" firstref="ref5"></nolink> <nolink nlid="nl6" bibid="bib59" firstref="ref6"></nolink> <nolink nlid="nl7" bibid="bib17" firstref="ref8"></nolink> <nolink nlid="nl8" bibid="bib38" firstref="ref9"></nolink> <nolink nlid="nl9" bibid="bib12" firstref="ref10"></nolink> <nolink nlid="nl10" bibid="bib52" firstref="ref11"></nolink> <nolink nlid="nl11" bibid="bib18" firstref="ref13"></nolink> <nolink nlid="nl12" bibid="bib41" firstref="ref14"></nolink> <nolink nlid="nl13" bibid="bib26" firstref="ref17"></nolink> <nolink nlid="nl14" bibid="bib37" firstref="ref18"></nolink> <nolink nlid="nl15" bibid="bib51" firstref="ref19"></nolink> <nolink nlid="nl16" bibid="bib49" firstref="ref20"></nolink> <nolink nlid="nl17" bibid="bib58" firstref="ref21"></nolink> <nolink nlid="nl18" bibid="bib48" firstref="ref23"></nolink> <nolink nlid="nl19" bibid="bib10" firstref="ref24"></nolink> <nolink nlid="nl20" bibid="bib43" firstref="ref25"></nolink> <nolink nlid="nl21" bibid="bib30" firstref="ref26"></nolink> <nolink nlid="nl22" bibid="bib56" firstref="ref27"></nolink> <nolink nlid="nl23" bibid="bib33" firstref="ref29"></nolink> <nolink nlid="nl24" bibid="bib28" firstref="ref30"></nolink> <nolink nlid="nl25" bibid="bib31" firstref="ref31"></nolink> <nolink nlid="nl26" bibid="bib14" firstref="ref32"></nolink> <nolink nlid="nl27" bibid="bib34" firstref="ref33"></nolink> <nolink nlid="nl28" bibid="bib21" firstref="ref34"></nolink> <nolink nlid="nl29" bibid="bib36" firstref="ref35"></nolink> <nolink nlid="nl30" bibid="bib35" firstref="ref36"></nolink> <nolink nlid="nl31" bibid="bib47" firstref="ref37"></nolink> <nolink nlid="nl32" bibid="bib16" firstref="ref38"></nolink> <nolink nlid="nl33" bibid="bib23" firstref="ref39"></nolink> <nolink nlid="nl34" bibid="bib22" firstref="ref40"></nolink> <nolink nlid="nl35" bibid="bib42" firstref="ref42"></nolink> <nolink nlid="nl36" bibid="bib45" firstref="ref47"></nolink> <nolink nlid="nl37" bibid="bib55" firstref="ref48"></nolink> <nolink nlid="nl38" bibid="bib11" firstref="ref49"></nolink> <nolink nlid="nl39" bibid="bib39" firstref="ref50"></nolink> <nolink nlid="nl40" bibid="bib57" firstref="ref51"></nolink> <nolink nlid="nl41" bibid="bib54" firstref="ref52"></nolink> <nolink nlid="nl42" bibid="bib29" firstref="ref55"></nolink> <nolink nlid="nl43" bibid="bib15" firstref="ref59"></nolink> <nolink nlid="nl44" bibid="bib60" firstref="ref60"></nolink> <nolink nlid="nl45" bibid="bib778" firstref="ref64"></nolink> <nolink nlid="nl46" bibid="bib594" firstref="ref66"></nolink> <nolink nlid="nl47" bibid="bib588" firstref="ref68"></nolink> <nolink nlid="nl48" bibid="bib578" firstref="ref70"></nolink> <nolink nlid="nl49" bibid="bib24" firstref="ref72"></nolink> <nolink nlid="nl50" bibid="bib53" firstref="ref73"></nolink>
Header DbId: eric
DbLabel: ERIC
An: EJ1460468
AccessLevel: 3
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Jung+Yeon+Park%22">Jung Yeon Park</searchLink><br /><searchLink fieldCode="AR" term="%22Sean+Joo%22">Sean Joo</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0003-4861-4362">0000-0003-4861-4362</externalLink>)<br /><searchLink fieldCode="AR" term="%22Zikun+Li%22">Zikun Li</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0002-3572-707X">0000-0002-3572-707X</externalLink>)<br /><searchLink fieldCode="AR" term="%22Hyejin+Yoon%22">Hyejin Yoon</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="SO" term="%22Educational+Measurement%3A+Issues+and+Practice%22"><i>Educational Measurement: Issues and Practice</i></searchLink>. 2025 44(1):55-65.
– Name: Avail
  Label: Availability
  Group: Avail
  Data: Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
– Name: PeerReviewed
  Label: Peer Reviewed
  Group: SrcInfo
  Data: Y
– Name: Pages
  Label: Page Count
  Group: Src
  Data: 11
– Name: DatePubCY
  Label: Publication Date
  Group: Date
  Data: 2025
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Journal Articles<br />Reports - Research
– Name: Audience
  Label: Education Level
  Group: Audnce
  Data: <searchLink fieldCode="EL" term="%22Secondary+Education%22">Secondary Education</searchLink>
– Name: Subject
  Label: Descriptors
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Achievement+Tests%22">Achievement Tests</searchLink><br /><searchLink fieldCode="DE" term="%22Secondary+School+Students%22">Secondary School Students</searchLink><br /><searchLink fieldCode="DE" term="%22International+Assessment%22">International Assessment</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Bias%22">Test Bias</searchLink><br /><searchLink fieldCode="DE" term="%22Multilingualism%22">Multilingualism</searchLink><br /><searchLink fieldCode="DE" term="%22Monolingualism%22">Monolingualism</searchLink><br /><searchLink fieldCode="DE" term="%22Reaction+Time%22">Reaction Time</searchLink><br /><searchLink fieldCode="DE" term="%22Native+Language%22">Native Language</searchLink><br /><searchLink fieldCode="DE" term="%22Research+Problems%22">Research Problems</searchLink>
– Name: Subject
  Label: Geographic Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22United+States%22">United States</searchLink>
– Name: SubjectThesaurus
  Label: Assessment and Survey Identifiers
  Group: Su
  Data: <searchLink fieldCode="SU" term="%22Program+for+International+Student+Assessment%22">Program for International Student Assessment</searchLink>
– Name: DOI
  Label: DOI
  Group: ID
  Data: 10.1111/emip.12640
– Name: ISSN
  Label: ISSN
  Group: ISSN
  Data: 0731-1745<br />1745-3992
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non-MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science). Differential item functioning (DIF) analysis reveals that 7-14% of items exhibit DIF-related problems in scored responses between the two groups, aligning with PISA technical report results. While MLs generally spend more time on the test than non-MLs across cognitive levels, differential response time (DRT) functioning identifies significant time differences in 7-10% of items for students with similar cognitive levels. It was noticeable that items with DIF and DRT issues show limited overlap, suggesting diverse reasons for student struggles in the assessment. A deeper examination of item characteristics is recommended for test developers and teachers to gain a better understanding of these nuances.
– Name: AbstractInfo
  Label: Abstractor
  Group: Ab
  Data: As Provided
– Name: DateEntry
  Label: Entry Date
  Group: Date
  Data: 2025
– Name: AN
  Label: Accession Number
  Group: ID
  Data: EJ1460468
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1460468
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1111/emip.12640
    Languages:
      – Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 11
        StartPage: 55
    Subjects:
      – SubjectFull: Achievement Tests
        Type: general
      – SubjectFull: Secondary School Students
        Type: general
      – SubjectFull: International Assessment
        Type: general
      – SubjectFull: Test Bias
        Type: general
      – SubjectFull: Multilingualism
        Type: general
      – SubjectFull: Monolingualism
        Type: general
      – SubjectFull: Reaction Time
        Type: general
      – SubjectFull: Native Language
        Type: general
      – SubjectFull: Research Problems
        Type: general
      – SubjectFull: United States
        Type: general
      – SubjectFull: Program for International Student Assessment
        Type: general
    Titles:
      – TitleFull: Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Jung Yeon Park
      – PersonEntity:
          Name:
            NameFull: Sean Joo
      – PersonEntity:
          Name:
            NameFull: Zikun Li
      – PersonEntity:
          Name:
            NameFull: Hyejin Yoon
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 03
              Type: published
              Y: 2025
          Identifiers:
            – Type: issn-print
              Value: 0731-1745
            – Type: issn-electronic
              Value: 1745-3992
          Numbering:
            – Type: volume
              Value: 44
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: Educational Measurement: Issues and Practice
              Type: main
ResultId 1