Conditioning on the Pre-Test versus Gain Score Modelling: Revisiting the Controversy in a Multilevel Setting
Saved in:
| Title: | Conditioning on the Pre-Test versus Gain Score Modelling: Revisiting the Controversy in a Multilevel Setting |
|---|---|
| Language: | English |
| Authors: | Bruno Arpino (ORCID |
| Source: | Evaluation Review. 2025 49(2):179-208. |
| Availability: | SAGE Publications. 2455 Teller Road, Thousand Oaks, CA 91320. Tel: 800-818-7243; Tel: 805-499-9774; Fax: 800-583-2665; e-mail: journals@sagepub.com; Web site: https://sagepub.com |
| Peer Reviewed: | Y |
| Page Count: | 30 |
| Publication Date: | 2025 |
| Document Type: | Journal Articles Reports - Research |
| Descriptors: | Scores, Pretesting, Conditioning, Achievement Gains, Comparative Analysis, Outcomes of Treatment, Hierarchical Linear Modeling, Context Effect |
| DOI: | 10.1177/0193841X241246833 |
| ISSN: | 0193-841X 1552-3926 |
| Abstract: | We consider estimating the effect of a treatment on a given outcome measured on subjects tested both before and after treatment assignment in observational studies. A vast literature compares the competing approaches of modelling the post-test score conditionally on the pre-test score versus modelling the difference, namely, the gain score. Our contribution lies in analyzing the merits and drawbacks of two approaches in a multilevel setting. This is relevant in many fields, such as education, where students are nested within schools. The multilevel structure raises peculiar issues related to contextual effects and the distinction between individual-level and cluster-level treatments. We compare the two approaches through a simulation study. For individual-level treatments, our findings align with existing literature. However, for cluster-level treatments, the scenario is more complex, as the cluster mean of the pre-test score plays a key role. Its reliability crucially depends on the cluster size, leading to potentially unsatisfactory estimators with small clusters. |
| Abstractor: | As Provided |
| Entry Date: | 2025 |
| Accession Number: | EJ1466338 |
| Database: | ERIC |
|
Full text is not displayed to guests.
Login for full access.
|
|
| FullText | Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwGJxLGyw-wCzj3OTcnCLcQJAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDB-Q_gJ3fMRxd_rwAwIBEICBmsKMJXX3IPySMh-HRtI2XXVX-gxemFKTlph2F1LQPtjZdWLl1Gx2rGxsg0nv6SAoyYOFiZ2K2Pm1La5RE38UHU6SUOYG2YLvU6creXRBND7nojNzV1-8Yl5QSIIFDZkKzdnjy63uZzu8ftq1elgBSUFqyaVXovgxEavx70BqdlyTGGsPuLeRL0Lnet427IzSNt0eax5VYLdzNM4= Text: Availability: 1 Value: <anid>AN0183370750;evr01apr.25;2025Mar04.03:58;v2.2.500</anid> <title id="AN0183370750-1">Conditioning on the Pre-Test versus Gain Score Modelling: Revisiting the Controversy in a Multilevel Setting </title> <p>We consider estimating the effect of a treatment on a given outcome measured on subjects tested both before and after treatment assignment in observational studies. A vast literature compares the competing approaches of modelling the post-test score conditionally on the pre-test score versus modelling the difference, namely, the gain score. Our contribution lies in analyzing the merits and drawbacks of two approaches in a multilevel setting. This is relevant in many fields, such as education, where students are nested within schools. The multilevel structure raises peculiar issues related to contextual effects and the distinction between individual-level and cluster-level treatments. We compare the two approaches through a simulation study. For individual-level treatments, our findings align with existing literature. However, for cluster-level treatments, the scenario is more complex, as the cluster mean of the pre-test score plays a key role. Its reliability crucially depends on the cluster size, leading to potentially unsatisfactory estimators with small clusters.</p> <p>Keywords: achievement tests; causal inference; common trend assumption; random effects model; reliability; treatment effect</p> <hd id="AN0183370750-2">Introduction</hd> <p>We consider the problem of estimating a treatment effect on an outcome measured at a one-time point before and after the treatment assignment in observational studies where individuals are not randomly allocated to the treatment. In particular, we focus on a typical situation in the educational setting, where the performance of a student is assessed by two achievement tests at different educational stages, and the target is to estimate the effect of a school-level or student-level treatment applied between the two tests, hence named pre-test and post-test.</p> <p>The evaluation of the treatment effect can be achieved by means of two approaches (see, for instance, [<reflink idref="bib7" id="ref1">7</reflink>]; [<reflink idref="bib24" id="ref2">24</reflink>]): the conditioning approach and the gain score approach. The <emph>conditioning</emph> approach consists in estimating the effect of the treatment on the post-test score, conditionally on the pre-test score. Instead, in the <emph>gain score</emph> approach, the analysis is carried out on the gain score (also known as change score), given by the difference between the post-test score and the pre-test score.</p> <p>Disappointingly, the conditioning and gain score approaches may give contradictory results in observational studies (Lord's paradox: [<reflink idref="bib25" id="ref3">25</reflink>]). The debate on which of the two approaches has to be preferred is still ongoing. Recently, [<reflink idref="bib35" id="ref4">35</reflink>] and [<reflink idref="bib19" id="ref5">19</reflink>] reconsidered this issue using graphical models to show the conditions under which an approach has to be preferred. Despite the important contribution of these and several other studies that we review later, the literature has overlooked the fact that test scores are often collected in multilevel settings: standardised scores on students' performance represent the prototypical case with students nested within schools. Indeed, analyses in the educational field are commonly carried out by means of multilevel models ([<reflink idref="bib34" id="ref6">34</reflink>]), mostly adopting the conditioning approach ([<reflink idref="bib30" id="ref7">30</reflink>]).</p> <p>In our contribution, we assess the effect of a given treatment on the post-test. To this aim, we will compare the competing approaches of modelling the post-test conditional on the pre-test versus modelling the gain score, explicitly accounting for the multilevel structure of the data. The multilevel setting generates new scenarios: in particular, the comparison between the two approaches depends on whether the treatment of interest is at the individual level or at the cluster level. As an example of an individual-level treatment, [<reflink idref="bib15" id="ref8">15</reflink>] found a positive effect of after-class tutoring on secondary school students' achievement, measured as the average scores of second-year Korean, mathematics and English tests. As an example of a cluster-level treatment, [<reflink idref="bib10" id="ref9">10</reflink>] estimated the effect of the participation of schools in a program of investments in information and communication technologies on students' performance in mathematics and language, finding a positive impact on mathematics tests scores.</p> <p>The remaining of the paper is organised as follows. In Section 'Motivating example: preparing students for standardized testing', we illustrate a study on real data that motivates our work. In the 'Literature overview', we briefly summarise the debate on the use of conditioning versus gain score approaches. In Section 'Conditioning and gain score approaches in a multilevel setting', we compare the two approaches in a multilevel setting, whereas in Section 'Simulation study', we devise a simulation study to assess the performance of the estimators. We end with conclusive remarks summarising our findings and giving directions for future research.</p> <hd id="AN0183370750-3">Motivating Example: Preparing Students for Standardised Testing</hd> <p>In the Italian school system, the National Institute for the Evaluation of the Education and Training System (INVALSI, https://<ulink href="http://www.invalsiopen.it/">www.invalsiopen.it/</ulink>) is responsible for conducting standardised tests to assess students' proficiency in Italian, Mathematics and English on a large scale. These tests are administered to students attending primary school at the end of the second grade and at the end of the fifth grade, to students in the final year of middle school (eighth grade), and to students in high school at the end of the 10th and 13th grades.</p> <p>In this study, we focus on a cohort of students who participated in the annual standardised Mathematics test in the second grade (year 2013–14) and fifth grade (year 2016–17). Our goal is to evaluate how specific training provided to fifth-grade students impacts their test scores. To this end, we selected a sample of 1110 students from 97 classes where no specific training for the Mathematics test was provided by teachers in the second grade. In most of those classes, tailored training was implemented in the fifth grade.</p> <p>Preparing primary school students for standardised tests involves providing them with a mock exam that simulates the previous year test. This allows them to become familiar with the types of questions they will encounter, estimate the time required to complete the test and identify the most important materials to study in advance.</p> <p>The Mathematics test is composed by multiple choice items that are dichotomously scored (1 for a correct answer, 0 for a wrong answer). The selection of test items is based on internationally validated methods relying on the Rasch model ([<reflink idref="bib29" id="ref10">29</reflink>]). The raw score (i.e. the total number of correct answers to test items) is a minimal sufficient statistic for the unobservable ability, thus, it serves as a measure of student ability. The scores are then normalised to a national average of 200 with a standard deviation of 40.</p> <p>Table 1 shows the average scores on the Mathematics test for second and fifth-grade students, distinguishing between treated pupils (i.e. those who received training) and untreated pupils (i.e. those without specific training).</p> <p>Table 1. INVALSI Math Test: Second and Fifth Grade Average Scores.</p> <p>Graph</p> <p></p> <p> <ephtml> &lt;table&gt;&lt;thead valign="top"&gt;&lt;tr&gt;&lt;th align="left" rowspan="2"&gt;Treated&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Grade&lt;/th&gt;&lt;th align="center"&gt;n&lt;/th&gt;&lt;th align="center"&gt;n&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center"&gt;2nd&lt;/th&gt;&lt;th align="center"&gt;5th&lt;/th&gt;&lt;th align="center"&gt;Pupils&lt;/th&gt;&lt;th align="center"&gt;Classes&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td align="left"&gt;1 yes&lt;/td&gt;&lt;td align="char" char="."&gt;203.74&lt;/td&gt;&lt;td align="char" char="."&gt;203.88&lt;/td&gt;&lt;td align="char" char="."&gt;941&lt;/td&gt;&lt;td align="char" char="."&gt;84&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;0 no&lt;/td&gt;&lt;td align="char" char="."&gt;212.63&lt;/td&gt;&lt;td align="char" char="."&gt;192.30&lt;/td&gt;&lt;td align="char" char="."&gt;169&lt;/td&gt;&lt;td align="char" char="."&gt;13&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;Total&lt;/td&gt;&lt;td align="char" char="."&gt;205.09&lt;/td&gt;&lt;td align="char" char="."&gt;202.12&lt;/td&gt;&lt;td align="char" char="."&gt;1110&lt;/td&gt;&lt;td align="char" char="."&gt;97&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The descriptive table reveals that the average score from the second grade to the fifth grade is stable for treated pupils, while it decreases by approximately 20 points for untreated pupils.</p> <p>To evaluate the impact of the training, we specify a multilevel random intercept model ([<reflink idref="bib34" id="ref11">34</reflink>]) with pupils at level 1 and classes at level 2. Assuming that the treatment effect is constant, we alternatively apply the conditioning and the gain score approaches, as outlined in the Introduction.</p> <p>In the conditioning approach, the response variable is the post-test score (i.e. the fifth-grade test score), while the pre-test score (i.e. the second-grade test score) is included as a covariate. In the gain score approach, the response variable is the gain score, defined as the difference between the fifth and second-grade test scores, and the pre-test score is omitted from the covariates. Both models include the treatment variable, defined as an indicator for tailored test training. Moreover, in multilevel modelling, it is customary to insert the group mean of a covariate to account for contextual effects. Therefore, both models are alternatively specified with and without the class mean of the pre-test score. In this illustrative example, we keep the models as simple as possible by omitting other student and class characteristics. Table 2 reports the results of the estimated models.</p> <p>Table 2. Random Intercept Models for Fifth Grade Math test: Conditional and Gain Score Specifications Without and With Class Means (CM) (Standard Errors in Parenthesis).</p> <p>Graph</p> <p></p> <p> <ephtml> &lt;table&gt;&lt;thead valign="top"&gt;&lt;tr&gt;&lt;th align="left" rowspan="2"&gt;Variable&lt;/th&gt;&lt;th align="center" colspan="4"&gt;Models&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="left"&gt;Conditioning&lt;/th&gt;&lt;th align="center"&gt;Conditioning with CM&lt;/th&gt;&lt;th align="center"&gt;Gain score&lt;/th&gt;&lt;th align="center"&gt;Gain score with CM&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td align="left" rowspan="2"&gt;Constant&lt;/td&gt;&lt;td align="center"&gt;60.603&lt;/td&gt;&lt;td align="center"&gt;99.969&lt;/td&gt;&lt;td align="center"&gt;&amp;#8722;22.479&lt;/td&gt;&lt;td align="center"&gt;98.800&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;(6.543)&lt;/td&gt;&lt;td align="center"&gt;(18.725)&lt;/td&gt;&lt;td align="center"&gt;(5.429)&lt;/td&gt;&lt;td align="center"&gt;(19.141)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" rowspan="2"&gt;Pre-test score&lt;/td&gt;&lt;td align="center"&gt;0.611&lt;/td&gt;&lt;td align="center"&gt;0.623&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;(0.021)&lt;/td&gt;&lt;td align="center"&gt;(0.022)&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" rowspan="2"&gt;CM pre-test score&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="center"&gt;&amp;#8722;0.197&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="center"&gt;&amp;#8722;0.567&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" /&gt;&lt;td align="center"&gt;(0.088)&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="center"&gt;(0.087)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" rowspan="2"&gt;Treatment&lt;/td&gt;&lt;td align="center"&gt;20.319&lt;/td&gt;&lt;td align="center"&gt;18.172&lt;/td&gt;&lt;td align="center"&gt;24.900&lt;/td&gt;&lt;td align="center"&gt;17.905&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;(5.054)&lt;/td&gt;&lt;td align="center"&gt;(5.068)&lt;/td&gt;&lt;td align="center"&gt;(5.863)&lt;/td&gt;&lt;td align="center"&gt;(5.039)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" colspan="5"&gt;Variance components&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; Level 2 (class)&lt;/td&gt;&lt;td align="center"&gt;192.414&lt;/td&gt;&lt;td align="center"&gt;185.303&lt;/td&gt;&lt;td align="center"&gt;262.961&lt;/td&gt;&lt;td align="center"&gt;167.723&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; Level 1 (pupil)&lt;/td&gt;&lt;td align="center"&gt;679.266&lt;/td&gt;&lt;td align="center"&gt;677.547&lt;/td&gt;&lt;td align="center"&gt;877.610&lt;/td&gt;&lt;td align="center"&gt;868.533&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; ICC&lt;/td&gt;&lt;td align="center"&gt;0.221&lt;/td&gt;&lt;td align="center"&gt;0.215&lt;/td&gt;&lt;td align="center"&gt;0.231&lt;/td&gt;&lt;td align="center"&gt;0.162&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Examining the models without the class mean (second and fourth columns of Table 2), the two competing approaches yield noticeably different results regarding the treatment effect. According to the conditioning approach, treated pupils obtain on average 20.3 points more than untreated pupils, whereas this difference rises to 24.9 with the gain score approach. Furthermore, when the class mean is included in the models, the estimated treatment effect diminishes and the disparity between the two approaches nearly vanishes (18.2 for the conditioning approach and 17.9 for the gain score approach).</p> <p>To explain the patterns observed in this case study, we need to go through the assumptions underlying the two modelling approaches, in particular to clarify the theoretical reasons supporting the inclusion of the cluster mean of pre-test score in the analysis.</p> <hd id="AN0183370750-4">Literature Overview</hd> <p>Several studies contributed to the debate about the performance of the conditioning and gain score approaches. In the applications, the conditioning approach is usually preferred, maybe for the sensitivity of the gain score approach to the regression toward the mean ([<reflink idref="bib36" id="ref12">36</reflink>]). In the 90s, [<reflink idref="bib2" id="ref13">2</reflink>] resumed the debate on the two methods partly disproving those widespread beliefs, and showing that the gain score method has to be considered superior when the treatment is subsequent to the pre-test and is uncorrelated with the time-varying components of the pre-test. Afterwards, [<reflink idref="bib27" id="ref14">27</reflink>] and, more recently, [<reflink idref="bib38" id="ref15">38</reflink>] show that under randomisation of the treatment assignment, both approaches are unbiased but the conditioning one is more powerful, whereas under assignment depending on the pre-test score only the conditioning approach is unbiased. These conclusions hold regardless of whether the pre-test suffers from measurement error ([<reflink idref="bib27" id="ref16">27</reflink>]). Although these results seem to favour the conditioning approach, they are anything but definitive, as was discussed by [<reflink idref="bib38" id="ref17">38</reflink>] that provided a novel interpretation of the two approaches in terms of models for repeated measures. On the one hand, the conditioning approach yields unbiased estimates provided that, conditionally on the pre-test and possibly other covariates, pre-treatment differences between the control and treatment groups are absent, implying the regression of the post-test scores toward a common mean if neither group is treated. On the other hand, the estimates of the gain score approach are unbiased if the change in the outcome over time for treated units would have been the same as that for untreated units in the absence of the treatment. Unfortunately, both assumptions are generally untestable, although sensitivity analyses and indirect tests can be implemented in the presence of multiple control clusters or multiple pre-treatment measurements (see [<reflink idref="bib32" id="ref18">32</reflink>]; [<reflink idref="bib39" id="ref19">39</reflink>]). [<reflink idref="bib19" id="ref20">19</reflink>] further contributed to the debate focussing on a causal inference perspective and considering the situation in which the treatment assignment is based on an unobservable variable (e.g. latent ability), as it often happens in observational studies.</p> <p>The causal inference literature makes it clear that the conditioning approach yields valid inferences under the unconfoundedness assumption, namely, when there is conditional independence between treatment and potential outcomes ([<reflink idref="bib3" id="ref21">3</reflink>]; [<reflink idref="bib18" id="ref22">18</reflink>]). On the other hand, the gain score approach entails taking the first difference of the outcome, thus, it removes confounding under the so called <emph>common trend assumption</emph> (e.g. [<reflink idref="bib23" id="ref23">23</reflink>]) requiring that the average outcomes in the treated and untreated groups in the absence of treatment would have followed parallel paths over time ([<reflink idref="bib12" id="ref24">12</reflink>]). To satisfy this assumption, in the case of a continuous treatment, for each value of the treatment, trends in average outcomes should be parallel with respect to all other values of the treatment ([<reflink idref="bib8" id="ref25">8</reflink>]). The typical approach is to control for observed confounders to make the assumptions underlying the methods (conditionally on these controls) more plausible. Note that the gain score approach is a special case of the difference-in-differences design (see [<reflink idref="bib39" id="ref26">39</reflink>]) that can be generalised to multiple time points and treatment/control groups. In cases where multiple pre-treatment observations (multiple pre-tests in our context) are available, falsification tests of the common trend assumption can be implemented; that is, it can be tested whether pre-test differences are parallel, which is often rejected by the data.</p> <p>Exploiting graphical models, [<reflink idref="bib19" id="ref27">19</reflink>] derived formulae for the bias of causal effects estimators under the two approaches. In particular, they considered a linear data generating model with constant effects across units, where a latent ability affects three observed variables, namely, the treatment variable, the pre-test score and the post-test score; additionally, the treatment variable affects the post-test score. In this setting, under the common trend assumption, low pre-test reliability (high measurement error) favours the gain score approach. However, this assumption cannot be tested with only two measures available on the test score. In general, in non-randomised studies, the choice of the method in presence of measurement error depends on specific assumptions, as outlined by [<reflink idref="bib38" id="ref28">38</reflink>]. [<reflink idref="bib19" id="ref29">19</reflink>] also considered more complex scenarios, as when the treatment assignment is directly affected by the pre-test score, in addition to the latent ability: this situation complicates the assessment of the bias under the gain score approach and, consequently, the choice between the two approaches.</p> <p>The literature on the conditioning versus gain score approaches has mainly focused on unstructured (single-level) data, overlooking the fact that in many settings, such as the educational context, the evaluation of an intervention is complicated by the hierarchical structure of data, for example, students nested into classes and schools. Namely, when the data have a multilevel structure, the framework has to be extended to handle new issues: some of these issues relate to identifying assumptions, others are conceptual considerations, and yet others relate to statistical modelling aspects. Indeed, in the multilevel setting, the treatment may intervene at the individual level (i.e. the treatment unit is the student) or at the cluster level (i.e. the treatment unit is the school or the class) and individual-level variables, such as the latent ability, may have a relevant contextual effect (i.e. an effect due to the aggregation of individuals into clusters). All these aspects make the comparison between the conditioning and the gain score approaches in the multilevel setting more complex. In their recent paper, [<reflink idref="bib22" id="ref30">22</reflink>] considered a multilevel setting; however, they focused only on the cluster (school/class) level, ignoring the individual (student) level. To our knowledge, our paper is the first one comparing the gain score and conditioning approaches in a fully multilevel observational setting, simultaneously considering cluster and individual levels. In the setting of experimental studies, instead, [<reflink idref="bib21" id="ref31">21</reflink>] considered similar approaches in the context of cluster randomisation trials.</p> <p>Before moving to a deeper discussion of the gain score and conditioning approaches, it is worth mentioning that other approaches can be used to estimate the causal effect of a treatment, such as the regression discontinuity design ([<reflink idref="bib17" id="ref32">17</reflink>]) or instrumental variable methods ([<reflink idref="bib6" id="ref33">6</reflink>]). However, while the two approaches we consider in principle only require measurement of the outcome before and after the treatment, the other mentioned approaches can be considered only in specific cases where a discontinuity exists in the treatment assignment or when an instrumental variable is available, respectively.</p> <hd id="AN0183370750-5">Conditioning and Gain Score Approaches in a Multilevel Setting</hd> <p>In this section, we first describe the assumptions concisely; for a more thorough and formal description, we refer the interested reader to Section 1 of the Supplementary Online Material. Then, we illustrate data generating models and estimation methods based on the conditioning and the gain score approaches in a multilevel observational setting. We extend the findings of [<reflink idref="bib19" id="ref34">19</reflink>] in three directions: (<emph>i</emph>) we consider a general treatment variable, which could be continuous or binary, whereas they focused on the continuous case; (<emph>ii</emph>) we assume from the beginning that a common measurement error affects both pre-test and post-test scores, whereas they elaborated this case as an extension of the basic framework without common measurement error; and (<emph>iii</emph>) we consider two-level models with random effects.</p> <hd id="AN0183370750-6">Assumptions</hd> <p>We consider a two-level data structure where individuals (<emph>i</emph> = 1, ..., <emph>n</emph><subs><emph>j</emph></subs>) are nested into clusters (<emph>j</emph> = 1, ..., <emph>J</emph>). Let <emph>A</emph><subs><emph>ij</emph></subs> be a latent variable summarising the unobservable ability of individual (student) <emph>i</emph> nested in cluster (school) <emph>j</emph>. Moreover, let <emph>Y</emph><subs>1<emph>ij</emph></subs> and <emph>Y</emph><subs>2<emph>ij</emph></subs> be continuous variables for the observed scores on the pre-test and the post-test, respectively, for individual <emph>i</emph> in cluster <emph>j</emph>. We assume that <emph>A</emph><subs><emph>ij</emph></subs> is an unobservable variable measured with error by <emph>Y</emph><subs>1<emph>ij</emph></subs> and affecting <emph>Y</emph><subs>2<emph>ij</emph></subs>. Moreover, in multilevel setting, it is commonly plausible to assume that an individual-level variable has a contextual effect, which is then modelled by adding its cluster mean as an additional covariate ([<reflink idref="bib34" id="ref35">34</reflink>]). Therefore, we introduce the cluster mean of the latent ability <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;/&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msubsup&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/msubsup&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> to account for the influence of the school mean of the latent ability on the test scores, in addition to the effect of the student ability.</p> <p>We also introduce a variable denoting the treatment. In comparison with a single-level setting, the multilevel setting is essential to distinguish the case of an individual-level treatment <emph>Z</emph><subs><emph>ij</emph></subs> (e.g. the treatment unit is the student) from the case of a cluster-level treatment <emph>Z</emph><subs><emph>j</emph></subs> (e.g. the treatment unit is the school). We consider the two cases separately because they have different implications for some of the identifying assumptions and the analysis. We assume that the treatment <emph>Z</emph><subs><emph>ij</emph></subs> or <emph>Z</emph><subs><emph>j</emph></subs> is assigned after the pre-test on the basis of <emph>A</emph><subs><emph>ij</emph></subs>. Therefore, <emph>A</emph><subs><emph>ij</emph></subs> is an unobservable confounder. Differently from [<reflink idref="bib19" id="ref36">19</reflink>], who focused on the case where the treatment variable is continuous, we do not specify the nature of the treatment variable. However, we illustrate the methodology with reference to a binary treatment, for example, <emph>Z</emph><subs><emph>ij</emph></subs> = 1 if student <emph>i</emph> of school <emph>j</emph> participates in after-class tutoring.</p> <p>According to the causal road map proposed by [<reflink idref="bib1" id="ref37">1</reflink>], we summarise the key assumptions that allow us to identify the causal effect of interest:</p> <p></p> <ulist> <item> • intact clusters,</item> <p></p> <item> • SUTVA,</item> <p></p> <item> • overlap,</item> <p></p> <item> • latent unconfoundness.</item> </ulist> <p>The <emph>latent unconfoundedness</emph> can be replaced by other sets of assumptions. Specifically, under the conditioning approach, we can alternatively assume <emph>unconfoundedness</emph> and <emph>no measurement error</emph>, whereas under the gain score approach we can alternatively assume <emph>common trend at level 1 (common within effect)</emph> and <emph>common trend at level 2 (common contextual effect)</emph>. See also Section 1 of Supplementary Online Material for further details.</p> <p>A complication with respect to the single-level setting concerns the genesis of the clusters. We assume that the clusters are formed before the pre-test and they are the same at post-test (<emph>intact clusters</emph> assumption, see [<reflink idref="bib16" id="ref38">16</reflink>]). In this situation, the cluster mean of the ability can affect the pre-test score, in addition to the post-test score. This is the case of the analysis of [<reflink idref="bib16" id="ref39">16</reflink>] on the effect of retention to kindergarten on pupils' performance and [<reflink idref="bib37" id="ref40">37</reflink>] on the effect of repeating grade 1 on passing the grade 3 Texas Assessment of Knowledge and Skills math achievement test.</p> <p>The <emph>Stable Unit Treatment Value Assumption (SUTVA)</emph>, and, specifically, its 'no interference' component, rules out the possibility that the treatment assigned to an individual influences the outcomes of other individuals, whereas with <emph>overlap</emph> we assume that each unit has non-zero probability to be assigned to the treatment or control group.</p> <p>The <emph>Unconfoundedness</emph> assumption amounts to the absence of unobserved confounders or, more precisely, it consists in assuming that the potential outcomes are independent of the treatment assignment conditional on the confounders. In our setting where <emph>A</emph><subs><emph>ij</emph></subs> is unobserved, this assumption would hold only conditioning on an unobserved confounder; this is why we refer to it as 'latent unconfoundedness'. Clearly, the conditioning approach would be biased because it would not be possible to condition on <emph>A</emph><subs><emph>ij</emph></subs>. As above mentioned, an alternative set of assumptions that can be invoked when using the conditioning approach consists in assuming that unconfoundedness holds conditioning on the observed pre-test <emph>Y</emph><subs>1<emph>ij</emph></subs>, which amounts to assume that <emph>Y</emph><subs>1<emph>ij</emph></subs> is a perfect measure of the unobserved ability <emph>A</emph><subs><emph>ij</emph></subs> (no measurement error).</p> <p>The gain score approach relies on a different assumption that, in the single-level setting, is known as <emph>common trend assumption</emph>, and in the multilevel setting requires a higher level of specification depending on whether the treatment is at individual or at cluster level, distinguishing between <emph>common trend at level 1</emph> and <emph>common trend at level 2</emph>, as detailed in the following two sections (Section 'Individual-level treatment' and Section 'Cluster-level treatment').</p> <p>The common trend assumption requires that, possibly conditional on observed variables, the average outcomes in the treated and untreated groups in the absence of treatment would have followed parallel paths over time ([<reflink idref="bib12" id="ref41">12</reflink>]). Several unobserved factors might be responsible for the initial gap in outcomes between the two groups (assumed to be constant over time). If <emph>A</emph><subs><emph>ij</emph></subs> is the only confounder, a sufficient although not necessary condition that would satisfy the common trend assumption is that ability is constant over time and it has the same effect on the pre-test and post-test. This unnecessarily restrictive assumption might be plausible in certain applications and it is a condition easier to argue from a theoretical point of view than the common trend assumption per se. We note again that, more in general, what the common trend assumption requires is only that the average change in outcomes among treated and untreated groups in the absence of treatment would have been the same. This in principle can be satisfied by a compensation of within-unit changes in the effects and in the values of one or more unobserved confounders (e.g. two confounders may change their influence on the outcome in opposing directions by compensating each other).</p> <hd id="AN0183370750-7">Individual-Level Treatment</hd> <p>We consider a setting with an individual-level treatment <emph>Z</emph><subs><emph>ij</emph></subs> influenced by the individual latent ability <emph>A</emph><subs><emph>ij</emph></subs> as well as by its cluster mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;/&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msubsup&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/msubsup&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> .</p> <p>The causal graph in Figure 1, which is an extension of [<reflink idref="bib19" id="ref42">19</reflink>], represents the case of a treatment acting at the individual level, for instance the students are individually assigned (or self-selected) into an after-class tutoring program. The two hierarchical levels are clearly separated and the cluster mean of the latent ability <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is a contextual variable affecting both the pre-test score <emph>Y</emph><subs>1<emph>ij</emph></subs> and the post-test score <emph>Y</emph><subs>2<emph>ij</emph></subs> and, then, also the gain score <emph>G</emph><subs><emph>ij</emph></subs> = <emph>Y</emph><subs>2<emph>ij</emph></subs> − <emph>Y</emph><subs>1<emph>ij</emph></subs>. The separation between individual- and cluster-level variables helps the reader to better understand the nature of the variables and, thus, provides a guide for the estimation process. This separation is especially useful when the treatment acts at the cluster level (see Figure 2 further on). Differently from 'traditional' causal graphs, we embrace the proposal by [<reflink idref="bib19" id="ref43">19</reflink>] to explicitly represent the measurement error <emph>e</emph><subs><emph>ij</emph></subs> in order to clarify the relationships between pre- and post-tests.</p> <p>In the causal graph of Figure 1, <emph>β</emph><subs>1</subs> is the within effect and <emph>ψ</emph><subs>1</subs> is the contextual effect of the ability on the pre-test score, whereas <emph>β</emph><subs>2</subs> and <emph>ψ</emph><subs>2</subs> are the corresponding effects on the post-test score, and <emph>τ</emph> is the causal effect of interest. The parameters <emph>α</emph> and <emph>ψ</emph><subs><emph>z</emph></subs> are the within and contextual effects of the latent ability on the treatment assignment.</p> <p>Graph: Figure 1.Causal graph for the data generating model in a multilevel setting with an individual-level treatment. Clusters formed before treatment assignment; measurement error common to pre-test and post-test (dashed arrow: deterministic relationship; solid arrow: probabilistic relationship; full circle: observed variable; empty circle: unobserved variable; cm : cluster mean).</p> <p>Graph: Figure 2.Causal graph for the data generating model in a multilevel setting with cluster-level treatment assignment depending on the cluster mean of the latent ability. Clusters formed before treatment assignment; measurement error common to pre-test and post-test (dashed arrow: deterministic relationship; solid arrow: probabilistic relationship; full circle: observed variable; empty circle: unobserved variable; cm : cluster mean).</p> <p>A data generating model consistent with the causal graph of Figure 1 is the multilevel model with random intercept ([<reflink idref="bib34" id="ref44">34</reflink>]), given by equations for <emph>Y</emph><subs>1<emph>ij</emph></subs> and <emph>Y</emph><subs>2<emph>ij</emph></subs>. In particular, the equation for <emph>Y</emph><subs>1<emph>ij</emph></subs> is <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#956;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant="normal"&gt;&amp;#955;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml></p> <p>Graph</p> <p>where <emph>u</emph><subs>1<emph>j</emph></subs> is the cluster-level error with <emph>E</emph> (<emph>u</emph><subs>1<emph>j</emph></subs>) = 0 and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , whereas λ<subs>1</subs><emph>e</emph><subs><emph>ij</emph></subs> is the measurement error with <emph>E</emph> (<emph>e</emph><subs><emph>ij</emph></subs>) = 0 and <emph>Var</emph>(<emph>e</emph><subs><emph>ij</emph></subs>) = 1. To insure identifiability, the individual ability has zero mean and unit variance, that is, <emph>E</emph> (<emph>A</emph><subs><emph>ij</emph></subs>) = 0 and <emph>Var</emph>(<emph>A</emph><subs><emph>ij</emph></subs>) = 1. The parameter λ<subs>1</subs> regulates the measurement error: if λ<subs>1</subs> = 0, there is no measurement error; thus, <emph>Y</emph><subs>1<emph>ij</emph></subs> is a perfect measure of the latent ability up to a scale factor <emph>β</emph><subs>1</subs>. The random intercept model (<reflink idref="bib1" id="ref45">1</reflink>) has a cluster-specific intercept composed by a fixed part, <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#956;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , and a random part, <emph>u</emph><subs>1<emph>j</emph></subs>.</p> <p>The post-test score is assumed to depend on both the latent ability and the treatment assignment <emph>Z</emph><subs><emph>ij</emph></subs><ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#956;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;Z&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant="normal"&gt;&amp;#955;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml></p> <p>Graph</p> <p>The term <emph>u</emph><subs>2<emph>j</emph></subs> is the cluster-level error with <emph>E</emph> (<emph>u</emph><subs>2<emph>j</emph></subs>) = 0 and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , whereas λ<subs>2</subs><emph>e</emph><subs><emph>ij</emph></subs> is the measurement error of the post-test in common with the pre-test, which arises if the two tests are administered through the same instrument (e.g. a standardised test). In addition, <emph>v</emph><subs><emph>ij</emph></subs> is an error component independent of <emph>e</emph><subs><emph>ij</emph></subs> having <emph>E</emph> (<emph>v</emph><subs><emph>ij</emph></subs>) = 0 and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msubsup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . The role of <emph>v</emph><subs><emph>ij</emph></subs> is to account for the post-test specific measurement error and for the effect of unobserved factors intervening after the pre-test. Due to sharing <emph>e</emph><subs><emph>ij</emph></subs>, the error terms of the two test scores are correlated, specifically <emph>Cov</emph>(λ<subs>1</subs><emph>e</emph><subs><emph>ij</emph></subs>, λ<subs>2</subs><emph>e</emph><subs><emph>ij</emph></subs> + <emph>v</emph><subs><emph>ij</emph></subs>) = λ<subs>1</subs>λ<subs>2</subs><emph>Var</emph>(<emph>e</emph><subs><emph>ij</emph></subs>) = λ<subs>1</subs>λ<subs>2</subs>.</p> <p>The random intercept model (<reflink idref="bib2" id="ref46">2</reflink>) has a cluster-specific intercept composed by a fixed part, <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#956;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , and a random part, <emph>u</emph><subs>2<emph>j</emph></subs>. The cluster-level errors <emph>u</emph><subs>1<emph>j</emph></subs> and <emph>u</emph><subs>2<emph>j</emph></subs> are independently and identically distributed across clusters according to a bivariate distribution with zero means and unconstrained covariance matrix, so they are arbitrarily correlated. Moreover, <emph>u</emph><subs>1<emph>j</emph></subs> and <emph>u</emph><subs>2<emph>j</emph></subs> are uncorrelated with <emph>Z</emph><subs><emph>ij</emph></subs>, thus, they are not confounders.</p> <p>In the conditioning approach, an estimable version of model (<reflink idref="bib2" id="ref47">2</reflink>) is obtained by replacing <emph>A</emph><subs><emph>ij</emph></subs> with the pre-test score <emph>Y</emph><subs>1<emph>ij</emph></subs> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> with the cluster mean pre-test score <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;msub&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;/&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msubsup&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/msubsup&gt;&lt;msub&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . The drawback of this approach is the attenuation of the regression coefficients of the pre-test score <emph>Y</emph><subs>1<emph>ij</emph></subs> and its cluster mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;msub&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> due to measurement error. In other words, the conditioning approach partially adjusts for the latent ability, causing a bias in the estimation of the treatment effect <emph>τ</emph> ([<reflink idref="bib19" id="ref48">19</reflink>]; [<reflink idref="bib38" id="ref49">38</reflink>]). The bias depends on the magnitude of the measurement error, which is regulated by the parameter λ<subs>1</subs> of equation (<reflink idref="bib1" id="ref50">1</reflink>), and it vanishes in the absence of measurement error (λ<subs>1</subs> = 0).</p> <p>For the gain score approach, the estimation model is obtained as the difference between equations (<reflink idref="bib2" id="ref51">2</reflink>) and (<reflink idref="bib1" id="ref52">1</reflink>) <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;&amp;#956;&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;Z&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#949;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml></p> <p>Graph</p> <p>where <emph>G</emph><subs><emph>ij</emph></subs> = <emph>Y</emph><subs>2<emph>ij</emph></subs> − <emph>Y</emph><subs>1<emph>ij</emph></subs>, <emph>μ</emph> = <emph>μ</emph><subs>2</subs> − <emph>μ</emph><subs>1</subs> and <emph>ɛ</emph><subs><emph>ij</emph></subs> = λ<subs>2</subs><emph>e</emph><subs><emph>ij</emph></subs> + <emph>v</emph><subs><emph>ij</emph></subs> − λ<subs>1</subs><emph>e</emph><subs><emph>ij</emph></subs>. We define the <emph>common trend at level 1 assumption</emph> as <emph>β</emph><subs>2</subs> = <emph>β</emph><subs>1</subs>, and the <emph>common trend at level 2 assumption</emph> as <emph>ψ</emph><subs>2</subs> = <emph>ψ</emph><subs>1</subs>. The gain score approach is implemented by fitting a random intercept linear model where <emph>G</emph><subs><emph>ij</emph></subs> is regressed on <emph>Z</emph><subs><emph>ij</emph></subs>. Under the above common trend assumptions, the gain score approach provides an unbiased estimator of the treatment effect, regardless of measurement error, namely, even if λ<subs>1</subs> ≠ 0 or λ<subs>2</subs> ≠ 0. As noted by [<reflink idref="bib19" id="ref53">19</reflink>], the pre-test score <emph>Y</emph><subs>1<emph>ij</emph></subs> should not be inserted as a covariate in the gain score model, because this would bring the measurement error back.</p> <p>The novelty of a multilevel setting is that the common trend assumption has two parts: at the individual level, the equality <emph>β</emph><subs>2</subs> = <emph>β</emph><subs>1</subs> let <emph>A</emph><subs><emph>ij</emph></subs> vanishes; at the cluster level, the equality <emph>ψ</emph><subs>2</subs> = <emph>ψ</emph><subs>1</subs> let <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> vanishes. Note that the above parameter constraints are sufficient conditions for unbiased estimation, though they are not necessary since the within and contextual effects could compensate for each other.</p> <p>It is worth noting that, for an individual-level treatment, the contextual effect of the latent ability on the treatment assignment <emph>ψ</emph><subs><emph>z</emph></subs> is irrelevant. Indeed, a variant of the simulation study discussed below shows that the results are unchanged if <emph>Z</emph><subs><emph>ij</emph></subs> is determined only by <emph>A</emph><subs><emph>ij</emph></subs> (i.e. omitting <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> ). This is expected as the outcome <emph>Y</emph><subs>2<emph>ij</emph></subs> depends on the treatment assignment exclusively through the individual value <emph>Z</emph><subs><emph>ij</emph></subs>.</p> <hd id="AN0183370750-8">Cluster-Level Treatment</hd> <p>In the case of a cluster-level treatment, the identifying assumptions for the causal effect of interest need to be modified since the confounder is the cluster mean ability <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . The modified assumptions are described formally in Section 1 of the Supplementary Online Material.</p> <p>An example of a cluster-level treatment is when all students of a subset of schools benefit from an investment in IT infrastructures. In this case, we can assume that the assignment to the treatment depends on the cluster mean of the unobserved ability <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . Thus, the causal graph of Figure 1 modifies as shown in Figure 2, where the cluster mean of the latent ability <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> affects the treatment variable <emph>Z</emph><subs><emph>j</emph></subs>, in addition to the pre-test score <emph>Y</emph><subs>1<emph>ij</emph></subs> and the post-test score <emph>Y</emph><subs>2<emph>ij</emph></subs>.</p> <p>The model equation of the pre-test score <emph>Y</emph><subs>1<emph>ij</emph></subs> remains unchanged as in equation (<reflink idref="bib1" id="ref54">1</reflink>), while in the equation of the post-test score <emph>Y</emph><subs>2<emph>ij</emph></subs> (<reflink idref="bib2" id="ref55">2</reflink>), we replace the individual-treatment variable <emph>Z</emph><subs><emph>ij</emph></subs> by the cluster-level treatment <emph>Z</emph><subs><emph>j</emph></subs>, namely <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#956;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;Z&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant="normal"&gt;&amp;#955;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml></p> <p>Graph</p> <p>The causal graph in Figure 2 shows that the cluster mean of the latent ability <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> influences <emph>Z</emph><subs><emph>j</emph></subs>, therefore, in the presence of measurement error, the implementation of the conditioning approach requires to fit a random intercept model with <emph>Y</emph><subs>2<emph>ij</emph></subs> regressed on <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , in addition to <emph>Z</emph><subs><emph>j</emph></subs> and <emph>Y</emph><subs>1<emph>ij</emph></subs>. The simulation study presented later on quantifies the bias in the estimation of the treatment effect <emph>τ</emph> if <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is omitted.</p> <p>The gain score approach is implemented, as usual, by computing the difference between the post-test and pre-test scores. The expression is like equation (<reflink idref="bib3" id="ref56">3</reflink>) for the individual-level treatment except that <emph>Z</emph><subs><emph>ij</emph></subs> is replaced by <emph>Z</emph><subs><emph>j</emph></subs><ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;&amp;#956;&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;Z&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#949;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;.&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml></p> <p>Graph</p> <p>Also for a cluster-level treatment, the gain score approach gives an unbiased estimator of the treatment effect under the common trend assumption at both levels.</p> <hd id="AN0183370750-9">General Remarks</hd> <p>Under our data generating models, one should account for the biasing effects introduced by the individual ability and its cluster mean. However, in the case of no measurement error and for an individual-level treatment, the cluster mean of the latent ability becomes irrelevant. When the treatment is at the cluster level, the cluster mean of the latent ability cannot be ignored, thus, the cluster mean of the pre-test must be included.</p> <p>In the gain score approach, unbiased estimation of the treatment effect is ensured by the common trend assumption at both levels.</p> <p>In general, as noted in the 'Literature overview', the common trend assumption is not directly testable, although falsification tests can be implemented in designs with more than two repeated measures or more than one control group. If the assumption does not hold, <emph>A</emph><subs><emph>ij</emph></subs> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> do not vanish, thus, the gain score approach is not helpful. However, when the common trend holds only at level 1, a possible strategy is to replace <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> with its observed counterpart <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . Indeed, in case of clusters of large size, the measurement error due to <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is negligible, thus, the cluster mean of the pre-test score <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is a reliable indicator of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> and its insertion in the gain score model makes the approach unbiased regardless of common trend assumption at cluster level. This is shown by the simulation study (see next section).</p> <p>The bias formulae for the single-level case (eqs. (2.1)–(2.4) Section 2 of Supplementary Online Material) cannot be easily extended to the multilevel case because estimators are not in closed form. However, in a linear model, the inclusion of random effects does not change the meaning of the regression coefficients because marginal and conditional coefficients coincide. Adding random effects just implies a modification of the estimation procedure which typically results in similar point estimates and larger standard errors ([<reflink idref="bib34" id="ref57">34</reflink>]). Therefore, we expect that the bias formulae for the single-level case are good approximations for the multilevel case. Anyway, an investigation of this point needs a simulation study.</p> <hd id="AN0183370750-10">Simulation Study</hd> <p>In a multilevel setting, the estimators do not have closed form expressions, thus, a suitable simulation study is needed to evaluate the performance of the conditioning and gain score approaches illustrated above. First, we outline the main characteristics of the scenarios adopted in the study and provide details about data generation and model fitting. Then, we discuss the main results.</p> <p>The goal of the simulation study is to assess the bias of the estimator of the treatment effect under different conditions, checking whether the bias formulae derived for the single-level setting (Section 2 Supplementary Online Material) are good approximations.</p> <p>Consistently with Sections 'Individual-level treatment' and 'Cluster-level treatment', we consider two scenarios. In <emph>Scenario 1,</emph> the treatment is at individual level (causal graph of Figure 1), while in <emph>Scenario 2,</emph> the treatment is at cluster level (causal graph of Figure 2). In both cases, we assume intact clusters, namely, the clusters are the same at pre-test and post-test. Each scenario is analyzed under eight configurations, differing for measurement error and common trend assumptions, as defined by the parameter values reported in Table 3, where the symbol '<emph>✓</emph>' denotes what assumption holds for the measurement error on the pre-test (present or absent) and for the common trend (at level 1, at level 2, at both levels).</p> <p>Table 3. Configurations of the Simulation Study.</p> <p>Graph</p> <p></p> <p> <ephtml> &lt;table&gt;&lt;thead valign="top"&gt;&lt;tr&gt;&lt;th align="left" rowspan="2"&gt;Conf&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Measurement error&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Common trend&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Meas. Err.&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Level 1 par.&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Level 2 par.&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="left"&gt;No&lt;/th&gt;&lt;th align="center"&gt;Yes&lt;/th&gt;&lt;th align="center"&gt;Level 1&lt;/th&gt;&lt;th align="center"&gt;Level 2&lt;/th&gt;&lt;th align="center"&gt;&amp;#955;&lt;sub&gt;1&lt;/sub&gt;&lt;/th&gt;&lt;th align="center"&gt;&amp;#955;&lt;sub&gt;2&lt;/sub&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;&amp;#946;&lt;/italic&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;&amp;#946;&lt;/italic&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;&amp;#968;&lt;/italic&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;/th&gt;&lt;th align="center"&gt;&lt;italic&gt;&amp;#968;&lt;/italic&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td align="left"&gt;I&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;II&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;6&lt;/td&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;III&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;4&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;IV&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;6&lt;/td&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;4&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;V&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;24&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;td align="char" char="."&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;VI&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;24&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;td align="char" char="."&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;VII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;6&lt;/td&gt;&lt;td align="char" char="."&gt;6&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt;VIII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;6&lt;/td&gt;&lt;td align="char" char="."&gt;6&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;16&lt;/td&gt;&lt;td align="char" char="."&gt;4&lt;/td&gt;&lt;td align="char" char="."&gt;8&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The simulation study is based on <emph>K</emph> = 1000 data sets having a balanced hierarchical structure. In particular, for each data set, we consider <emph>J</emph> = 100 clusters (e.g. schools) having the same number <emph>n</emph><subs><emph>j</emph></subs> = <emph>n</emph> of individuals (e.g. students). We repeat the simulations for <emph>n</emph> = 100 and <emph>n</emph> = 4 to evaluate the influence of cluster size on the bias. Indeed, the cluster size plays a key role since the confounder <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is measured by <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> whose reliability depends on the cluster size <emph>n</emph>.</p> <p>The simulation study mimics data on student achievement by the Invalsi ([<reflink idref="bib9" id="ref58">9</reflink>]). Those data could be used to estimate the effect of a student-level or school-level intervention on the test scores measured at the last year of primary school (fifth grade), accounting for pre-test scores measured at the second year of primary school (second grade), similarly to what illustrated in the motivating example above described.</p> <p>The data generation process is based on the steps described in the following, where <emph>i</emph> = 1, ..., <emph>n</emph> denotes individuals and <emph>j</emph> = 1, ..., <emph>J</emph> denotes clusters.</p> <p></p> <ulist> <item> <bold> Step 1. Generation of individual ability. </bold> For each individual, the ability <emph>A</emph><subs><emph>ij</emph></subs> is generated from a standard normal distribution <emph>A</emph><subs><emph>ij</emph></subs> ∼ <emph>N</emph> (0, 1).</item> <p></p> <item> <bold> Step 2. Allocation of individuals into clusters. </bold> 25% of the individuals are randomly assigned to the clusters, whereas 75% are assigned on the basis of their ability with the following procedure: (<emph>i</emph>) individuals are ordered according to <emph>A</emph><subs><emph>ij</emph></subs>, (<emph>ii</emph>) they are picked up in clusters of <emph>n</emph> × 0.75, that is, 75 when <emph>n</emph> = 100 and 3 when <emph>n</emph> = 4. After the allocation of individual into clusters, the cluster means of the ability are computed as j <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;/&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;msubsup&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/msubsup&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> .</item> <p></p> <item> <bold> Step 3. Treatment assignment. </bold> For an individual-level treatment (Scenario 1), the binary treatment variable is sampled from a Bernoulli distribution <emph>Z</emph><subs><emph>ij</emph></subs> ∼Bernoulli (<emph>π</emph><subs><emph>ij</emph></subs>), where <emph>π</emph><subs><emph>ij</emph></subs> = <emph>Pr</emph> (<emph>Z</emph><subs><emph>ij</emph></subs> = 1∣<emph>A</emph><subs><emph>ij</emph></subs>) is the probability of being assigned to treatment for individual <emph>i</emph> in cluster <emph>j</emph>. The probability <emph>π</emph><subs><emph>ij</emph></subs> is derived from a logit model with linear predictor <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#951;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;&amp;#948;&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;&amp;#945;&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , setting <emph>α</emph> = 1, <emph>ψ</emph><subs><emph>z</emph></subs> = 1 and <emph>δ</emph> = log 0.2/0.8 = −1.38,629 to obtain about 20% of treated individuals. Similarly, for a cluster-level treatment (Scenario 2), the binary treatment variable is sampled once for each cluster from a Bernoulli distribution <emph>Z</emph><subs><emph>j</emph></subs> ∼Bernoulli (<emph>π</emph><subs><emph>j</emph></subs>), where <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#960;&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi mathvariant="italic"&gt;Pr&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;Z&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8739;&lt;/mo&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is the probability of being assigned to treatment for cluster <emph>j</emph>. The probability <emph>π</emph><subs><emph>j</emph></subs> is generated by a logit model with linear predictor j <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mo /&gt;&lt;mi&gt;&amp;#951;&lt;/mi&gt;&lt;/mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;&amp;#948;&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;&amp;#968;&lt;/mi&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;/msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> .</item> <p></p> <item> <bold> Step 4. Generation of pre-test and post-test scores. </bold> The individual pre-test scores are generated using the random intercept model (<reflink idref="bib1" id="ref59">1</reflink>) under both scenarios, while the post-test scores are generated using the random intercept model (<reflink idref="bib2" id="ref60">2</reflink>) for Scenario 1 and model (<reflink idref="bib4" id="ref61">4</reflink>) for Scenario 2. The error terms are generated as follows: <emph>e</emph><subs><emph>ij</emph></subs> ∼ <emph>N</emph> (0, 1), <emph>v</emph><subs><emph>ij</emph></subs> ∼ <emph>N</emph> (0, 1) and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msup&gt;&lt;mo&gt;&amp;#8764;&lt;/mo&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;mn mathvariant="bold"&gt;0&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi mathvariant="bold"&gt;&amp;#931;&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , where <bold>Σ</bold> has variances equal to 1 and covariance 0.8. In order to mimic the structure of Invalsi data, in the simulations, the regression coefficients are set as follows: <emph>μ</emph><subs>1</subs> = <emph>μ</emph><subs>2</subs> = 60, <emph>τ</emph> = 2, <emph>β</emph><subs>1</subs> = 16. The values of <emph>β</emph><subs>2</subs>, <emph>ψ</emph><subs>1</subs>, <emph>ψ</emph><subs>2</subs>, λ<subs>1</subs> and λ<subs>2</subs> depend on the specific configuration I-VIII of Table 3.</item> </ulist> <p>As for the parameter λ<subs>1</subs> regulating the measurement error of the pre-test, it is set either to 0 (no measurement error, i.e., perfect reliability) or to 6, corresponding to a reliability <emph>ρ</emph> = 0.88, which is the value observed in the fifth level Invalsi pre-test (see Section 3 of the Supplementary Online Material for details). The parameter λ<subs>2</subs> is set to 0 except in the last two configurations where it is set at 6 to generate a common measurement error of the same magnitude as λ<subs>1</subs> for checking if equation (2.3) of Section 2 of the Supplementary Online Material holds also in the multilevel setting.</p> <p>The common trend at level 1 (<emph>β</emph><subs>1</subs> = <emph>β</emph><subs>2</subs>) holds in configurations I–IV and VII–VIII. The common trend at level 2 (<emph>ψ</emph><subs>1</subs> = <emph>ψ</emph><subs>2</subs>) holds in configurations I, II and VII. It follows that the common trend holds at both levels in configurations I, II and VII.</p> <p>It is worth to note that in configurations where <emph>ψ</emph><subs>1</subs> = <emph>ψ</emph><subs>2</subs> we obtain the same results for <emph>ψ</emph><subs>1</subs> = <emph>ψ</emph><subs>2</subs> = 0 (no contextual effect) and <emph>ψ</emph><subs>1</subs> = <emph>ψ</emph><subs>2</subs> = 8 (contextual effects at the same extent on pre-test and post-test). This is the reason why in Table 3 we only report <emph>ψ</emph><subs>1</subs> = <emph>ψ</emph><subs>2</subs> = 8.</p> <p>The model parameters are estimated using random intercept linear models fitted with maximum likelihood using the NLMIXED procedure of SAS ([<reflink idref="bib33" id="ref62">33</reflink>]).</p> <p>The post-test score <emph>Y</emph><subs>2<emph>ij</emph></subs> is regressed on the treatment variable <emph>Z</emph><subs><emph>ij</emph></subs> (or <emph>Z</emph><subs><emph>j</emph></subs>) and the pre-test score <emph>Y</emph><subs>1<emph>ij</emph></subs>. In order to evaluate the role of the sample cluster mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , we considered for each configuration two model specifications with and without <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> .</p> <p>The parameter of interest is the treatment effect <emph>τ</emph>, thus, we restrict our comments to this parameter. We report the results of the simulation study in terms of relative error (%) of the point estimate of the treatment effect <emph>τ</emph>, namely, <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mn&gt;100&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;/&lt;/mo&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , where <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is the Monte Carlo mean over 1000 simulations and the true value is <emph>τ</emph> = 2.</p> <hd id="AN0183370750-11">Results of Scenario 1 (Individual-Level Treatment)</hd> <p>The results of the Monte Carlo (MC) simulation study when the treatment acts at individual level (Scenario 1) are displayed in Table 4. For each configuration defined in Tables 3 and 4 reports, separately for cluster sizes <emph>n</emph> = 100 and <emph>n</emph> = 4, the MC means of the estimated treatment effect <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> with the relative error, using alternatively the conditioning approach and the gain score approach, for models including or not including the Cluster Mean (CM) pre-test <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> .</p> <p>Table 4. Scenario 1 (Individual-Level Treatment, Figure 1): MC Means of Estimated Treatment Effect <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> and Corresponding Relative Error (%err), Using Conditioning and Gain Score Approaches (Configurations Defined in Table 3; 100 Clusters of Size n ∈ {4, 100}; CM Stands for Cluster Mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> ).</p> <p>Graph</p> <p></p> <p> <ephtml> &lt;table&gt;&lt;thead valign="top"&gt;&lt;tr&gt;&lt;th align="left" rowspan="3"&gt;Conf&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Measurement error&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Common trend&lt;/th&gt;&lt;th align="center" colspan="4"&gt;Conditioning&lt;/th&gt;&lt;th align="center" colspan="4"&gt;Gain&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="left" rowspan="2"&gt;No&lt;/th&gt;&lt;th align="center" rowspan="2"&gt;Yes&lt;/th&gt;&lt;th align="center" rowspan="2"&gt;Level 1&lt;/th&gt;&lt;th align="center" rowspan="2"&gt;Level 2&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Without CM&lt;/th&gt;&lt;th align="center" colspan="2"&gt;With CM&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Without CM&lt;/th&gt;&lt;th align="center" colspan="2"&gt;With CM&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td align="left" colspan="13"&gt;&lt;italic&gt;n&lt;/italic&gt; = 100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; I&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.17&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.19&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.05&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.19&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; II&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;3.69&lt;/td&gt;&lt;td align="char" char="."&gt;84.50&lt;/td&gt;&lt;td align="char" char="."&gt;3.61&lt;/td&gt;&lt;td align="char" char="."&gt;80.55&lt;/td&gt;&lt;td align="char" char="."&gt;2.01&lt;/td&gt;&lt;td align="char" char="."&gt;0.45&lt;/td&gt;&lt;td align="char" char="."&gt;2.02&lt;/td&gt;&lt;td align="char" char="."&gt;1.15&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; III&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.06&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.17&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.18&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.17&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; IV&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;3.69&lt;/td&gt;&lt;td align="char" char="."&gt;84.60&lt;/td&gt;&lt;td align="char" char="."&gt;3.63&lt;/td&gt;&lt;td align="char" char="."&gt;81.70&lt;/td&gt;&lt;td align="char" char="."&gt;2.17&lt;/td&gt;&lt;td align="char" char="."&gt;8.49&lt;/td&gt;&lt;td align="char" char="."&gt;2.03&lt;/td&gt;&lt;td align="char" char="."&gt;1.40&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; V&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.01&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.14&lt;/td&gt;&lt;td align="char" char="."&gt;5.45&lt;/td&gt;&lt;td align="char" char="."&gt;172.36&lt;/td&gt;&lt;td align="char" char="."&gt;5.13&lt;/td&gt;&lt;td align="char" char="."&gt;156.33&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VI&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;4.44&lt;/td&gt;&lt;td align="char" char="."&gt;121.95&lt;/td&gt;&lt;td align="char" char="."&gt;4.38&lt;/td&gt;&lt;td align="char" char="."&gt;119.00&lt;/td&gt;&lt;td align="char" char="."&gt;5.62&lt;/td&gt;&lt;td align="char" char="."&gt;181.00&lt;/td&gt;&lt;td align="char" char="."&gt;5.15&lt;/td&gt;&lt;td align="char" char="."&gt;157.50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.05&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.06&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.06&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.07&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VIII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.05&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.02&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.19&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.03&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" colspan="13"&gt;&lt;italic&gt;n&lt;/italic&gt; = 4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; I&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.48&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.55&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.11&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.53&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; II&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;4.32&lt;/td&gt;&lt;td align="char" char="."&gt;116.00&lt;/td&gt;&lt;td align="char" char="."&gt;3.82&lt;/td&gt;&lt;td align="char" char="."&gt;91.00&lt;/td&gt;&lt;td align="char" char="."&gt;1.96&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;2.56&lt;/td&gt;&lt;td align="char" char="."&gt;28.00&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; III&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;2.14&lt;/td&gt;&lt;td align="char" char="."&gt;7.00&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.30&lt;/td&gt;&lt;td align="char" char="."&gt;2.15&lt;/td&gt;&lt;td align="char" char="."&gt;7.48&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.30&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; IV&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;4.89&lt;/td&gt;&lt;td align="char" char="."&gt;144.50&lt;/td&gt;&lt;td align="char" char="."&gt;3.97&lt;/td&gt;&lt;td align="char" char="."&gt;98.50&lt;/td&gt;&lt;td align="char" char="."&gt;4.44&lt;/td&gt;&lt;td align="char" char="."&gt;122.00&lt;/td&gt;&lt;td align="char" char="."&gt;2.71&lt;/td&gt;&lt;td align="char" char="."&gt;35.50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; V&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;1.57&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;21.50&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;td align="char" char="."&gt;7.65&lt;/td&gt;&lt;td align="char" char="."&gt;282.63&lt;/td&gt;&lt;td align="char" char="."&gt;4.73&lt;/td&gt;&lt;td align="char" char="."&gt;136.71&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VI&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;4.50&lt;/td&gt;&lt;td align="char" char="."&gt;124.80&lt;/td&gt;&lt;td align="char" char="."&gt;4.62&lt;/td&gt;&lt;td align="char" char="."&gt;130.85&lt;/td&gt;&lt;td align="char" char="."&gt;7.62&lt;/td&gt;&lt;td align="char" char="."&gt;281.00&lt;/td&gt;&lt;td align="char" char="."&gt;5.46&lt;/td&gt;&lt;td align="char" char="."&gt;173.20&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.40&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.45&lt;/td&gt;&lt;td align="char" char="."&gt;2.01&lt;/td&gt;&lt;td align="char" char="."&gt;0.25&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.35&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VIII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;2.09&lt;/td&gt;&lt;td align="char" char="."&gt;4.35&lt;/td&gt;&lt;td align="char" char="."&gt;2.08&lt;/td&gt;&lt;td align="char" char="."&gt;4.05&lt;/td&gt;&lt;td align="char" char="."&gt;2.15&lt;/td&gt;&lt;td align="char" char="."&gt;7.60&lt;/td&gt;&lt;td align="char" char="."&gt;2.08&lt;/td&gt;&lt;td align="char" char="."&gt;3.90&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>The results displayed in Table 4 confirm the results of [<reflink idref="bib19" id="ref63">19</reflink>] in a multilevel context. Indeed, the treatment effect is correctly estimated whatever the approach when the pre-test is measured without error (reliability equals 1), and the common trend assumption is fully satisfied both at level 1 and level 2 (config. I). Moreover, under the common trend assumption, the gain score approach has to be preferred to the conditioning one in the presence of measurement error on the pre-test (config. II), also when the reliability is pretty high (0.88 is a typical value for validated tests like the ones adopted by Invalsi). Remarkably, the advantage of the gain approach under the common trend assumption is to cancel out terms affected by measurement error. As shown in Table 4, this advantage is lost with the inclusion of the pre-test cluster mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> when the cluster size is small (<emph>n</emph> = 4).</p> <p>Similar results are obtained when the common trend assumption holds only at level 1 (configurations III and IV) for a large cluster size (<emph>n</emph> = 100). Namely, the treatment effect is unbiasedly estimated under both approaches when the measurement error is absent (configuration III), while in presence of measurement error (configuration IV), the estimated treatment effect is nearly unbiased only with the gain approach. Instead, for a small cluster size (<emph>n</emph> = 4), both approaches yield unbiased estimates only in absence of measurement error (configuration III), as expected from equation (<reflink idref="bib3" id="ref64">3</reflink>).</p> <p>When the common trend assumption is violated at both levels, (configurations V and VI) the estimated treatment effect is seriously overestimated with the gain score approach. On the other hand, the conditioning approach leads to an unbiased estimator when the pre-test is measured without error (configuration V) and the cluster size is large (<emph>n</emph> = 100). However, with a small cluster size (<emph>n</emph> = 4), it is necessary to include the pre-test cluster mean to obtain an unbiased estimator. Moreover, when the pre-test is affected by measurement error (configuration VI), also the conditioning approach yields a bias, though lower than for the gain score approach.</p> <p>Configurations VII and VIII show what happens when the measurement error is present both on pre-test and post-test with the same effect (λ<subs>1</subs> = λ<subs>2</subs>). In this situation, the treatment effect is unbiasedly estimated even with the conditioning approach as long as the common trend assumption holds at least at level 1 (equation 2.2 of Section 2 of the Supplementary Online Material). However, if the common trend at level 1 is relaxed or the magnitude of the measurement error differs between pre-test and post-test (λ<subs>1</subs> ≠ λ<subs>2</subs>), the unbiasedness of the conditioning approach is lost (not shown here).</p> <p>We performed many other simulations to check the effect of other parameters on the treatment effect estimation, but findings were less interesting (results available upon request). For instance, we found that, as expected, an increase in the effect of the ability on the treatment assignment is associated with a substantial increase of the estimation bias in both approaches. On the contrary, changing the values of other parameters had a negligible impact on the estimation of the treatment effect, such as considering a higher proportion of treated subjects (30% instead of 20%) or forming the clusters in a different way (in the simulation study 75% of the subjects have been allocated on the basis of their ability and 25% at random).</p> <p>In summary, the conditioning approach is to be preferred in absence of measurement error since it yields an unbiased estimator of the treatment effect. The amount of measurement error is summarised by the reliability of the test. On the other hand, in case of relevant measurement error, say when the reliability is less than 0.90, the conditioning approach yields a markedly biased estimator of the treatment effect; in this situation, the gain score approach yields an unbiased estimator when the cluster size is large and the common trend assumption holds at least at level 1, while for a small cluster size the common trend must hold at both levels. As for the inclusion of the pre-test cluster mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> as a covariate, it is better to always include it with the conditioning approach regardless of the cluster size, while with the gain score approach, it depends on the type of common trend assumption: adding the pre-test cluster mean reduces the bias if the common trend holds only at level 1, while it increases the bias if it holds at both levels.</p> <hd id="AN0183370750-12">Results of Scenario 2 (Cluster-Level Treatment)</hd> <p>The results of the MC simulation study when the treatment is at cluster level (Scenario 2) are displayed in Table 5. For each configuration defined in Table 3, we show the MC means of the estimated treatment effect together with the corresponding relative error, under the conditioning approach and the gain score approach. We also distinguish between models including or not including the Cluster Mean (CM) pre-test <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> .</p> <p>Table 5. Scenario 2 (Cluster-Level Treatment, Figure 2): MC Means of Estimated Treatment Effect <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> and Corresponding Relative Error (%err), Using Conditioning and Gain Score Approaches (Configurations Defined in Table 3; 100 Clusters of Size n ∈ {4, 100}; CM Stands for Cluster Mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> ).</p> <p>Graph</p> <p></p> <p> <ephtml> &lt;table&gt;&lt;thead valign="top"&gt;&lt;tr&gt;&lt;th align="left" rowspan="3"&gt;Conf&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Measurement Error&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Common Trend&lt;/th&gt;&lt;th align="center" colspan="4"&gt;Conditioning&lt;/th&gt;&lt;th align="center" colspan="4"&gt;Gain&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="left" rowspan="2"&gt;Yes&lt;/th&gt;&lt;th align="center" rowspan="2"&gt;No&lt;/th&gt;&lt;th align="center" rowspan="2"&gt;Level 1&lt;/th&gt;&lt;th align="center" rowspan="2"&gt;Level 2&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Without CM&lt;/th&gt;&lt;th align="center" colspan="2"&gt;With CM&lt;/th&gt;&lt;th align="center" colspan="2"&gt;Without CM&lt;/th&gt;&lt;th align="center" colspan="2"&gt;With CM&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;th align="center"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow xmlns=""&gt;&lt;mover accent="true"&gt;&lt;mi&gt;&amp;#964;&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/p&gt;&lt;/th&gt;&lt;th align="center"&gt;%err&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td align="left" colspan="13"&gt;&lt;italic&gt;n&lt;/italic&gt; = 100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; I&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.15&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.33&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.01&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.35&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; II&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;4.95&lt;/td&gt;&lt;td align="char" char="."&gt;147.50&lt;/td&gt;&lt;td align="char" char="."&gt;2.01&lt;/td&gt;&lt;td align="char" char="."&gt;0.35&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.03&lt;/td&gt;&lt;td align="char" char="."&gt;2.01&lt;/td&gt;&lt;td align="char" char="."&gt;0.35&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; III&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;4.08&lt;/td&gt;&lt;td align="char" char="."&gt;104.15&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;td align="char" char="."&gt;4.09&lt;/td&gt;&lt;td align="char" char="."&gt;104.35&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; IV&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;6.58&lt;/td&gt;&lt;td align="char" char="."&gt;229.10&lt;/td&gt;&lt;td align="char" char="."&gt;2.02&lt;/td&gt;&lt;td align="char" char="."&gt;0.90&lt;/td&gt;&lt;td align="char" char="."&gt;4.09&lt;/td&gt;&lt;td align="char" char="."&gt;104.50&lt;/td&gt;&lt;td align="char" char="."&gt;2.02&lt;/td&gt;&lt;td align="char" char="."&gt;0.90&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; V&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;&amp;#8722;2.19&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;209.70&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;td align="char" char="."&gt;4.10&lt;/td&gt;&lt;td align="char" char="."&gt;104.95&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VI&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;2.14&lt;/td&gt;&lt;td align="char" char="."&gt;6.83&lt;/td&gt;&lt;td align="char" char="."&gt;2.02&lt;/td&gt;&lt;td align="char" char="."&gt;0.85&lt;/td&gt;&lt;td align="char" char="."&gt;4.09&lt;/td&gt;&lt;td align="char" char="."&gt;104.70&lt;/td&gt;&lt;td align="char" char="."&gt;2.02&lt;/td&gt;&lt;td align="char" char="."&gt;0.85&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.32&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.01&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.32&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VIII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;4.10&lt;/td&gt;&lt;td align="char" char="."&gt;104.95&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.08&lt;/td&gt;&lt;td align="char" char="."&gt;4.10&lt;/td&gt;&lt;td align="char" char="."&gt;105.15&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.08&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" colspan="13"&gt;&lt;italic&gt;n&lt;/italic&gt; = 4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; I&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.30&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.35&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.05&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.35&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; II&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;3.19&lt;/td&gt;&lt;td align="char" char="."&gt;59.40&lt;/td&gt;&lt;td align="char" char="."&gt;2.34&lt;/td&gt;&lt;td align="char" char="."&gt;17.00&lt;/td&gt;&lt;td align="char" char="."&gt;1.99&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.75&lt;/td&gt;&lt;td align="char" char="."&gt;2.34&lt;/td&gt;&lt;td align="char" char="."&gt;17.00&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; II&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;4.13&lt;/td&gt;&lt;td align="char" char="."&gt;106.45&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;td align="char" char="."&gt;4.29&lt;/td&gt;&lt;td align="char" char="."&gt;114.35&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; IV&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;5.15&lt;/td&gt;&lt;td align="char" char="."&gt;157.50&lt;/td&gt;&lt;td align="char" char="."&gt;2.56&lt;/td&gt;&lt;td align="char" char="."&gt;28.00&lt;/td&gt;&lt;td align="char" char="."&gt;4.29&lt;/td&gt;&lt;td align="char" char="."&gt;114.40&lt;/td&gt;&lt;td align="char" char="."&gt;2.55&lt;/td&gt;&lt;td align="char" char="."&gt;27.50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; V&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;0.96&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;52.00&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.07&lt;/td&gt;&lt;td align="char" char="."&gt;4.30&lt;/td&gt;&lt;td align="char" char="."&gt;114.85&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;&amp;#8722;0.05&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VI&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;2.29&lt;/td&gt;&lt;td align="char" char="."&gt;14.70&lt;/td&gt;&lt;td align="char" char="."&gt;2.41&lt;/td&gt;&lt;td align="char" char="."&gt;20.50&lt;/td&gt;&lt;td align="char" char="."&gt;4.30&lt;/td&gt;&lt;td align="char" char="."&gt;114.95&lt;/td&gt;&lt;td align="char" char="."&gt;2.42&lt;/td&gt;&lt;td align="char" char="."&gt;20.95&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="char" char="."&gt;3.21&lt;/td&gt;&lt;td align="char" char="."&gt;60.50&lt;/td&gt;&lt;td align="char" char="."&gt;2.36&lt;/td&gt;&lt;td align="char" char="."&gt;17.85&lt;/td&gt;&lt;td align="char" char="."&gt;2.00&lt;/td&gt;&lt;td align="char" char="."&gt;0.20&lt;/td&gt;&lt;td align="char" char="."&gt;2.36&lt;/td&gt;&lt;td align="char" char="."&gt;17.85&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left"&gt; VIII&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;(&amp;#955;&lt;sub&gt;1&lt;/sub&gt; = &amp;#955;&lt;sub&gt;2&lt;/sub&gt;)&lt;/td&gt;&lt;td align="left"&gt;&lt;italic&gt;&amp;#10003;&lt;/italic&gt;&lt;/td&gt;&lt;td align="left" /&gt;&lt;td align="char" char="."&gt;4.17&lt;/td&gt;&lt;td align="char" char="."&gt;108.50&lt;/td&gt;&lt;td align="char" char="."&gt;2.09&lt;/td&gt;&lt;td align="char" char="."&gt;4.35&lt;/td&gt;&lt;td align="char" char="."&gt;4.29&lt;/td&gt;&lt;td align="char" char="."&gt;114.50&lt;/td&gt;&lt;td align="char" char="."&gt;2.09&lt;/td&gt;&lt;td align="char" char="."&gt;4.35&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Let us first consider the results without the cluster mean pre-test score <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . The simulations suggest that the absence of measurement error is necessary, but not sufficient, to correctly estimate the treatment effect under the conditioning approach. Indeed, the treatment effect is correctly estimated only when the measurement error on the pre-test is absent and the common trend holds at both levels as in configuration I (in addition to the peculiar case of common measurement error of configuration VII).</p> <p>The situation sharply changes when the cluster mean pre-test <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is introduced as a regressor in the conditioning model and the cluster size is large (<emph>n</emph> = 100). Indeed, in this case, the conditioning approach yields an unbiased estimator also in case of measurement error (λ<subs>1</subs> ≠ 0). In fact, the treatment assignment now depends on the cluster mean ability <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> which is measured by the cluster mean pre-test <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> whose reliability tends to one as the cluster size increases, whatever the value of λ<subs>1</subs> (equation 3.5 in Section 3 of the Supplementary Online Material). In configurations III and V, where there is not measurement error but the common trend at level 2 does not hold, the insertion of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is needed to avoid level 2 endogeneity ([<reflink idref="bib14" id="ref65">14</reflink>]).</p> <p>The above considerations are confirmed when looking at simulation results for a small cluster size. Namely, when <emph>n</emph> = 4, adding the cluster mean pre-test <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> does not solve the measurement error problem (configurations II, IV and VI) because the reliability of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is far from 1. On the other side, the inclusion of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is needed to solve the endogeneity problem of configurations III and V.</p> <p>It is worth to note that, when the pre-test cluster mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is included in the model, the gain score approach provides the same estimation of the treatment effect as the conditioning approach. As expected, if the common trend holds at both levels, the gain score approach without the pre-test cluster mean <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> provides an unbiased estimator of the treatment effect; the unbiasedness is lost if the cluster mean is added in settings with small clusters and measurement error (configurations II and VII).</p> <p>In sum, with a large cluster size, it is always convenient to include the pre-test cluster mean: in fact, the estimator of the treatment effect is unbiased for all configurations and identical for the conditioning and gain score approaches. Instead, with a small cluster size, the two approaches give different results: in some configurations, the best approach is conditioning with the pre-test cluster mean, whereas in other configurations, the best approach is the gain score without the cluster mean.</p> <p>Note that the other parameter estimates (not shown here) may differ among the two approaches. For example, the estimate of the cluster-level variance is greater in the conditional model, while the individual-level variance is greater in the gain score model.</p> <p>Till now in the study we have considered data generating models where the treatment assignment depends only on a latent variable (i.e. the latent ability). However, in applications, there are also situations where the pre-test score, or a synthesis of it, directly influences the treatment assignment ([<reflink idref="bib20" id="ref66">20</reflink>], pp. 13–14). To take into account situations of this type, Section 4 of the Supplementary Online Material reports the results of simulations for a third scenario, where the cluster-level treatment is assigned only on the basis of observed variables, specifically the cluster mean of the pre-test score <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> . As expected, the treatment effect estimator is always unbiased when the pre-test cluster mean is included in the model. On the other hand, the omission of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;mo&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> leads to biased estimates due to endogeneity.</p> <hd id="AN0183370750-13">Concluding Remarks</hd> <p>In this paper, we considered the estimation of a treatment effect on a test score in observational studies when a pre-test is available, comparing the alternative approaches known as conditioning and gain score. Our contribution mainly lies in analyzing the merits and drawbacks of the two approaches in a multilevel setting, which is relevant in many research fields.</p> <p>We first outlined the assumptions for causal inference and the data generating models. Then, we provided details on the approaches to estimate the causal effect of the treatment, distinguishing whether the treatment is at individual or at cluster level. The performance of the conditioning and gain score estimating approaches was investigated through a simulation study under different scenarios.</p> <p>For a treatment at the individual level, our results confirm the findings of the literature for a single-level setting. Specifically, the conditioning approach gives a biased estimator of the treatment effect whenever the pre-test is affected by measurement error, though the bias reduces if the pre-test and post-test scores are affected by a common source of error of the same magnitude. As a consequence, designs with the same instrument at pre-test and post-test should be preferred as they help reducing the bias. As for the gain score approach, it provides an unbiased estimator if the common trend assumption holds at both individual and cluster levels. For an individual-level treatment, including the cluster mean of the pre-test score as a regressor is not recommended, as it introduces further measurement error without reducing the bias.</p> <p>On the other hand, the findings for a treatment at the cluster level are different because the cluster mean of the latent ability acts as a confounder. Thus, its observable counterpart, namely, the cluster mean of the pre-test score, should be included as a regressor. However, this is not sufficient to completely eliminate the bias if the pre-test is affected by measurement error, because the cluster mean of the pre-test is in turn affected by measurement error. This issue is especially relevant with small clusters (e.g. size 4 in our simulation study). Anyway, inserting the cluster mean of the pre-test as a regressor is generally convenient because it reduces the bias. It is worth noting that, if the cluster mean of the pre-test is used as a regressor, then the conditioning and gain score approaches have the same estimand, thus providing similar estimates of the treatment effect regardless of the cluster size.</p> <hd id="AN0183370750-14">Limitations and Future Work</hd> <p>In applied work, the most difficult task is evaluating the credibility of the assumptions underlying the chosen method. As mentioned above, researchers can use statistical tests and graphs to empirically support the credibility of the assumptions. However, these are only indirect tests of the assumptions underlying the conditioning and gain score approaches. Researchers should also think carefully about the conceptual reasons for which these assumptions may be (or may not be) plausible. As for the common trend assumption, it may be helpful to interpret it as a byproduct of a set of unobserved variables that differ across treatment groups. If the researcher believes that these factors and their influence on the outcome are stable over time, then the plausibility of the common trend assumption becomes stronger. Instead, if the researcher suspects that one or more unobserved confounders have changed over time, then the common trend assumption is less likely to hold. In situations where the common trend assumption is not plausible so the gain score approach is not valid, an alternative route is to adjust the conditioning approach by explicitly accounting for measurement error. This is possible when an external estimate of the reliability is available (e.g. classical test theory; [<reflink idref="bib26" id="ref67">26</reflink>]). Another approach is to embed the conditioning model into a structural equation model with multiple indicators measuring the latent ability ([<reflink idref="bib11" id="ref68">11</reflink>]), an approach also known as multilevel latent covariate model ([<reflink idref="bib28" id="ref69">28</reflink>]).</p> <p>Our data generating models do not include covariates. This choice was aimed at simplifying the analysis in order to focus on the main goal, namely, assessing the bias induced by the latent ability. Thus, we worked in a framework where the unconfoundedness assumption holds conditionally on the latent ability. Other variables can be included as controls. In this framework, the controls are useful if they contribute to improving the measurement of the latent ability or if they block the causal paths from ability to the treatment or to the outcome.</p> <p>In the setting of an individual-level treatment, another extension consists in accounting for the heterogeneity of the causal effect across clusters. This aim can be achieved by introducing a random slope for the treatment variable, in addition to the random intercept ([<reflink idref="bib31" id="ref70">31</reflink>]). This extension is easy to implement, though further research is needed to investigate the properties of the estimators.</p> <p>Future work should also investigate more complex clustering settings, for example, one in which clusters differ before and after the treatment. This situation occurs when pre- and post-tests are conducted at different school levels, as in the evaluation of a treatment effect on a test taken in the first grade of secondary school, adjusting for a pre-test administered during the primary school.</p> <p>Another interesting route for future work is to consider some forms of interference among units, for example, by letting the individual outcome to depend on the proportion of treated units in the cluster. Work in the causal inference literature shows that ignoring interference does not necessarily implies a substantial bias in treatment effect estimates ([<reflink idref="bib4" id="ref71">4</reflink>]), unless the level of interference is high or when the association between individuals and their peers treatments is high ([<reflink idref="bib13" id="ref72">13</reflink>]). Future research should investigate how different kinds of interference affect the gain score and conditioning approaches.</p> <p>Finally, a promising research route is exploiting propensity score techniques for clustered data ([<reflink idref="bib5" id="ref73">5</reflink>]; [<reflink idref="bib37" id="ref74">37</reflink>]) to implement the conditioning and gain score approaches in a multilevel setting.</p> <hd id="AN0183370750-15">Supplemental Material</hd> <p>Graph: Supplemental Material for Conditioning on the Pre-Test versus Gain Score Modelling: Revisiting the Controversy in a Multilevel Setting by Bruno Arpino, Silvia Bacci, Leonardo Grilli, Raffaele Guetto, and Carla Rampichini in Evaluation Review</p> <hd id="AN0183370750-16">ORCID iDs</hd> <p>Bruno Arpino https://orcid.org/0000-0002-8374-3066</p> <p>Silvia Bacci https://orcid.org/0000-0001-8097-3870</p> <p>Leonardo Grilli https://orcid.org/0000-0002-3886-7705</p> <p>Raffaele Guetto https://orcid.org/0000-0001-8052-9809</p> <ref id="AN0183370750-17"> <title> References </title> <blist> <bibl id="bib1" idref="ref37" type="bt">1</bibl> <bibtext> Ahern J. (2018). Start with the "C-word," follow the roadmap for causal inference. American Journal of Public Health, 108(5), 621. https://doi.org/10.2105/AJPH.2018.304358</bibtext> </blist> <blist> <bibl id="bib2" idref="ref13" type="bt">2</bibl> <bibtext> Allison P. D. (1990). Change scores as dependent variables in regression analysis. Sociological Methodology, 20, 93–114. https://doi.org/10.2307/271083.</bibtext> </blist> <blist> <bibl id="bib3" idref="ref21" type="bt">3</bibl> <bibtext> Arpino B., Aassve A. (2013). Estimating the causal effect of fertility on economic well-being: Data requirements, identifying assumptions and estimation methods. Empirical Economics, 44(1), 355–385. https://doi.org/10.1007/s00181-010-0356-9</bibtext> </blist> <blist> <bibl id="bib4" idref="ref61" type="bt">4</bibl> <bibtext> Arpino B., Mattei A. (2016). Assessing the causal effects of financial aids to firms in tuscany allowing for interference. Annals of Applied Statistics, 10(3), 1170–1194. https://doi.org/10.1214/15-aoas902</bibtext> </blist> <blist> <bibl id="bib5" idref="ref73" type="bt">5</bibl> <bibtext> Arpino B., Mealli F. (2011). The specification of the propensity score in multilevel observational studies. Computational Statistics &amp; Data Analysis, 55(4), 1770–1780. https://doi.org/10.1016/j.csda.2010.11.008</bibtext> </blist> <blist> <bibl id="bib6" idref="ref33" type="bt">6</bibl> <bibtext> Baiocchi M., Cheng J., Small D. S. (2014). Instrumental variable methods for causal inference. Statistics in Medicine, 33(13), 2297–2340. https://doi.org/10.1002/sim.6128</bibtext> </blist> <blist> <bibl id="bib7" idref="ref1" type="bt">7</bibl> <bibtext> Bereiter C. (1963). Some persisting dilemmas in the measurement of change. In Harris C. W. (Ed.), Problems in measuring change (pp. 3–20). University of Wisconsin Press.</bibtext> </blist> <blist> <bibl id="bib8" idref="ref25" type="bt">8</bibl> <bibtext> Callaway B., Goodman-Bacon A., Sant'Anna P. H. (2021). Difference-in-differences with a continuous treatment.arXiv preprint arXiv:2107.02637.</bibtext> </blist> <blist> <bibl id="bib9" idref="ref58" type="bt">9</bibl> <bibtext> Cardone M., Falzetti P., Sacco C. (2019). INVALSI data for school system improvement: The value added. Working papers INVALSI, n. 5694, Invalsi.</bibtext> </blist> <blist> <bibtext> Carrillo P. E., Onofa M., Ponce J. (2011). Information technology and student achievement: Evidence from a randomized experiment in Ecuador. Research Department Publications 4698, Inter-American Development Bank, Research Department.</bibtext> </blist> <blist> <bibtext> Croon M. A., van Veldhoven M. J. P. M. (2007). Predicting group-level outcome variables from variables measured at the individual level: A latent variable multilevel model. Psychological Methods, 12(1), 45–57. https://doi.org/10.1037/1082-989X.12.1.45</bibtext> </blist> <blist> <bibtext> Ding P., Li F. (2019). A bracketing relationship between difference-in-differences and lagged-dependent-variable adjustment. Political Analysis, 27(4), 605–615. https://doi.org/10.1017/pan.2019.25</bibtext> </blist> <blist> <bibtext> Forastiere L., Airoldi E. M., Mealli F. (2021). Identification and estimation of treatment and interference effects in observational studies on networks. Journal of the American Statistical Association, 116(534), 901–918. https://doi.org/10.1080/01621459.2020.1768100</bibtext> </blist> <blist> <bibtext> Grilli L., Rampichini C. (2011). The role of sample cluster means in multilevel models: A view on endogeneity and measurement error issues. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 7(4), 121–133. https://doi.org/10.1027/1614-2241/a000030</bibtext> </blist> <blist> <bibtext> Ha Y., Park H. J. (2017). Can after-school programs and private tutoring help improve students' achievement? Revisiting the effects in Korean secondary schools. Asia Pacific Education Review, 18(1), 65–79. https://doi.org/10.1007/s12564-016-9451-8</bibtext> </blist> <blist> <bibtext> Hong G., Raudenbush S. W. (2006). Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. Journal of the American Statistical Association, 101(475), 901–910. https://doi.org/10.1198/016214506000000447</bibtext> </blist> <blist> <bibtext> Imbens G. W., Lemieux T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics, 142(2), 615–635. https://doi.org/10.1016/j.jeconom.2007.05.001</bibtext> </blist> <blist> <bibtext> Imbens G. W., Rubin D. W. (2015). Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge University Press.</bibtext> </blist> <blist> <bibtext> Kim Y., Steiner P. M. (2021a). Gain scores revisited: A graphical models perspective. Sociological Methods &amp; Research, 50(3), 1353–1375. https://doi.org/10.1177/0049124119826155</bibtext> </blist> <blist> <bibtext> Kim Y., Steiner P. M. (2021b). Causal graphical views of fixed effects and random effects models. British Journal of Mathematical and Statistical Psychology, 74(2), 165–183. https://doi.org/10.1111/bmsp.12217</bibtext> </blist> <blist> <bibtext> Klar N., Darlington G. (2004). Methods for modelling change in cluster randomization trial. Statistics in Medicine, 23(15), 2341–2357. https://doi.org/10.1002/sim.1858</bibtext> </blist> <blist> <bibtext> Köhler C., Hartig J., Schmid C. (2020). Deciding between the covariance analytical approach and the change-score approach in two wave panel data. Multivariate Behavioral Research, 56(3), 447–458. https://doi.org/10.1080/00273171.2020.1726723</bibtext> </blist> <blist> <bibtext> Lechner M. (2011). The estimation of causal effects by difference-in-difference methods. Foundations and Trends in Econometrics, 4(3), 165–224. https://doi.org/10.1561/0800000014</bibtext> </blist> <blist> <bibtext> Lord F. M. (1963). Elementary models for measuring change. In Harris C. W. (Ed.), Problems in measuring change. University of Wisconsin Press.</bibtext> </blist> <blist> <bibtext> Lord F. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bullettin, 68(5), 304–305. https://doi.org/10.1037/h0025105</bibtext> </blist> <blist> <bibtext> Lord F. M., Novick M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.</bibtext> </blist> <blist> <bibtext> Maris E. (1998). Covariance adjustment versus gain scores - revisited. Psychological Methods, 3(3), 309–327. https://doi.org/10.1037/1082-989x.3.3.309</bibtext> </blist> <blist> <bibtext> Marsh H. W., Lüdtke O., Robitzsch A., Trautwein U., Asparouhov T., Muthen B., Nagengast B. (2009). Doubly-latent models of school contextual effects: Integrating multilevel and structural equation approaches to control measurement and sampling error. Multivariate Behavioral Research, 44(6), 764–802. https://doi.org/10.1080/00273170903333665</bibtext> </blist> <blist> <bibtext> Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research.</bibtext> </blist> <blist> <bibtext> Raudenbush S. W. (2005). Learning from attempts to improve schooling: The contribution of methological diversity. Educational Researcher, 34(5), 25–31. https://doi.org/10.3102/0013189x034005025</bibtext> </blist> <blist> <bibtext> Raudenbush S. W., Schwartz D. (2020). Randomized experiments in education, with implications for multilevel causal inference. Annual Review of Statistics and Its Application, 7(1), 177–208. https://doi.org/10.1146/annurev-statistics-031219-041205</bibtext> </blist> <blist> <bibtext> Rosenbaum P. R. (2002). Observational studies (2nd ed.). Springer-Verlag.</bibtext> </blist> <blist> <bibtext> SAS Institute Inc. (2018). SAS/STAT 15.1 User's guide. SAS Institute Inc.</bibtext> </blist> <blist> <bibtext> Snijders T., Bosker R. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). Sage Publications Ltd.</bibtext> </blist> <blist> <bibtext> Steiner P. M., Kim Y. (2016). The mechanics of omitted variable bias: Bias amplification and cancellation of offsetting biases. Journal of Causal Inference, 4(2), 20160009. https://doi.org/10.1515/jci-2016-0009</bibtext> </blist> <blist> <bibtext> Stigler S. M. (1997). Regression towards the mean, historically considered. Statistical Methods in Medical Research, 6(2), 103–114. https://doi.org/10.1177/096228029700600202</bibtext> </blist> <blist> <bibtext> Thoemmes F. J., West S. G. (2011). The use of propensity scores for nonrandomized designs with clustered data. Multivariate Behavioral Research, 46(3), 514–543. https://doi.org/10.1080/00273171.2011.569395</bibtext> </blist> <blist> <bibtext> van Breukelen G. J. P. (2013). ANCOVA versus CHANGE from baseline in nonrandomized studies: The difference. Multivariate Behavioral Research, 48(6), 895–922. https://doi.org/10.1080/00273171.2013.831743</bibtext> </blist> <blist> <bibtext> Wing C., Simon K., Bello-Gomez R. A. (2018). Designing difference in difference studies: Best practices for public health policy research. Annual Review of Public Health, 39, 453–469. https://doi.org/10.1146/annurev-publhealth-040617-013507</bibtext> </blist> </ref> <ref id="AN0183370750-18"> <title> Footnotes </title> <blist> <bibtext> The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.</bibtext> </blist> <blist> <bibtext> The author(s) received no financial support for the research, authorship, and/or publication of this article.</bibtext> </blist> <blist> <bibtext> Supplemental material for this article is available online.</bibtext> </blist> </ref> <aug> <p>By Bruno Arpino; Silvia Bacci; Leonardo Grilli; Raffaele Guetto and Carla Rampichini</p> <p>Reported by Author; Author; Author; Author; Author</p> </aug> <nolink nlid="nl1" bibid="bib24" firstref="ref2"></nolink> <nolink nlid="nl2" bibid="bib25" firstref="ref3"></nolink> <nolink nlid="nl3" bibid="bib35" firstref="ref4"></nolink> <nolink nlid="nl4" bibid="bib19" firstref="ref5"></nolink> <nolink nlid="nl5" bibid="bib34" firstref="ref6"></nolink> <nolink nlid="nl6" bibid="bib30" firstref="ref7"></nolink> <nolink nlid="nl7" bibid="bib15" firstref="ref8"></nolink> <nolink nlid="nl8" bibid="bib10" firstref="ref9"></nolink> <nolink nlid="nl9" bibid="bib29" firstref="ref10"></nolink> <nolink nlid="nl10" bibid="bib36" firstref="ref12"></nolink> <nolink nlid="nl11" bibid="bib27" firstref="ref14"></nolink> <nolink nlid="nl12" bibid="bib38" firstref="ref15"></nolink> <nolink nlid="nl13" bibid="bib32" firstref="ref18"></nolink> <nolink nlid="nl14" bibid="bib39" firstref="ref19"></nolink> <nolink nlid="nl15" bibid="bib18" firstref="ref22"></nolink> <nolink nlid="nl16" bibid="bib23" firstref="ref23"></nolink> <nolink nlid="nl17" bibid="bib12" firstref="ref24"></nolink> <nolink nlid="nl18" bibid="bib22" firstref="ref30"></nolink> <nolink nlid="nl19" bibid="bib21" firstref="ref31"></nolink> <nolink nlid="nl20" bibid="bib17" firstref="ref32"></nolink> <nolink nlid="nl21" bibid="bib16" firstref="ref38"></nolink> <nolink nlid="nl22" bibid="bib37" firstref="ref40"></nolink> <nolink nlid="nl23" bibid="bib33" firstref="ref62"></nolink> <nolink nlid="nl24" bibid="bib14" firstref="ref65"></nolink> <nolink nlid="nl25" bibid="bib20" firstref="ref66"></nolink> <nolink nlid="nl26" bibid="bib26" firstref="ref67"></nolink> <nolink nlid="nl27" bibid="bib11" firstref="ref68"></nolink> <nolink nlid="nl28" bibid="bib28" firstref="ref69"></nolink> <nolink nlid="nl29" bibid="bib31" firstref="ref70"></nolink> <nolink nlid="nl30" bibid="bib13" firstref="ref72"></nolink> |
|---|---|
| Header | DbId: eric DbLabel: ERIC An: EJ1466338 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: Conditioning on the Pre-Test versus Gain Score Modelling: Revisiting the Controversy in a Multilevel Setting – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Bruno+Arpino%22">Bruno Arpino</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0002-8374-3066">0000-0002-8374-3066</externalLink>)<br /><searchLink fieldCode="AR" term="%22Silvia+Bacci%22">Silvia Bacci</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0001-8097-3870">0000-0001-8097-3870</externalLink>)<br /><searchLink fieldCode="AR" term="%22Leonardo+Grilli%22">Leonardo Grilli</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0002-3886-7705">0000-0002-3886-7705</externalLink>)<br /><searchLink fieldCode="AR" term="%22Raffaele+Guetto%22">Raffaele Guetto</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0001-8052-9809">0000-0001-8052-9809</externalLink>)<br /><searchLink fieldCode="AR" term="%22Carla+Rampichini%22">Carla Rampichini</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22Evaluation+Review%22"><i>Evaluation Review</i></searchLink>. 2025 49(2):179-208. – Name: Avail Label: Availability Group: Avail Data: SAGE Publications. 2455 Teller Road, Thousand Oaks, CA 91320. Tel: 800-818-7243; Tel: 805-499-9774; Fax: 800-583-2665; e-mail: journals@sagepub.com; Web site: https://sagepub.com – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 30 – Name: DatePubCY Label: Publication Date Group: Date Data: 2025 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Scores%22">Scores</searchLink><br /><searchLink fieldCode="DE" term="%22Pretesting%22">Pretesting</searchLink><br /><searchLink fieldCode="DE" term="%22Conditioning%22">Conditioning</searchLink><br /><searchLink fieldCode="DE" term="%22Achievement+Gains%22">Achievement Gains</searchLink><br /><searchLink fieldCode="DE" term="%22Comparative+Analysis%22">Comparative Analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Outcomes+of+Treatment%22">Outcomes of Treatment</searchLink><br /><searchLink fieldCode="DE" term="%22Hierarchical+Linear+Modeling%22">Hierarchical Linear Modeling</searchLink><br /><searchLink fieldCode="DE" term="%22Context+Effect%22">Context Effect</searchLink> – Name: DOI Label: DOI Group: ID Data: 10.1177/0193841X241246833 – Name: ISSN Label: ISSN Group: ISSN Data: 0193-841X<br />1552-3926 – Name: Abstract Label: Abstract Group: Ab Data: We consider estimating the effect of a treatment on a given outcome measured on subjects tested both before and after treatment assignment in observational studies. A vast literature compares the competing approaches of modelling the post-test score conditionally on the pre-test score versus modelling the difference, namely, the gain score. Our contribution lies in analyzing the merits and drawbacks of two approaches in a multilevel setting. This is relevant in many fields, such as education, where students are nested within schools. The multilevel structure raises peculiar issues related to contextual effects and the distinction between individual-level and cluster-level treatments. We compare the two approaches through a simulation study. For individual-level treatments, our findings align with existing literature. However, for cluster-level treatments, the scenario is more complex, as the cluster mean of the pre-test score plays a key role. Its reliability crucially depends on the cluster size, leading to potentially unsatisfactory estimators with small clusters. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2025 – Name: AN Label: Accession Number Group: ID Data: EJ1466338 |
| PLink | https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1466338 |
| RecordInfo | BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1177/0193841X241246833 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 30 StartPage: 179 Subjects: – SubjectFull: Scores Type: general – SubjectFull: Pretesting Type: general – SubjectFull: Conditioning Type: general – SubjectFull: Achievement Gains Type: general – SubjectFull: Comparative Analysis Type: general – SubjectFull: Outcomes of Treatment Type: general – SubjectFull: Hierarchical Linear Modeling Type: general – SubjectFull: Context Effect Type: general Titles: – TitleFull: Conditioning on the Pre-Test versus Gain Score Modelling: Revisiting the Controversy in a Multilevel Setting Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Bruno Arpino – PersonEntity: Name: NameFull: Silvia Bacci – PersonEntity: Name: NameFull: Leonardo Grilli – PersonEntity: Name: NameFull: Raffaele Guetto – PersonEntity: Name: NameFull: Carla Rampichini IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 04 Type: published Y: 2025 Identifiers: – Type: issn-print Value: 0193-841X – Type: issn-electronic Value: 1552-3926 Numbering: – Type: volume Value: 49 – Type: issue Value: 2 Titles: – TitleFull: Evaluation Review Type: main |
| ResultId | 1 |