View in EDS HTML Full Text PDF Full Text

Which Score for What? Operationalizing Standardized Cognitive Test Performance for the Assessment of Change

Saved in:

Bibliographic Details
Title:	Which Score for What? Operationalizing Standardized Cognitive Test Performance for the Assessment of Change
Language:	English
Authors:	Cristan Farmer, Audrey Thurm, Tanvi Das, E. Martina Bebin, Jonathan A. Bernstein, Elizabeth Berry-Kravis, Joseph D. Buxbaum, Charis Eng, Thomas Frazier, Antonio Y. Hardan, Alexander Kolevzon, Darcy A. Krueger, Julian A. Martinez-Agosto, Hope Northrup, Craig M. Powell, Latha Valluripalli Soorya, Joyce Y. Wu, Mustafa Sahin
Source:	American Journal on Intellectual and Developmental Disabilities. 2025 130(5):344-361.
Availability:	American Association on Intellectual and Developmental Disabilities. P.O. Box 1897, Lawrence, KS 66044-1897. Tel: 785-843-1235; Fax: 785-843-1274; e-mail: AJMR@allenpress.com; Web site: https://meridian.allenpress.com/aaidd
Peer Reviewed:	Y
Page Count:	18
Publication Date:	2025
Document Type:	Journal Articles Reports - Research
Descriptors:	Cognitive Tests, Intelligence Tests, Cognitive Ability, Intellectual Disability, Developmental Disabilities, Scores, Test Use, Test Interpretation, Genetic Disorders, Longitudinal Studies, Change, Standardized Tests
Assessment and Survey Identifiers:	Stanford Binet Intelligence Scale
DOI:	10.1352/1944-7558-130.5.344
ISSN:	1944-7515 1944-7558
Abstract:	Developmental domains, such as cognitive, language, and motor, are key concepts of interest in longitudinal studies of intellectual and developmental disabilities (IDD). Normative scores (e.g., IQ) are often used to operationalize performance on standardized tests of these concepts, but it is the interval-distributed person-ability scores that are intended for the assessment of within-individual change. Here we illustrate the use and interpretation of several Stanford Binet, 5th Edition score types (IQ, extended IQ, Z-normalized raw score, developmental quotient, raw sum score, age equivalent, and ability score) using data from two longitudinal studies of rare genetic conditions associated with IDD. We found that, although normality assumptions were tenuous for all score types, floor effects led to model unsuitability for longitudinal analysis of most types of norm-referenced scores, and that the validity of interpretation with respect to individual change was best for ability scores. [This article was authored on behalf of the Developmental Synaptopathies Consortium.]
Abstractor:	As Provided
Entry Date:	2025
Accession Number:	EJ1482497
Database:	ERIC
Full text is not displayed to guests. Login for full access.

FullText	Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwFmORlk3C7Hq3m5HhPc5uNdAAAA4TCB3gYJKoZIhvcNAQcGoIHQMIHNAgEAMIHHBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDFXZdv34DpafXKaFrgIBEICBmYso5ucwBL1admGDWCx6cLwGU1cZjCgIV3AvsnzcniHIUcbFL8TcPeYZY4BNEk3cWFvFYxhX3ulBeNxHZhLnnJbN8xp0FbnGjD8Qb2A4YaJtURLLc-8YTMtSOJ_C9PqntOmc0VIYRMSW4d8cXvA7QKGobMgSjbbuV0ROeriecpGl77zLuVrrNI9CrkRJ9tUs4gVeQiZubfRCxw== Text: Availability: 1 Value: <anid>AN0187547685;[8z1j]01sep.25;2025Aug29.02:09;v2.2.500</anid> <title id="AN0187547685-1">Which Score for What? Operationalizing Standardized Cognitive Test Performance for the Assessment of Change </title> <p>Developmental domains, such as cognitive, language, and motor, are key concepts of interest in longitudinal studies of intellectual and developmental disabilities (IDD). Normative scores (e.g., IQ) are often used to operationalize performance on standardized tests of these concepts, but it is the interval-distributed person-ability scores that are intended for the assessment of within-individual change. Here we illustrate the use and interpretation of several Stanford Binet, 5&lt;sup&gt;th&lt;/sup&gt; Edition score types (IQ, extended IQ, Z-normalized raw score, developmental quotient, raw sum score, age equivalent, and ability score) using data from two longitudinal studies of rare genetic conditions associated with IDD. We found that, although normality assumptions were tenuous for all score types, floor effects led to model unsuitability for longitudinal analysis of most types of norm-referenced scores, and that the validity of interpretation with respect to individual change was best for ability scores.</p> <p>Keywords: standard scores; age equivalents; person ability scores; sum scores; item response theory; Rasch analysis; psychometrics; longitudinal data; rare genetic conditions; change sensitive score; Stanford-Binet</p> <hd id="AN0187547685-2">Introduction</hd> <p>Developmental concepts, such as cognition, motor skills, social and emotional abilities, and adaptive behavior, are central to research on intellectual and developmental disabilities (IDD). Developmental measures that may have robust psychometric profiles when used in the general population may be insufficient for those with IDD, especially when the goal is not to identify disability but to monitor change in longitudinal research. Standardized assessments often have limited validity for populations with IDD because of the test floor; tests that are appropriate for the chronological age of an individual with IDD may be too difficult for the individual's developmental level, especially in the case of moderate-to-profound levels of IDD. A common solution in this situation is to administer an out-of-age-range test, which necessitates the use of scoring methods alternative to norm-referencing, such as developmental quotients (DQs; the ratio of age equivalent to chronological age; [<reflink idref="bib22" id="ref1">22</reflink>]). The use of DQs is generally viewed as necessary but suboptimal, as they allow for the estimation of an individual's performance but have significant limitations. Especially for individuals at the extremes of the distribution, the DQ is a poor approximation of the IQ, and the discrepancy is inconstant across age ([<reflink idref="bib16" id="ref2">16</reflink>]). The meaning of change in DQ can be unclear because the denominator of chronological age continues to increase even after the numerator of mental age plateaus, leading to artifactual declines in DQ over time ([<reflink idref="bib2" id="ref3">2</reflink>]). Further, age equivalents and, by extension, DQ are subject to a second type of floor effect, which is the youngest age at which the test is normed.</p> <p>If an individual with IDD has sufficient ability to perform the easiest items of an assessment intended for their chronological age (i.e., exceed the test floor), then a norm-referenced score is possible. Norm-referencing is a key feature of many standardized developmental tests, as it is used to assist the interpretation of performance by comparing it to that of same-age peers. At the individual level, this allows for validity in the diagnostic context, because disability is typically defined relative to expected functioning. At the group level, norm-referencing is intended to facilitate the valid comparison of performance across individuals when the effect of development is considered a nuisance. Norm-referenced scores for developmental tests, whether they are standard scores, <emph>T</emph> scores, or scaled scores, usually express performance as a function of a normal distribution, and so the units of norm-referenced scores are standard deviations (<emph>SD</emph>) and the values correspond directly to percentiles (see Table 1). A standard score of 100 reflects performance that is as good or better than 50% of the population, a standard score of 55 is as good or better than 0.013% of the population, and so on. Because directly estimating a score at the &lt;1<sups>st</sups> percentile requires a prohibitively large sample per normative group, scores more than about 3 <emph>SD</emph> below average are extrapolations (i.e., actual standardization data in this range may not be present; see [<reflink idref="bib23" id="ref4">23</reflink>], for a tutorial on one type of regression-based norming procedures). Even after borrowing statistical information from adjacent age groups, the precision of the extrapolated values is low. Thus, by convention, scores more than 3 to 4 <emph>SD</emph> below average are usually censored. This lowest standard score offered by the publisher is the third type of floor effect, referred to here as the standard score floor.</p> <p>Table 1 Summary of Available Methods for Operationalizing Performance on the SB5 Full-Scale Composite</p> <p>PHOTO (COLOR)</p> <hd id="AN0187547685-3">Floor Effects: Challenges and Proposed Solutions</hd> <p>The test floor and standard score floor effects are of significant relevance in IDD because they have important consequences for the statistical analysis and interpretation of group-level effects ([<reflink idref="bib26" id="ref5">26</reflink>]). These effects are applicable to both the cross-sectional and longitudinal contexts. Test floors are important because if a test cannot be used for some proportion of a sample with IDD, the data are systematically missing and the results from the available testing will be biased positively. For those who can get past the test floor, the standard score floor obscures variability for only low scores. This reduces responsiveness to change in ability that occurs below the standard score floor and induces heterogeneity in variance (heteroscedasticity) that biases the estimated standard errors. This is diagnosed by a "conical" pattern in model residuals, where the variance in residuals increases as a function of the predicted values. Similar to the effect of the test floor, standard score floors result in positively biased estimates of the intercepts and slopes with inaccurate standard errors, depending on the score region. As standard score floor effects increase as a function of age, they could cause artifactual nonlinearity in both the fixed and random effects ([<reflink idref="bib26" id="ref6">26</reflink>]; though we note that the test floor effects may lessen as a function of age).</p> <p>One proposed solution to standard score floor effects is the Z-score method ([<reflink idref="bib12" id="ref7">12</reflink>]), wherein an individual's raw score is expressed as a function of the norm-group raw score mean and <emph>SD</emph>, with no censoring (see Table 1). This method does remove the normative floor effects, but major test publishers do not use it for at least two reasons. First, it rests on the assumption that raw scores are normally distributed within each normative age group. Skewness, which is often observed especially at the youngest and oldest ages, compromises this. When test developers do base standard scores on raw scores, this skewness is addressed by first normalizing, or converting to percentiles, the raw scores. Second, the Z-score method creates discontinuity at age breaks, such that one could observe a dramatic difference in Z-score for the similar performance across two adjacent age groups. For standard scores based on raw scores, this is addressed with the statistical procedure of smoothing growth curves. Especially in IDD research, however, the benefit of estimating normative scores below the standard score floor may exceed the risk of these limitations.</p> <hd id="AN0187547685-4">Understanding Change</hd> <p>In the longitudinal context, the goal is to understand within-person change in the outcome of interest. Regardless of the method, change in normative scores is challenging to interpret because it quantifies not only within-individual change in performance, but age-related differences in the normative sample. As a result, change in norm-referenced score has an indeterminate relationship with change in absolute levels of the underlying construct. A decrease in norm-referenced score can—though not necessarily—occur due to an actual decrease in skills (e.g., degeneration observed in many rare genetic conditions associated with IDD). However, because skills in developmental constructs are expected to increase over time, decreasing normative scores can also occur when the acquisition of skills is slower than expected or if skills are simply maintained. Thus, degeneration cannot be distinguished from slow gains or stability—a serious threat to the validity of the interpretation of change in norm-referenced scores. Thus, both floor effects and indeterminacy of change seriously threaten the validity of interpreting norm-referenced scores from a developmental test in the longitudinal context.</p> <p>Many developmental tests do contain a scoring method intended for monitoring change ([<reflink idref="bib9" id="ref8">9</reflink>]). These scores, called person-ability scores, are derived via item response theory or Rasch analysis (see Table 1). These approaches transform the ordinal raw sum score (or sometimes the pattern of item scores) into an interval-level measurement representing the ability that would produce that performance. Interval-level measurement means that a given difference in ability score has the same meaning at all points in the scale—this property is an essential assumption for most statistical models common to longitudinal data analysis and is required for the valid interpretation of the magnitude of resulting parameters. Because all Rasch-based and most IRT-based ability scores have a monotonic relationship with the raw sum score (the raw sum score is an ordered approximation of the ability score; [<reflink idref="bib21" id="ref9">21</reflink>]), valid interpretation of the direction of change is possible. Finally, ability scores are subject only to the single test floor and, therefore, have limited risk of flooring-related bias in previously described parameter estimates. For these reasons, ability scores have the reverse functional profile of norm-referenced scores, well-suited to longitudinal but not diagnostic contexts.</p> <hd id="AN0187547685-5">Current Study</hd> <p>The range of options for expressing performance on any test (Table 1) presents an opportunity for researchers to select the option that is best fit for their intended purpose. We propose that the key elements to consider are the study population, the study design, and the intended interpretation of the scores. Here, we focus on the case of longitudinal research in rare genetic conditions associated with neurodevelopmental disorder (GCAND), which are often associated with moderate-to-profound levels of IDD. In contrast to cross-sectional studies, which focus on between-person differences, the goal of longitudinal research is to describe within-person change. Longitudinal research may be interventional, as in clinical trials, or observational, as in natural history studies. We use a common longitudinal analytic method, hierarchical linear modeling, to evaluate the relative profiles of each score type with respect to statistical and practical interpretation and offer an appraisal of the validity argument for the use of each as an outcome in longitudinal research. The principles discussed here should apply to any norm-referenced test of a developmental construct, but here we focus on cognitive ability and the Stanford-Binet, 5<sups>th</sups> edition. We hypothesized that, consistent with the previously described background, we would observe statistical and theoretical limitations to the validity of model results using norm-referenced scores, and that the person ability score would have the most favorable profile of results. This report is intended to support clinical trial readiness by contributing to the literature base supporting the selection of person ability scores as endpoints for studies of individuals with IDD.</p> <hd id="AN0187547685-6">Methods</hd> <p></p> <hd id="AN0187547685-7">Participants</hd> <p>The Developmental Synaptopathies Consortium (DSC), a Rare Disease Clinical Research Consortium (https://<ulink href="http://www.rarediseasesnetwork.org/),">www.rarediseasesnetwork.org/),</ulink> comprises researchers conducting three multisite natural history studies of GCANDs: Phelan McDermid Syndrome (NCT02461420; see [<reflink idref="bib15" id="ref10">15</reflink>]), tuberous sclerosis complex (TSC; NCT02461459), and PTEN hamartoma tumor syndrome (PTEN; NCT02461446; see [<reflink idref="bib3" id="ref11">3</reflink>]; [<reflink idref="bib4" id="ref12">4</reflink>]). Phelan McDermid Syndrome is caused by a terminal 22q13.3 deletion encompassing the <emph>SHANK3</emph> gene or a pathogenic sequence variant in <emph>SHANK3</emph>, both resulting in haploinsufficiency. TSC is an autosomal dominant condition caused by loss-of-function mutations in <emph>TSC1</emph> or <emph>TSC2</emph>. PHTS is a genetic condition caused by germline mutations in <emph>PTEN</emph>, which encodes phosphatase and tensin homolog. The clinical manifestations of each of these conditions are heterogeneous, but each is associated with IDD (among numerous other features). Because it is the score types and not the conditions that are the focus of this article, we do not further describe the conditions themselves. Each study was approved by a centralized institutional review board, and informed consent was obtained from legal guardians, as well as assent where possible. The dataset was issued in October 2021. Participants in the dataset were included in the analysis if they had at least one assessment with the Stanford Binet, 5<sups>th</sups> edition.</p> <hd id="AN0187547685-8">Measures</hd> <p>The Stanford-Binet, 5<sups>th</sups> edition (SB5), was refined using Rasch analysis and normed on a nationally representative sample of <emph>N</emph> = 4,800 aged 2 to 85 years ([<reflink idref="bib18" id="ref13">18</reflink>]). There are 10 subtests that feed into the full-scale (FS) composite used in this study. The available score types for the SB5 FS are described in detail in Table 1; in the current study we evaluated the IQ, extended IQ (EXIQ), developmental quotient (DQ), Z-score (Z), raw sum score (RAW), age equivalent (AE), and change sensitive score (CSS). IQ, EXIQ, DQ, and Z are normative scores. RAW, AE, and CSS are absolute scores. CSS is the person ability score on the SB5; test publishers commonly apply a trade name to these scores (e.g., growth scale values on the Vineland Adaptive Behavior Scales).</p> <p>The SB5 scoring process automatically generates the CSS, AE, and IQ scores. When an IQ is at the floor (<reflink idref="bib40" id="ref14">40</reflink>) or ceiling (<reflink idref="bib160" id="ref15">160</reflink>), the user may also choose to access EXIQ scores via the manual. DQ is calculated as AE100 divided by the chronological age (in months). Z is calculated using the raw score and published age-group-specific means and standard deviations ([<reflink idref="bib19" id="ref16">19</reflink>]). The script for score derivation can be obtained at https://doi.org/10.31234/osf.io/mgf9a.</p> <hd id="AN0187547685-9">Statistical Analyses</hd> <p>To model within-person change in SB5 scores, we used hierarchical linear modeling. An identical but separate analysis was performed for each score type within each study. To account for clustering within participants, a random subject-level intercept (ID) was included in the model. A subject-level slope for DURATION (described below) was also included, reflecting variability across participants in their rate of change.</p> <p>Age at baseline was highly variable and, so, the chronological age variable contained both between-subject (i.e., differences between older and younger participants) and within-subject (i.e., change within a participant over time) information. To differentiate between developmental and cohort effects, thereby avoiding the inferential error of attributing between-subject differences to the within-subject effect, chronological age was decomposed into two fixed effects ([<reflink idref="bib6" id="ref17">6</reflink>]): time-invariant <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> (the participant's average age during participation, centered at the sample mean age of 11 years) and time-varying DURATION (the passage of time within a person, centered at the person's mean age; the slope for this term is referred to as "annualized change"). This disaggregation also creates more accurate (larger) estimates of variability in the fixed effects. A quadratic form was specified for <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> and, in parallel, for DURATION via the DURATION <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> interaction (i.e., within-subject change was allowed to depend on the participant's age). To allow for comparability across results, the same fixed ( <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , DURATION, <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> DURATION) and random effects (ID, DURATION) were specified for all score types. The within-subject terms (DURATION and <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> DURATION) correspond to the research questions of longitudinal research, specifically how much change was observed at the individual level. To aid interpretation of these within-subject parameters values, specific contrasts were used to estimate the fixed effect of DURATION for hypothetical participants at a representative range of ages (<reflink idref="bib3" id="ref18">3</reflink>, 7, 11, 15, and 19 years). R version 4.2.2 was used to implement the package lme4 ([<reflink idref="bib1" id="ref19">1</reflink>]). The R script can be obtained from https://doi.org/10.31234/osf.io/mgf9a.</p> <hd id="AN0187547685-10">Model Fitting</hd> <p>Because they reflect the study design, all fixed and random effects were retained regardless of their contribution to the model. Both level 1 and level 2 model residuals were visually inspected for consistency with the required assumptions of normality and constancy. Across most models, the residuals departed from these assumptions in two important ways. First, a conical shape in residuals can be induced by floor effects and, second, an excess of extreme residuals can occur in the presence of unmodeled causal variables. Addressing non-normality is beyond the scope of this article, but the identification of score types more likely to exhibit violations of modeling assumptions is of high relevance to the goal of comparing optimal scoring rules for a longitudinal context of use. For brevity, only our conclusions from visual inspection are included here, but residual plots can be found at https://doi.org/10.31234/osf.io/mgf9a.</p> <hd id="AN0187547685-11">Statistical Interpretation</hd> <p>The magnitude and precision of fixed effects is described using the parameter estimates with 95% confidence intervals, as well as the associated test statistics and uncorrected <emph>p</emph>-values. To facilitate understanding of the results, here we review the statistical meaning of these parameters. The <bold>intercept</bold> for each model is the estimated point-in-time score for a participant whose average age of participation was 11 years ( <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> = 0; this variable is grand-mean centered) at the middle of their study participation (DURATION = 0). The <bold>estimate of</bold><ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="bold-italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> is interpreted as the expected between-subject <emph>difference</emph> in the outcome for an individual whose average age of participation is 1 year older than the group average; the <bold>quadratic term for</bold><ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="bold-italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> allows for this between-subject difference to become smaller or larger for participants older or younger than the average. A negative slope for the quadratic term indicates that differences as a function of <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> are smaller (or more negative) for older participants than younger participants. If the main effect of <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> or its quadratic term is nonzero, then the estimated average value (i.e., the intercept) depends on the average age of the participant. The <bold>slope of DURATION</bold> is interpreted as the within-person expected <emph>change</emph> in the outcome for each year of participation in the study, also known as the annualized change. An <bold>interaction between</bold><ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="bold-italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml><bold>and DURATION</bold> allows for the annualized change (DURATION) to depend on the person's mean age ( <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> ); for example, older participants might gain skills more slowly than younger participants (a negative slope for the <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> DURATION interaction). If the linear and quadratic effects are similar between DURATION and <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> , then one might interpret the estimated annualized change as applying for the full age range in the study. When DURATION and <ephtml> &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;mi mathvariant="italic"&gt;AGE&lt;/mi&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;&amp;#175;&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> are dissimilar, however, it is referred to as a cohort effect. This means that the estimated differences between participants are more or less than would be expected as a function of the estimated within-subject change. Cohort effects are especially relevant for cross-sectional comparisons or any use of the between-subject terms in longitudinal data, as differences between ages cannot be solely attributed to the effect of development.</p> <hd id="AN0187547685-12">Results</hd> <p>The Phelan McDermid Syndrome cohort was excluded from analysis because too few participants in the dataset had sufficient ability to take the SB5, resulting in a too-small sample size for the proposed analyses (<emph>n</emph> = 24 out of 101 participants, several with only one assessment). Most participants in the PTEN dataset (<emph>n</emph> = 91 of 107) and the TSC dataset (<emph>n</emph> = 81 of 106) received at least one SB5 and were included in the analysis (Table 2). About half of the individuals without SB5 were reported to not have sufficient ability to take the test. The median number of yearly SB5 assessments per person in both studies was 3 [IQR: 2, 3].</p> <p>Table 2 Baseline Characteristics of Cohorts Used in Analysis</p> <p>PHOTO (COLOR)</p> <p>The first available assessment from each person was used to illustrate the relationships amongst scores. Amongst the norm-referenced scores, Z yielded the largest estimates, followed by IQ and EXIQ. The EXIQ had the largest standard deviations, though the standard deviations for all norm-referenced scores exceeded 15 (the value in the population; Table 2). In both groups, DQ yielded a lower estimate than the norm-referenced scores. The norm-referenced scores (FSIQ, EXIQ, Z) and DQ were all very strongly and positively correlated (ρ &gt; 0.93) with one another, and more moderately with the absolute scores (RAW, AE, and CSS; for PTEN ρ = 0.62–0.67 and for TSC ρ = 0.35–0.46; Figure 1). The nearly perfect rank-order correlation amongst the absolute scores (RAW, AE, and CSS) is expected as by definition they have a monotonic relationship.</p> <p>PHOTO (COLOR): Figure 1 Observed Data by Age for Each Study Note. IQ = intelligence quotient; EXIQ = extended IQ; Z = Z-normalized score; DQ = developmental quotient; AE = age equivalent; CSS = change sensitive score. Each observation is marked with a filled circle and observations from the same individual are connected by a solid line. For IQ, EXIQ, Z, and DQ, the dotted line reflects the expected population average. For RAW, the dotted line indicates the raw score corresponding to AEs plotted across the X axis. For AE, the dotted line indicates the AE corresponding to each chronological age. There are no normative values for CSS, but the CSS corresponding to each AE is plotted (AE is on the chronological age axis).</p> <hd id="AN0187547685-13">Statistical Interpretation</hd> <p>The raw data subjected to hierarchical linear modeling are shown in Figure 2. Here, we offer a narrative summary of this modeling (see Table 3 for parameter estimates and test statistics and Table 4 for summary). For PTEN, the older and less uniformly impaired of the two samples, the classic standard scores (IQ and EXIQ) would be interpreted as stable within person regardless of age because slope point estimates did not differ from zero (Figure 3), but the floor effects suggested that the model results might be biased. Because EXIQ simply replaced the floor values in IQ with a new floor (i.e., almost no scores between 40 and 10 were observed), it did not successfully address the censoring in IQ. The norm-referencing methods that mitigate floor effects, DQ and Z, did both result in lower estimated scores across the age range and did decline within person. These effects were presumably obscured by censoring in IQ and EXIQ, and so these models support the observation that parameter estimates from IQ and EXIQ models were biased. The absolute scoring methods, RAW, AE, and CSS, all indicated growth that was more rapid for younger participants and leveled off for older participants, consistent with expectations for a developmental trajectory (Figure 3). The AE data, however, exhibited both floor and ceiling effects that indicated the possibility of bias. RAW and CSS had excess positive residuals that could threaten model validity. Most of these observations were also true for TSC, except that the DQ and Z were stable within person and behaved more similarly to the standard scores. Further, only DQ and AE had nonhomogenous residual variance; the norm-referenced scores exhibited excess positive residuals that could threaten model validity.</p> <p>PHOTO (COLOR): Figure 2 Distributions and Interrelationships of Each Score Type at Baseline Note. IQ = intelligence quotient; EXIQ = extended IQ; Z = Z-normalized score; DQ = developmental quotient; AE = age equivalent; CSS = change sensitive score. Panel A: PTEN cohort data. Panel B: TSC cohort data. Both panels: Diagonal is the density plot per score. Below the diagonal is scatter plot of scores on X and Y axis. Above the diagonal is Spearman rank-order correlation for scores on X and Y axis.</p> <p>PHOTO (COLOR): Figure 3 Annualized Change Estimates Note. IQ = intelligence quotient; EXIQ = extended IQ; Z = Z-normalized score; DQ = developmental quotient; AE = age equivalent; CSS = change sensitive score. Both panels: Contrasts were used to generate the predicted fixed estimate for DURATION (annualized change) at several hypothetical ages. This can be expressed as a function of AGE¯ (see Table 4 for DURATION AGE¯ term), though if the interaction was not different from zero the DURATION estimate will be similar across values of AGE¯. Panel A: Annualized Change in norm-referenced scores for the PTEN sample. Panel B: Annualized change in absolute scores for the PTEN sample. Panel C: Annualized change in norm-referenced scores for the TSC sample. Panel D: Annualized change in absolute scores for the TSC sample.</p> <p>Table 3 Fixed Effects From Hierarchical Linear Models</p> <p>PHOTO (COLOR)</p> <p>Table 4 Summary of Hierarchical Linear Model Results</p> <p>PHOTO (COLOR)</p> <hd id="AN0187547685-14">Discussion</hd> <p>Researchers have a range of options for operationalizing performance on most standardized developmental tests, and the best option must be determined based on the context of use and the intended interpretation of the score. Here, we leveraged data from two GCAND studies to illustrate the analysis of several score types in the longitudinal context, so that we might discuss the quantitative differences as well as the validity case for interpretation of these scores as reflecting individual change. For TSC, we found agreement across norm-referenced scores in that no within-person change was observed, whereas in PTEN, the use of different norm-referenced scores led to different conclusions. However, the floor effects observed for these scores suggest that the results might be biased. In both studies, the absolute scores were consistent with a theoretical developmental curve (faster gains for younger children and slower gains for older children), but both floor and ceiling effects were found for AE. Overall, our results were consistent with our theory-based hypothesis that the person ability score (CSS) would be the most appropriate score type for the longitudinal context.</p> <hd id="AN0187547685-15">Model Suitability</hd> <p>We found that the model residuals for all score types were in some way inconsistent with assumptions in the PTEN and/or TSC studies. Future investigators should be aware that analysis of norm-referenced scores is likely to violate assumptions about variance and normality, and that models using RAW and CSS may violate normality assumptions. The violation of normality might in future research be addressable via transformation or the inclusion of additional explanatory variables, but the censoring causing nonhomogenous variance in the norm-referenced scores cannot be remediated within the general linear model framework. We selected the multilevel model because it is the standard in the field for modeling an outcome as a function of age, but one might consider a Tobit growth curve ([<reflink idref="bib26" id="ref20">26</reflink>]) for data with high rates of floor effects (such as IQ) or a random effects generalization of quantile regression ([<reflink idref="bib17" id="ref21">17</reflink>]) for data where variance increases as a function of scale (such as CSS). These results underscore the importance of reviewing residuals to evaluate the tenability of assumptions in every new model, and suggest that the norm-referenced scores are less well-suited to the standard modeling procedures than the absolute scores.</p> <hd id="AN0187547685-16">Validity of Parameter Interpretation</hd> <p>Next, we turn to the validity arguments for interpreting the parameters produced by the statistical models as pertaining to individual change. Some of the current authors ([<reflink idref="bib9" id="ref22">9</reflink>]; [<reflink idref="bib11" id="ref23">11</reflink>]) and others (e.g., [<reflink idref="bib8" id="ref24">8</reflink>]; [<reflink idref="bib14" id="ref25">14</reflink>]; [<reflink idref="bib20" id="ref26">20</reflink>]) have argued that for measuring change over time in developmental concepts, the validity case for ability scores like CSS is stronger than that for raw sum scores, AEs, or any type of norm-referenced scores. The most important feature of the ability score with respect to interpretation in longitudinal research is interval-level measurement, because the evaluation of meaningfulness in change rests on the assumption that the meaning is the same regardless of scale location. The AE is an ordinal variable and so it does not meet this standard. As [<reflink idref="bib16" id="ref27">16</reflink>] pointed out, Wechsler himself lamented that AEs continued to appear in test manuals due to their "firm place in clinical practice ... in spite of the fact that these methods violate the philosophy of the Scale" ([<reflink idref="bib27" id="ref28">27</reflink>], p. 381). Dividing an ordinal variable (AE) by an interval-level variable (chronological age) does not yield an interval-level variable, and so this criticism extends to DQ ([<reflink idref="bib16" id="ref29">16</reflink>]). Further, although the DQ does avoid the normative floor effect, the underlying AE is still subject to the AE floor and ceiling, as observed in the current study. Still, the AE could play an important supporting role in understanding the results of a longitudinal study. The AE can be used to aid clinical interpretation of change in the ability score, which is itself unitless ([<reflink idref="bib10" id="ref30">10</reflink>]). For example, we understand that the model intercept in this study is the estimated mean score when the average age is 11 years. For TSC, the CSS intercept was 472, which corresponds to an AE of 5 years, 2 months. For the average 11-year-old TSC participant, the average annualized change of about 4 points per year would translate to an increase to 5 years, 7 months. Note that CSS is the basis of statistical interpretation, and AE is used only as an interpretative support. Ultimately, however, the AE and DQ cannot be reasonably interpreted as an interval level variable and so the valid interpretation of the longitudinal modeling of AE and DQ is simply not possible. AE and DQ should be rejected as endpoints in longitudinal research.</p> <p>Like AE, the raw score sum is also an ordinal variable, though it has many more levels. However, as observed in the results of this study, the patterns of quantitative results were similar for RAW and CSS. Both exhibited faster change for younger participants that plateaued for older participants, and the model residuals for both were inconsistent with the normality assumption. This was unsurprising, because raw sum scores are an ordered approximation of the ability score ([<reflink idref="bib21" id="ref31">21</reflink>]). But when the precise magnitude of change in score is the parameter of interest, the ordinal nature of the raw sum score adversely impacts the validity argument. Importantly, the degree to which this is a problem depends on test construction; where there is more information (items of similar levels of difficulty) on the test, the raw sum score will be closer to interval. Tests developed with considerable resources—like nationally normed and established IQ tests—are likelier to have denser item information than an investigator-created survey instrument. For the SB5, a one-unit change in RAW is equivalent to a one-unit change in CSS for the middle ∼50% of the raw sum score range, supporting its interval-level interpretation in that range. As the raw sum score approaches the minimum or maximum extremes, however, this becomes less true (on the SB5, between 2 and 7 RAW points might correspond to a one-unit change in CSS). This could be particularly impactful for studies of individuals with IDD, because samples are likely to have performance in the lower range of scores. Thus, as with any aspect of the validity argument, the extent to which the ordinality of raw sum scores threatens the validity of interpretation must be evaluated for the specific context. Raw sum scores are an absolute measure of ability and the direction of change is interpretable, but their ordinal nature may adversely impact the valid interpretation of the magnitude of change. Thus, if ability scores are not available for a test, the raw sum score appears to have the next strongest validity argument for use in longitudinal analyses.</p> <p>Finally, we consider the norm-referenced scores. Unlike RAW, AE, and DQ, normalized standard scores (e.g., IQ, EXIQ) can be considered interval-level to the extent that ability is normally distributed in the population ([<reflink idref="bib8" id="ref32">8</reflink>]). However, for individuals with IDD, norm-referenced scores like IQ are often significantly limited by floor effects, which the EXIQ is intended to address. As illustrated in both samples in this article, almost no intermediate values were assigned between the original floor of 40 and the EXIQ floor of 10. Although the EXIQ method only moved the standard score floor, the Z-score method did successfully remove it, revealing variability that was censored by the IQ and EXIQ. However, we observed that, despite the normative metrics all putatively measuring relative standing of an individual's cognitive ability, the results of statistical analysis did not always lead to the same interpretation. For the PTEN study, the model results for the norm-referenced score types disagreed not only in magnitude but in direction of effect. IQ and EXIQ indicated that, on average, skills were gained in a manner consistent with the normative groups (i.e., point estimates for slopes did not differ from zero), but this was not the case for Z, which indicated declines in scores for younger participants. Further, the clinical interpretation of this negative annualized change estimate for Z, like for any norm-referenced score, is indeterminate. Average skill gains may have been slower than necessary to keep up with the normative age tables, there may have been only stability—neither loss nor gain of skills, or perhaps on average, individuals lost skills during study participation. Not only are these three interpretations of Z-score change different from one another, but all of them differ markedly from that of the other norm-referenced scores. Given the psychometric limitations in norm-referenced scores, and the potential for disagreement in conclusions regarding their change, it is difficult to claim that one is more valid than the other.</p> <p>This is partially driven by the fact that the normative scores measure both change within the person and change in reference groups (each to unknowable degrees); our desired interpretation corresponds to the direct quantification of individual change as a function of age. This directly threatens the validity of interpreting the results of a longitudinal study as pertaining to change in in the individual. The Projected Retained Ability Score (PRAS) is a new approach to dealing with this limitation ([<reflink idref="bib13" id="ref33">13</reflink>]). With PRAS, an individual's performance over time is scaled against the normative group corresponding to that individual's baseline age. We had hoped to include the PRAS in the current study, but the SB5 publisher declined to provide digital versions of the normative scoring tables to facilitate bulk rescoring (ProEd, personal communication, November 15, 2023), so we address it only in theory. Indeed, a difference in PRAS does reflect absolute change (like the CSS), but the units are relative (like IQ). In addition to failing to address the significant limitation of standard score floor effects, this could complicate the validity case for interpretation of change scores, depending on the length of follow up of a study (e.g., what is the meaning of comparing change within a now-10-year-old to the baseline distribution of 5-year-olds?). Further theoretical and quantitative evaluation of this method is needed.</p> <hd id="AN0187547685-17">Clinical Meaningfulness</hd> <p>Supporting clinical trial readiness is a goal of the Developmental Synaptopathies Consortium, and the results in this article are intended to aid researchers in the construction of endpoints where a developmental concept is of interest. It is therefore essential to distinguish clinical <emph>meaning</emph>, which we have used here to refer to the interpretation of statistical results with respect to the human behavior under study, from clinical <emph>meaningfulness</emph> ([<reflink idref="bib28" id="ref34">28</reflink>]). A statistically detectable difference may not be judged by the patient to be of sufficient magnitude to warrant use of a medication ([<reflink idref="bib25" id="ref35">25</reflink>]). Thus, regulatory agencies require both statistical evidence of efficacy and qualitative evidence supporting the clinical meaningfulness of that statistical effect to stakeholders ([<reflink idref="bib25" id="ref36">25</reflink>]). In this study, we have focused only on the former. The meaningfulness of an effect depends on the context, and so one must determine via qualitative methods what a clinically meaningful effect is for an individual study, regardless of the selected scoring method. We have sometimes encountered the argument that the clinical meaningfulness of the score types mentioned here can be established by comparing to their <emph>SD</emph> or <emph>SEM</emph>. This is especially true of norm-referenced scores, for which meaningful change is often colloquially defined based on the population standard deviation (e.g., one-half an <emph>SD</emph>). This is a distribution-based method of establishing meaningful differences, which is not considered adequate by regulatory agencies ([<reflink idref="bib24" id="ref37">24</reflink>]). The anchor-based method, though not without limitations ([<reflink idref="bib29" id="ref38">29</reflink>]), is one alternative approach. This method compares quantitative change on the outcome measure against qualitative ratings of improvement (usually from the patient or caregiver's perspective; e.g., [<reflink idref="bib5" id="ref39">5</reflink>]). And so, although we cannot speak to the clinical meaningfulness of any of the statistical results described in this study because we have not performed the necessary qualitative work, we refer interested readers to work exploring methods for establishing clinically meaningful change for individuals with GCAND and related conditions ([<reflink idref="bib7" id="ref40">7</reflink>]).</p> <hd id="AN0187547685-18">Conclusion</hd> <p>Whether an endpoint is fit for purpose depends on the context of use and, in the case of longitudinal research, the goal is to derive information about individual change in ability as a function of time. Here, we described and illustrated theoretical and quantitative threats to validity for several types of scores from a standardized developmental test. Some limitations of the norm-referenced scores, such as floor effects, were observable in the data. However, other limitations, such as the indeterminacy of change in norm-referenced scores and the ordinal nature of AEs and raw sum scores, are not observable in model results and must be considered theoretically. Researchers must consider the validity of each score type for their particular context. Based on theory and the statistical results of this study, we argue that, for longitudinal studies of people with IDD, the person ability score is most appropriate.</p> <p> <emph>Portions of this project were presented at the 2024 Gatlinburg Conference on Intellectual and Developmental Disabilities and the 2024 International Association for the Scientific Study of Intellectual and Developmental Disabilities. The Developmental Synaptopathies Consortium (U54NS092090) is part of the National Center for Advancing Translational Sciences (NCATS) Rare Diseases Clinical Research Network (RDCRN) and is supported by the RDCRN Data Management and Coordinating Center (DMCC; U2CTR002818). RDCRN is an initiative of the Office of Rare Diseases Research (ORDR), NCATS, funded through a collaboration between NCATS and the National Institute of Neurological Disorders and Stroke of the National Institutes of Health (NINDS), Eunice Kennedy Shriver National Institute of Child Health &amp; Human Development (NICHD) and National Institute of Mental Health (NIMH). This work was also supported in part by the Intramural Research Program of the NIMH (ZIC-MH002961).</emph> </p> <p> <emph>The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health (NIH).</emph> </p> <p> <emph>We are saddened by the passing of our coauthor, Dr. Charis Eng, in August 2024. Her outstanding contributions to clinical practice, science, and mentorship leave a lasting legacy. We are sincerely indebted to the generosity of the families and patients in PMS, PTEN, and TSC clinics across the United States who contributed their time and effort to these studies. We also thank the Phelan-McDermid Syndrome Foundation, the PTEN Hamartoma Tumor Syndrome Foundation, the PTEN Research Foundation, and the TSC Alliance for their continued support. We thank Aaron J. Kaat, Ph.D., and Mark Daniel, Ph.D., for providing helpful suggestions and guidance in the preparation of this manuscript.</emph> </p> <ref id="AN0187547685-19"> <title> References </title> <blist> <bibl id="bib1" idref="ref19" type="bt">1</bibl> <bibtext> Bates, D., Mächler, M., Bolker, B., &amp; Walker, S. (2014). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(2015). <ulink href="http://dx.doi.org/10.18637/jss.v067.i01">http://dx.doi.org/10.18637/jss.v067.i01</ulink></bibtext> </blist> <blist> <bibl id="bib2" idref="ref3" type="bt">2</bibl> <bibtext> Bishop, S. L., Farmer, C., &amp; Thurm, A. (2015). Measurement of nonverbal IQ in autism spectrum disorder: scores in young adulthood compared to early childhood. Journal of Autism and Developmental Disorders, 45(4), 966–974. https://doi.org/10.1007/s10803-014-2250-3</bibtext> </blist> <blist> <bibl id="bib3" idref="ref11" type="bt">3</bibl> <bibtext> Busch, R. M., Frazier Ii, T. W., Sonneborn, C., Hogue, O., Klaas, P., Srivastava, S., Hardan, A. Y., Martinez-Agosto, J. A., Sahin, M., &amp; Eng, C. (2023). Longitudinal neurobehavioral profiles in children and young adults with PTEN hamartoma tumor syndrome and reliable methods for assessing neurobehavioral change. Journal of Neurodevelopmental Disorders, 15(1), 3. https://doi.org/10.1186/s11689-022-09468-4</bibtext> </blist> <blist> <bibl id="bib4" idref="ref12" type="bt">4</bibl> <bibtext> Busch, R. M., Srivastava, S., Hogue, O., Frazier, T. W., Klaas, P., Hardan, A., Martinez-Agosto, J. A., Sahin, M., &amp; Eng, C. (2019). Neurobehavioral phenotype of autism spectrum disorder associated with germline heterozygous mutations in PTEN. Translational Psychiatry, 9(1), 253. https://doi.org/10.1038/s41398-019-0588-1</bibtext> </blist> <blist> <bibl id="bib5" idref="ref39" type="bt">5</bibl> <bibtext> Chatham, C. H., Taylor, K. I., Charman, T., Liogier D'ardhuy, X., Eule, E., Fedele, A., Hardan, A. Y., Loth, E., Murtagh, L., Del Valle Rubido, M., San Jose Caceres, A., Sevigny, J., Sikich, L., Snyder, L., Tillmann, J. E., Ventola, P. E., Walton-Bowen, K. L., Wang, P. P., Willgoss, T., &amp; Bolognani, F. (2018). Adaptive behavior in autism: Minimal clinically important differences on the Vineland-II. Autism Research, 11(2), 270–283. https://doi.org/10.1002/aur.1874</bibtext> </blist> <blist> <bibl id="bib6" idref="ref17" type="bt">6</bibl> <bibtext> Curran, P. J., &amp; Bauer, D. J. (2011). The disaggregation of within-person and between-person effects in longitudinal models of change. Annual Review of Psychology, 62, 583–619.</bibtext> </blist> <blist> <bibl id="bib7" idref="ref40" type="bt">7</bibl> <bibtext> Duong, T., Staunton, H., Braid, J., Barriere, A., Trzaskoma, B., Gao, L., Willgoss, T., Cruz, R., Gusset, N., Gorni, K., Randhawa, S., Yang, L., &amp; Vuillerot, C. (2021). A patient-centered evaluation of meaningful change on the 32-item Motor Function Measure in spinal muscular atrophy using qualitative and quantitative data. Frontiers in Neurology, 12, 770423. https://doi.org/10.3389/fneur.2021.770423</bibtext> </blist> <blist> <bibl id="bib8" idref="ref24" type="bt">8</bibl> <bibtext> Eisengart, J. B., Daniel, M. H., Adams, H. R., Williams, P., Kuca, B., &amp; Shapiro, E. (2022). Increasing precision in the measurement of change in pediatric neurodegenerative disease. Molecular Genetics and Metabolism, 137(1), 201–209. https://doi.org/10.1016/j.ymgme.2022.09.001</bibtext> </blist> <blist> <bibl id="bib9" idref="ref8" type="bt">9</bibl> <bibtext> Farmer, C., Kaat, A. J., Berry-Kravis, E., &amp; Thurm, A. (2022). Psychometric perspectives on developmental outcome and endpoint selection in treatment trials for genetic conditions associated with neurodevelopmental disorder. In Esbensen A. J. &amp; Schworer E. K. (Eds.), International review of research in developmental disabilities (Vol. 62, pp. 1–39). Academic Press. https://doi.org/10.1016/bs.irrdd.2022.05.001</bibtext> </blist> <blist> <bibtext> Farmer, C., Ludwig, N. N., &amp; Thurm, A. (2025). A tutorial on person ability scores for the intellectual and developmental disabilities clinician. International Journal of Developmental Disabilities, 1–10.</bibtext> </blist> <blist> <bibtext> Farmer, C., Thurm, A., Troy, J. D., &amp; Kaat, A. J. (2023). Comparing ability and norm-referenced scores as clinical trial outcomes for neurodevelopmental disabilities: A simulation study. Journal of Neurodevelopmental Disorders, 15(1), 4. https://doi.org/10.1186/s11689-022-09474-6</bibtext> </blist> <blist> <bibtext> Hessl, D., Nguyen, D. V., Green, C., Chavez, A., Tassone, F., Hagerman, R. J., Senturk, D., Schneider, A., Lightbody, A., Reiss, A. L., &amp; Hall, S. (2009). A solution to limitations of cognitive testing in children with intellectual disabilities: The case of fragile X syndrome. Journal of Neurodevelopmental Disorders, 1(1), 33–45. https://doi.org/10.1007/s11689-008-9001-8</bibtext> </blist> <blist> <bibtext> Kronenberger, W. G., Harrington, M., &amp; Yee, K. S. (2021). Projected Retained Ability Score (PRAS): A new methodology for quantifying absolute change in norm-based psychological test scores over time. Assessment, 28(2), 367–379. https://doi.org/10.1177/1073191119872250</bibtext> </blist> <blist> <bibtext> Kwok, E., Feiner, H., Grauzer, J., Kaat, A., &amp; Roberts, M. Y. (2022). Measuring change during intervention using norm-referenced, standardized measures: A comparison of raw scores, standard scores, age equivalents, and growth scale values from the Preschool Language Scales-Fifth edition. Journal of Speech, Language, and Hearing Research, 65(11), 4268–4279. https://doi.org/10.1044/2022_jslhr-22-00122</bibtext> </blist> <blist> <bibtext> Levy, T., Foss-Feig, J. H., Betancur, C., Siper, P. M., Trelles-Thorne, M. D. P., Halpern, D., Frank, Y., Lozano, R., Layton, C., Britvan, B., Bernstein, J. A., Buxbaum, J. D., Berry-Kravis, E., Powell, C. M., Srivastava, S., Sahin, M., Soorya, L., Thurm, A., &amp; Kolevzon, A. (2022). Strong evidence for genotype-phenotype correlations in Phelan-McDermid syndrome: results from the developmental synaptopathies consortium. Human Molecular Genetics, 31(4), 625–637. https://doi.org/10.1093/hmg/ddab280</bibtext> </blist> <blist> <bibtext> Ostrolenk, A., &amp; Courchesne, V. (2023). Examining the validity of the use of ratio IQs in psychological assessments. Acta Psychologica, 240, 104054. https://doi.org/10.1016/j.actpsy.2023.104054</bibtext> </blist> <blist> <bibtext> Petscher, Y., &amp; Logan, J. A. R. (2014). Quantile regression in the study of developmental sciences. Child Development, 85(3), 861–881. https://doi.org/10.1111/cdev.12190</bibtext> </blist> <blist> <bibtext> Roid, G. H. (2003). Stanford-Binet Intelligence Scales, fifth edition, technical Manual. Riverside Publishing.</bibtext> </blist> <blist> <bibtext> Sansone, S. M., Schneider, A., Bickel, E., Berry-Kravis, E., Prescott, C., &amp; Hessl, D. (2014). Improving IQ measurement in intellectual disabilities using true deviation from population norms. Journal of Neurodevelopmental Disorders, 6(1), 16. https://doi.org/10.1186/1866-1955-6-16</bibtext> </blist> <blist> <bibtext> Shapiro, E. G., Eisengart, J. B., Whiteman, D., &amp; Whitley, C. B. (2024). Ability change across multiple domains in mucopolysaccharidosis (Sanfilippo syndrome) type IIIA. Molecular Genetics and Metabolism, 141(2), 108110. https://doi.org/10.1016/j.ymgme.2023.108110</bibtext> </blist> <blist> <bibtext> Sijtsma, K., Ellis, J. L., &amp; Borsboom, D. (2024). Recognize the value of the sum score, psychometrics' greatest accomplishment. Psychometrika, 1–34.</bibtext> </blist> <blist> <bibtext> Soorya, L., Leon, J., Trelles, M. P., &amp; Thurm, A. (2018). Framework for assessing individuals with rare genetic disorders associated with profound intellectual and multiple disabilities (PIMD): The example of Phelan McDermid Syndrome. The Clinical Neuropsychologist, 32(7), 1226–1255. https://doi.org/10.1080/13854046.2017.1413211</bibtext> </blist> <blist> <bibtext> Timmerman, M. E., Voncken, L., &amp; Albers, C. J. (2021). A tutorial on regression-based norming of psychological tests with GAMLSS. Psychological Methods, 26(3), 357–373. https://doi.org/10.1037/met0000348</bibtext> </blist> <blist> <bibtext> U.S. Department of Health and Human Services. (2022). Patient-Focused drug development: Selecting, developing, or modifying fit-for-purpose clinical outcome assessments [Guidance document]. U.S. Food and Drug Administration.</bibtext> </blist> <blist> <bibtext> U.S. Department of Health and Human Services. (2023). Patient-Focused drug development: Incorporating clinical outcome assessments into endpoints for regulatory decision-making [Guidance document]. U.S. Food and Drug Administration.</bibtext> </blist> <blist> <bibtext> Wang, L., Zhang, Z., McArdle, J. J., &amp; Salthouse, T. A. (2008). Investigating ceiling effects in longitudinal data analysis. Multivariate Behavioral Research, 43(3), 476–496.</bibtext> </blist> <blist> <bibtext> Wechsler, D. (1951). Equivalent test and mental ages for the WISC. Journal of Consulting Psychology, 15(5), 381.</bibtext> </blist> <blist> <bibtext> Weinfurt, K. P. (2019). Clarifying the meaning of clinically meaningful benefit in clinical research: Noticeable change vs valuable change. JAMA, 322(24), 2381–2382. https://doi.org/10.1001/jama.2019.18496</bibtext> </blist> <blist> <bibtext> Wyrwich, K. W., &amp; Norman, G. R. (2023). The challenges inherent with anchor-based approaches to the interpretation of important change in clinical outcome assessments. Quality of Life Research, 32(5), 1239–1246. https://doi.org/10.1007/s11136-022-03297-7</bibtext> </blist> </ref> <aug> <p>By Cristan Farmer; Audrey Thurm; Tanvi Das; E. Martina Bebin; Jonathan A. Bernstein; Elizabeth Berry-Kravis; Joseph D. Buxbaum; Charis Eng; Thomas Frazier; Antonio Y. Hardan; Alexander Kolevzon; Darcy A. Krueger; Julian A. Martinez-Agosto; Hope Northrup; Craig M. Powell; Latha Valluripalli Soorya; Joyce Y. Wu and Mustafa Sahin</p> <p>Reported by Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author</p> </aug> <nolink nlid="nl1" bibid="bib22" firstref="ref1"></nolink> <nolink nlid="nl2" bibid="bib16" firstref="ref2"></nolink> <nolink nlid="nl3" bibid="bib23" firstref="ref4"></nolink> <nolink nlid="nl4" bibid="bib26" firstref="ref5"></nolink> <nolink nlid="nl5" bibid="bib12" firstref="ref7"></nolink> <nolink nlid="nl6" bibid="bib21" firstref="ref9"></nolink> <nolink nlid="nl7" bibid="bib15" firstref="ref10"></nolink> <nolink nlid="nl8" bibid="bib18" firstref="ref13"></nolink> <nolink nlid="nl9" bibid="bib40" firstref="ref14"></nolink> <nolink nlid="nl10" bibid="bib160" firstref="ref15"></nolink> <nolink nlid="nl11" bibid="bib19" firstref="ref16"></nolink> <nolink nlid="nl12" bibid="bib17" firstref="ref21"></nolink> <nolink nlid="nl13" bibid="bib11" firstref="ref23"></nolink> <nolink nlid="nl14" bibid="bib14" firstref="ref25"></nolink> <nolink nlid="nl15" bibid="bib20" firstref="ref26"></nolink> <nolink nlid="nl16" bibid="bib27" firstref="ref28"></nolink> <nolink nlid="nl17" bibid="bib10" firstref="ref30"></nolink> <nolink nlid="nl18" bibid="bib13" firstref="ref33"></nolink> <nolink nlid="nl19" bibid="bib28" firstref="ref34"></nolink> <nolink nlid="nl20" bibid="bib25" firstref="ref35"></nolink> <nolink nlid="nl21" bibid="bib24" firstref="ref37"></nolink> <nolink nlid="nl22" bibid="bib29" firstref="ref38"></nolink>
Header	DbId: eric DbLabel: ERIC An: EJ1482497 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Which Score for What? Operationalizing Standardized Cognitive Test Performance for the Assessment of Change – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Cristan+Farmer%22">Cristan Farmer</searchLink><br /><searchLink fieldCode="AR" term="%22Audrey+Thurm%22">Audrey Thurm</searchLink><br /><searchLink fieldCode="AR" term="%22Tanvi+Das%22">Tanvi Das</searchLink><br /><searchLink fieldCode="AR" term="%22E%2E+Martina+Bebin%22">E. Martina Bebin</searchLink><br /><searchLink fieldCode="AR" term="%22Jonathan+A%2E+Bernstein%22">Jonathan A. Bernstein</searchLink><br /><searchLink fieldCode="AR" term="%22Elizabeth+Berry-Kravis%22">Elizabeth Berry-Kravis</searchLink><br /><searchLink fieldCode="AR" term="%22Joseph+D%2E+Buxbaum%22">Joseph D. Buxbaum</searchLink><br /><searchLink fieldCode="AR" term="%22Charis+Eng%22">Charis Eng</searchLink><br /><searchLink fieldCode="AR" term="%22Thomas+Frazier%22">Thomas Frazier</searchLink><br /><searchLink fieldCode="AR" term="%22Antonio+Y%2E+Hardan%22">Antonio Y. Hardan</searchLink><br /><searchLink fieldCode="AR" term="%22Alexander+Kolevzon%22">Alexander Kolevzon</searchLink><br /><searchLink fieldCode="AR" term="%22Darcy+A%2E+Krueger%22">Darcy A. Krueger</searchLink><br /><searchLink fieldCode="AR" term="%22Julian+A%2E+Martinez-Agosto%22">Julian A. Martinez-Agosto</searchLink><br /><searchLink fieldCode="AR" term="%22Hope+Northrup%22">Hope Northrup</searchLink><br /><searchLink fieldCode="AR" term="%22Craig+M%2E+Powell%22">Craig M. Powell</searchLink><br /><searchLink fieldCode="AR" term="%22Latha+Valluripalli+Soorya%22">Latha Valluripalli Soorya</searchLink><br /><searchLink fieldCode="AR" term="%22Joyce+Y%2E+Wu%22">Joyce Y. Wu</searchLink><br /><searchLink fieldCode="AR" term="%22Mustafa+Sahin%22">Mustafa Sahin</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22American+Journal+on+Intellectual+and+Developmental+Disabilities%22"><i>American Journal on Intellectual and Developmental Disabilities</i></searchLink>. 2025 130(5):344-361. – Name: Avail Label: Availability Group: Avail Data: American Association on Intellectual and Developmental Disabilities. P.O. Box 1897, Lawrence, KS 66044-1897. Tel: 785-843-1235; Fax: 785-843-1274; e-mail: AJMR@allenpress.com; Web site: https://meridian.allenpress.com/aaidd – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 18 – Name: DatePubCY Label: Publication Date Group: Date Data: 2025 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Cognitive+Tests%22">Cognitive Tests</searchLink><br /><searchLink fieldCode="DE" term="%22Intelligence+Tests%22">Intelligence Tests</searchLink><br /><searchLink fieldCode="DE" term="%22Cognitive+Ability%22">Cognitive Ability</searchLink><br /><searchLink fieldCode="DE" term="%22Intellectual+Disability%22">Intellectual Disability</searchLink><br /><searchLink fieldCode="DE" term="%22Developmental+Disabilities%22">Developmental Disabilities</searchLink><br /><searchLink fieldCode="DE" term="%22Scores%22">Scores</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Use%22">Test Use</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Interpretation%22">Test Interpretation</searchLink><br /><searchLink fieldCode="DE" term="%22Genetic+Disorders%22">Genetic Disorders</searchLink><br /><searchLink fieldCode="DE" term="%22Longitudinal+Studies%22">Longitudinal Studies</searchLink><br /><searchLink fieldCode="DE" term="%22Change%22">Change</searchLink><br /><searchLink fieldCode="DE" term="%22Standardized+Tests%22">Standardized Tests</searchLink> – Name: SubjectThesaurus Label: Assessment and Survey Identifiers Group: Su Data: <searchLink fieldCode="SU" term="%22Stanford+Binet+Intelligence+Scale%22">Stanford Binet Intelligence Scale</searchLink> – Name: DOI Label: DOI Group: ID Data: 10.1352/1944-7558-130.5.344 – Name: ISSN Label: ISSN Group: ISSN Data: 1944-7515<br />1944-7558 – Name: Abstract Label: Abstract Group: Ab Data: Developmental domains, such as cognitive, language, and motor, are key concepts of interest in longitudinal studies of intellectual and developmental disabilities (IDD). Normative scores (e.g., IQ) are often used to operationalize performance on standardized tests of these concepts, but it is the interval-distributed person-ability scores that are intended for the assessment of within-individual change. Here we illustrate the use and interpretation of several Stanford Binet, 5th Edition score types (IQ, extended IQ, Z-normalized raw score, developmental quotient, raw sum score, age equivalent, and ability score) using data from two longitudinal studies of rare genetic conditions associated with IDD. We found that, although normality assumptions were tenuous for all score types, floor effects led to model unsuitability for longitudinal analysis of most types of norm-referenced scores, and that the validity of interpretation with respect to individual change was best for ability scores. [This article was authored on behalf of the Developmental Synaptopathies Consortium.] – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2025 – Name: AN Label: Accession Number Group: ID Data: EJ1482497
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1482497
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1352/1944-7558-130.5.344 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 18 StartPage: 344 Subjects: – SubjectFull: Cognitive Tests Type: general – SubjectFull: Intelligence Tests Type: general – SubjectFull: Cognitive Ability Type: general – SubjectFull: Intellectual Disability Type: general – SubjectFull: Developmental Disabilities Type: general – SubjectFull: Scores Type: general – SubjectFull: Test Use Type: general – SubjectFull: Test Interpretation Type: general – SubjectFull: Genetic Disorders Type: general – SubjectFull: Longitudinal Studies Type: general – SubjectFull: Change Type: general – SubjectFull: Standardized Tests Type: general – SubjectFull: Stanford Binet Intelligence Scale Type: general Titles: – TitleFull: Which Score for What? Operationalizing Standardized Cognitive Test Performance for the Assessment of Change Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Cristan Farmer – PersonEntity: Name: NameFull: Audrey Thurm – PersonEntity: Name: NameFull: Tanvi Das – PersonEntity: Name: NameFull: E. Martina Bebin – PersonEntity: Name: NameFull: Jonathan A. Bernstein – PersonEntity: Name: NameFull: Elizabeth Berry-Kravis – PersonEntity: Name: NameFull: Joseph D. Buxbaum – PersonEntity: Name: NameFull: Charis Eng – PersonEntity: Name: NameFull: Thomas Frazier – PersonEntity: Name: NameFull: Antonio Y. Hardan – PersonEntity: Name: NameFull: Alexander Kolevzon – PersonEntity: Name: NameFull: Darcy A. Krueger – PersonEntity: Name: NameFull: Julian A. Martinez-Agosto – PersonEntity: Name: NameFull: Hope Northrup – PersonEntity: Name: NameFull: Craig M. Powell – PersonEntity: Name: NameFull: Latha Valluripalli Soorya – PersonEntity: Name: NameFull: Joyce Y. Wu – PersonEntity: Name: NameFull: Mustafa Sahin IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2025 Identifiers: – Type: issn-print Value: 1944-7515 – Type: issn-electronic Value: 1944-7558 Numbering: – Type: volume Value: 130 – Type: issue Value: 5 Titles: – TitleFull: American Journal on Intellectual and Developmental Disabilities Type: main
ResultId	1