View in EDS HTML Full Text PDF Full Text

Statistically Guided Grading Judgements: Contextualisation or Contamination?

Saved in:

Bibliographic Details
Title:	Statistically Guided Grading Judgements: Contextualisation or Contamination?
Language:	English
Authors:	Louise Badham (ORCID 0000-0003-0411-1827)
Source:	Oxford Review of Education. 2025 51(1):17-35.
Availability:	Routledge. Available from: Taylor & Francis, Ltd. 530 Walnut Street Suite 850, Philadelphia, PA 19106. Tel: 800-354-1420; Tel: 215-625-8900; Fax: 215-207-0050; Web site: http://www.tandf.co.uk/journals
Peer Reviewed:	Y
Page Count:	19
Publication Date:	2025
Document Type:	Journal Articles Reports - Research
Descriptors:	Advanced Placement Programs, Grading, Interrater Reliability, Evaluative Thinking, Evaluation Criteria, Statistical Data, Teacher Attitudes, Error of Measurement, Scoring, Scoring Rubrics, Context Effect, Bias
DOI:	10.1080/03054985.2023.2290640
ISSN:	0305-4985 1465-3915
Abstract:	Different sources of assessment evidence are reviewed during International Baccalaureate (IB) grade awarding to convert marks into grades and ensure fair results for students. Qualitative and quantitative evidence are analysed to determine grade boundaries, with statistical evidence weighed against examiner judgement and teachers' feedback on examinations. A trial was conducted to explore how examiners' grading decisions were influenced by having access to statistical evidence. Grade awards were replicated in nine exams across five subjects, with examiners accessing all available evidence in one model, and only scripts and grade descriptors in the other. Preliminary findings suggest that both approaches lead to broadly comparable grading outcomes. Focus group feedback indicates that examiners consider judging the grade-worthiness of student work to be their primary role in grade award. Whilst they found item-level data helpful for prioritising questions for review, participants reported that access to evidence such as statistically recommended boundaries can cloud their judgement or encourage strategic grading. This study also raises further questions about the purposes and uses of different forms of statistical evidence, as well as how and when they should be integrated with qualitative evidence in grade awarding.
Abstractor:	As Provided
Entry Date:	2025
Accession Number:	EJ1458475
Database:	ERIC
Full text is not displayed to guests. Login for full access.

FullText	Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwGROl2NHIkyylRoXpOyLD6xAAAA4zCB4AYJKoZIhvcNAQcGoIHSMIHPAgEAMIHJBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDA_KAqfs6YmTnlWXUQIBEICBmy9zto5tDw83ZtKMfXyn3oUTEksxmaJkJuBOG79jG8N5qtAuybaTNTsSIKIGOxWobaLmPRIWdc29gQCOqsOMx-f8BjJULJ31Pu4qwhUWBvZ5sI1OAV_qEh6KVaf9BJkTJS21GZ6iYvRGY5R3poNU7nvB9NgpvTb2PSfbcqzlnQxUVyKW9oerrZWE0YAKaLrdVoKnLjQb1sfpo5yD Text: Availability: 1 Value: <anid>AN0182340775;oxr01feb.25;2025Jan23.01:33;v2.2.500</anid> <title id="AN0182340775-1">Statistically guided grading judgements: contextualisation or contamination? </title> <p>Different sources of assessment evidence are reviewed during International Baccalaureate (IB) grade awarding to convert marks into grades and ensure fair results for students. Qualitative and quantitative evidence are analysed to determine grade boundaries, with statistical evidence weighed against examiner judgement and teachers' feedback on examinations. A trial was conducted to explore how examiners' grading decisions were influenced by having access to statistical evidence. Grade awards were replicated in nine exams across five subjects, with examiners accessing all available evidence in one model, and only scripts and grade descriptors in the other. Preliminary findings suggest that both approaches lead to broadly comparable grading outcomes. Focus group feedback indicates that examiners consider judging the grade-worthiness of student work to be their primary role in grade award. Whilst they found item-level data helpful for prioritising questions for review, participants reported that access to evidence such as statistically recommended boundaries can cloud their judgement or encourage strategic grading. This study also raises further questions about the purposes and uses of different forms of statistical evidence, as well as how and when they should be integrated with qualitative evidence in grade awarding.</p> <p>Keywords: Grade awarding; examination standards; examiner judgement; criterion referencing; International Baccalaureate</p> <hd id="AN0182340775-2">Introduction</hd> <p>The International Baccalaureate (IB) offers pre-university summative assessments through its Diploma Programme, which is intended to serve as a 'passport to higher education' (Hill, [<reflink idref="bib21" id="ref1">21</reflink>], p. 101). Students' final grades have a significant impact on their future academic and professional trajectories, so examination standards must be maintained over time to ensure their results are fair and meaningful. The maintenance of examination standards is also central to stakeholders' trust in assessment systems, without which grades lack value and meaning (Simpson &amp; Baird, [<reflink idref="bib32" id="ref2">32</reflink>]). It is therefore essential that grade awarding (also known as standard setting and maintaining) – the process through which grade boundaries are established to convert marks into grades – are robust and rigorous.</p> <p>In 2019, the IB implemented an alternative model of grade awarding in some subjects for operational reasons. Rather than examiners reviewing statistical evidence and teacher feedback on examinations before grading student work, in the new model they graded scripts without access to additional evidence. In an exploratory study, experienced IB examiners and assessment staff replicated grade award processes in nine exams across five subjects, each following the opposite model for their subject compared to 2019. Grade boundary decisions were compared in each subject, and focus groups were held to gather participants' feedback on the advantages and disadvantages of the different approaches. This article discusses the findings of the study, and discusses what this means for questions about how different sources of evidence are integrated in grade award.</p> <hd id="AN0182340775-3">Approaches to grade awarding</hd> <p>Examination standards are a complex and multifaceted area of assessment, that is mired by contradictory understandings of what is meant by 'standards' (Baird, [<reflink idref="bib5" id="ref3">5</reflink>]; Coe, [<reflink idref="bib16" id="ref4">16</reflink>]). This is further complicated by the opacity of terminology, processes, and practices in how grades are established in educational assessment. An additional complicating factor in navigating the theory and practice of standard setting is that different techniques are used in different examination systems around the world (Baird et al., [<reflink idref="bib8" id="ref5">8</reflink>]). Approaches vary from being solely reliant on quantitative evidence, to prioritising only qualitative evidence, or some combination of the two. Similar to the UK-based examination boards offering the Advanced Level (A Level), the IB uses both quantitative and qualitative evidence to establish grade boundaries. However, most standard setting research tends to emerge from the US-based psychometrics tradition where assessments are not directly tied to curricula (Baird &amp; Gray, [<reflink idref="bib7" id="ref6">7</reflink>]), therefore statistical methods associated with psychometrics tend to dominate the literature. Yet, most awarding processes draw upon both quantitative and qualitative approaches in practice, as 'well-conceived and implemented standard setting must recognize that any procedure requires participants to rely on both dimensions to effectively carry out their task' (Cizek &amp; Bunch, [<reflink idref="bib15" id="ref7">15</reflink>], p. 10). Despite the seemingly common 'mixed methods' approach to grade awarding, research into how qualitative and quantitative sources of evidence are combined in the process is surprisingly scarce (Opposs &amp; Gorgen, [<reflink idref="bib29" id="ref8">29</reflink>]).</p> <hd id="AN0182340775-4">Statistical versus judgemental evidence</hd> <p>Clarity about how qualitative and quantitative data are combined in awarding is needed to facilitate discussions about the benefits and drawbacks of the different approaches (Cambridge Assessment, [<reflink idref="bib14" id="ref9">14</reflink>]). In the UK context, this issue has generated research on both statistical and judgemental evidence in awarding. Criticisms of statistical equating methods include that they can fail to account for changes in candidate ability or the difficulty of exams from one year to the next (Newton, [<reflink idref="bib28" id="ref10">28</reflink>]). Atypical awarding contexts such as small cohorts, fluctuating candidate entries and changes to course specifications also create challenges for statistical methods used to generate statistically recommended boundaries (AlphaPlus, [<reflink idref="bib2" id="ref11">2</reflink>]). Examiner judgement faces different reliability challenges, such as the consistency of candidate performance and different features of exam scripts unduly influencing examiners' grading decisions (Scharaschkin &amp; Baird, [<reflink idref="bib31" id="ref12">31</reflink>]; Suto &amp; Novakovic, [<reflink idref="bib34" id="ref13">34</reflink>]). Consideration has also been given to the strengths and weaknesses of the different approaches and how much relative weight should be placed on statistical versus judgemental methods (Benton &amp; Bramley, [<reflink idref="bib10" id="ref14">10</reflink>]; Newton, [<reflink idref="bib27" id="ref15">27</reflink>]). Yet, whilst practices for combining statistical and judgemental evidence may be detailed in organisational documentation, there remains a scarcity of academic literature dedicated to exploring the interaction between the two.</p> <p>There is, however, general consensus that examiner judgement is needed, but is insufficient as the sole basis for grade boundary decisions (Benton &amp; Bramley, [<reflink idref="bib10" id="ref16">10</reflink>]), as examiners are unable to account sufficiently for changes in exam difficulty (Baird, [<reflink idref="bib4" id="ref17">4</reflink>]). Whilst examiners might shift their grading expectations to an extent to accommodate changing levels of difficulty, they do not adjust as far as statistical data suggests is necessary (Good &amp; Cresswell, [<reflink idref="bib20" id="ref18">20</reflink>]). This is not an indictment of examiners' professional expertise, but rather points to the impossibility of pinpointing an exact point at which student responses collectively shift in quality from one grade to the next. Examiners' grading judgements are also influenced by different features of scripts, such as missing responses, accuracy, the length of responses, and poor answers to easy questions or good answers to hard questions (Bramley, [<reflink idref="bib12" id="ref19">12</reflink>]; Suto &amp; Novakovic, [<reflink idref="bib34" id="ref20">34</reflink>]). The internal consistency of responses also has an impact, as examiners tend to be harsher when reviewing individual elements compared to the whole piece of work (Baird &amp; Scharaschkin, [<reflink idref="bib9" id="ref21">9</reflink>]).</p> <p>Furthermore, other elements can influence examiner grading judgements, such as the original marks awarded or social dynamics within an examining team (Crisp, [<reflink idref="bib18" id="ref22">18</reflink>]). However, factors that are intended to contribute to the judgement process do not always have a significant impact, and awarders tend to rely more on their own internalised notion of the 'standard' rather than that which is exemplified in archived boundary scripts (Baird, [<reflink idref="bib3" id="ref23">3</reflink>]). Similarly, grade descriptors which are intended to describe the levels of attainment for each grade are more helpful in '"conveying the flavour" of a grade rather than defining it' (Suto &amp; Novakovic, [<reflink idref="bib34" id="ref24">34</reflink>], p. 315). Judgements of grade-worthiness are therefore not 'fine-grained' in nature, but 'a broad-beamed searchlight for identifying standards' (Baird &amp; Dhillon, [<reflink idref="bib6" id="ref25">6</reflink>], p. 2). Yet, whilst there are certainly challenges with examiner grading in awarding, there is also a tendency to focus only on reliability issues, and overlook the benefits of judgemental evidence (Benton &amp; Bramley, [<reflink idref="bib10" id="ref26">10</reflink>]).</p> <p>Statistical evidence suffers from similar misconceptions, particularly as statistical methods are often conflated with norm-referencing. Instead, statistically recommended boundaries might more appropriately be considered as additional evidence to support awarding decisions, or 'statistical rules of thumb' for where grade boundaries should sit (Newton, [<reflink idref="bib28" id="ref27">28</reflink>], p. 879). As statistical and judgemental methods both have strengths and weaknesses, it is often recommended that both are incorporated into grade awarding processes (Black &amp; Bramley, [<reflink idref="bib11" id="ref28">11</reflink>]; Newton, [<reflink idref="bib27" id="ref29">27</reflink>]). Yet, the mechanics of how these different elements are integrated in awarding remain somewhat opaque.</p> <hd id="AN0182340775-5">Integrating sources of evidence</hd> <p>In IB grade award, the approach of balancing expert judgement on student work with statistical evidence and teacher feedback on assessments is believed to the offer the fairest possible results. IB Diploma subjects are graded on a scale of 1 to 7 (with 7 representing the highest level of attainment), and the fairest outcome is believed to be that which ensures that grades 'mean the same whichever session a candidate takes their exam in' (IBO, [<reflink idref="bib25" id="ref30">25</reflink>], p. 133). By combining qualitative and quantitative methods in awarding, 'controlled, rigorous and complex' outcomes can be produced (Opposs &amp; Gorgen, [<reflink idref="bib29" id="ref31">29</reflink>], p. 62). However, questions remain around how the different sources of evidence should be integrated. Should examiners be expected to account for statistical recommendations explicitly in their grading decisions? In this sense, the purpose of examiners reviewing statistical evidence may be to contextualise their grading, guide decision-making and avoid judgements taking place in a vacuum. Or does this muddy the waters, with additional evidence distracting examiners from the grading task at hand, or even contaminating their judgements by introducing bias into the process? From this perspective, the preference may be for grading to take place as an independent activity, with grading decisions weighed against other evidence after the fact. The question then, is who should integrate the different sources of evidence: examiners or assessment staff?</p> <p>This may depend on how different sources of evidence influence examiners' judgements. Examiner grading typically occurs 'within the context or confines of statistical information arguably constraining the purity of the expert judgement' (Black &amp; Bramley, [<reflink idref="bib11" id="ref32">11</reflink>], p. 359). Some examiners may be reluctant to accept statistical recommendations if they are seen to undermine their professional competence (Cresswell, [<reflink idref="bib17" id="ref33">17</reflink>]). On the other hand, awarders might instinctively restrict grading within the confines of the previous year's grade boundaries (Benton &amp; Bramley, [<reflink idref="bib10" id="ref34">10</reflink>]). Flawed statistically recommended boundaries can also skew grading judgements and undermine awarders' confidence (Stringer, [<reflink idref="bib33" id="ref35">33</reflink>]). Thus, there appear to be different ways in which the 'purity' of examiner grading judgements can be contaminated through the provision of statistical data in awarding.</p> <hd id="AN0182340775-6">Influencing factors in examiners' grading decisions</hd> <p>Subject-specific features can also influence examiners' interactions with statistical evidence. Different academic traditions can affect how examiners prioritise different types of evidence in awarding (Cresswell, [<reflink idref="bib17" id="ref36">17</reflink>]). Similarly, assessment design is important, with extended responses offering more substance for holistic grading judgements, compared to shorter items which might not neatly align with grade descriptors. In multi-item exams, examiners therefore tend to focus on extended responses that are more useful for distinguishing between grades (Crisp, [<reflink idref="bib18" id="ref37">18</reflink>]). Linguistic factors can also impact examiners' experiences. Non-native speakers can find the technical language of awarding challenging, which includes statistical terms (e.g. standard deviation), assessment terminology (e.g. cohort) and technological jargon (e.g. familiarisation mode) that are not common parlance for most examiners, let alone those working in their second language. There are also broader cultural and contextual influences. The IB examining community represents varied assessment experiences from different education systems around the word, all of which are 'enshrined through culture and context' (Isaacs &amp; Gorgen, [<reflink idref="bib26" id="ref38">26</reflink>], p. 307). Examiners in IB grade award typically represent diverse linguistic and cultural backgrounds and are likely to have different culturally embedded notions of what standard setting should look like, including the relative weighting of different forms of evidence.</p> <hd id="AN0182340775-7">Context: IB grade award</hd> <p>The purpose of grade award is to transform candidates' raw marks on individual assessments into one overall subject grade that represents their level of attainment in the discipline. This involves identifying the 'turning point' on the original mark scale where the quality of the work shifts from one grade to the next: i.e. the grade boundary. The IB defines its approach to grade awarding as 'weak criterion-referencing' (also known as attainment referencing) to reflect that it 'is based upon criteria but recognizes the evidence of the Good and Cresswell effect' (IBO, [<reflink idref="bib25" id="ref39">25</reflink>], p. 54). That is to say, it acknowledges the benefits of criterion referencing, but recognises its limitations as examiner judgement cannot sufficiently account for changes in exam difficulty. This is mitigated by analysing statistical data alongside judgemental evidence. Examiners review scripts in specified mark ranges where the grade boundary is considered likely to fall based on statistical evidence, including statistically recommended boundaries (SRBs). SRBs provide statistical estimates of which grade boundaries would produce the most similar percentage of candidates at each grade compared to the previous year.</p> <p>However, whilst SRBs are a helpful indicator, they are not sufficient alone to produce boundaries. Certain scenarios can undermine the SRB procedure, such as a change in curriculum or very small cohorts which are inherently unstable, making SRBs less reliable (AlphaPlus, [<reflink idref="bib2" id="ref40">2</reflink>]). Another challenge can be that they cannot usually produce identical percentages to the previous year. Instead, the closest possible percentage must be identified, which is likely to be (albeit marginally) harsher or more generous than the previous year. On rare occasions, multiple SRBs may be produced for the same grade. For example, a grade 7 boundary at 64 marks may have resulted in 14% of candidates at grade 7 last year. However, this year the same boundary may result in 13% at grade 7, but a 63 mark boundary would produce 15%. So, both options result in a 1% change, and there are two possible SRBs.</p> <p>Even without these practical challenges, the IB's method of calculating SRBs is 'an example of pure norm-referencing' (AlphaPlus, [<reflink idref="bib1" id="ref41">1</reflink>], p. 5). Therefore, if applied blindly, they would not account for changes in exam difficulty or cohort ability. Taking the SRBs at face value would be contrary to the IB's aim of ensuring that grades represent the same quality of attainment between years, hence the need to balance them against expert judgement. However, examiners are inclined to give candidates the benefit of the doubt when recommending grade boundaries, which can lead to grade inflation (Stringer, [<reflink idref="bib33" id="ref42">33</reflink>]). So, SRBs and other evidence are needed to focus grading within a statistically-reasonable mark range.</p> <p>Qualitative and quantitative approaches merge at various stages in IB awarding, with SRBs used to identify mark ranges for grading, examiners reviewing statistical data before grading, and subject managers analysing statistical and judgemental evidence to determine final grade boundaries. Furthermore, different awarding models[<reflink idref="bib1" id="ref43">1</reflink>] are used in the IB, which might be categorised as:</p> <p></p> <ulist> <item> <emph>Extended access model</emph>: statistical data are used to identify mark ranges for grading. Statistical evidence is shared with examiners, who grade scripts with contextual data in mind. They form part of the awarding committee (led by subject manager), to discuss grading outcomes and statistical data. Subject manager makes final grade boundary recommendations in light of discussions.</item> <p></p> <item> <emph>Limited access model</emph>: statistical data are used to identify mark ranges for grading, but examiners do not access statistical evidence. They grade scripts individually with no discussion. Subject managers review judgements and statistical data, and make final grade boundary recommendations independently.</item> <p></p> <item> <emph>Independent grading model</emph>: where there are very few candidates, examiners grade all scripts.</item> <p></p> <item> <emph>Verification model</emph>: examiners review samples of scripts at 'fixed boundaries' to verify their suitability.</item> <p></p> <item> <emph>Confirmation model</emph>: examiners review scripts at the mark which is considered the most appropriate grade boundary according to statistical evidence, to confirm its suitability.</item> </ulist> <p>Different models are used for different boundary setting scenarios. For example, coursework-only subjects have 'fixed' grade boundaries, where scripts are reviewed only for verification purposes (AlphaPlus, [<reflink idref="bib2" id="ref44">2</reflink>]). In subjects with extremely small cohorts where statistical evidence is weaker, the independent grading model may be followed, as qualitative evidence is prioritised. The IB's default approach is the extended access model. However, in May 2019, the limited model was introduced in several subjects for operational reasons. This study compared the historical approach with the new, limited model of grade award.</p> <hd id="AN0182340775-8">Aims</hd> <p>The broad aims of this study were to explore how different sources of evidence are integrated in IB grade award, and how examiners' grading decisions are impacted by reviewing contextual evidence before making their judgements. The study was guided by the following research questions:</p> <p></p> <ulist> <item> How and to what effect are judgemental and statistical evidence combined during IB grade award?</item> <p></p> <item> To what extent does access to statistical evidence impact examiners' grading decisions?</item> <p></p> <item> Do grade awards when examiners review statistical evidence lead to similar grading outcomes, compared to when they do not?</item> <p></p> <item> What are the perceived benefits and drawbacks of examiners reviewing statistical evidence on exams in IB grade award?</item> </ulist> <hd id="AN0182340775-9">Methodology</hd> <p>In Spring 2022, IB grade award processes were replicated in nine examinations across five subjects: business management, English literature, Spanish literature and Japanese literature from the Diploma Programme (all higher level), and mathematics from the Middle Years Programme. English, Spanish and Japanese literature followed the extended access model where examiners reviewed statistical evidence and teacher feedback on exams, whereas mathematics and business management followed the limited access model, where examiners only provided feedback on exams and graded scripts for their respective exams. Each subject followed the opposite model in previous examination sessions, which allowed comparisons to be made between the different approaches. Details of the two awarding models are provided in Table 1.</p> <p>Table 1. Limited and extensive awarding models.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;Grade award process&lt;/td&gt;&lt;td&gt;Limited model&lt;/td&gt;&lt;td&gt;Extended model&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Examiners submit feedback questionnaire on exam session&lt;/td&gt;&lt;td&gt;x&lt;/td&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Examiners and subject manager discuss cohort performance and exam difficulty before grading&lt;/td&gt;&lt;td /&gt;&lt;td&gt;x&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Examiners review teacher feedback and statistical evidence (e.g. mean marks, mark distributions and item-level data)&lt;/td&gt;&lt;td /&gt;&lt;td&gt;x&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Examiners may be provided with statistically recommended boundaries (SRBs)&lt;/td&gt;&lt;td /&gt;&lt;td&gt;x&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SRBs used to select mark ranges for grading&lt;/td&gt;&lt;td&gt;x&lt;/td&gt;&lt;td&gt;x&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Examiners grade samples of scripts against grade descriptors&lt;/td&gt;&lt;td&gt;x&lt;/td&gt;&lt;td&gt;x&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Examiners and subject manager discuss all evidence, and agree final grade boundaries&lt;/td&gt;&lt;td /&gt;&lt;td&gt;x&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Subject manager reviews all evidence and submits final grade boundary recommendations to senior management&lt;/td&gt;&lt;td&gt;x&lt;/td&gt;&lt;td&gt;x&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;bold&gt;Subjects&lt;/bold&gt;&lt;/td&gt;&lt;td /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;English, Spanish and Japanese literature&lt;/td&gt;&lt;td&gt;2019&lt;/td&gt;&lt;td&gt;Trial&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Business management and mathematics&lt;/td&gt;&lt;td&gt;Trial&lt;/td&gt;&lt;td&gt;2019&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>A mixed methods design was used to compare the two models, with quantitative comparisons carried out on grading outcomes, and focus groups held to gather feedback on participants' experiences.</p> <hd id="AN0182340775-10">Participants</hd> <p>In total, 25 experienced examiners and five subject managers participated in the study. All participants were experienced in IB grade award. To ensure ecological validity, the number of participants was relatively small, as it is typical to have three examiners per exam in grade award (although this can vary depending on cohort size and marking team structure). To avoid examiners being unduly influenced by knowledge of previous grade boundaries, wherever possible examiners graded a different exam compared to 2019 (e.g. by swapping time zones[<reflink idref="bib2" id="ref45">2</reflink>]). However, this was not always possible (e.g. mathematics had only one assessment). Grade boundaries are also published in subject reports after each exam session, so examiners were advised not to refer to historical documentation that referenced the original boundaries. A risk remained that some participants may have remembered previous boundaries either from participating in the original award or from published reports, although this was believed to be minimal as five assessment sessions[<reflink idref="bib3" id="ref46">3</reflink>] had taken place since the original exams.</p> <p>A formal standardisation process was not conducted prior to the trial, as all participants were experienced and familiar with the grade descriptors and grade awarding processes for their respective subjects. However, examiners were provided with examination papers and corresponding markschemes to familiarise themselves with the content. Participants were also given access to an online training module that described the background, rationale and process to be followed in the trial.</p> <hd id="AN0182340775-11">Materials</hd> <p>Assessment data and sample examination scripts were taken from the May 2019 examination session. Four of the selected subjects typically have large cohorts (<reflink idref="bib5" id="ref47">5</reflink>,000+ candidates per May examination session), and Japanese literature represented a smaller entry subject, with 255 candidates across both levels in May 2019. The five subjects also represented differences in question type and maximum marks. Literature candidates produced one essay per exam, with paper 1 marked out of 20 marks and paper 2 out of 25 marks (IBO, [<reflink idref="bib22" id="ref48">22</reflink>]). Business management exam comprised a combination of structured questions and extended responses, with paper 1 marked out of 60 marks and paper 2 out of 70 marks (IBO, [<reflink idref="bib24" id="ref49">24</reflink>]). Mathematics represented the only onscreen assessment, which comprised multiple short items marked out of a total of 100 marks (IBO, [<reflink idref="bib23" id="ref50">23</reflink>]).</p> <hd id="AN0182340775-12">Grade awarding tasks</hd> <p>As in live examination sessions, all awarding processes were carried out online. Assessment data and examination scripts were shared with examiners via the virtual learning environment Moodle and, for the extended model, subject managers held virtual meetings with examiners via Microsoft Teams. In both models, examiners judged the grade-worthiness of scripts within specified mark ranges. As in a live session, marks on scripts were retained and examiners were guided to start grading at the highest mark and work down. Three examiners graded each exam, except mathematics (where one participant dropped out) and Japanese paper 1 (which typically has only two awarders due to the small cohort and examining team). Examiners typically graded five scripts at each mark in the specified mark range.[<reflink idref="bib4" id="ref51">4</reflink>] Mimicking a live examination session, participants only graded scripts for three key 'judgemental boundaries' (from grades 1 to 7, these comprised the 2/3, 3/4 and 6/7 boundaries). Examiners recorded their grading decisions in 'script comment forms' (e.g. at the 6/7 boundary, each script was recorded as 6-, 6, 6+, 7-, 7 or 7+). Grading results were submitted via email to the subject manager, who analysed all available evidence. This included identifying a 'zone of uncertainty' for each grade, which is the range of marks where a grade boundary could sit based on the examiners' collective grading decisions. Finally, subject managers compiled reports detailing their final grade boundary recommendations.</p> <hd id="AN0182340775-13">Focus groups</hd> <p>In total, 15 trial participants took part in three focus groups: five in each group. Two were held with examiners (one for business management and mathematics, and one for English, Japanese and Spanish literature) and the other with subject managers. Each focus group was scheduled for 1.5 hours. The focus groups with examiners were held online via Microsoft Teams due to the geographical spread of participants, and the focus group with subject managers took a hybrid format where most participants met face-to-face at the IB global centre in Cardiff, and one joined remotely via Microsoft Teams. Each participant was provided with an information sheet outlining the aims of the study, emphasising that participation was optional, and they had the right to withdraw at any time.</p> <p>Participants were provided with prompts to guide discussion, with a loosely structured format to encourage interaction and allow information to be elicited about their views and opinions, as well as the reasoning behind them (Denscombe, [<reflink idref="bib19" id="ref52">19</reflink>]). Examples of the broad prompts were: <emph>What Grade Award approach is typically used in your subject? Which types of assessment evidence do you find the most useful in grade award? How do different types of evidence influence the way you grade? What are the benefits and drawbacks of having access to different types of evidence in grade award? What kind of evidence base do you value in grade award (and why)?</emph></p> <hd id="AN0182340775-14">Data analysis</hd> <p>Focus groups were recorded and transcripts autogenerated in Microsoft Teams. Transcripts were reviewed by the researcher and corrected based on repeated viewings of the recordings to ensure participants' contributions were captured verbatim. Transcripts were analysed using computer-assisted data analysis software NVivo, version 12 (QSR International Pty Ltd, [<reflink idref="bib30" id="ref53">30</reflink>]). Taking a grounded theory approach where 'codes are open to change and refinements as research progresses' (Denscombe, [<reflink idref="bib19" id="ref54">19</reflink>], p. 116), the qualitative analysis was carried out in three stages. First, an open coding process was carried out, where transcript data were organised into purely descriptive codes (categorised by stage of the process, awarding role, type of evidence, etc.). A second round of coding was conducted to refine the data by looking for patterns and relationships between the codes, and prioritise the most significant themes. This included organising the descriptive codes into broader categories (e.g. codes such as 'provides context' and 'shows how questions work' were gathered under 'advantages of statistical evidence'). This stage also included sifting out extraneous categories, such as Covid-19-related challenges or issues relating to different assessment processes (e.g. standardisation of marking) which were outside the scope of the current investigation. Finally, a selective coding process was conducted to reveal the 'core codes', aiming to 'encapsulate the way the categories relate to each other in a single notion' (Denscombe, [<reflink idref="bib19" id="ref55">19</reflink>], pp. 116–7). For example, codes on subject-specific features such as exams' maximum marks, item types or language variants, were combined under the overarching theme: 'needs and requirements vary across subjects'. The identification of main themes was also determined by frequency of references. For instance, the theme relating to script evidence being most central included 70 references across eight subcodes (including 'focus on script evidence' and 'prioritization of evidence'), whereas the impact of teacher predicted grades on examiners' grading was only mentioned once.</p> <hd id="AN0182340775-15">Results</hd> <p></p> <hd id="AN0182340775-16">Grading judgements and boundary recommendations</hd> <p>In total, 1,320 grading decisions were made by 25 examiners in the trial and 1,168 decisions by 23 examiners in 2019. The number of grading decisions by awarding model is shown in Table 2.</p> <p>Table 2. Number of grading decisions by awarding model.</p> <p> <ephtml> &lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;graphic href="core&amp;#95;a&amp;#95;2290640&amp;#95;ilg0001.jpg" content-type="Graph" /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>To create a binary measure of grade-worthiness, examiners' grading decisions were recoded as 0 or 1 to indicate whether they reached the threshold for each grade (e.g. for the 6/7 boundary, judgements at -6, 6 or 6+ were coded as 0 and 7-, 7 or 7+ were coded as 1). A small number of scripts (7 in the trial and 9 in 2019) were judged as borderline (e.g. 6/7), and were removed for the purpose of the analysis. The proportions of scripts judged to be grade-worthy were compared to the SRBs in the limited and extended models, as shown in Figure 1.</p> <p>Graph: Figure 1. Proportion of scripts judged grade-worthy versus statistically recommended boundaries (SRBs).</p> <p>As might be expected, the most significant differences between models appeared in Japanese due to the very small cohort and limited availability of scripts. Japanese paper 1 seemed particularly distorted, as there were only three scripts at two marks above the SRB across all three boundaries, which were all judged grade-worthy in the limited model, but not in the extended model. Assessments with large cohorts and smaller mark ranges, such as English and Spanish, were very consistent across the two models. In all subjects combined, there was a marginal difference of 0.1% in the proportion of scripts judged to be grade-worthy at/above the grade 7 SRB in the two models. At grades 4 and 3 the differences were slightly larger (2.32% and 1.25% respectively), but still relatively small.</p> <p>To compare how judgemental and statistical evidence were combined to produce the final outcomes in each model, the SRBs and judgemental boundaries (defined by the zones of uncertainty) were considered for each subject, as shown in Table 3.</p> <p>Table 3. Judgemental boundaries (zone of uncertainty [ZoU]), statistical boundaries and final grade boundaries.</p> <p> <ephtml> &lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;graphic href="core&amp;#95;a&amp;#95;2290640&amp;#95;ilg0002.jpg" content-type="Graph" /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Interestingly, the SRBs appeared more often within the zones of uncertainty when examiners <emph>did not</emph> review statistical evidence (22 times), compared to when they did (17 times). It is somewhat surprising that examiners' judgements were more aligned with SRBs when they were not provided with statistical data, as the opposite would be assumed if examiners were influenced by contextual evidence.</p> <p>Of the final grade boundaries, 20 out of 27 (74%) were identical regardless of the model followed. Of those that differed, six varied by just one mark, and one differed by three marks. Significantly, two of those that differed (the 2/3 boundaries on the business management papers) balanced out, with paper 1 being one mark lower in the limited model and paper 2 one mark higher, resulting in identical boundaries for the subject overall. Thus, 22 out of the 27 (81%) final boundary recommendations gave identical subject level outcomes. Of the remaining five, only mathematics differed by more than one mark, and it should be noted that this was the assessment with most marks available, so more fluctuations in boundaries may be expected.</p> <p>These findings are not conclusive as to which method is preferred, due to other factors impacting live grade awarding that could not be replicated here. For example, in a live exam session, examiners start grade awarding immediately after completing marking, so they already have an idea of how students performed. Outside the trial environment, other factors can influence final grade boundaries, such as subject pairs analyses which may suggest that harsher or more generous awards are appropriate to address issues of inter-subject comparability (IBO, [<reflink idref="bib25" id="ref56">25</reflink>]). Additionally, whilst the same SRBs and mark ranges were used in the trial and in 2019, the same scripts could not be used in both conditions, as the original script IDs could not be retrieved. Such differences between the live and trial conditions restricted the possibility of conducting certain types of analyses meaningfully, such as inferential statistics. Despite these limitations, however, the similarity in boundary recommendations may be taken as a general indication that the different approaches result in broadly comparable outcomes. Next, focus group feedback was analysed to gather insight into the realities of combining different forms of evidence in IB grade award.</p> <hd id="AN0182340775-17">Focus groups</hd> <p>The five main themes that emerged from the focus groups were:</p> <p></p> <ulist> <item> Script evidence is considered central to examiners' role in IB grade award.</item> <p></p> <item> Statistical recommendations can cloud examiner judgement during grading.</item> <p></p> <item> The most useful statistical evidence for guiding examiners' grading is item-level data.</item> <p></p> <item> Statistical evidence has many uses for examiners outside of grading.</item> <p></p> <item> One size does not fit all: needs and requirements vary across subjects.</item> </ulist> <hd id="AN0182340775-18">Script evidence is considered central to examiners' role in IB grade award</hd> <p>Examiners overwhelmingly reported that script evidence was central to their role in awarding:</p> <p>Script evidence is just the core of the process, no doubt about it. [participant 14]</p> <p>it's most important the script evidence. [participant 11]</p> <p>Examiners further noted that scripts ranked highest in terms of prioritisation of evidence:</p> <p>evidences are not, let's say equally weighted. I think that there is the main thing is the script evidence ... [The statistics] is just an external factor. [participant 4]</p> <p>there's sort of a ranking ... script evidence is by all means the most important thing ... and the statistics as least. [participant 14]</p> <p>They also observed that examiner and subject manager roles differed in terms of statistical expertise:</p> <p>for the statistic[s], I think I trust in the ... subject manager's judgement ... because I'm not you know, a specialist in that domain. [participant 11]</p> <p>Additionally, it was observed that examiners and subject managers have different perspectives in awarding:</p> <p>the senior examiner in grade award has the total view from inside. And the subject manager has perhaps the total view, but from the outside. [participant 14]</p> <p>examiners and ... subject managers are sort of looking at the same problem from a different place, right? That we're ... in the helicopter, and they're in the trenches. [participant 7]</p> <p>Participants generally considered examiners' priority to be judging the quality (grade-worthiness) of the scripts, whereas for subject managers it was having a broader overview so they could integrate all the evidence.</p> <hd id="AN0182340775-19">Statistical recommendations can cloud examiner judgement during grading</hd> <p>Examiners reported that statistical evidence provided useful context, particularly when unusual circumstances affected marking or when scripts were borderline:</p> <p>when the [marking] team changed quite dramatically ... we looked into the statistics and ... it really helped. [participant 15]</p> <p>if I'm toying between whether a grade be let's say, [at] 26 or 27 [marks]. If I had some other information, it might just make me drop one side or the other. [participant 3]</p> <p>However, participants also raised concerns that statistical evidence could cloud examiners' judgement or influence grading decisions:</p> <p>looking at evidence ... sometimes it clouds the judgement of the senior examiner. [participant 4]</p> <p>if I'm given [the statistics] in advance, obviously that influences me ... I shouldn't be influenced. [participant 11]</p> <p>[I] think that it influences their judgements ... whereas if I don't show them the stats, they have a blank slate to come up with their own view. [participant 7]</p> <p>This feedback suggests that the 'purity' of examiner judgement is constrained by statistical recommendations (Black &amp; Bramley, [<reflink idref="bib11" id="ref57">11</reflink>]). This relates to concerns that in the extended model, statistical recommendations risk creating confirmation bias or encouraging strategic grading:</p> <p>[When] they had access to all of the information in those meetings ... they would ... come out of that meeting knowing very well which grade they wanted. [participant 6]</p> <p>we're tending to ... look at ... statistically recommended boundaries and where we think the grade should be, and we are playing a bit of a game sometimes to make it fit. [participant 3]</p> <p>Examiners also feared that their judgements would be overruled by statistical evidence:</p> <p>They felt quite actively like antagonistic towards the outcome data ... as if it was more important than what they had to say, and ... I was going to override them with ... the power of the stats. [participant 7]</p> <p>Some examiners may therefore lean towards statistical recommendations, whilst others may be inclined to reject SRBs, particularly if they conflict with their professional judgement. Consequently, different types of bias may be introduced when examiners review statistical evidence before grading.</p> <hd id="AN0182340775-20">The most useful statistical evidence for guiding examiners' grading is item-level data</hd> <p>Item-level data were repeatedly highlighted for its usefulness in awarding:</p> <p>my ideal thought of the grade award ... the script evidence together with the item-level statistics. And nothing else. [participant 4]</p> <p>Significantly, the purposes of item-level data differed from other statistical evidence. Where available, item-level data were viewed as essential for gauging how individual questions performed:</p> <p>we found [item-level data] really useful ... [for a question] they think has been difficult for candidates, or ... hardly any candidates have answered it. [participant 10]</p> <p>Its main function was to identify 'key discriminators' to help examiners target items that may be more fruitful for grading (Bramley, [<reflink idref="bib13" id="ref58">13</reflink>]). Item-level data were not reported to cloud examiners' judgement in the same was as other statistical evidence, but rather helped them to prioritise and manage their grading task more efficiently.</p> <hd id="AN0182340775-21">Statistical evidence has many uses for examiners outside of grading</hd> <p>Most challenges with statistical data were reported to be <emph>at the time of grading</emph>. However, other benefits were identified:</p> <p>as professional development or to reflect on the paper setting or for the subject report. [participant 6]</p> <p>Some participants also suggested it may be more helpful later in the process:</p> <p>I would like to see [the statistics] at the end but not the beginning because that's going to influence the papers in the team. [participant 11]</p> <p>As well as avoiding potential biases, an additional benefit of examiners reviewing statistical evidence later on might be allowing subject managers more time to dedicate to analyses during the busy awarding period. Conversely, however, examiners may also be resentful of statistical data that is kept from them initially, only to overrule their judgemental recommendations later on (Cresswell, [<reflink idref="bib17" id="ref59">17</reflink>]). Therefore, thoughtful discussion would be needed to help examiners understand how their judgemental input contributed to final outcomes.</p> <hd id="AN0182340775-22">One size does not fit all: needs and requirements vary across subjects</hd> <p>Whilst consistent themes arose across the focus groups, it was apparent that no 'one size fits all' was possible in awarding. Differences such as academic discipline, cohort size and marks available in exams were all reported to have an impact:</p> <p>the difference in our experiences is ... because of the subject matter that we're dealing with. [participant 7]</p> <p>[in] a very small subject ... If they've only marked three candidate responses, I don't even know whether we can have a valid discussion [on] the difficulty of the paper. [participant 6]</p> <p>two marks is not... as influential as it would be in... another subject which is out of small marks. [participant 8]</p> <p>There were also observations that team dynamics and the nature of discussions varied depending on the cultural and linguistic representation in awarding teams:</p> <p>there is a lot of respect for hierarchy and that is also cultural. [participant 6]</p> <p>Therefore, needs and requirements vary depending on subject-specific nuances in awarding.</p> <hd id="AN0182340775-23">Discussion and conclusions</hd> <p>Whilst some research exists on individual aspects of IB standard setting, such as the procedure for calculating SRBs (AlphaPlus, [<reflink idref="bib2" id="ref60">2</reflink>]), this is believed to be the first study investigating the overall awarding process and, as such, it represents a contribution to the literature. This study compared awarding processes in an extended model where examiners reviewed contextual evidence including statistical recommendations before grading, with a limited model where examiners graded scripts without access to additional evidence. Preliminary findings suggest that the two approaches lead to broadly comparable outcomes. Surprisingly, grading judgements aligned more often with SRBs when examiners were not provided with additional evidence, suggesting that contextual data does not always encourage examiners to align their grading with statistical recommendations.</p> <p>Findings also raise questions about the purposes of different types of statistical evidence in grade award. Participants distinguished between item-level data and other statistical evidence in awarding, with item-level data generally preferred as it helps examiners prioritise questions for review. This is consistent with other evidence that item-level data improves efficiency by identifying 'key discriminators' for review during grading (Bramley, [<reflink idref="bib13" id="ref61">13</reflink>]). Statistical evidence such as SRBs are intended to provide context of the exam session and guide grading decisions. Yet, these findings suggest it may cloud examiner judgement instead. In IB awarding, the aim is to balance judgemental and statistical evidence to ensure grades retain the same value over time. If this awarding process is analogous to mixed methods research, it would be expected to 'involve the connection, integration or linking of two <emph>independent</emph> strands of quantitative and qualitative data' [emphasis added] (Opposs &amp; Gorgen, [<reflink idref="bib29" id="ref62">29</reflink>], p. 62). However, qualitative awarding data cease to be independent in nature if statistical recommendations contaminate rather than contextualise examiner grading judgements.</p> <hd id="AN0182340775-24">Limitations and further research</hd> <p>The nature of this study required pragmatic decisions in the trial design to ensure ecological validity, which created some limitations that restrict the generalisability of the findings. Only a small number of subjects were included, and it is possible that statistical and judgemental evidence are combined and valued differently in other disciplinary contexts. The small sample size, and the risk that some participants may have remembered original grade boundaries created further limitations in the study. Finally, using SRBs to identify mark ranges for grading means that statistical and judgemental approaches were entangled from the outset, which creates some limitations for the findings. The aim, therefore, was not to provide definitive evidence, but rather to encourage further consideration and discussion about the integration of different sources of evidence in awarding.</p> <p>Further research may include exploratory studies with larger samples and different disciplines. Additional possibilities might include inviting examiners to consider statistical recommendations post hoc rather than before grading, to avoid clouding their judgement. Wider mark ranges could also be selected to give examiners more freedom in their judgements, with subject managers combining grading results with other evidence after the fact. Finally, investigations into the impact for very small cohort subjects would be useful, as the small entry size makes statistical evidence less reliable in these scenarios.</p> <hd id="AN0182340775-25">Acknowledgements</hd> <p>The author would like to thank all participants who were involved in the study, as well as Antony Furlong, Dr Matt Glanville and Professor Jo-Anne Baird for their invaluable comments on earlier versions of this manuscript.</p> <p>Initial results were shared as a conference presentation, but are published here in full for the first time.</p> <hd id="AN0182340775-26">Disclosure statement</hd> <p>No potential conflict of interest was reported by the author. The views expressed in this paper are those of the author and are not to be taken as the views of the International Baccalaureate (IB).</p> <hd id="AN0182340775-27">Data availability statement</hd> <p>The data that support the findings of this study may be available from the corresponding author upon reasonable request. The data are not publicly available due to privacy restrictions, as they may contain information that could compromise the privacy of research participants.</p> <hd id="AN0182340775-28">Appendix</hd> <p>Table A1. Grade-worthiness of scripts: grading decisions versus statistically recommended boundaries (SRBs).</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;Assessment&lt;/td&gt;&lt;td&gt;Grade&lt;/td&gt;&lt;td&gt;Limited&lt;/td&gt;&lt;td&gt;Extended&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Total # decisions&lt;/td&gt;&lt;td&gt;Proportion grade-worthy below SRB&lt;/td&gt;&lt;td&gt;Proportion grade worthy at/above SRB&lt;/td&gt;&lt;td&gt;Total # decisions&lt;/td&gt;&lt;td&gt;Proportion grade-worthy below SRB&lt;/td&gt;&lt;td&gt;Proportion grade-worth at/above SRB&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Mathematics&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;62&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;32&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;70&lt;/td&gt;&lt;td&gt;35&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;td&gt;37&lt;/td&gt;&lt;td&gt;11&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;td&gt;24&lt;/td&gt;&lt;td&gt;32&lt;/td&gt;&lt;td&gt;66&lt;/td&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;38&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Business management paper 1&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;43&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;38&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;85&lt;/td&gt;&lt;td&gt;18&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;td&gt;66&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;41&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;90&lt;/td&gt;&lt;td&gt;22&lt;/td&gt;&lt;td&gt;33&lt;/td&gt;&lt;td&gt;49&lt;/td&gt;&lt;td&gt;24&lt;/td&gt;&lt;td&gt;33&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Business management paper 2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;90&lt;/td&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;29&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;38&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;89&lt;/td&gt;&lt;td&gt;18&lt;/td&gt;&lt;td&gt;31&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;29&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;78&lt;/td&gt;&lt;td&gt;53&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;42&lt;/td&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;English paper 1&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;43&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;52&lt;/td&gt;&lt;td&gt;42&lt;/td&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;44&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;51&lt;/td&gt;&lt;td&gt;41&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;52&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;43&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;58&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;63&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;English paper 2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;65&lt;/td&gt;&lt;td&gt;36&lt;/td&gt;&lt;td&gt;28&lt;/td&gt;&lt;td&gt;39&lt;/td&gt;&lt;td&gt;34&lt;/td&gt;&lt;td&gt;22&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;59&lt;/td&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;59&lt;/td&gt;&lt;td&gt;37&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;56&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;16&lt;/td&gt;&lt;td&gt;61&lt;/td&gt;&lt;td&gt;38&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;53&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Japanese paper 1&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;61&lt;/td&gt;&lt;td&gt;18&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;82&lt;/td&gt;&lt;td&gt;17&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;33&lt;/td&gt;&lt;td&gt;67&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Japanese paper 2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;34&lt;/td&gt;&lt;td&gt;27&lt;/td&gt;&lt;td&gt;41&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;83&lt;/td&gt;&lt;td&gt;24&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Spanish paper 1&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;49&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;37&lt;/td&gt;&lt;td&gt;51&lt;/td&gt;&lt;td&gt;31&lt;/td&gt;&lt;td&gt;39&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;64&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;58&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;62&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Spanish paper 2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;62&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;67&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;64&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;60&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;67&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Overall&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;466&lt;/td&gt;&lt;td&gt;16&lt;/td&gt;&lt;td&gt;39&lt;/td&gt;&lt;td&gt;403&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;39&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;454&lt;/td&gt;&lt;td&gt;18&lt;/td&gt;&lt;td&gt;46&lt;/td&gt;&lt;td&gt;420&lt;/td&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;43&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;415&lt;/td&gt;&lt;td&gt;16&lt;/td&gt;&lt;td&gt;44&lt;/td&gt;&lt;td&gt;330&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;43&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <ref id="AN0182340775-29"> <title> Notes </title> <blist> <bibl id="bib1" idref="ref41" type="bt">1</bibl> <bibtext> For the purposes of this study, alternative approaches implemented to accommodate atypical circumstances resulting from the Covid-19 pandemic are not considered.</bibtext> </blist> <blist> <bibl id="bib2" idref="ref11" type="bt">2</bibl> <bibtext> In large cohort subjects where IB exams are administered in different regions around the world, it is standard practice to produce parallel forms of the same exam, to avoid the risk of time zone cheating.</bibtext> </blist> <blist> <bibl id="bib3" idref="ref23" type="bt">3</bibl> <bibtext> The IB has two exam sessions each year, in May and November.</bibtext> </blist> <blist> <bibl id="bib4" idref="ref17" type="bt">4</bibl> <bibtext> Although this was limited by the small number of scripts at some marks, notably in Japanese.</bibtext> </blist> </ref> <ref id="AN0182340775-30"> <title> References </title> <blist> <bibtext> AlphaPlus. (2021). Statistical grade boundary setting approaches: Literature review for the IB. AlphaPlus Ltd.</bibtext> </blist> <blist> <bibtext> AlphaPlus. (2022). Statistical grade boundary setting approaches: Simulation analysis for the IB. AlphaPlus Ltd.</bibtext> </blist> <blist> <bibtext> Baird, J.-A. (2000). Are examination standards all in the head?: Experiments with examiners' judgements of standards in a level examinations. Research in Education, 64 (1), 91 – 100. https://doi.org/10.7227/RIE.64.9</bibtext> </blist> <blist> <bibtext> Baird, J.-A. (2008). Alternative conceptions of comparability. In P. Newton, J. Baird, H. Goldstein, H. Patrick, &amp; P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 124 – 165). Qualifications and Curriculum Authority.</bibtext> </blist> <blist> <bibl id="bib5" idref="ref3" type="bt">5</bibl> <bibtext> Baird, J.-A. (2018). The meaning of national examination standards. In J.-A. Baird, T. Isaacs, D. Opposs &amp; L. Gray (Eds.), Examination standards: How measures and meanings differ around the world (pp. 284 – 306). UCL IOE Press.</bibtext> </blist> <blist> <bibl id="bib6" idref="ref25" type="bt">6</bibl> <bibtext> Baird, J.-A., &amp; Dhillon, D. (2005). Qualitative expert judgements on examination standards: Valid, but inexact (Internal Report RPA 05 JB RP 077). Assessment and Qualifications Alliance.</bibtext> </blist> <blist> <bibl id="bib7" idref="ref6" type="bt">7</bibl> <bibtext> Baird, J.-A., &amp; Gray, L. (2016). The meaning of curriculum-related examination standards in Scotland and England: A home–international comparison. Oxford Review of Education, 42 (3), 266 – 284. https://doi.org/10.1080/03054985.2016.1184866</bibtext> </blist> <blist> <bibl id="bib8" idref="ref5" type="bt">8</bibl> <bibtext> Baird, J.-A., Isaacs, T., Opposs, D., &amp; Gray, L. (2018). Examination standards: How measures and meanings differ around the world. UCL, IOE Press.</bibtext> </blist> <blist> <bibl id="bib9" idref="ref21" type="bt">9</bibl> <bibtext> Baird, J.-A., &amp; Scharaschkin, A. (2002). Is the whole worth more than the sum of the parts? Studies of examiners' grading of individual papers and candidates' whole A-level examination performances. Educational Studies, 28 (2), 143 – 162. https://doi.org/10.1080/03055690220124588</bibtext> </blist> <blist> <bibtext> Benton, T., &amp; Bramley, T. (2015). The use of evidence in setting and maintaining standards in GCSEs and a levels: Discussion paper. Cambridge Assessment.</bibtext> </blist> <blist> <bibtext> Black, B., &amp; Bramley, T. (2008). Investigating a judgemental rank‐ordering method for maintaining standards in UK examinations. Research Papers in Education, 23 (3), 357 – 373. https://doi.org/10.1080/02671520701755440</bibtext> </blist> <blist> <bibtext> Bramley, T. (2009, November). The effect of manipulating features of examinees' scripts on their perceived quality. In Association for Educational Assessment – Europe (AEA-Europe) annual conference, Malta. https://<ulink href="http://www.cambridgeassessment.org.uk/Images/459334-the-effect-of-manipulating-features-of-examinees-scripts-on-their-perceived-quality.pdf">www.cambridgeassessment.org.uk/Images/459334-the-effect-of-manipulating-features-of-examinees-scripts-on-their-perceived-quality.pdf</ulink></bibtext> </blist> <blist> <bibtext> Bramley, T. (2010). 'Key discriminators' and the use of item level data in awarding. Research Matters: A Cambridge Assessment Publication, 9, 32 – 38.</bibtext> </blist> <blist> <bibtext> Cambridge Assessment. (2010). Cambridge assessment exam standards: The big debate report and recommendations. https://<ulink href="http://www.cambridgeassessment.org.uk/Images/125765-exam-standards-report-and-recommendations.pdf">www.cambridgeassessment.org.uk/Images/125765-exam-standards-report-and-recommendations.pdf</ulink></bibtext> </blist> <blist> <bibtext> Cizek, G., &amp; Bunch, M. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests (1st ed.). Sage.</bibtext> </blist> <blist> <bibtext> Coe, R. (2010). Understanding comparability of examination standards. Research Papers in Education, 25 (3), 271 – 284. https://doi.org/10.1080/02671522.2010.498143</bibtext> </blist> <blist> <bibtext> Cresswell, M. (1997). Examining judgements: Theory and practice of awarding public examination grades.</bibtext> </blist> <blist> <bibtext> Crisp, V. (2010). Judging the grade: Exploring the judgement processes involved in examination grading decisions. Evaluation &amp; Research in Education, 23 (1), 19 – 35. https://doi.org/10.1080/09500790903572925</bibtext> </blist> <blist> <bibtext> Denscombe, M. (2017). The Good research guide: For small-scale social research projects (6th ed.). Open University Press.</bibtext> </blist> <blist> <bibtext> Good, F., &amp; Cresswell, M. (1988). Grade awarding judgements in differentiated examinations. British Educational Research Journal, 14 (3), 263 – 281. https://doi.org/10.1080/0141192880140304</bibtext> </blist> <blist> <bibtext> Hill, I. (2006). Do International Baccalaureate programs internationalise or globalise? International Education Journal, 7 (1), 98 – 108.</bibtext> </blist> <blist> <bibtext> IBO. (2013). Language A: Literature guide first examinations 2015.</bibtext> </blist> <blist> <bibtext> IBO. (2014). Mathematics guide: For use from September 2014/January 2015.</bibtext> </blist> <blist> <bibtext> IBO. (2017). Business management guide first assessment 2016.</bibtext> </blist> <blist> <bibtext> IBO. (2019). Assessment principles and practices – quality assessments in a digital ag e.</bibtext> </blist> <blist> <bibtext> Isaacs, T., &amp; Gorgen, K. (2018). Culture, context and controversy in setting national examination standards. In J.-A. Baird, T. Isaacs, D. Opposs &amp; L. Gray (Eds.), Examination standards: How measures and meanings differ around the world (pp. 307 – 330). UCL IOE Press.</bibtext> </blist> <blist> <bibtext> Newton, P. E. (2007). Techniques for monitoring the comparability of examination standards, Version 2.1. Qualifications and Curriculum Authority.</bibtext> </blist> <blist> <bibtext> Newton, P. E. (2022). Demythologising a level exam standards. Research Papers in Education, 37 (6), 875 – 906. https://doi.org/10.1080/02671522.2020.1870543</bibtext> </blist> <blist> <bibtext> Opposs, D., &amp; Gorgen, K. (2018). What is standard setting? In J.-A. Baird, T. Isaacs, D. Opposs, &amp; L. Gray (Eds.), Examination standards: How measures and meanings differ around the world (pp. 54 – 76). UCL IOE Press.</bibtext> </blist> <blist> <bibtext> QSR International Pty Ltd. (2018). NVivo (version 12) [ Computer software ]. https://<ulink href="http://www.qsrinternational.com/nvivo-qualitative-data-analysis-software/home">www.qsrinternational.com/nvivo-qualitative-data-analysis-software/home</ulink></bibtext> </blist> <blist> <bibtext> Scharaschkin, A., &amp; Baird, J.-A. (2000). The effects of consistency of performance on a level examiners' judgements of standards. British Educational Research Journal, 26 (3), 343 – 357. https://doi.org/10.1080/713651557</bibtext> </blist> <blist> <bibtext> Simpson, L., &amp; Baird, J.-A. (2013). Perceptions of trust in public examinations. Oxford Review of Education, 39 (1), 17 – 35. https://doi.org/10.1080/03054985.2012.760264</bibtext> </blist> <blist> <bibtext> Stringer, N. S. (2012). Setting and maintaining GCSE and GCE grading standards: The case for contextualised cohort-referencing. Research Papers in Education, 27 (5), 535 – 554. https://doi.org/10.1080/02671522.2011.580364</bibtext> </blist> <blist> <bibtext> Suto, I., &amp; Novakovic, N. (2011). An exploration of the examination script features that most influence expert judgements in three methods of evaluating script quality. Assessment in Education Principles, Policy &amp; Practice, 19 (3), 301 – 320. https://doi.org/10.1080/0969594X.2011.592971</bibtext> </blist> </ref> <aug> <p>By Louise Badham</p> <p>Reported by Author</p> <p></p> <p>Louise Badham is Manager for Assessment Research and Design at the International Baccalaureate. She is currently carrying out doctoral research into comparability issues in international summative assessments at the University of Oxford.</p> </aug> <nolink nlid="nl1" bibid="bib21" firstref="ref1"></nolink> <nolink nlid="nl2" bibid="bib32" firstref="ref2"></nolink> <nolink nlid="nl3" bibid="bib16" firstref="ref4"></nolink> <nolink nlid="nl4" bibid="bib15" firstref="ref7"></nolink> <nolink nlid="nl5" bibid="bib29" firstref="ref8"></nolink> <nolink nlid="nl6" bibid="bib14" firstref="ref9"></nolink> <nolink nlid="nl7" bibid="bib28" firstref="ref10"></nolink> <nolink nlid="nl8" bibid="bib31" firstref="ref12"></nolink> <nolink nlid="nl9" bibid="bib34" firstref="ref13"></nolink> <nolink nlid="nl10" bibid="bib10" firstref="ref14"></nolink> <nolink nlid="nl11" bibid="bib27" firstref="ref15"></nolink> <nolink nlid="nl12" bibid="bib20" firstref="ref18"></nolink> <nolink nlid="nl13" bibid="bib12" firstref="ref19"></nolink> <nolink nlid="nl14" bibid="bib18" firstref="ref22"></nolink> <nolink nlid="nl15" bibid="bib11" firstref="ref28"></nolink> <nolink nlid="nl16" bibid="bib25" firstref="ref30"></nolink> <nolink nlid="nl17" bibid="bib17" firstref="ref33"></nolink> <nolink nlid="nl18" bibid="bib33" firstref="ref35"></nolink> <nolink nlid="nl19" bibid="bib26" firstref="ref38"></nolink> <nolink nlid="nl20" bibid="bib22" firstref="ref48"></nolink> <nolink nlid="nl21" bibid="bib24" firstref="ref49"></nolink> <nolink nlid="nl22" bibid="bib23" firstref="ref50"></nolink> <nolink nlid="nl23" bibid="bib19" firstref="ref52"></nolink> <nolink nlid="nl24" bibid="bib30" firstref="ref53"></nolink> <nolink nlid="nl25" bibid="bib13" firstref="ref58"></nolink>
Header	DbId: eric DbLabel: ERIC An: EJ1458475 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Statistically Guided Grading Judgements: Contextualisation or Contamination? – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Louise+Badham%22">Louise Badham</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0003-0411-1827">0000-0003-0411-1827</externalLink>) – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22Oxford+Review+of+Education%22"><i>Oxford Review of Education</i></searchLink>. 2025 51(1):17-35. – Name: Avail Label: Availability Group: Avail Data: Routledge. Available from: Taylor & Francis, Ltd. 530 Walnut Street Suite 850, Philadelphia, PA 19106. Tel: 800-354-1420; Tel: 215-625-8900; Fax: 215-207-0050; Web site: http://www.tandf.co.uk/journals – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 19 – Name: DatePubCY Label: Publication Date Group: Date Data: 2025 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Advanced+Placement+Programs%22">Advanced Placement Programs</searchLink><br /><searchLink fieldCode="DE" term="%22Grading%22">Grading</searchLink><br /><searchLink fieldCode="DE" term="%22Interrater+Reliability%22">Interrater Reliability</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluative+Thinking%22">Evaluative Thinking</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluation+Criteria%22">Evaluation Criteria</searchLink><br /><searchLink fieldCode="DE" term="%22Statistical+Data%22">Statistical Data</searchLink><br /><searchLink fieldCode="DE" term="%22Teacher+Attitudes%22">Teacher Attitudes</searchLink><br /><searchLink fieldCode="DE" term="%22Error+of+Measurement%22">Error of Measurement</searchLink><br /><searchLink fieldCode="DE" term="%22Scoring%22">Scoring</searchLink><br /><searchLink fieldCode="DE" term="%22Scoring+Rubrics%22">Scoring Rubrics</searchLink><br /><searchLink fieldCode="DE" term="%22Context+Effect%22">Context Effect</searchLink><br /><searchLink fieldCode="DE" term="%22Bias%22">Bias</searchLink> – Name: DOI Label: DOI Group: ID Data: 10.1080/03054985.2023.2290640 – Name: ISSN Label: ISSN Group: ISSN Data: 0305-4985<br />1465-3915 – Name: Abstract Label: Abstract Group: Ab Data: Different sources of assessment evidence are reviewed during International Baccalaureate (IB) grade awarding to convert marks into grades and ensure fair results for students. Qualitative and quantitative evidence are analysed to determine grade boundaries, with statistical evidence weighed against examiner judgement and teachers' feedback on examinations. A trial was conducted to explore how examiners' grading decisions were influenced by having access to statistical evidence. Grade awards were replicated in nine exams across five subjects, with examiners accessing all available evidence in one model, and only scripts and grade descriptors in the other. Preliminary findings suggest that both approaches lead to broadly comparable grading outcomes. Focus group feedback indicates that examiners consider judging the grade-worthiness of student work to be their primary role in grade award. Whilst they found item-level data helpful for prioritising questions for review, participants reported that access to evidence such as statistically recommended boundaries can cloud their judgement or encourage strategic grading. This study also raises further questions about the purposes and uses of different forms of statistical evidence, as well as how and when they should be integrated with qualitative evidence in grade awarding. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2025 – Name: AN Label: Accession Number Group: ID Data: EJ1458475
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1458475
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1080/03054985.2023.2290640 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 19 StartPage: 17 Subjects: – SubjectFull: Advanced Placement Programs Type: general – SubjectFull: Grading Type: general – SubjectFull: Interrater Reliability Type: general – SubjectFull: Evaluative Thinking Type: general – SubjectFull: Evaluation Criteria Type: general – SubjectFull: Statistical Data Type: general – SubjectFull: Teacher Attitudes Type: general – SubjectFull: Error of Measurement Type: general – SubjectFull: Scoring Type: general – SubjectFull: Scoring Rubrics Type: general – SubjectFull: Context Effect Type: general – SubjectFull: Bias Type: general Titles: – TitleFull: Statistically Guided Grading Judgements: Contextualisation or Contamination? Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Louise Badham IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2025 Identifiers: – Type: issn-print Value: 0305-4985 – Type: issn-electronic Value: 1465-3915 Numbering: – Type: volume Value: 51 – Type: issue Value: 1 Titles: – TitleFull: Oxford Review of Education Type: main
ResultId	1