Assessing the Validity of Can-Do Statements in Retrospective (Then-Now) Self-Assessment

Saved in:
Bibliographic Details
Title: Assessing the Validity of Can-Do Statements in Retrospective (Then-Now) Self-Assessment
Language: English
Authors: Brown, N. Anthony, Dewey, Dan P., Cox, Troy L.
Source: Foreign Language Annals. Sum 2014 47(2):261-285.
Availability: Wiley-Blackwell. 350 Main Street, Malden, MA 02148. Tel: 800-835-6770; Tel: 781-388-8598; Fax: 781-388-8232; e-mail: cs-journals@wiley.com; Web site: http://www.wiley.com/WileyCDA
Peer Reviewed: Y
Page Count: 25
Publication Date: 2014
Document Type: Journal Articles
Reports - Research
Descriptors: Self Evaluation (Individuals), Pretests Posttests, Interviews, Language Proficiency, Oral Language, Language Tests, Correlation, Predictive Validity, Item Analysis, Internship Programs, Foreign Countries, Achievement Gains, Russian, Second Language Learning, Student Attitudes, Test Reliability
Geographic Terms: Russia
DOI: 10.1111/flan.12082
ISSN: 0015-718X
Abstract: In this study, the authors evaluated the strengths and limitations of a self-assessment based on ACTFL Can-Do statements ("ACTFL," 2013]) as a tool for measuring linguistic gains over an internship abroad in Russia. They assessed its reliability, determined how its items mapped with the ACTFL scale, and measured the degree to which students' self-evaluations matched oral proficiency interview (OPI) test results (i.e., predictive validity). Data revealed a high level of reliability. Furthermore, self-assessment items ascended in the order of difficulty expected (i.e., Superior items were the most difficult, followed by Advanced), but differences between the means for items representing the ACTFL levels were not statistically significant. Finally, while students demonstrated significant gains from pre- to posttests on both the OPI and the self-assessment, correlations between these measures were only moderate.
Abstractor: As Provided
Entry Date: 2014
Accession Number: EJ1031126
Database: ERIC
Full text is not displayed to guests.
FullText Links:
  – Type: pdflink
    Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHj0k_4E0hTGH8RJwT4gCJyBsGNe_WN95AvKlDbXJGqwxwGSbWPQHxKctDWpwi5RJ1flAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDNUoZ05jGIZ6giEUqgIBEICBmp95Mgb4ILtNyFxFUp_EaQOpuLlpzpqz1wGu9BcYuR0N3QnDRzX9F9HRRyRgTweJDpm5WSFb5gPclMnwwD2leay1ihZBAsXVAtTQqi6KlYN0WD7kLcB9LfGlnLPRKUYiBWqM_ncvUobW6M9q_w9tVhY02Tn2r00YfdQYm443utWgyRM9bZQciwi8L96wXa-l7xdr_p9vgSe9DJk=
Text:
  Availability: 1
  Value: <anid>AN0096730569;fla01jun.14;2018Aug09.15:26;v2.2.500</anid> <title id="AN0096730569-1">Assessing the Validity of Can-Do Statements in Retrospective (Then-Now) Self-Assessment. </title> <p>In this study, the authors evaluated the strengths and limitations of a self‐assessment based on ACTFL Can‐Do statements (ACTFL, ) as a tool for measuring linguistic gains over an internship abroad in Russia. They assessed its reliability, determined how its items mapped with the ACTFL scale, and measured the degree to which students' self‐evaluations matched oral proficiency interview (OPI) test results (i.e., predictive validity). Data revealed a high level of reliability. Furthermore, self‐assessment items ascended in the order of difficulty expected (i.e., Superior items were the most difficult, followed by Advanced), but differences between the means for items representing the ACTFL levels were not statistically significant. Finally, while students demonstrated significant gains from pre‐ to posttests on both the OPI and the self‐assessment, correlations between these measures were only moderate.</p> <p>Video Abstract & Discussion</p> <p>learning environment; oral proficiency; Russian as a second language; self‐evaluation; study abroad</p> <p>Several years ago, a former student related an experience he had after completing an intensive 12‐week Russian course in preparation for spending 2 years in Russia. Just prior to boarding his plane, he called his parents to say goodbye and to impress them with his newly acquired foreign language skills. Quite sincerely, he informed them that he still had some work to do in order to become fluent in Russian but that he was almost there. To this student's credit, after returning to the United States 2 years later and completing another 2 years of coursework in Russian, he tested Superior on the ACTFL Oral Proficiency Interview (OPI) and since then has proven remarkably adept at learning other foreign languages. Nevertheless, the question arises as to the criteria by which individuals self‐assess their language proficiency at any given time.</p> <p>In the case of the aforementioned student, he assessed himself at the Superior level after only 12 weeks of study, but he still needed another 4 years of studying Russian to reach that level. Certainly, lack of exposure to established proficiency criteria partly explained such a self‐assessment. Research has also suggested that the unskilled have a propensity for overrating their abilities (Ehrlinger, Johnson, Banner, Dunning, & Kruger, [<reflink idref="bib25" id="ref1">25</reflink>] ). Regardless of the factors contributing to this student's linguistic self‐perception when he spoke with his parents, his knowledge and skills proved to be insufficient at best after setting foot in the target language country less than 24 hours later. Alternatively, more accurate self‐assessment can heighten individuals' awareness of skills that they can and cannot carry out effectively and guide their judgment in setting learning goals. Recognizing the need to implement a more effective self‐assessment procedure at their home institution, the researchers developed a survey instrument based on the National Council of State Supervisors of Languages‐ACTFL (NCSSFL‐ACTFL) Can‐Do Statements (ACTFL, [<reflink idref="bib3" id="ref2">3</reflink>] ) that was designed to compare self‐assessed language abilities both prior and subsequent to participating in a 12‐week internship program in Moscow. This article presents pre‐ and post–study abroad data that were gathered using both the Can‐Do self‐assessment instrument and official OPIs, and also addresses the strengths and limitations of the self‐assessment instrument—in particular, the process of linguistic self‐assessment. Three research questions guided the collection and analysis of the data as well as the subsequent interpretation of the findings:</p> <p>What is the reliability of the retrospective Can‐Do self‐assessment instrument used in this study?</p> <p>To what extent do the survey items ascend in a hierarchy of difficulty based on the ACTFL speaking proficiency guidelines?</p> <p>What is the predictive validity of the self‐assessment items in determining an OPI score?</p> <hd id="AN0096730569-2">Literature Review</hd> <hd id="AN0096730569-3">Internships, Study Abroad, and Other Experiential Language Learning</hd> <p>Recent literature on the nature and role of experiential language learning informed this research. Experiential language learning is learning that allows a student to acquire and use a foreign language both in and out of class in a variety of formal and informal settings. The number of U.S. students studying abroad has steadily increased over the past decade, reaching more than 340,000 in 2011–2012 (Institute of International Education, [<reflink idref="bib36" id="ref3">36</reflink>] , n.p.). Furthermore, in the decade between 2000 and 2010, the number of students who traveled abroad for a credit‐earning internship program ballooned from 1,700 to 16,400, with another 8,700 who worked abroad on a noncredit basis (Simon, [<reflink idref="bib56" id="ref4">56</reflink>] , n.p.). Perhaps this growth could most accurately be described as one of augmenting existing study abroad programs and thereby providing students “work‐study abroad” (Simon, [<reflink idref="bib56" id="ref5">56</reflink>] , n.p.). In addition to these increases, enrollment in domestic intensive study abroad programs, such as those at Beloit College and Middlebury College, continues to grow. In fact, ongoing demand for enrollment in Middlebury Language Schools led to the creation of a West Coast campus to accommodate the dramatic increase in student applications (Duran, [<reflink idref="bib22" id="ref6">22</reflink>] ). Other experiential learning programs, such as on‐campus language‐specific housing options, where speakers of the same foreign language agree to speak only in that language so as to improve their proficiency and depth of cultural knowledge in a domestic setting, have also drawn interest (Bown, Dewey, Martinsen, & Baker, [<reflink idref="bib9" id="ref7">9</reflink>] ; Dewey, Bown, Baker, & Martinsen, [<reflink idref="bib19" id="ref8">19</reflink>] ; Martinsen, Baker, Dewey, Bown, & Johnson, [<reflink idref="bib45" id="ref9">45</reflink>] ). In short, interest in and demand for experiential language learning opportunities both at home and abroad have grown considerably over the past decade as students, faculty, and administrators increasingly view foreign languages as an indispensable component of a global community.</p> <hd id="AN0096730569-4">Self‐Assessment</hd> <p>The literature on self‐assessment has typically highlighted several advantages: (<reflink idref="bib1" id="ref10">1</reflink>) it is cost‐effective and relatively easy to design, administer, and score; (<reflink idref="bib2" id="ref11">2</reflink>) it can promote greater learner awareness and self‐regulation; and (<reflink idref="bib3" id="ref12">3</reflink>) it can motivate students by adding variety to, as well as increased participation in, the assessment process (Dickinson, [<reflink idref="bib21" id="ref13">21</reflink>] ; Oscarson, [<reflink idref="bib49" id="ref14">49</reflink>] ; Ross, [<reflink idref="bib53" id="ref15">53</reflink>] , [<reflink idref="bib54" id="ref16">54</reflink>] ). Critics have suggested that self‐assessment is not appropriate because learners are not capable of accurately gauging their own abilities, and some asserted that it can lead to lower standards, rewarding students who overestimate their abilities (Ross, [<reflink idref="bib54" id="ref17">54</reflink>] ). In spite of these criticisms, self‐assessment has been used regularly either as the sole measure of language development or as a complement to other measures in research on study abroad (see Badstübner & Ecke, [<reflink idref="bib5" id="ref18">5</reflink>] ; Carlson, Burn, Ussem, & Yachimovicz, [<reflink idref="bib13" id="ref19">13</reflink>] ; Dewey, [<reflink idref="bib18" id="ref20">18</reflink>] ; Dyson, [<reflink idref="bib24" id="ref21">24</reflink>] ; Magnan & Back, [<reflink idref="bib44" id="ref22">44</reflink>] ; Meara, [<reflink idref="bib46" id="ref23">46</reflink>] ; Opper, Teichler, & Carlson, [<reflink idref="bib48" id="ref24">48</reflink>] ; Teichler & Maiworm, [<reflink idref="bib61" id="ref25">61</reflink>] ), internships abroad (see Feldman & Bolino, [<reflink idref="bib29" id="ref26">29</reflink>] ; Gillespie, Braskamp, & Braskamp, [<reflink idref="bib30" id="ref27">30</reflink>] ; Steinberg, [<reflink idref="bib58" id="ref28">58</reflink>] ; van‘t Klooster, van Wijk, Go, & van Rekom, [<reflink idref="bib62" id="ref29">62</reflink>] ; Waryszak, [<reflink idref="bib63" id="ref30">63</reflink>] ), intensive domestic immersion (see Dewey, [<reflink idref="bib18" id="ref31">18</reflink>] ; Savchenko, [<reflink idref="bib55" id="ref32">55</reflink>] ), and foreign language residences (see Bown et al., [<reflink idref="bib9" id="ref33">9</reflink>] ; Martinsen et al., [<reflink idref="bib45" id="ref34">45</reflink>] ).</p> <p>To determine the validity of self‐assessment as a proxy for more objective measures, researchers in a variety of fields, including math, science, first and second language reading and writing, and medicine, have administered both self‐assessments and objective measures and have found that learners are largely able to make good judgments of their own abilities, but the accuracy of these judgments improves as learners reach higher levels of achievement in the domain being assessed (Falchikov & Boud, [<reflink idref="bib28" id="ref35">28</reflink>] ). In fact, it appears that learners are more aware of their areas of difficulty than they are of what is easier for them (Burson, Larrick, & Klayman, [<reflink idref="bib11" id="ref36">11</reflink>] ). Researchers have also found that accuracy can vary depending on one's field, with self‐estimates of abilities in the hard sciences being the most accurate (Falchikov & Boud, [<reflink idref="bib28" id="ref37">28</reflink>] ). Meta‐analyses of such studies typically have indicated that correlations between self‐assessments and more objective measures often range between r =0.60 and r =0.80, although some certainly are considerably lower and others even higher (Falchikov & Boud, [<reflink idref="bib28" id="ref38">28</reflink>] ; Oscarson, [<reflink idref="bib49" id="ref39">49</reflink>] ; Ross, [<reflink idref="bib53" id="ref40">53</reflink>] ). While one might consider correlations in this range too low to merit direct substitution of a self‐assessment for another standardized measure, as Oscarson ([<reflink idref="bib49" id="ref41">49</reflink>] ) pointed out, these correlations are “about the same magnitude as those obtained between different sub‐sections in a major language test battery” (p. 179). In other words, subtests purporting to measure aspects of the same construct are as closely related (i.e., correlated) as self‐assessment results are with more objective measures of language proficiency, suggesting some overlap in terms of the constructs being measured. In the 1980s, researchers conducting large‐scale studies of second language learners, including students enrolled in higher education institutions (Clark, [<reflink idref="bib14" id="ref42">14</reflink>] ) and instructors in K–12 settings (e.g., Hilton, Grandy, Kline, & Liskin‐Gasparro, [<reflink idref="bib35" id="ref43">35</reflink>] ) across the United States, deemed these correlations high enough to merit using self‐assessment as a primary tool for gathering information on proficiency.</p> <p>Some factors that can influence the accuracy of learners’ self‐assessments include academic record, peer‐group and parental expectations, career aspirations, lack of training in self‐assessment, cultural background, and self‐management skills (see the following for more extensive reviews of the findings in these areas: Falchikov & Boud, [<reflink idref="bib28" id="ref44">28</reflink>] ; Oscarson, [<reflink idref="bib49" id="ref45">49</reflink>] ; Ross, [<reflink idref="bib53" id="ref46">53</reflink>] ). Awareness of these variables and their possible influence can allow educators to make adjustments to improve both the accuracy and interpretation of self‐assessment data.</p> <p>Researchers have found that self‐assessments prove most useful when they are tied to tasks that learners are likely to encounter and can imagine themselves experiencing. As Oscarson ([<reflink idref="bib49" id="ref47">49</reflink>] ) pointed out, “Self‐assessments are more accurate when based on task content closely tied to students' situations as potential users of the language in question.” Furthermore, “The evidence is that it is easier for learners to assess their ability in relation to concrete descriptions of more narrowly defined linguistic situations” (p. 183). Providing concrete descriptors with specific examples can help learners more accurately evaluate their abilities. Recent efforts to implement self‐assessment in second language education have involved using Can‐Do statements connected with tasks that are often the focus of language curricula and that learners ought to expect to encounter in authentic real‐world situations. Learners respond to statements such as, “I can give directions from one location to another within my neighborhood,” using options such as the following: (<reflink idref="bib1" id="ref48">1</reflink>) “I can do this with no difficulty at all,” (<reflink idref="bib2" id="ref49">2</reflink>) “I can do this with little difficulty,” (<reflink idref="bib3" id="ref50">3</reflink>) “I can do this with some difficulty,” or (<reflink idref="bib4" id="ref51">4</reflink>) “I cannot do this at all.” Such Can‐Do statements have been based on the Common European Framework of Reference for Languages/CEFR (Engelhardt & Pfingsthorn, [<reflink idref="bib27" id="ref52">27</reflink>] ), the ACTFL Speaking Proficiency Guidelines (Dewey, Bown, & Eggett, [<reflink idref="bib20" id="ref53">20</reflink>] ), and the Interagency Language Roundtable Language Proficiency Skill Level Descriptions/ILR scale (Stansfield, Gao, & Rivers, [<reflink idref="bib57" id="ref54">57</reflink>] ). Bandura ([<reflink idref="bib6" id="ref55">6</reflink>] ) noted that in order to measure a learner's self‐perceived capability, “items should be phrased in terms of can do” (p. 308). In fact, he asserted that valid self‐assessment requires such statements.[<reflink idref="bib1" id="ref56">1</reflink>] Early indications are that these Can‐Do self‐assessments can be used effectively to evaluate and facilitate learners' linguistic progress.</p> <hd id="AN0096730569-5">Then‐Now Self‐Assessment</hd> <p>When it is used to evaluate changes resulting from immersion programs such as internships or study abroad, self‐assessment typically consists of having students evaluate their abilities to perform specific linguistic tasks two times: once at the beginning of their experience and once again at the end (see Dyson, [<reflink idref="bib24" id="ref57">24</reflink>] ; Meara, [<reflink idref="bib46" id="ref58">46</reflink>] ; Teichler & Maiworm, [<reflink idref="bib61" id="ref59">61</reflink>] ). This pretest‐posttest self‐assessment design has been a common feature in educational research for many years, but it has been replaced in recent decades in many studies by a post + retrospective pretest method, in which participants assess their skills only at the end of their study period. In that single end‐of‐program survey, students retrospectively evaluate their abilities prior to their learning experience (this retrospection is often labeled “Then”) and then provide an additional rating of their abilities following instruction (often labeled “Now”). Even though the instrument is administered in one sitting, the results are analyzed as if it were administered in a traditional pretest‐posttest design in which gain is measured through the difference of the means of the Then questions and those of the Now questions. In an extensive review of research on the use of post + retrospective (i.e., Then‐Now) surveys, Lam and Bengo ([<reflink idref="bib39" id="ref60">39</reflink>] ) concluded, “More than three decades of research on post + retrospective method has unequivocally supported this approach over the traditional pretest‐posttest approach to measuring change” (p. 78).</p> <p>A major disadvantage of the traditional pretest‐posttest approach is that learners experience a perspective shift between pre‐ and posttesting because their standard of measurement at posttesting tends to be different from pretesting due to increased experience with the tasks being self‐assessed. Learners with little or no experience with a task can often be fairly confident well before they are expected to perform the task, but as the day of reckoning draws closer and they begin to feel the reality approach, their degree of confidence consistently drops (Gilovich, Kerr, & Medvec, [<reflink idref="bib31" id="ref61">31</reflink>] ); and when they actually have experience with attempting the task, their evaluations can change again. The more experience and proficiency learners have, the more accurately they assess (Caputo & Dunning, [<reflink idref="bib12" id="ref62">12</reflink>] ; Ehrlinger et al., [<reflink idref="bib25" id="ref63">25</reflink>] ). Addressing these types of response shift, Lam and Bengo ([<reflink idref="bib39" id="ref64">39</reflink>] ) and Rohs and Langone ([<reflink idref="bib52" id="ref65">52</reflink>] ) argued that a Then‐Now approach to self‐assessment is more accurate than the traditional pre‐ and post‐self‐assessment technique because students completing Then‐Now self‐assessments are “evaluating themselves with the same standard of measurement or level of understanding on both their posttest responses (how they feel now) and how they felt before the program (then)” (Rohs & Langone, [<reflink idref="bib52" id="ref66">52</reflink>] , p. 156). Prior to the learning experience, they may overestimate or underestimate their abilities due to a lack of experience, but following the experience, they know better what the tasks entail on which they are asked to rate themselves.</p> <p>Based on a review of 40 years of research on post + retrospective self‐assessment, Hill and Betz ([<reflink idref="bib34" id="ref67">34</reflink>] ) listed two important benefits of this approach: first, it is a good means of promoting self‐efficacy, and second, it is capable of “describ[ing] change as experienced subjectively by … participants” (p. 514). Hill and Betz continued, “If the aim is to understand how participants feel about program effectiveness and their personal growth or skill acquisition, the retrospective test provides a more direct assessment of these factors” (p. 514).</p> <hd id="AN0096730569-6">Measurement Principles</hd> <p>Statistical analyses of surveys often utilize a classical test theory paradigm. This practice can be problematic as the resulting data regularly violate the basic parametric statistical assumption that data are interval in nature. For this reason, it is important to address several key issues related to quantitative analysis of results, including the conversion of raw test scores to measures used in statistical analyses, characteristics of interval data, Rasch scaling, measurement invariance, diagnosing rating scales, and analyzing reliability.</p> <hd id="AN0096730569-7">Conversion of Raw Scores to Measures</hd> <p>One criticism of scoring in the human sciences is that the data are presumed to be interval when that presumption has not been tested empirically. Stevens ([<reflink idref="bib59" id="ref68">59</reflink>] ), in his seminal work on types of measurement scales, noted that most of the scales used by researchers in the social sciences were actually ordinal, and while using parametric statistics “illegally” can yield results that lead to interesting insights, the means and standard deviations computed on the data are in “error to the extent that the successive intervals are unequal in size” (p. 679). The tendency to assign numbers to objects and then treat the numbers as interval data in doing statistical tests still persists (Bond & Fox, [<reflink idref="bib8" id="ref69">8</reflink>] ; Crocker & Algina, [<reflink idref="bib17" id="ref70">17</reflink>] ; Raykov & Marcoulides, [<reflink idref="bib51" id="ref71">51</reflink>] ). One way to ensure that the data truly meet interval criterion is to convert the raw scores to measures. Raw scores are the observed counts in their original state with no statistical adjustment (Bond & Fox, [<reflink idref="bib8" id="ref72">8</reflink>] ). Measures are derived by assigning numerals to objects based on rules (Stevens, [<reflink idref="bib59" id="ref73">59</reflink>] ). For the measures to be interval, the rules require that the numerals are assigned in a linear manner based on a scale (Crocker & Algina, [<reflink idref="bib17" id="ref74">17</reflink>] ).</p> <hd id="AN0096730569-8">Characteristics of Interval Data</hd> <p>In classical test theory, an instrument is designed to measure the amount of a trait that someone possesses. Each question is designed to demonstrate the extent to which a person possesses that trait; analogously, a series of hurdles on a track represents an opportunity for a person to demonstrate jumping ability. In classical test theory, a person's ability is estimated by a score based on the total number of items answered correctly (Brown, [<reflink idref="bib10" id="ref75">10</reflink>] ). So if 10 people went through a course with 10 hurdles, and the first person cleared 8 hurdles while the second only cleared 4, a score of 80% would be assigned for the first person and 40% for the second. An item's difficulty is calculated by dividing the number of examinees who answer the item correctly by the total number of examinees (Bachman, [<reflink idref="bib4" id="ref76">4</reflink>] ). So a hurdle that is 2 feet high is expected to be easier than a hurdle that is 4 feet high. If 9 of the 10 people cleared the 2‐foot hurdle, the ratio would be 9/10 for an item difficulty of 0.9, and if only 1 of the 10 cleared the 4‐foot hurdle, 0.1 would be the item difficulty. Both person ability and item difficulty are dependent on the people who took the survey and the items that were included (Crocker & Algina, [<reflink idref="bib17" id="ref77">17</reflink>] ). What is more, even if the items are intended to measure the same construct at one ability level (e.g., all the hurdles are supposed to be 3 feet tall), there is no guarantee—and it is in fact unlikely—that the resulting item and person measures have the properties of interval data (Bond & Fox, [<reflink idref="bib8" id="ref78">8</reflink>] ). Note that in using this hurdle analogy, it has been possible to reference the height of the hurdles through the standard English measurement of feet. With classical test theory, no such standard test‐external rulers exist to measure latent traits. If jumping were a latent trait, one would be left to describe the behavior using primitive measurements, such as cubits (i.e., the tip of the finger to the elbow).</p> <p>The property of equal‐intervalness requires that space between any two adjacent scores be equidistant (Stevens, [<reflink idref="bib59" id="ref79">59</reflink>] ). Furthermore, the same distance between two points should demonstrate the same increase in ability regardless of where that space falls. So if one created a hypothetical self‐assessment survey with 25 questions and a student completed it prior to participating in an internship with a score of 10, and completed it post‐internship with a score of 15, it would be assumed that the student increased in ability by five points. To be interval data, that difference of five would need to progress in a linear manner, meaning that difference of five points would reflect the same amount of growth regardless if it were from 1 to 6 or from 18 to 23. Not all learning is linear. Some learning is characterized by initial steep gains in ability that flatten out as the person becomes more adept, resulting in a construct that is nonlinear (see Figure [NaN] ). To be interval data, that ability increase of five should have the same meaning throughout the scale—that is, from 0 to 25. The issue of equal intervals becomes even more pronounced with survey instruments that use Likert‐type scales (e.g., five‐point rating scales ranging from “strongly disagree” to “strongly agree”). Rarely do ordinal categories reflecting affective characteristic progress in an interval manner. While most social scientists acknowledge that the data from their measurement instruments are not truly interval in this sense, they still use parametric statistics (Wright & Linacre, [<reflink idref="bib64" id="ref80">64</reflink>] ). Furthermore, although some may argue that parametric statistics are robust enough to use with ordinal data (Knapp, [<reflink idref="bib37" id="ref81">37</reflink>] ; Norman, [<reflink idref="bib47" id="ref82">47</reflink>] ), the use of Rasch scaling can make the criticism a moot point.</p> <hd id="AN0096730569-9">Rasch Scaling</hd> <p>Instead of interpreting person ability and item‐difficulty estimates from the total points of a test score or survey instrument, Rasch scaling uses the logit scale (Baylor et al., [<reflink idref="bib7" id="ref83">7</reflink>] ). Logits (or log odds ratios) are the natural logarithm of odds ratios of success and can be converted to and from probabilities. For example, if someone has a 0.6 probability of answering an item correctly, then that person would have a 0.4 probability of answering it incorrectly. Therefore the odds of answering the item correctly are 0.6 divided by 0.4, or 1.5 to 1. By being transformed to a log odds ratio, the measures are now interval data and have additive properties (Linacre, [<reflink idref="bib41" id="ref84">41</reflink>] ). Those logits can then be transformed back into probabilities. Georg Rasch, the Danish mathematician who developed the measurement model, described the principle as follows:</p> <p>A person having a greater ability than another person should have the greater probability of solving any item of the type in question, and similarly, one item being more difficult than another means that for any person the probability of solving the second item is the greater one. (Rasch, [<reflink idref="bib50" id="ref85">50</reflink>] , p. 117)</p> <p>In Rasch scaling, person ability and item difficulty are measured conjointly so that an examinee with a person ability estimate of a given value will have a 0.50 probability of answering an item with a difficulty parameter of that same value correctly (Linacre, [<reflink idref="bib41" id="ref86">41</reflink>] ). So if an examinee has a person ability estimate logit of 1.00 and a prompt has an item difficulty parameter logit of 1.00, the probability of the examinee responding to that prompt correctly is 50‐50, or odds of 1 to 1. In contrast, if an examinee has a person ability estimate of 1.00 and the prompt has an item difficulty parameter of −1.00 for a distance of two logits between person ability and item difficulty, then the probability of the examinee responding to that prompt correctly is 0.88, or odds of 7.3 to 1.</p> <p>Besides yielding interval data, another advantage of Rasch scaling is that the parameter estimates for both persons and items have the quality of measurement invariance (Engelhard, [<reflink idref="bib26" id="ref87">26</reflink>] ). That is, when measuring a unitary construct, person ability estimates are the same regardless of the items that are presented to the examinees, and item ability estimates are the same regardless of who responds to them. A helpful metaphor is to envision “the logit scale as a type of ‘ruler’ for latent traits, with the units on the ruler being logits instead of inches” (Baylor et al., [<reflink idref="bib7" id="ref88">7</reflink>] , p. 245). In essence, instead of using primitive measure descriptions to describe the heights of hurdles, the use of “calibrated” hurdles yields an instrument that will provide consistent person ability estimates from whoever uses it. Conversely, if there is a group of individuals with known jumping ability, one can observe their performance on the hurdles to calculate the item difficulty parameters and obtain evidence if the intended difficulty levels corresponded empirically with the actual performance.</p> <p>Finally, Rasch scaling provides more tools for determining the reliability of the measurement instrument. A rating scale diagnostic is possible by evaluating the following: (<reflink idref="bib1" id="ref89">1</reflink>) category frequencies, (<reflink idref="bib2" id="ref90">2</reflink>) average logit measures, (<reflink idref="bib3" id="ref91">3</reflink>) threshold estimates, (<reflink idref="bib4" id="ref92">4</reflink>) category probability curves, and (<reflink idref="bib5" id="ref93">5</reflink>) fit statistics (Bond & Fox, [<reflink idref="bib8" id="ref94">8</reflink>] ). As reliability is defined as the ratio of the true variance to the observed variance (Crocker & Algina, [<reflink idref="bib17" id="ref95">17</reflink>] ), Rasch reliability reports the relative reproducibility of results by including the error variance of the model in its calculation. When the reliability is close to 1.0, it indicates that the observed variance of the object being measured (person or item) is close or nearly equivalent to the true (and immeasurable) variance. Therefore, as person reliability approaches 1, the differences in scores are due more to differences in examinee ability rather than to measurement error.</p> <hd id="AN0096730569-10">Methods</hd> <hd id="AN0096730569-11">Background</hd> <p>Since 2007, students at Brigham Young University (BYU) have interned in Moscow with a number of prestigious nongovernment organizations, political and economic think tanks, hospitals, law firms, businesses, investment banks, consulting firms, news media organizations, and the like. Prior to going abroad, students complete an OPI administered by ACTFL, or as of winter 2013, a computerized ACTFL OPI (OPIc). Shortly before returning to the United States, they complete a post‐OPIc in order to ascertain the degree to which their language skills improved while participating in the program. In order to clarify what constitutes a speaker at the Advanced and Superior levels, participants receive a copy of the ACTFL Oral Proficiency Guidelines followed by a careful explanation (ACTFL, [<reflink idref="bib1" id="ref96">1</reflink>] ). With such criteria in hand and having received their pre‐OPI ratings, students are able to both pinpoint their strengths and weaknesses and target specific areas on which to focus.</p> <p>In addition to pursuing full‐time internships, students attend advanced foreign language courses twice a week (6 hours total) in which they analyze and discuss readings dealing with global issues (e.g., climate change), review grammar topics, and address language‐related questions that arise at their respective internships. Consistent contact with and feedback from a native speaker trained in teaching Russian as a foreign language lends structure to their otherwise informal study of the language on the job.</p> <p>In its infancy, the program spanned only the spring/summer semester in order to stay within the 90‐day visa window; however, as demand grew, program dates likewise expanded to include winter and fall semesters. With such growth also came increased accountability, which prompted the development of a survey instrument to promote student self‐awareness and self‐reflection on their experiences.</p> <hd id="AN0096730569-12">Context</hd> <p>Most students in upper‐division Russian courses at BYU have spent more than 2 years living in a Russian‐speaking country. Upon returning, many opt to continue their study of Russian and matriculate directly into third‐year advanced grammar courses. Unlike the gradual attrition that normally occurs in university foreign language programs over the course of 4 years, enrollments at BYU swell at the third year and stay consistently high through the fourth year. However, even though most students in upper‐division courses have lived in a Russian‐speaking country, they lack experience using the target language in a professional capacity.</p> <p>When educators at BYU were considering possible ways to structure a program that would give students professional language opportunities, the question arose of whether to partner with providers on an individual basis or with an in‐country institution that would handle the logistics of placing students in internships, not to mention facilitate housing and visa support. The latter proved far more advantageous than relying on long‐distance contact limited by both time zone difference and access to top‐level management. In addition to streamlining logistical aspects of the program, partnering with an in‐country institution enabled students to enroll in advanced‐level Russian language courses designed specifically for learners of Russian as a second language. Aside from choosing to partner with an institution of higher education in Russia, the decision was made to focus on Moscow rather than outlying cities in order to take advantage of the immense influx of capital into one city that translates into increased work opportunities for students. In 2006, this initiative was formalized through an agreement between Brigham Young University and the Russian Academy of National Economy and Public Administration.</p> <p>For this study, everyone that participated in this program was contacted and asked to complete the Can‐Do statement Then‐Now self‐assessment survey. As of winter semester 2014, 68 students had participated in the program, of which 36 (27 male, 9 female) responded to the entire survey and completed a pre‐ and post‐OPI/OPIc.</p> <hd id="AN0096730569-13">Self‐Assessment Survey Instrument</hd> <p>During summer 2013, the authors developed a survey that asked students to respond to Then‐Now statements regarding their language abilities on a five‐point categorical scale to indicate their degree of confidence. The statements were culled from the major subheadings of the NCSSFL‐ACTFL Can‐Do Statements (ACTFL, [<reflink idref="bib3" id="ref97">3</reflink>] ) that reflect the ACTFL proficiency Interpersonal and Presentational Communication standards (see Table [NaN] for Interpersonal Standards).[<reflink idref="bib2" id="ref98">2</reflink>] The survey framed tasks in the form of Can‐Do statements and required learners to provide both pre‐ and post‐program estimates of their abilities, (e.g., “I could exchange detailed information on topics within and beyond my fields of interest.”). As these participants already had extensive language experience, their self‐assessed levels ranged from Intermediate High to Distinguished. A scale including the following five categories accompanied each task:</p> <p>Could not do this even with extensive preparation.</p> <p>Unsure as to whether I could or could not do this.</p> <p>Could do this with extensive preparation.</p> <p>Could do this with minimal preparation.</p> <p>Could do this without any preparation.</p> <p>Description of NCSSL‐ACTFL Can‐Do Statements on Interpersonal Communication (ACTFL, 3 , pp. 4–5)</p> <p> <ephtml> <table><tr><th align="left">ACTFL Standard</th><th align="left">Description</th></tr><tr><td align="left">Novice Low</td><td align="left">I can communicate on some very familiar topics using single words and phrases that I have practiced and memorized.</td></tr><tr><td align="left">Novice Mid</td><td align="left">I can communicate on very familiar topics using a variety of words and phrases that I have practiced and memorized.</td></tr><tr><td align="left">Novice High</td><td align="left">I can communicate and exchange information about familiar topics using phrases and simple sentences, sometimes supported by memorized language. I can usually handle short social interactions in everyday situations by asking and answering simple questions.</td></tr><tr><td align="left">Intermediate Low</td><td align="left">I can participate in conversations on a number of familiar topics using simple sentences. I can handle short social interactions in everyday situations by asking and answering simple questions.</td></tr><tr><td align="left">Intermediate Mid</td><td align="left">I can participate in conversations on familiar topics using sentences and series of sentences. I can handle short social interactions in everyday situations by asking and answering a variety of questions. I can usually say what I want to say about myself and my everyday life.</td></tr><tr><td align="left">Intermediate High</td><td align="left">I can participate with ease and confidence in conversations on familiar topics. I can usually talk about events and experiences in various time frames. I can usually describe people, places, and things. I can handle social interactions in everyday situations, sometimes even when there is an unexpected complication.</td></tr><tr><td align="left">Advanced Low</td><td align="left">I can participate in conversations about familiar topics that go beyond my everyday life. I can talk in an organized way and with some detail about events and experiences in various time frames. I can describe people, places, and things in an organized way and with some detail. I can handle a familiar situation with an unexpected complication.</td></tr><tr><td align="left">Advanced Mid</td><td align="left">I can express myself fully not only on familiar topics but also on some concrete social, academic, and professional topics. I can talk in detail and in an organized way about events and experiences in various time frames. I can confidently handle routine situations with an unexpected complication. I can share my point of view in discussions on some complex issues.</td></tr><tr><td align="left">Advanced High</td><td align="left">I can express myself freely and spontaneously, and for the most part accurately, on concrete topics and on most complex issues. I can usually support my opinion and develop hypotheses on topics of particular interest or personal expertise.</td></tr><tr><td align="left">Superior</td><td align="left">I can communicate with ease, accuracy, and fluency. I can participate fully and effectively in discussions on a variety of topics in formal and informal settings. I can discuss at length complex issues by structuring arguments and developing hypotheses.</td></tr><tr><td align="left">Distinguished</td><td align="left">I can communicate reflectively on a wide range of global issues and highly abstract concepts in a culturally sophisticated manner.</td></tr></table> </ephtml> </p> <p>In addition to asking questions about specific linguistic tasks, the survey inquired about changes relating to personal development, e.g., “My internship helped me develop increased self‐confidence”; changes in terms of academic commitment, e.g., “My internship served to enhance my interest in academic study”; changes in terms of intercultural development, e.g., “My internship helped me better understand my own cultural values and biases”; and changes in terms of career development, e.g., “My internship helped me acquire skill sets that influenced my career path” (Dwyer & Peters, [<reflink idref="bib23" id="ref99">23</reflink>] ). Previous participants received an e‐mail invitation to participate in a Qualtrics survey (see <ulink href="http://qualtrics.com/">http://qualtrics.com/</ulink> for information on this survey tool) in regard to their experience on the Moscow internship program.</p> <hd id="AN0096730569-14">Findings</hd> <p>The goals of this study included evaluating the reliability of a Then‐Now self‐assessment instrument, determining the extent to which the Can‐Do statements reflected the hierarchy of the proficiency scale empirically with item‐difficulty parameters, and examining the predictive validity of using the self‐assessment instrument for participants to self‐report their OPI level.</p> <hd id="AN0096730569-15">Reliability of Self‐Assessment Instrument</hd> <p>The first research question examined the reliability of the self‐assessment instrument, which comprised Can‐Do statements (ACTFL, [<reflink idref="bib3" id="ref100">3</reflink>] ) spanning the proficiency range of ACTFL Intermediate High to Distinguished levels. To answer this question, researchers applied the Rasch measurement to the results of the survey[<reflink idref="bib3" id="ref101">3</reflink>] using the rating scale model for polytomous data. A diagnosis of the functionality of the scale is followed by a reliability analysis of the test scores from the use of the scale.</p> <hd id="AN0096730569-16">Scale Diagnosis</hd> <p>The five‐level scale categories (one through five) functioned well within established guidelines (Linacre, [<reflink idref="bib42" id="ref102">42</reflink>] ). The absolute frequency of each category had a minimum of 10, though categories one and two had a combined relative frequency of only 7% and thus could be candidates for combination to a single category. The average measures for each category increased monotonically without exception, as did the threshold estimates. The threshold estimates had the minimum recommendation of 1.4 logits between each category, thus indicating that each category showed distinction. Furthermore, for the scale to be treated as interval data, it was desirable for the thresholds to be regularly spaced (see Figure [NaN] )—a criterion met by the survey data. An examination of the category probability distributions indicated that each category functioned well (see Table [NaN] ) and that none of the outfit mean squares exceeded 2.0. One note of interest: The two most frequent categories were four (43% of responses) and five (29% of responses), indicating that 72% of the time, students self‐assessed that they could accomplish the Can‐Do tasks with minimal to no preparation. This finding suggests that students perceived the tasks to be fairly easy. Despite this observation, the category descriptions of the scale functioned within the expected parameters, and there was no need to make adjustments to the categories in order to evaluate the reliability of the instrument.</p> <p>Human‐Rated Holistic Speaking Level Rating Category Statistics</p> <p> <ephtml> <table><tr><th align="left">Category</th><th align="left">Absolute Frequency</th><th align="left">Relative Frequency</th><th align="left">Average Measure</th><th align="left">Outfit</th><th align="left">Threshold</th><th align="left">SE</th></tr><tr><td align="left">1</td><td align="left">17</td><td align="left">1%</td><td align="char" char=".">−2.98</td><td align="left">0.98</td><td align="left" /><td align="left" /></tr><tr><td align="left">2</td><td align="left">149</td><td align="left">6%</td><td align="char" char=".">−0.30</td><td align="left">1.61</td><td align="char" char=".">−4.13</td><td align="left">0.26</td></tr><tr><td align="left">3</td><td align="left">544</td><td align="left">22%</td><td align="char" char=".">0.60</td><td align="left">0.80</td><td align="char" char=".">−1.24</td><td align="left">0.11</td></tr><tr><td align="left">4</td><td align="left">1070</td><td align="left">43%</td><td align="char" char=".">2.80</td><td align="left">0.86</td><td align="char" char=".">1.13</td><td align="left">0.06</td></tr><tr><td align="left">5</td><td align="left">720</td><td align="left">29%</td><td align="char" char=".">4.91</td><td align="left">1.02</td><td align="char" char=".">4.24</td><td align="left">0.07</td></tr></table> </ephtml> </p> <p>1 Categories:</p> <ulist> <item>2 1. Could not do this even with extensive preparation.</item> <item>3 2. Unsure as to whether I could or could not do this.</item> <item>4 3. Could do this with extensive preparation.</item> <item>5 4. Could do this with minimal preparation.</item> <item>6 5. Could do this without any preparation.</item> </ulist> <hd id="AN0096730569-17">Reliability Analysis</hd> <p>One advantage of a Rasch measurement analysis is that the facets can be compared on a vertical scale showing the link between persons, items, and the rating scale. This vertical scale helps one visualize separation reliability. Figure [NaN] illustrates the logits in the first column, the “Then” person ability estimate in the second, the “Now” person ability estimate in the third, the item difficulty in the fourth, and the rating scale in the fifth. The person labels are coded as the OPI/OPIc score (six = Intermediate High, seven = Advanced Low, etc.) and randomized initials. Accordingly, in Figure [NaN] , the person label 9‐NEW in the Then column represents a student with a pre‐internship OPI/OPIc of Advanced High and a pre‐internship self‐assessment logit of 1. That same student in the Now column has the label 10‐NEW, thus indicating a post‐internship OPI/OPIc of Superior and a post‐internship self‐assessment logit of just over 3. The item labels are coded as intended OPI level and question number, so the item labeled AM5 is the fifth survey question that reflects an Advanced Mid self‐assessment task. The italicized letters along the vertical axes represent the mean (M), one standard deviation from the mean (S) and two standard deviations from the mean (T) for the Then‐Now and Items columns. The horizontal axis indicates the 50% probability threshold, so a person with a logit of 1 (i.e., 9‐NEW) had a 50% probability of self‐assessing his/her ability to accomplish tasks AM5, AH3, and AH4 in category four (i.e., “can do this with minimal preparation”).</p> <p>From the reliability, one can calculate how many statistically separate groups existed for both persons and items. Figure [NaN] shows that the person ability estimates ranged from −3 to 7 on the scale, with a mean of 2.68. The analysis found that the separation reliability among the students was 0.96, with a separation strata index of 5.57, thus suggesting that estimated person ability parameters were indicative of reliable differences in the students' self‐perceptions of their abilities in five distinct levels. The item ability estimates ranged from −3 to 2 on the scale, with a mean of 0. The item separation reliability was 0.95, with a separation strata index of 4.52. Such findings indicate that the items were reliably different from each other and represented four distinct difficulty levels.</p> <p>Thus, to summarize the findings in regard to the first research question addressing the reliability of the self‐assessment instrument: The five categorical levels for each of the Can‐Do statements functioned well, although the students perceived the self‐assessment items to be quite easy. The reliability estimates of both persons and items approached the upper limit of 1, indicating that the instrument was internally reliable.</p> <hd id="AN0096730569-18">Hierarchy of Item Difficulty Levels</hd> <p>The second research question asks whether the difficulty of the Can‐Do statements ascended in the hierarchal order in which they were predicted. For example, it was predicted that the Intermediate tasks would be perceived to be easier than the Advanced, the Advanced easier than the Superior, and so forth. To answer this question, the item logits from the Rasch analysis were grouped by their intended ACTFL levels and an ANOVA was run. An analysis of the descriptive statistics (see Table [NaN] ) shows that the Intermediate items (mean = −0.83, sd = 1.14, 95% CI [−2.02, 0.36]) were perceived to be the easiest, the Advanced (mean = −0.02, sd = 1.00, 95% CI [−0.50, 0.46]) were next, followed by the Superior (mean = 0.38, sd = 0.98, 95% CI [−0.65, 1.41]), and finally the Distinguished terms were perceived as the most difficult (mean = 0.60, sd = 0.57, 95% CI [−0.11, 1.30]) (see Figure [NaN] ). An independent‐measures ANOVA found that the differences were not statistically significant (F = 2.36, df = 3, p = 0.09).</p> <p>Intended OPI Difficulty Levels</p> <p> <ephtml> <table><tr><th align="left">Intended OPI Level</th><th align="left" /><th align="left">N</th><th align="left">Mean</th><th align="left">SD</th><th align="left">SE</th><th align="center">95% Confidence Interval for Mean</th></tr><tr><th align="left">Lower Bound</th><th align="left">Upper Bound</th></tr><tr><td align="left">Intermediate</td><td align="left">High</td><td align="left">6</td><td align="char" char=".">−0.83</td><td align="left">1.14</td><td align="left">0.46</td><td align="char" char=".">−2.02</td><td align="left">0.36</td></tr><tr><td align="left">Advanced</td><td align="left" /><td align="left">19</td><td align="char" char=".">−0.02</td><td align="left">1.00</td><td align="left">0.23</td><td align="char" char=".">−0.50</td><td align="left">0.47</td></tr><tr><td align="left" /><td align="left">Low</td><td align="left">8</td><td align="char" char=".">−0.32</td><td align="left">0.97</td><td align="left">0.34</td><td align="char" char=".">−1.13</td><td align="left">0.49</td></tr><tr><td align="left" /><td align="left">Mid</td><td align="left">6</td><td align="char" char=".">−0.26</td><td align="left">1.17</td><td align="left">0.48</td><td align="char" char=".">−1.48</td><td align="left">0.96</td></tr><tr><td align="left" /><td align="left">High</td><td align="left">5</td><td align="char" char=".">0.76</td><td align="left">0.44</td><td align="left">0.20</td><td align="char" char=".">0.21</td><td align="left">1.30</td></tr><tr><td align="left">Superior</td><td align="left">—</td><td align="left">6</td><td align="char" char=".">0.38</td><td align="left">0.98</td><td align="left">0.40</td><td align="char" char=".">−0.65</td><td align="left">1.42</td></tr><tr><td align="left">Distinguished</td><td align="left">—</td><td align="left">5</td><td align="char" char=".">0.60</td><td align="left">0.57</td><td align="left">0.25</td><td align="char" char=".">−0.11</td><td align="left">1.30</td></tr><tr><td align="left">Total</td><td align="left" /><td align="left">36</td><td align="char" char=".">0.00</td><td align="left">1.04</td><td align="left">0.17</td><td align="char" char=".">−0.35</td><td align="left">0.35</td></tr></table> </ephtml> </p> <p>Revisiting Figure [NaN] , one sees that the self‐assessment items were clustered together and that students felt that most of the Can‐Do statements could be accomplished successfully with minimal or no preparation. This could be indicative of a failure of the students to understand the intended difficulty of the tasks or an overinflated sense among students of what they could actually accomplish. Regardless, the data suggest that the survey items ascended in a hierarchy of difficulty levels based on the ACTFL Oral Proficiency Guidelines (ACTFL, [<reflink idref="bib1" id="ref103">1</reflink>] ) but not to a statistically significant degree.</p> <hd id="AN0096730569-19">Predictive Validity of Self‐Assessment Items for OPIs/OPIcs</hd> <p>The third research question addressed the predictive validity of the self‐assessment instrument for OPI scores. Answering this question involved a three‐step analysis: (<reflink idref="bib1" id="ref104">1</reflink>) examining if OPI scores changed from the pretest to the posttest, (<reflink idref="bib2" id="ref105">2</reflink>) determining whether students perceived a difference in their ability by looking at the person ability estimates from the Then questions to the Now questions, and (<reflink idref="bib3" id="ref106">3</reflink>) assessing how well the students' perceived abilities correlated with their OPI scores.</p> <hd id="AN0096730569-20">Pre‐Internship vs. Post‐Internship OPIs</hd> <p>To determine the extent of language gain during the internship, pre‐ and post‐internship OPIs were administered. For the pre‐internship OPI, the mean score was 7.42 (sd = 1.01), with the median at Advanced Low and mode at Advanced Mid (see Figure [NaN] ). For the post‐OPI, the mean was 8.45 (sd = 1.09), with the median at Advanced Mid and the mode at Advanced High. As OPI scores do not have the characteristics of interval‐level data and the data were not normally distributed, the nonparametric Wilcoxon matched‐pairs signed ranks test was conducted with the ratings from the pre‐ and post‐internship OPIs. Results of the analysis indicate significant gain from pre‐ to post‐internship OPIs (Z = −5.57, p < 0.001), with 41 of the 68 subjects scoring higher on the posttest. There were 12 instances in which subjects had the same rating on the pre‐ and posttest and only two instances in which a student scored lower on the post‐internship OPI than on the pre‐internship OPI. In those two instances, neither subject crossed a major threshold.</p> <hd id="AN0096730569-21">Then vs. Now Self‐Assessments</hd> <p>To determine the extent of perceived language gain during the internship, the Then‐Now instrument was administered at the end of the experience. The mean person ability estimates of the Then statements were compared to the mean person ability estimates of the Now statements (see Figure [NaN] ). As logits are interval data and the data were normally distributed (see Figure [NaN] ), a paired‐samples t test was used. The difference of the means was −1.88 (sd = 1.64, 95% CI [−2.43, −1.33]), resulting in t = −7.00, df = 36, p < 0.001. The students therefore perceived an increase in language gain using the self‐assessment instrument.</p> <hd id="AN0096730569-22">Then‐Now Self‐Assessments and OPIs/OPIcs</hd> <p>To determine the extent to which Then vs. Now self‐assessments correlate with OPIs/OPIcs, a correlation was run between the Then person ability estimates and the pre‐internship OPIs/OPIcs and between the Now person ability estimates and post‐internship OPIs/OPIcs. The correlation between the Then and the pre‐OPI/OPIc was 0.27 (n = 37, 95% CI [−0.05, 0.56]), indicating a small to medium effect size. The correlation between the Now and the post‐OPI/OPIc was 0.21 (n = 37, 95% CI [−0.13, 0.49]), indicating a small to medium effect size. Figure [NaN] illustrates in scatterchart format a slight trend, but the effect size was just that—slight. To see if the gain in the person ability estimates correlated with gain in OPI scores, a correlation between those variables was run and yielded a correlation of 0.21 (n = 37, 95% CI [−0.13, 0.49])—a small to medium effect size (see Figure [NaN] ). Thus, in answer to the third question, both OPIs and Then‐Nows showed gain but the effect was small, perhaps owing to the relatively small sample size.</p> <hd id="AN0096730569-23">Discussion</hd> <p>To determine the strengths and limitations of self‐assessment as a tool for evaluating linguistic gains over an experiential language learning setting (internship in Russia), the reliability of the instrument was evaluated, the item difficulties were mapped with the ACTFL scale, and the degree to which students' self‐evaluations matched OPI test results (i.e., predictive validity) was explored. The data revealed a high level of reliability (i.e., the scale functioned within the expected parameters, and there was no need to make adjustments to the categories in order to evaluate the reliability of the instrument); furthermore, there were reliable differences in students' self‐perceptions of their own abilities and questionnaire items proved to be reliably different from each other. Next, data confirmed that the self‐assessment items ascended in the order of difficulty expected (Superior items were most difficult, followed by Advanced, and so forth), but differences among the means for items representing the ACTFL levels were not statistically significant. Finally, while students demonstrated statistically significant gains from pre‐ to posttests on both the OPI and the self‐assessment, correlations between these measures were relatively low.</p> <p>Given the results presented here, the relative merits and limitations of the self‐assessment can be considered. First, it is encouraging that the level of reliability was so high and that the response scale seemed to function well in terms of differentiating individuals and items from each other. As Gronlund ([<reflink idref="bib32" id="ref107">32</reflink>] ) stated, “Unless the results are generalizable over similar samples of tasks, time periods, and raters, we are not likely to have confidence in them” (p. 212). In terms of this self‐assessment, a high level of confidence in the consistency of the results is possible. Furthermore, as illustrated in Figure [NaN] , the thresholds for the scale were regularly spaced, indicating a fairly distinct hierarchy in terms of difficulty for the numeric responses. Such regular spacing suggests that a 1 represents the lowest level of difficulty, followed, by a 2, and so forth.</p> <p>Next, because the data fit the Rasch model, the raw scores were transformed successfully into interval values, thus meeting the assumptions needed to conduct parametric analyses appropriate only for interval data, such as the ANOVA used to determine whether there were significant differences between the means for the items from each of the ACTFL levels. These interval values (logits) further permitted more meaningful comparisons between items and individuals, thus allowing increased confidence in measuring how far apart two individuals were in terms of their perceived abilities, how much more difficult one item was than another, etc.</p> <p>It is encouraging that the self‐assessment items tended to align with the ACTFL categories with which they were associated (i.e., they followed the expected hierarchy). However, the mean scores for items at each ACTFL level were not significantly different from each other, indicating that this hierarchy may not be as straightforward as one would expect. Figure [NaN] illustrates how some Can‐Do item results were lower or higher in difficulty than expected (e.g., SU3, which had a logit value of lower than −1, indicating it was easier than some Intermediate and Advanced items and below average overall). Figure [NaN] also depicts an overall tendency for students to rate themselves at an excessively high level, in particular when it came to Superior and Distinguished tasks. The figure maps item difficulty with person ability and demonstrates that item difficulty tended to be lower than person ability. In other words, learners perceived items to be easier than they ought to be. Two questions therefore arise: (<reflink idref="bib1" id="ref108">1</reflink>) Are some Can‐Do items truly easier or more difficult than other items aimed at the same ACTFL level or sublevel? and (<reflink idref="bib2" id="ref109">2</reflink>) Are the mismatches described here due to students' failure to accurately comprehend the nature and difficulty of the tasks on which students were being asked to rate themselves?</p> <p>Regarding possible mismatches in difficulty between items at the same level, the necessary alignment between function, text type, content, and accuracy to fulfill the task may not have been communicated clearly to participants. For example, supporting an opinion represents a Superior function, but only when discussing abstract topics with extended discourse. Thus, while an Intermediate speaker could offer a series of sentences with an opinion on where to eat dinner because of price and location, the Superior speaker could speak at length on the effect that globalization and franchised chains have had on locally owned restaurants. This failure to account for the necessary alignment in tasks could have made some tasks appear easier than they were intended to be. In addition, the design of the instrument could have contributed to the students' tendency to inflate their self‐assessments. To keep the length of the survey down to 36 statements, the items were based on the subheadings rather than the specific examples in the Can‐Do statement, which may have overgeneralized the nature of the tasks.</p> <p>In addition, instructors may believe that they have a familiarity with the proficiency guidelines, but their cursory understanding may cause them to provide student feedback that is higher than warranted. This feedback could, in turn, encourage students to overinflate their self‐assessments. For example, an instructor who knows “bits and pieces” of information about the proficiency scales may think that success on discrete features is evidence that his or her students are at a higher level than they truly are (ACTFL, personal communication, March 4, 2014). If the instructors share that feedback with the students, the students may in turn self‐assess too high. Another source of misconception concerning the scale comes from conflating performance and proficiency. For example, if a student completes the Superior function of supporting an opinion, yet the response has been practiced, with vocabulary and content that has been explicitly taught, the instructor may mistakenly inform the students that they are at the Superior level when in fact there is insufficient evidence on which to make such a judgment. In order to assist language educators in their efforts to provide students realistic feedback, ACTFL developed the ACTFL 2012 Performance Descriptors for Language Learners, which explicitly differentiate proficiency and performance (ACTFL, [<reflink idref="bib2" id="ref110">2</reflink>] ). Indeed, misinformed feedback from instructors can hamper the ability of students to accurately self‐assess.</p> <p>As far as students' possible failure to accurately comprehend the nature and difficulty of the Can‐Do tasks, there is evidence that learners regularly fall into this trap when self‐assessing (Oscarson, [<reflink idref="bib49" id="ref111">49</reflink>] ). Strong‐Krause ([<reflink idref="bib60" id="ref112">60</reflink>] ) suggested that the more specific and closer the self‐assessment is to a real‐world task, the more accurate the self‐assessment will likely be. In her research, she paired self‐assessments of difficulty with scenarios that learners were asked to imagine themselves acting out, including descriptions of who the interlocutor would be, several specific things they would need to say to the interlocutor, and what they were expected to accomplish linguistically. In other research, Strong‐Krause (personal communication, January 29, 2014) also found that learners with limited experience in an area tend to overestimate their abilities on more difficult tasks more than those with no experience at all. In other words, their limited experience with the tasks can produce confidence, even though the ability they develop through that experience may be minimal. It is possible that learners in this study had just enough experience with what they perceived to be higher‐level tasks to develop greater confidence in their abilities to accomplish them.</p> <p>One possible explanation for students rating some items as much easier or harder than expected even though they were actually at the same ACTFL level is that learners rated individual items relative to other items: that is, if participants thought that they could not do some of the items at all and thus considered these items to be beyond their abilities, they rated themselves higher on the ones that they felt they could do. Promoting more accurate self‐assessment when learners overestimate their abilities can be accomplished by providing prompt feedback (Lichtenstein & Fischoff, [<reflink idref="bib40" id="ref113">40</reflink>] ) and getting learners to think of the reasons why their estimates might be wrong (Koriat, Lichtenstein, & Fischoff, [<reflink idref="bib38" id="ref114">38</reflink>] ). People tend to be more accurate in their self‐evaluations of ability when they experience counterevidence (evidence that suggests it would be difficult to achieve the task), usually as they are asked to attempt or imagine themselves attempting a specific task or as they are asked to provide the specific tools that would be necessary for a task, such as the vocabulary, the structures, the pragmatic knowledge, etc. (Koriat et al., [<reflink idref="bib38" id="ref115">38</reflink>] ).</p> <p>Regardless of the lack of a clear‐cut hierarchy and the apparent inflated estimates, the Can‐Do items provided valuable data that can be informative to experiential language programs. This self‐assessment data can provide an overall picture of the tasks about which individuals or groups of individuals feel most and/or least confident and can suggest some general patterns in terms of task difficulty at the various ACTFL levels. This understanding can inform second language teaching by helping instructors to know which tasks may require additional instruction, etc.</p> <p>Introducing a self‐assessment instrument into a program has the effect of enhancing rather than diminishing the role of the OPI. Indeed, such an approach provides formative checks for learners prior to taking an OPI/c and clear objectives that assist students when instructors articulate language‐learning goals. It is noteworthy that learners showed significant gains on both measures, suggesting that self‐assessment may be useful for seeing group performance in terms of gains over time. Furthermore, the self‐assessments provided insights into individuals' confidence in their abilities to perform specific tasks, which the OPI does not provide. The OPI is a holistic score based on an external evaluation, whereas the self‐assessment is a collection of a range of self‐estimates that can be summed up into one collective score or broken down into discrete items spread across the range of difficulty represented by the ACTFL scale.</p> <p>Stansfield et al. ([<reflink idref="bib57" id="ref116">57</reflink>] ) evaluated the validity of similar Can‐Do statements based on ILR descriptions similar to the Can‐Do statements in this survey. Their correlations were higher than this (moderate to high effect sizes), but they used raw scores rather than logits, and the timing of their self‐assessments was different. They administered the Can‐Do statements in order to screen applicants and determine whether to administer the OPI. In their study of predictive validity, they concluded that applicants to National Language Service Corps programs “can make reasonably effective judgments about their own language skills,” given that “self‐assessment scores … exhibited significant positive relationships with OPI scores” (Stansfield et al., [<reflink idref="bib57" id="ref117">57</reflink>] , p. 313). They pointed out that their correlations were well above those found between standardized tests used for university admission and first‐year GPA. They argued that an assessment that predicts performance on the criterion measure to a similar or better degree ought to be acceptable.</p> <p>This study focused strictly on self‐assessment administered at a single time (retrospective self‐assessment following internship abroad). The literature on self‐assessment suggests that it is a good tool for developing learner autonomy and self‐awareness, but the aim of this study was not to test this claim. Rather, the focus was on determining the value of the aforementioned Then‐Now Can‐Do self‐assessment for understanding student performance over the internship abroad experience. Because the correlations between self‐assessments and the OPI were not high enough to use the self‐assessments to predict OPI scores, it seems that additional training involving multiple self‐assessments and reflections designed to make learners more aware of their abilities and to facilitate more accurate self‐assessment is in order.</p> <p>The current study focused strictly on university‐level learners approaching or at the Advanced level or above. There are many other possible audiences for Can‐Do statements, and the growing body of literature in this area suggests great potential. For example, the Can‐Do statements making up the CEFR have been used widely in a range of settings and for a variety of assessment purposes. One of the original aims of the CEFR and the subsequent European Language Portfolio was to facilitate learner reflection and autonomy (Council of Europe, [<reflink idref="bib15" id="ref118">15</reflink>] , [<reflink idref="bib16" id="ref119">16</reflink>] ), and this has consequently led to the use of self‐assessment in portfolios and other measures (Little, [<reflink idref="bib43" id="ref120">43</reflink>] ). Similarly, the ACTFL World Readiness Standards for Learning Languages (<ulink href="http://www.actfl.org/publications/all/world‐readiness‐standards‐learning‐languages">http://www.actfl.org/publications/all/world‐readiness‐standards‐learning‐languages</ulink>) and the 21st Century Skill Map (<ulink href="http://www.actfl.org/sites/default/files/pdfs/21stCenturySkillsMap/p21%5fworldlanguagesmap.pdf">http://www.actfl.org/sites/default/files/pdfs/21stCenturySkillsMap/p21%5fworldlanguagesmap.pdf</ulink>) encourage learners to “take responsibility for their own learning.” Even at the Novice level, the 21st Century Skill Map promotes self‐assessment as a means of taking initiative and self‐directing in an effort to improve one's linguistic and cultural competence.</p> <hd id="AN0096730569-24">Conclusion</hd> <p>This study falls in line with previous research, which indicates that self‐assessment can be high in reliability and can provide valuable information on learners' perceived gains over time. The self‐assessment data met the requirements of Rasch analysis, allowing the data to be treated as interval. Using this approach opens up a variety of possibilities when it comes to combining self‐assessment with inferential statistics in future studies. In spite of these insights, many questions remain unanswered. For example: Were learners adequately trained to accurately self‐assess? Why did some items not align well with others at the same ACTFL level? Why were correlations between the self‐assessment and OPI results so low?</p> <p>While this study has provided a few possible answers above, only additional research will allow more definitive conclusions. Perhaps the Can‐Do statements used in this study need to be paired with more specific sample scenarios such as those presented by Strong‐Krause ([<reflink idref="bib60" id="ref121">60</reflink>] ) in order to obtain estimates that more closely resemble proficiency test scores given by raters. Furthermore, revising the self‐assessment experience so that students have additional exposure with the proficiency scale, explanations of the linguistic expectations, and examples of linguistic failure could improve their ability to self‐assess. Using more explicit scenarios may also improve the alignment of the intended task difficulty with the empirical findings. In addition, while participants in this study were higher‐level language learners in a study abroad context, it would be valuable to examine how self‐assessment changes with beginning learners who may be studying in the classroom. Future studies ought to employ Can‐Do self‐assessments (perhaps modified versions that contain specific examples of tasks and are more focused) on a regular basis throughout an experience abroad. Adding ongoing assessment would allow researchers to track whether learners become more accurate in their self‐assessments over time.</p> <p>This study of Then‐Now Can‐Do self‐assessment was limited to one setting (internship abroad). This same assessment approach could be used in a variety of other settings, including classroom instruction (e.g., for having students reflect and evaluate progress on various tasks over the course of instruction) in addition to internships and other out‐of‐class experiential learning. Doing so will allow language educators to better understand how Then‐Now and other types of Can‐Do self‐assessments (i.e., more regular reflective self‐assessments spaced over time) can be used to promote self‐awareness and reflection and to better understand what learners feel they are learning during their experiences.</p> <p>The primary issues that come to the forefront are learner awareness, experience, and training. Learners typically have limited experience with Advanced or higher tasks prior to experiential learning, such as study abroad and internships. It may therefore be difficult for them to make judgments regarding their own capability to perform these tasks. Language educators and program directors typically assume that learners ought to have opportunities to at least observe Advanced or higher language usage while engaged in experiential and classroom learning. If this is the case, such experience ought to inform learners and help them make more accurate judgments of their own abilities. Given that the data showed that students tended to overestimate their own abilities at these higher‐level tasks, even after an internship abroad, several questions arise. Are students actually engaging in higher‐level tasks that would allow them to better understand their own abilities? If not, are they at least observing such tasks? If they are encountering Advanced or higher experiences, are they processing them sufficiently and identifying their own strengths and gaps in terms of their ability to perform these tasks well?</p> <p>One additional area for further research is the role of feedback in self‐assessment. Once a learner self‐assesses, what evidence does he or she receive regarding the accuracy of that assessment? As Lichtenstein and Fischoff ([<reflink idref="bib40" id="ref122">40</reflink>] ) found, prompt feedback helps promote more accurate self‐assessment. The 21st Century Skill Map suggests that learners use a digital self‐assessment and portfolio to track their progress over time. Instructors could easily view these self‐assessments and provide ongoing feedback to help learners know where they are under‐ or overestimating their own abilities. Furthermore, computer‐based tools could provide instant feedback to learners to indicate whether their performance on objective measures (questions graded by the computer) match their own self‐assessments of their performance on the same measures. Diagnostic feedback could be used to help learners know where gaps exist and therefore gain a more accurate understanding of their abilities. Students could even listen to audio recordings of their own performance after they have self‐assessed and then have another opportunity to assess their own speaking performance. These examples illustrate how providing feedback (i.e., evidence regarding performance) could help learners develop their self‐assessment skills. The effectiveness of various types of feedback should be carefully evaluated before drawing any conclusions.</p> <p>Finally, future research could evaluate the accuracy and impact of self‐assessment over time. For example, when learners begin self‐assessing at the Novice level, do they become more accurate over time? Does regular self‐evaluation promote increased linguistic development over the course of learning? In addition, self‐assessment has been used successfully with children (Hasselgreen, [<reflink idref="bib33" id="ref123">33</reflink>] ). Research to determine differences between adults and children in terms of the accuracy and effects of self‐assessment would be helpful. Are children and adults equally accurate in their assessment of their own abilities? Do they develop similarly in their accuracy as they assess themselves over time? These and other questions remain unanswered in this research, but they certainly merit further consideration.</p> <ref id="AN0096730569-25"> <title>Footnotes</title> <blist> <bibl id="bib1" idref="ref10" type="bt">1</bibl> <bibtext>Sometimes Can‐Do statements are reversed by creating statements such as “I cannot do X.” Other wording, such as “It is difficult for me to do X” may also be used. We are unaware of research comparing the effects of such wording, but it is possible that such slight changes, in particular moves away from Bandura's (2006) suggested “can do,” could change the results and lead to measurement of a different construct. </bibtext> </blist> <blist> <bibl id="bib2" idref="ref11" type="bt">2</bibl> <bibtext>Note that the standards progress from survival language skills produced with memorized material (i.e., Novice Level) through the highest skills that require rhetorical skills and cultural knowledge (Distinguished level), which only a small percentage of native speakers ever attain. </bibtext> </blist> <blist> <bibl id="bib3" idref="ref2" type="bt">3</bibl> <bibtext>Statistical analyses were conducted using Winsteps software. </bibtext> </blist> </ref> <ref id="AN0096730569-26"> <title>References</title> <blist> <bibtext>ACTFL. ( 2012a ). ACTFL proficiency guidelines—speaking. Retrieved April 1, 2014, from http://www.actfl/org </bibtext> </blist> <blist> <bibtext>ACTFL. ( 2012b ). NCSSFL‐ACTFL Can‐Do statements: Progress indicators for language learners. Alexandria, VA : Author. Retrieved January 22, 2014, from <ulink href="http://www.actfl.org/sites/default/files/pdfs/Can‐Do%5fStatements.pdf">http://www.actfl.org/sites/default/files/pdfs/Can‐Do%5fStatements.pdf</ulink></bibtext> </blist> <blist> <bibtext>ACTFL. ( 2013 ). NCSSFL‐ACTFL Can‐Do statements. Retrieved March 28, 2013, from <ulink href="http://www.actfl.org/global%5fstatements">http://www.actfl.org/global%5fstatements</ulink></bibtext> </blist> <blist> <bibl id="bib4" idref="ref51" type="bt">4</bibl> <bibtext>Bachman, L. F. ( 2004 ). Statistical analyses for language assessment. Cambridge, UK : Cambridge University Press. Retrieved from Google Scholar. </bibtext> </blist> <blist> <bibl id="bib5" idref="ref18" type="bt">5</bibl> <bibtext>Badstübner, T., & Ecke, P. ( 2009 ). Student expectations, motivations, target language use, and perceived learning progress in a summer study abroad program in Germany. Die Unterrichtspraxis/Teaching German. 42, 41 – 49. </bibtext> </blist> <blist> <bibl id="bib6" idref="ref55" type="bt">6</bibl> <bibtext>Bandura, A. ( 2006 ). Guide for constructing self‐efficacy scales. In F. Pajares & T. C. Urdan (Eds.), Self‐efficacy beliefs of adolescents (pp. 307 – 337 ). Charlotte, NC : IAP Information Age Publishing. </bibtext> </blist> <blist> <bibl id="bib7" idref="ref83" type="bt">7</bibl> <bibtext>Baylor, C., Hula, W., Donovan, N. J., Doyle, P. J., Kendall, D., & Yorkston, K. ( 2011 ). An introduction to item response theory and Rasch models for speech‐language pathologists. American Journal of Speech‐language Pathology/American Speech‐Language‐Hearing Association, 20, 243 – 259. </bibtext> </blist> <blist> <bibl id="bib8" idref="ref69" type="bt">8</bibl> <bibtext>Bond, T. G., & Fox, C. M. ( 2007 ). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ : Lawrence Erlbaum. </bibtext> </blist> <blist> <bibl id="bib9" idref="ref7" type="bt">9</bibl> <bibtext>Bown, J., Dewey, D. P., Martinsen, R., & Baker, W. ( 2011 ). Foreign language houses: Identities in transition. Critical Inquiry in Language Studies, 8, 203 – 235. </bibtext> </blist> <blist> <bibl id="bib10" idref="ref75" type="bt">10</bibl> <bibtext>Brown, J. D. ( 1996 ). Testing in language programs. Upper Saddle River, NJ : Prentice Hall Regents. Retrieved from Google Scholar. </bibtext> </blist> <blist> <bibl id="bib11" idref="ref36" type="bt">11</bibl> <bibtext>Burson, K. A., Larrick, R. P., & Klayman, J. ( 2006 ). Skilled and unskilled, but still aware of it: How perceptions of difficulty drive miscalibration in relative comparisons. Journal of Personal and Social Psychology, 90, 60 – 77. </bibtext> </blist> <blist> <bibl id="bib12" idref="ref62" type="bt">12</bibl> <bibtext>Caputo, D., & Dunning, D. ( 2005 ). What you don't know: The role played by errors of omission in imperfect self‐assessments. Journal of Experimental Social Psychology, 41, 488 – 505. </bibtext> </blist> <blist> <bibl id="bib13" idref="ref19" type="bt">13</bibl> <bibtext>Carlson, J., Burn, B., Ussem, J., & Yachimovicz, D. ( 1990 ). Study abroad: The experience of American undergraduates. Westport, CT : Greenwood Press. </bibtext> </blist> <blist> <bibl id="bib14" idref="ref42" type="bt">14</bibl> <bibtext>Clark, J. D. ( 1981 ). Language. In T. D. Barrows (Ed.), College students' knowledge and beliefs: A survey of global understanding (pp. 87 – 100 ). Princeton, NJ : Educational Testing Service. </bibtext> </blist> <blist> <bibl id="bib15" idref="ref118" type="bt">15</bibl> <bibtext>Council of Europe. ( 2001 ). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge, UK : Cambridge University Press. </bibtext> </blist> <blist> <bibl id="bib16" idref="ref119" type="bt">16</bibl> <bibtext>Council of Europe. ( 2004 ). European language portfolio (ELP): Principles and guidelines. Strasbourg : Council of Europe. </bibtext> </blist> <blist> <bibl id="bib17" idref="ref70" type="bt">17</bibl> <bibtext>Crocker, L., & Algina, J. ( 1986 ). Introduction to classical and modern test theory. Fort Worth, TX : Harcourt Brace. </bibtext> </blist> <blist> <bibl id="bib18" idref="ref20" type="bt">18</bibl> <bibtext>Dewey, D. P. ( 2004 ). A comparison of reading development by learners of Japanese in intensive domestic immersion and study abroad contexts. Studies in Second Language Acquisition, 26, 303 – 327. </bibtext> </blist> <blist> <bibl id="bib19" idref="ref8" type="bt">19</bibl> <bibtext>Dewey, D. P., Bown, J., Baker, W., & Martinsen, R. ( 2011 ). Foreign language housing in the U.S.: Results of a nationwide survey. ADFL Bulletin, 41, 70 – 86. </bibtext> </blist> <blist> <bibl id="bib20" idref="ref53" type="bt">20</bibl> <bibtext>Dewey, D. P., Bown, J., & Eggett, D. ( 2012 ). Japanese language proficiency, social networking, and language use during study abroad: Learners' perspectives. Canadian Modern Language Review, 68, 111 – 137. </bibtext> </blist> <blist> <bibl id="bib21" idref="ref13" type="bt">21</bibl> <bibtext>Dickinson, L. ( 1987 ). Self‐instruction in language learning. Cambridge, UK : Cambridge University Press. </bibtext> </blist> <blist> <bibl id="bib22" idref="ref6" type="bt">22</bibl> <bibtext>Duran, D., ( 2008, September 25). Middlebury expands language ties to California. The Middlebury Campus (Electronic version). Retrieved January 22, 2014, from <ulink href="http://middleburycampus.com/article/middlebury‐expands‐language‐ties‐to‐california/">http://middleburycampus.com/article/middlebury‐expands‐language‐ties‐to‐california/</ulink></bibtext> </blist> <blist> <bibl id="bib23" idref="ref99" type="bt">23</bibl> <bibtext>Dwyer, M. M., & Peters, C. K. ( 2004 ). The benefits of study abroad. Transitions Abroad Magazine, 27 (Electronic version). Retrieved January 22, 2014, from <ulink href="http://www.transitionsabroad.com/publications/magazine/0403/benefits%5fstudy%5fabroad.shtml">http://www.transitionsabroad.com/publications/magazine/0403/benefits%5fstudy%5fabroad.shtml</ulink></bibtext> </blist> <blist> <bibl id="bib24" idref="ref21" type="bt">24</bibl> <bibtext>Dyson, P. ( 1988 ). The year abroad: Report for the Central Bureau for Educational Visits and Exchanges. Oxford : Oxford University Language Teaching Centre. </bibtext> </blist> <blist> <bibl id="bib25" idref="ref1" type="bt">25</bibl> <bibtext>Ehrlinger, J., Johnson, K., Banner, M., Dunning, D., & Kruger, J. ( 2008 ). Why the unskilled are unaware: Further explorations of (absent) self‐insight among the incompetent. Organizational Behavior and Human Decision Processes, 105, 98 – 121. </bibtext> </blist> <blist> <bibl id="bib26" idref="ref87" type="bt">26</bibl> <bibtext>Engelhard, G. Jr. ( 2008 ). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement, 6, 155 – 189. </bibtext> </blist> <blist> <bibl id="bib27" idref="ref52" type="bt">27</bibl> <bibtext>Engelhardt, M., & Pfingsthorn, J. ( 2013 ). Self‐assessment and placement tests—A worthwhile combination ? Language Learning in Higher Education, 2, 75 – 89. </bibtext> </blist> <blist> <bibl id="bib28" idref="ref35" type="bt">28</bibl> <bibtext>Falchikov, N., & Boud, D. J. ( 1989 ). Student self‐assessment in higher education: A meta‐analysis. Review of Educational Research, 59, 395 – 430. </bibtext> </blist> <blist> <bibl id="bib29" idref="ref26" type="bt">29</bibl> <bibtext>Feldman, D. C., & Bolino, M. C. ( 2000 ). Skill utilization of overseas interns: Antecedents and consequences. Journal of International Management, 6, 29 – 47. </bibtext> </blist> <blist> <bibl id="bib30" idref="ref27" type="bt">30</bibl> <bibtext>Gillespie, J., Braskamp, L. A., & Braskamp, D. C. ( 1999 ). Evaluation and study abroad: Developing assessment criteria and practices to promote excellence. Frontiers: The Interdisciplinary Journal of Study Abroad, 5, 101 – 127. </bibtext> </blist> <blist> <bibl id="bib31" idref="ref61" type="bt">31</bibl> <bibtext>Gilovich, T., Kerr, M., & Medvec, V. H. ( 1993 ). Effect of temporal perspective on subjective confidence. Journal of Personality and Social Psychology, 64, 552. </bibtext> </blist> <blist> <bibl id="bib32" idref="ref107" type="bt">32</bibl> <bibtext>Gronlund, N. E. ( 2003 ). Assessment of student achievement. Needham Heights, MA : Allyn & Bacon. </bibtext> </blist> <blist> <bibl id="bib33" idref="ref123" type="bt">33</bibl> <bibtext>Hasselgreen, A. ( 2005 ). Assessing the language of young learners. Language Testing, 22, 337 – 354. </bibtext> </blist> <blist> <bibl id="bib34" idref="ref67" type="bt">34</bibl> <bibtext>Hill, L. G., & Betz, D. L. ( 2005 ). Revisiting the retrospective pretest. American Journal of Evaluation, 26, 501 – 517. </bibtext> </blist> <blist> <bibl id="bib35" idref="ref43" type="bt">35</bibl> <bibtext>Hilton, T., Grandy, J., Kline, R., & Liskin‐Gasparro, J. ( 1985 ). The oral language proficiency of teachers in the United States in the 1980's: An empirical study. Princeton, NJ : Educational Testing Service. </bibtext> </blist> <blist> <bibl id="bib36" idref="ref3" type="bt">36</bibl> <bibtext>Institute of International Education. ( 2013 ). Report on international educational exchange online (Open Doors report). Retrieved November 11, 2013, from <ulink href="http://www.iie.org/Research‐and‐Publications/Open‐Doors">http://www.iie.org/Research‐and‐Publications/Open‐Doors</ulink></bibtext> </blist> <blist> <bibl id="bib37" idref="ref81" type="bt">37</bibl> <bibtext>Knapp, T. R. ( 1990 ). Treating ordinal scales as interval scales: An attempt to resolve the controversy. Nursing Research, 39, 121 – 123. </bibtext> </blist> <blist> <bibl id="bib38" idref="ref114" type="bt">38</bibl> <bibtext>Koriat, A., Lichtenstein, S., & Fischoff, B. ( 1980 ). Reasons for confidence. Journal of Experimental Social Psychology: Human Learning and Memory, 6, 107 – 118. </bibtext> </blist> <blist> <bibl id="bib39" idref="ref60" type="bt">39</bibl> <bibtext>Lam, T., & Bengo, P. ( 2003 ). A comparison of three retrospective self‐reporting methods of measuring change in instructional practice. American Journal of Evaluation, 24, 65 – 80. </bibtext> </blist> <blist> <bibl id="bib40" idref="ref113" type="bt">40</bibl> <bibtext>Lichtenstein, S., & Fischoff, B. ( 1980 ). Training for calibration. Organizational Behavior and Human Performance, 26, 149 – 171. </bibtext> </blist> <blist> <bibl id="bib41" idref="ref84" type="bt">41</bibl> <bibtext>Linacre, J. M. ( 1991 ). Log‐odds in Sherwood Forest. Rasch Measurement Transactions, 5, 162‐163 (Electronic version). Retrieved March 31, 2014, from <ulink href="http://www.rasch.org/rmt/rmt53d.htm">http://www.rasch.org/rmt/rmt53d.htm</ulink></bibtext> </blist> <blist> <bibl id="bib42" idref="ref102" type="bt">42</bibl> <bibtext>Linacre, J. ( 2002 ). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85 – 106. </bibtext> </blist> <blist> <bibl id="bib43" idref="ref120" type="bt">43</bibl> <bibtext>Little, D. ( 2005 ). The Common European Framework and the European Language Portfolio: Involving learners and their judgments in the assessment process. Language Testing, 22, 321 – 336. </bibtext> </blist> <blist> <bibl id="bib44" idref="ref22" type="bt">44</bibl> <bibtext>Magnan, S. S., & Back, M. ( 2007 ). Social interaction and linguistic gain during study abroad. Foreign Language Annals, 40, 43 – 61. </bibtext> </blist> <blist> <bibl id="bib45" idref="ref9" type="bt">45</bibl> <bibtext>Martinsen, R., Baker, W., Dewey, D. P., Bown, J., & Johnson, C. ( 2010 ). Exploring diverse settings for language acquisition and use: Comparing study abroad, service learning abroad, and foreign language housing. Applied Language Learning, 20, 45 – 66. </bibtext> </blist> <blist> <bibl id="bib46" idref="ref23" type="bt">46</bibl> <bibtext>Meara, P. ( 1994 ). The year abroad and its effects. Language Learning Journal, 10, 32 – 38. </bibtext> </blist> <blist> <bibl id="bib47" idref="ref82" type="bt">47</bibl> <bibtext>Norman, G. ( 2010 ). Likert scales, levels of measurement and the “laws” of statistics. Advances in Health Sciences Education: Theory and Practice, 15, 625 – 632. doi: 10.1007/s10459‐010‐9222‐y </bibtext> </blist> <blist> <bibl id="bib48" idref="ref24" type="bt">48</bibl> <bibtext>Opper, S., Teichler, U., & Carlson, J. (Eds.). ( 1990 ). Impact of study abroad programmes on students and graduates. London : Jessica Kingsley. </bibtext> </blist> <blist> <bibl id="bib49" idref="ref14" type="bt">49</bibl> <bibtext>Oscarson, M. ( 1997 ). Self‐assessment of foreign and second language proficiency. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education, vol. 7: Language testing and assessment (pp. 175 – 187 ). Amsterdam : Kluwer Academic Publishers. </bibtext> </blist> <blist> <bibl id="bib50" idref="ref85" type="bt">50</bibl> <bibtext>Rasch, G. ( 1960 ). Probabilistic models for some intelligence and attainment tests. Copenhagen : Danmarks Paedogogiske Institut. </bibtext> </blist> <blist> <bibl id="bib51" idref="ref71" type="bt">51</bibl> <bibtext>Raykov, T., & Marcoulides, G. A. ( 2010 ). Introduction to psychometric theory. New York : Routledge. </bibtext> </blist> <blist> <bibl id="bib52" idref="ref65" type="bt">52</bibl> <bibtext>Rohs, F. R., & Langone, C. A. ( 1997 ). Increased accuracy in measuring leadership impacts. Journal of Leadership Studies, 4, 150 – 158. </bibtext> </blist> <blist> <bibl id="bib53" idref="ref15" type="bt">53</bibl> <bibtext>Ross, S. ( 1998 ). Self‐assessment in second language testing: A meta‐analysis and analysis of experiential factors. Language Testing, 15, 1 – 20. </bibtext> </blist> <blist> <bibl id="bib54" idref="ref16" type="bt">54</bibl> <bibtext>Ross, S. ( 2006 ). The reliability, validity, and utility of self‐assessment. Practical Assessment, Research, and Evaluation, 11, 1 – 13. </bibtext> </blist> <blist> <bibl id="bib55" idref="ref32" type="bt">55</bibl> <bibtext>Savchenko, U. ( 2011 ). Vulnerable L2 semantics: The case of Russian dative subjects. Toronto Working Papers in Linguistics, 33, (Electronic version). Retrieved March 31, 2014, from https://jps.library.utoronto.ca/index.php/twpl/article/view/6895/12259 </bibtext> </blist> <blist> <bibl id="bib56" idref="ref4" type="bt">56</bibl> <bibtext>Simon, C. C., ( 2013, 30 January). The world is their workplace. The New York Times. Retrieved March 28, 2013, from <ulink href="http://www.nytimes.com/2013/02/03/education/edlife/the‐world‐is‐their‐workplace.html?%5fr=0">http://www.nytimes.com/2013/02/03/education/edlife/the‐world‐is‐their‐workplace.html?%5fr=0</ulink></bibtext> </blist> <blist> <bibl id="bib57" idref="ref54" type="bt">57</bibl> <bibtext>Stansfield, C. W., Gao, J., & Rivers, W. P. ( 2010 ). A concurrent validity study of self‐assessment and the federal interagency roundtable Oral Proficiency Interview. Russian Language Journal, 60, 299 – 315. </bibtext> </blist> <blist> <bibl id="bib58" idref="ref28" type="bt">58</bibl> <bibtext>Steinberg, M. ( 2002 ). Involve me and I will understand: Academic quality in experiential programs abroad. Frontiers: The Interdisciplinary Journal of Study Abroad, 8, 207 – 227. </bibtext> </blist> <blist> <bibl id="bib59" idref="ref68" type="bt">59</bibl> <bibtext>Stevens, S. S. ( 1946 ). On the theory of scales of measurement. Science, 677 – 680. Retrieved from Google Scholar. </bibtext> </blist> <blist> <bibl id="bib60" idref="ref112" type="bt">60</bibl> <bibtext>Strong‐Krause, D. ( 2000 ). Exploring the effectiveness of self‐assessment strategies in ESL placement. In G. Ekbatani & H. Pierson (Eds.), Learner‐directed assessment in ESL (pp. 49 – 73 ). London : Lawrence Erlbaum Associates. </bibtext> </blist> <blist> <bibl id="bib61" idref="ref25" type="bt">61</bibl> <bibtext>Teichler, U., & Maiworm, F. ( 1997 ). The ERASMUS experience: Major findings of the ERASMUS evaluation research. Luxembourg : Office for Official Publications of the European Countries. </bibtext> </blist> <blist> <bibl id="bib62" idref="ref29" type="bt">62</bibl> <bibtext>van‘t Klooster, E., van Wijk, J., Go, F., & van Rekom, J. ( 2008 ). Educational travel: The overseas internship. Annals of Tourism Research, 35, 690 – 711. </bibtext> </blist> <blist> <bibl id="bib63" idref="ref30" type="bt">63</bibl> <bibtext>Waryszak, R. Z. ( 2000 ). Before, during and after: International perspective of students’ perceptions of their cooperative education placements in the tourism industry. Journal of Cooperative Education, 35, 84 – 94. </bibtext> </blist> <blist> <bibl id="bib64" idref="ref80" type="bt">64</bibl> <bibtext>Wright, B. D., & Linacre, J. M. ( 1989 ). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70, 857. </bibtext> </blist> </ref> <p>Graph: Example of Linear vs. Nonlinear Growth</p> <p>Graph: image_n/flan12082-fig-0001.png</p> <p>Graph: Human‐Rated Holistic Speaking Level Rating Category Distribution</p> <p>Graph: image_n/flan12082-fig-0002.png</p> <p>Graph: Self‐Assessment Vertical Scale</p> <p>Graph: image_n/flan12082-fig-0003.png</p> <p>Graph: Items Grouped by Intended Difficulty Level</p> <p>Graph: image_n/flan12082-fig-0004.png</p> <p>Graph: Pre‐ and Post‐Internship OPI Results</p> <p>Graph: image_n/flan12082-fig-0005.png</p> <p>Graph: Then vs. Now Person Ability Estimates</p> <p>Graph: image_n/flan12082-fig-0006.png</p> <p>Graph: Scatterchart of Then Statements With Pre‐Internship OPIs and Now Statements of Post‐Internship OPIs</p> <p>Graph: image_n/flan12082-fig-0007.png</p> <p>Graph: Scatterchart of Then‐Now Gains With Pre‐ and Post‐Internship OPI Gains</p> <p>Graph: image_n/flan12082-fig-0008.png</p> <aug> <p>By N. Anthony Brown; Dan P. Dewey and Troy L. Cox</p> <p></p> <p>N. Anthony Brown (PhD, Bryn Mawr) is Associate Professor of Germanic and Slavic Languages and Director of Foreign Language Housing, Brigham Young University, Provo, Utah.</p> <p>Dan P. Dewey (PhD, Carnegie Mellon University) is Associate Professor of Linguistics and English Language and Associate Department Chair, Brigham Young University, Provo, Utah.</p> <p>Troy L. Cox (PhD, Brigham Young University) is Assistant Professor of Linguistics and English Language and Associate Director of the Center for Language Studies, Brigham Young University, Provo, Utah.</p> </aug>
Header DbId: eric
DbLabel: ERIC
An: EJ1031126
AccessLevel: 3
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Assessing the Validity of Can-Do Statements in Retrospective (Then-Now) Self-Assessment
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Brown%2C+N%2E+Anthony%22">Brown, N. Anthony</searchLink><br /><searchLink fieldCode="AR" term="%22Dewey%2C+Dan+P%2E%22">Dewey, Dan P.</searchLink><br /><searchLink fieldCode="AR" term="%22Cox%2C+Troy+L%2E%22">Cox, Troy L.</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="SO" term="%22Foreign+Language+Annals%22"><i>Foreign Language Annals</i></searchLink>. Sum 2014 47(2):261-285.
– Name: Avail
  Label: Availability
  Group: Avail
  Data: Wiley-Blackwell. 350 Main Street, Malden, MA 02148. Tel: 800-835-6770; Tel: 781-388-8598; Fax: 781-388-8232; e-mail: cs-journals@wiley.com; Web site: http://www.wiley.com/WileyCDA
– Name: PeerReviewed
  Label: Peer Reviewed
  Group: SrcInfo
  Data: Y
– Name: Pages
  Label: Page Count
  Group: Src
  Data: 25
– Name: DatePubCY
  Label: Publication Date
  Group: Date
  Data: 2014
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Journal Articles<br />Reports - Research
– Name: Subject
  Label: Descriptors
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Self+Evaluation+%28Individuals%29%22">Self Evaluation (Individuals)</searchLink><br /><searchLink fieldCode="DE" term="%22Pretests+Posttests%22">Pretests Posttests</searchLink><br /><searchLink fieldCode="DE" term="%22Interviews%22">Interviews</searchLink><br /><searchLink fieldCode="DE" term="%22Language+Proficiency%22">Language Proficiency</searchLink><br /><searchLink fieldCode="DE" term="%22Oral+Language%22">Oral Language</searchLink><br /><searchLink fieldCode="DE" term="%22Language+Tests%22">Language Tests</searchLink><br /><searchLink fieldCode="DE" term="%22Correlation%22">Correlation</searchLink><br /><searchLink fieldCode="DE" term="%22Predictive+Validity%22">Predictive Validity</searchLink><br /><searchLink fieldCode="DE" term="%22Item+Analysis%22">Item Analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Internship+Programs%22">Internship Programs</searchLink><br /><searchLink fieldCode="DE" term="%22Foreign+Countries%22">Foreign Countries</searchLink><br /><searchLink fieldCode="DE" term="%22Achievement+Gains%22">Achievement Gains</searchLink><br /><searchLink fieldCode="DE" term="%22Russian%22">Russian</searchLink><br /><searchLink fieldCode="DE" term="%22Second+Language+Learning%22">Second Language Learning</searchLink><br /><searchLink fieldCode="DE" term="%22Student+Attitudes%22">Student Attitudes</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Reliability%22">Test Reliability</searchLink>
– Name: Subject
  Label: Geographic Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Russia%22">Russia</searchLink>
– Name: DOI
  Label: DOI
  Group: ID
  Data: 10.1111/flan.12082
– Name: ISSN
  Label: ISSN
  Group: ISSN
  Data: 0015-718X
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: In this study, the authors evaluated the strengths and limitations of a self-assessment based on ACTFL Can-Do statements ("ACTFL," 2013]) as a tool for measuring linguistic gains over an internship abroad in Russia. They assessed its reliability, determined how its items mapped with the ACTFL scale, and measured the degree to which students' self-evaluations matched oral proficiency interview (OPI) test results (i.e., predictive validity). Data revealed a high level of reliability. Furthermore, self-assessment items ascended in the order of difficulty expected (i.e., Superior items were the most difficult, followed by Advanced), but differences between the means for items representing the ACTFL levels were not statistically significant. Finally, while students demonstrated significant gains from pre- to posttests on both the OPI and the self-assessment, correlations between these measures were only moderate.
– Name: AbstractInfo
  Label: Abstractor
  Group: Ab
  Data: As Provided
– Name: DateEntry
  Label: Entry Date
  Group: Date
  Data: 2014
– Name: AN
  Label: Accession Number
  Group: ID
  Data: EJ1031126
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1031126
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1111/flan.12082
    Languages:
      – Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 25
        StartPage: 261
    Subjects:
      – SubjectFull: Self Evaluation (Individuals)
        Type: general
      – SubjectFull: Pretests Posttests
        Type: general
      – SubjectFull: Interviews
        Type: general
      – SubjectFull: Language Proficiency
        Type: general
      – SubjectFull: Oral Language
        Type: general
      – SubjectFull: Language Tests
        Type: general
      – SubjectFull: Correlation
        Type: general
      – SubjectFull: Predictive Validity
        Type: general
      – SubjectFull: Item Analysis
        Type: general
      – SubjectFull: Internship Programs
        Type: general
      – SubjectFull: Foreign Countries
        Type: general
      – SubjectFull: Achievement Gains
        Type: general
      – SubjectFull: Russian
        Type: general
      – SubjectFull: Second Language Learning
        Type: general
      – SubjectFull: Student Attitudes
        Type: general
      – SubjectFull: Test Reliability
        Type: general
      – SubjectFull: Russia
        Type: general
    Titles:
      – TitleFull: Assessing the Validity of Can-Do Statements in Retrospective (Then-Now) Self-Assessment
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Brown, N. Anthony
      – PersonEntity:
          Name:
            NameFull: Dewey, Dan P.
      – PersonEntity:
          Name:
            NameFull: Cox, Troy L.
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 01
              Type: published
              Y: 2014
          Identifiers:
            – Type: issn-print
              Value: 0015-718X
          Numbering:
            – Type: volume
              Value: 47
            – Type: issue
              Value: 2
          Titles:
            – TitleFull: Foreign Language Annals
              Type: main
ResultId 1