Página principal
    Hispania [Publicaciones periódicas]. Volume 73, Number 4, December 1990
    
Página principal Enviar comentarios Ficha de la obra Marcar esta página Índice de la obra Anterior Abajo Siguiente



––––––––   1124   ––––––––


ArribaAbajo

Development and Analysis of a Flexible Spanish Language Test for Placement and Outcome Assessment

Irene Wherritt


T. Anne Cleary


Cynthia Ann Druva-Roush



University of Iowa

Enrollment in foreign languages at B.A. -granting institutions in the United States has been increasing dramatically in recent decades72 (see, for example, Brod, 1988). Recent studies and commissions have recommended more language study at all levels of schooling, and universities have offered new opportunities for students to pursue international careers through interdisciplinary studies. In creases in foreign language study have also evolved through strengthening of school and college language requirements (see Brod and Lapointe, 1989).

Large state institutions may be especially vulnerable to a myriad of problems brought about by burgeoning enrollments in foreign language. Computer registration has de creased informal advising and counseling for placement. At large institutions, such as the University of Iowa, placement tests are ad ministered to a great number of students during a one-hour period. The results are required for placement decisions within hours of the test administration. Developing and administering tests for placement and assessment of large numbers of foreign language students is a major undertaking. In addition, with increased enrollments, teacher training and testing of foreign language teachers has become an even greater responsibility.

The creation of the ACTFL Oral Proficiency Interview (OPI) has offered a form of oral testing intended for the college setting, yet the OPI has been criticized as being limited in the type of discourse elicited. (See, for example, Barnwell, 1980; Valdés, 1989; Bachman and Clark, 1987; and Lantolf and Frawley, 1985). Furthermore, the OPI training demands considerable expenditure of money and time. These financial and time constraints make such oral testing feasible only at critical junctures or for borderline cases in placement. Machine-scored tests of listening comprehension and reading are promising modes in which ACTFL has made little progress to date. Tape-recorded speaking tests, writing correction tests, and dictations may also prove to be useful for placement testing in foreign languages. Tape-recorded speaking tests can be graded more efficiently than the OPI and they still collect a speech sample; writing correction tests can be done in a multiple-choice format allowing for machine-scored grading; and dictation can be quickly graded by counting each word as either right or wrong.

Several institutions have refined placement procedures that are not based on «seat time», for example, the University of Arizona (Schultz, 1988), Brigham Young University (Larson, 1989), the University of Minnesota (Large, 1987), the University of Pennsylvania (Freed, 1987), the University of South Carolina (Mosher, 1989). Nevertheless, recent surveys have shown that foreign language placement testing procedures are neither systematic nor satisfactory for most institutions. (Wherritt and Cleary, 1990; Klee and Rogers, 1989). Commercially available tests have had three critical problems: availability only in predetermined difficulty levels, limitations to discrete-point testing, and the large expense associated with purchase of test booklets. This paper reports on the development and field test of an item bank developed at the University of Iowa for placement and assessment of language students.


The Setting

A program has been created at the University of Iowa called the Foreign Language Assessment Project (FLAP). The primary goals of FLAP are to improve articulation in foreign languages between feeder high schools and the University of Iowa and to develop outcome measures for the exit from the language requirement, the language major, and teacher certification.

Prior to the creation of FLAP, in the six language departments at the University of Iowa, placement procedures were haphazard and subjective, and revision of the curriculum has moved at differing paces. Previous efforts

––––––––   1125   ––––––––

by the University of Iowa Evaluation and Examination Service and by individual language departments had failed in upgrading and standardizing placement procedures and testing. Recent modifications of the language requirement at the University of Iowa provided a major stimulus to the current project. Beginning in the fall of 1990, all freshmen entering the University will be required to have taken two years of a single foreign language in high school. Faculty and administration realized that any previous inadequacies in placement procedures could only be compounded by the implementation of the new requirement. For example, enrollments in Spanish language courses for the first two years of study account for 47% of total language enrollments, involving more than 2,000 students. Because of the new requirement, anticipated increased enrollments in Spanish created a major challenge in placing incoming students into Spanish courses.

A group of faculty, staff and administrators delineated an initial three-year plan for FLAP. Funding was initiated through the cooperative efforts of the Evaluation and Examination Service, the Center for International and Comparative Studies, and the College of Liberal Arts. A faculty member in the Department of Spanish and Portuguese was appointed director of the project and was given a half-time release from teaching duties to oversee activities. An advisory committee was formed with representatives from all language departments and related units. In addition, a full-time test developer was hired.

Through the efforts of FLAP, the University's Faculty Assembly has adopted a policy stating that incoming freshmen who do not place in at least the third semester course be required to take the first-year course(s) for no credit towards graduation. In addition, Incentive Credit for doing well in high school has been approved to encourage freshmen to place at as high a level as possible. Those who place in a fourth semester language course and receive a B -or better will receive Incentive Credit for the prerequisite third semester course, and those who place in the fifth semester course and earn a B- or better will receive Incentive Credit for the prerequisite third and fourth semester classes.






Test Development

FLAP developed an item bank for Spanish that can be used to generate a variety of tests. The first tests generated from the item bank were designed to place incoming freshmen in one of the first four semesters of language instruction or exempt them from the language requirement. To date, a total of over 500 reading and listening comprehension items have been written. Items testing speaking and writing will be added to the bank in the future. Except for short dictations, all items developed to date are machine scoreable. At a later date, items will be created at more advanced levels for use in tests for completion of the language major and teacher certification. Following the Spanish model, item banks are being developed for French and German.


The Table of Specifications and Item Bank Structure

A table of specifications was created from the syllabi of first -and second- year University of Iowa Spanish courses and from a survey of content of textbooks most frequently used in Iowa high schools. Each part of the table of specifications has been assigned a numerical identification number so that items can be coded for entry into the item bank. The following skill areas have been defined as part of the specifications: dictation, grammar, reading comprehension, listening comprehension, writing, and speaking. For the first stage of development, attention was focused on dictation, listening comprehension, reading, and some grammar. The specifications at the be ginning level include four domains: 1) the basics [numbers, greetings, common objects, colors, calendar terms, nationalities, and professions]; 2) personal [family, clothing, food, daily activities, house, body parts, health, neighborhood, likes, dislikes, plans, memories, and future]; 3) outside [transportation, travel, telephone, recreational activities, shopping, geography, and history]; and 4) sociolinguistic [invitation, refusal, seeking information, expressing attitudes, determining attitudes, persuading, and socializing]. Item writers were asked to target the item difficulties at a level appropriate for the end of the first year of language study.




Item Writing

Items were written and edited by faculty and teaching assistants selected from the Department of Spanish and Portuguese. All texts simulated authentic materials, using target

––––––––   1126   ––––––––

language newspapers and magazines as models. Materials judged by the item reviewers as giving advantage to students with prior background knowledge were discarded or altered. For example, a passage on a recipe for gazpacho, a soup likely to be quite familiar to those who cook, was converted into a recipe for cold cauliflower soup, a description less likely to be known to any examinee. Item writers worked in the same room and each completed item was circulated among several participants for suggestions for improving clarity. Extensive proofreading was done later by native speakers and language teachers. Graduate students took sample exams generated from the item bank to further improve items and format.

Seven item types were developed: dictations, dialogues, announcements, modified cloze test of grammar, modified doze test of reading, multiple-choice reading, and picture interpretation. Both written and tape-recorded (when appropriate) instructions were created in English for each item type. Tape recordings in Spanish and tape-recorded instructions in English were read by native speakers. To assess integrated language skills, 25 -word dictations were written. The tape-recorded dictations contained simple prose read once quickly, once slowly in phrases, and again quickly.

Listening comprehension items included simulated authentic tape-recorded dialogues and announcements accompanied by printed multiple-choice items in English. The dialogues and announcements were read more slowly than native tempo, but at a reasonable speed. Dialogues portrayed short exchanges between friends or family, such as invitations, minor disagreements, and instructions. Announcements included those heard in such places as train stations and airports and through media such as radio and telephone.

Modified cloze73 passages with multiple choice responses were developed to test the recognition of correct grammatical usage. Basic grammatical items mastered in first year courses were chosen, such as uses of ser and estar, present tense, and simple prepositions.

Three types of items to assess global reading comprehension were developed. In one type, a modified doze procedure uses passages of general interest followed by multiple choice items in which all foils are semantically and grammatically possible within the sentence but only the correct answer is appropriate for the passage as a whole. To choose the correct answer, an examinee must under stand the entire text. The blank generally appears towards the end of a paragraph in a simple declarative sentence. A second type of item uses simulated authentic newspaper passages followed by multiple-choice items in English. The articles cover topics of general interest including sports, weather, history, and health. The items test inference and general comprehension rather than recognition of specific facts. English is used in the questions to ensure that examinees understand each item. A third type of item tests general comprehension through use of pictures, maps, and diagrams accompanied by multiple choice questions in Spanish. This item type measures ability in use of the target language for simple spacial relationships and directions.

Each item was coded according to skill area, domain, and difficulty and entered into the computer bank.






Field Trial


Test Forms and Administration

Ten parallel alternative test forms were generated from the item bank and field tested during one testing period using 1,422 total students who were completing each of the four semesters of general education courses in Spanish. Bonus points were given to each student to encourage them to perform well. Absenteeism for the test was minimal. Each test was generated through random selection from the item bank and contained a dictation plus ten listening comprehension items, five grammar/vocabulary items, and 20 items from reading comprehension and drawings. All test forms were administered in a 50-minute class period. The majority of students finished within 35 minutes. Each test administrator filled out a report and any irregularities or suggestions for improvement were noted. Generally, test administrators approved of the test and the manner of administration.




Scoring

Multiple-choice responses were scored as signing one point for each correct response and zero for an incorrect response or a question omitted. The dictation was scored by giving one point for each completely correct word

––––––––   1127   ––––––––

including accent marks and any punctuation that followed the word. The total number of points possible on each dictation was 25. The dictations were scored by Evaluation and Examination Service staff, some of whom had no knowledge of Spanish. The group was tested first for their accuracy and speed and later spot checked for accuracy. Probably, for a language less phonetic than Spanish, such as German or French, only persons skilled in the language would be able to correct dictations rapidly and accurately.




Item and Test Analysis

The test analysis information for the 10 field trial forms is presented in Table l. The machine-scoreable sections show coefficient alpha reabilities that range from .58 to .82, quite respectable levels for the first trial administration of a 35-item test. It is not possible to compute an estimate of the internal consistency of the dictation section because it did not consist of independent items, but information about its reliability can be obtained from its correlation with the other sections of the test. The dictation, which took only three minutes to administer, had very high correlations with the multiple-choice section, sometimes dose to the reliability of the multiple-choice section. Other than one very low correlation of .43, the correlations ranged from .53 to .68. When the dictation and machine-scored items were combined the coefficient alpha (consistency of the two parts within each test) ranged from .60 to .80.

The dictations varied substantially in difficulty, from an average of 8.8 points to an average of 15 points (out of a possible 25 points).

Table 1: Field Trial Test Characteristics
Multiple-Choice Section Dictation Section Composite
Form N Mean S.D. Reliability
(Alpha)
*
Mean S.D Correlation
with Multiple-
Choice Section
*
Reliability
(Alpha)
*
1 130 22.9 3.6 .62 13.5 4.7 .55 .69
2 146 22.6 3.6 .62 13.5 4.7 .55 .69
3 135 19.2 3.8 .58 8.8 4.7 .57 .72
4 152 22.3 4.3 .70 14.6 5.7 .61 .74
5 145 24.1 4.6 .77 12.3 4.0 .53 .68
6 151 19.9 4.4 .68 10.0 5.1 .43 .60
7 125 23.3 4.2 .69 15.0 4.8 .57 .72
8 145 23.0 5.5 .82 12.0 5.1 .56 .72
9 142 22.7 4.7 .75 11.0 4.9 .63 .77
10 151 26.4 4.1 .73 11.7 5.2 .68 .80
All reliability coefficients and correlations are significantly different from zero.

Several characteristics that contributed to unusual difficulty were identified: the use of rare proper names, difficult accented words that appeared several times, and more punctuation. Control of these factors in the future should make different dictation passages more comparable in difficulty. In addition, once dictation passages have been field tested several times, the dictation scores can be standardized to have the same mean and standard deviation.

The statistical characteristics of each of the machine-scored item types are shown in Table 2. The multiple-choice reading items were the easiest and the grammar were the most difficult. The median discriminations (how well the top students did compared to those in the bottom ranges) are all quite high. The lowest median discrimination, that for multiple-choice reading, could be improved by increasing the difficulty of these items. The median difficulty of 88% describes items that are too easy to allow any discrimination.

Table 2:
Statistical Characteristics of Different Item Types
Item Type No. of Items Median Difficulty Median Discrimination
Grammar 56 44% .33
Audio Dialogues 54 74% .32
Audio
Announcements 34 61% .27
Reading Passages 82 88% .26
Modified Cloze
Reading Passages 48 68% .35
Picture
Interpretation 65 76% .35







––––––––   1128   ––––––––


Validity Study

In order to establish a measurement instrument as credible, it is essential that the instrument be validated for the purposes for which it is to be used. One form of validation, content validation, has been discussed earlier: items were constructed to reflect the current curriculum used in Spanish language classes at the University of Iowa. A second form of validation concerns the relationship between test score results and performance in a Spanish language class. Course grades for spring semester, 1989, were chosen as an appropriate criterion. These course grades are calculated using 50% oral comprehension and production, 25% reading comprehension, and 25% vocabulary and grammar. The test was administered near the end of the semester so that the results would reflect abilities of individuals having completed each semester of instruction. If a strong relationship exists between test scores and grades in particular courses, adequate performance on the test can be used to place an incoming student into one of the four courses.

Test scores were standardized (mean of 50 and standard deviation of 10) within each of the ten forms. (See Cleary, Wherritt, and Druva-Roush, 1990, for a detailed report on the statistical analyses of test scores). The end-of-semester grades were determined by the teaching staff without inclusion of the test scores. There is a clear positive relationship between test scores and grades; the correlation ranges from .62-.66 from end of first semester through end of fourth semester. Individuals doing less well in the course scored lower on the test.

The test was constructed to be used both for placement and assessment of competence. Four cut scores were statistically determined, three for placement into three higher level courses and one for exemption from the language requirement. A loss function for each course was created to establish cut scores for placement into the first four semesters of Spanish taught at The University of Iowa. A loss function is an analysis of the possible errors to be made by setting the cut score at a specific level. Success was defined here as receiving at least a C+ in a course; failure as receiving a C or lower. Two types of error were counted for the loss function: 1) either predicting failure for individuals who in fact succeeded in the course [false negative] or 2) predicting success for individuals who in fact failed [false positive]. To set a cut score one looks for that score for which the sum of the two types of error is minimized.

As can be seen in Figure 1 the cut score for each of the four courses is progressively higher for each higher level course. The loss function minimum is around a score of 53 for the end-of-fourth semester, so that score can be used for exemption. For placement into the fourth semester, the minimum loss function for the end-of-third semester is used, where the minimum is in the area 42 to 46. The end-of-second semester reaches a minimum at about 39, and the end-of-first semester at about 30.

 Figure 1

Figure 1: Loss Function of Test Scores for End of First Four Semesters of Spanish

An alternate means of setting a cut score for a particular course is to examine the mean grade across individuals within a course for each score. When the mean grade drops below an acceptable level, the cut score is defined. The cut scores defined by the «mean grade» method lie within the range defined by the loss function approach to setting cut scores. Two different methods have defined cut scores within the same range of values. The triangulation of these two different cut score methods coupled with the fact that the score ranges predicting adequate performance are higher with each succeeding course, lend evidence for the construct validity of the

––––––––   1129   ––––––––

test. The data indicate that the test is indeed capable of measuring across a rather wide range of ability.




Conclusions

FLAP has been successful in developing a large item bank for the testing of reading comprehension and listening in Spanish at the level of the first four semesters of college work. Tests generated from the bank have demonstrated validity for placement of incoming students into these college level courses. By having seven item types, the test is varied and provides interest for examinees. Also, the variety in item types allows for students with different learning styles to excel in their areas of strength. By having a reasonable amount of the test devoted to dictation and listening comprehension, the test reflects more accurately than traditional tests the importance of oral comprehension in first -and second- year language courses.

Revision of defective items based on the field trial administrations will allow a large bank of items to be used for the generation of multiple alternative forms in the future. Similarly, through use of the item bank, tests can be generated containing different skill emphases and higher or lower difficulty levels.

The success of the development of the FLAP item bank has been aided by broad financial and staff support. This broad base of funding has encouraged faculty and staff of divergent groups to come together to talk about common problems. FLAP is housed in the University of Iowa Evaluation and Examination Service and therefore enjoys the avail ability of measurement and statistics specialists and previously developed item bank software. The FLAP staff has background in foreign language pedagogy, enabling accurate input on content. High school teachers have been consulted through local workshops. By having a central location, a staff, and a recognized name, FLAP has achieved a reasonable amount of credibility on campus and in the state of Iowa in a short period of time.




Works Cited

Bachman, L. and J. Clark. «The Measurement of Foreign / Second Language Proficiency.» The Annals of the American Academy of Political and Social Sciences 490 (1987): 20-33.

Barnwell, D. «Proficiency and the Native Speaker.» ADFL Bulletin 20(2) [1989]: 42-46.

Brod, R. «Foreign Language Enrollments in U.S. Institutions of Higher Education-Fall 1986.» ADFL Bulletin 19(2) [1988]: 39-44.

Brod, R. and M. Lapointe. «The MLA Survey of Foreign Language Entrance and Degree Requirements, 1987-88.» ADFL Bulletin 20(2) [1989]: 17-41.

Cleary, T. A., I. Wherritt and C. Druva-Roush. «Development and Analysis of a Spanish Language Item Bank for Placement Testing and Outcome Assessment» Iowa City, IA: The University of Iowa, Unpublished manuscript, 1990.

Freed, B. «Issues in Establishing and Maintaining a Language Proficiency Requirement» Proceedings of the Symposium on the Evaluation of Foreign Language Proficiency. Ed. A. Valdman, Bloomington, IN: Indiana University, 1987. 263-73.

Klee, C. and E. Rogers. «Status of Articulation: Placement, Advanced Placement Credit, and Course Options.» Hispania 72 (1989): 763-73.

Lange, D. «Developing and Implementing Proficiency Oriented Tests for a New Language Requirement at the University of Minnesota: Issues and Problems for Implementing and ACTFL/ETS/ILR Proficiency Guidelines.» Proceedings of the Symposium on the Evaluation of Foreign Language Proficiency. Ed. A. Valdman, Bloomington, IN: Indiana University, 1987. 275-90.

Lantolf, J. and J. Frawley. «Oral-Proficiency Testing: A Critical Analysis.» The Modern Language Journal 69 (1985): 337-45.

Larson, J. «S-CAPE: A Spanish Computerized Adaptive Placement Exam.» Modern Technology in Foreign Language Education: Applications and Projects. Ed. W. F. Smith. Linwood, IL: National Textbook Company, 1989. 277-89.

Mosher, A. «The South Carolina Plan for Improved Curriculum Articulation Between High Schools and Colleges.» Foreign Language Annals 22 (1989): 157-62.

Schultz, R. «Proficiency-Based Foreign Language Requirements: A Plan for Action.» ADFL Bulletin 19(4) [1988]: 24-28.

Valdés, G. «Teaching Spanish to Hispanic Bilinguals: A Look at Oral Proficiency Testing and the Proficiency Movement.» Hispania 72 (1989): 392-401.

Wherritt, I. and T. A. Cleary. «A National Survey of Spanish Language Testing for Placement or Outcome Assessment at B.A. -Granting Institutions in the United States.» Foreign Language Annals 23 (1990): 157-630.







Arriba

    Hispania [Publicaciones periódicas]. Volume 73, Number 4, December 1990
    
Página principal Enviar comentarios Ficha de la obra Marcar esta página Índice de la obra Anterior Arriba Siguiente
Marco legal