THE use of computers has infiltrated many areas of education, including the fields of language instruction and testing. Although sophisticated computer programs have historically been more common in curricula outside foreign languages, we are now beginning to see several worthwhile applications in this field as well. One of these applications is language testing.
Computer-assisted testing (CAT) includes any use of the computer that aids in the testing process. The assistance may be in the form of test-item generation, test delivery, scoring, record keeping, reporting results, providing feedback to examinees, and the like. While CAT programs vary considerably in the range of assistance they provide to testers, ideally they should eliminate as much of the drudgery of testing as possible.
As with many innovations, the question arises: Is computer-assisted testing really profitable? The answer to this query depends on several factors. Budget restrictions, numbers of students, testing frequency, availability of equipment, and so forth are all important. One must also evaluate the advantages and disadvantages of both computer and conventional testing procedures and materials. To assist in such an evaluation, a brief discussion of various features of computer-assisted language testing follows.
When one evaluates the appropriateness of any CAT program, it is important to remember that computer-assisted testing uses unique capabilities of the computer. If specialized computer capabilities are not used, this form of testing will be no more efficient than conventional tests.
Memory capability is clearly one of the distinct advantages of using computers for testing. The ability to store and retrieve data makes item banking and random or selected item generation possible. Items within the bank can be arranged and classified however they are needed for immediate access. If, for example, they are cataloged by difficulty level, they can be called up by the computer in such a way as to tailor the test to the ability level of the examinee; it is then possible for all examinees to take individualized, yet equated, tests. Items can be added to or deleted from the bank as course requirements change or if certain items are found to be unsatisfactory.
Dynamic memory capability allows the computer to keep track of examinee performance and provide instantaneous feedback regarding areas of strength or deficiency upon completion of the test. These records enable the teacher to track student progress and to diagnose general areas of concern, thus facilitating the task of counseling and recommending remediation. In addition, the computer can be programmed to provide a performance record of the test items. This record can be saved and subsequently used for evaluating the effectiveness of individual items.
Word-processing programs are available for most microcomputers, and by using the word processor, test developers can easily edit, add, or delete items in the test bank. This potential, coupled with record-keeping capability, encourages and facilitates regular item analysis and test improvement. Word-processing features also make it possible to include writing-type test items via the computer. Hence, item types such as fill-in-the-blank and short and long essays become feasible.
Since the microcomputer is normally used on an individualized basis, self-paced language testing is readily accomplished. Examinees can progress through the test according to their competence and test-taking styles; they are not forced to complete anxiety-producing timed tests. In addition, individuals can be allowed to take a test when they are ready, instead of having to wait until the rest of the class is prepared or being compelled to take the test before they are familiar with the information being tested.
Using the computer for test delivery or construction does not limit the possible kinds of tests. Progress and achievement tests for classroom purposes are easily handled via the computer. Printed test forms can be generated from a bank of items dealing with the information covered in class, or tests can be taken directly from the computer terminal. Depending on the size and arrangement of the item bank, several alternative test forms may be produced.
Placement testing is easily accomplished via computer. (A particular type of CAT placement test, a computerized adaptive placement exam, is discussed later in this paper.) Foreign language diagnostic tests are also well suited for computer-assisted testing. In fact, because of the memory and record-keeping capability of the computer, diagnostic testing is in many ways accomplished much better via computer than through paper-and-pencil testing procedures.
Computer-assisted language proficiency testing has recently received increased attention. The ACTFL is presently developing a computer-adaptive proficiency test of listening and reading.
Although computer-assisted testing appears more suited to assessing reading skills and knowledge of grammar and vocabulary, it is possible to test listening and writing skills as well. As mentioned earlier, the word-processor capabilities of the computer make test items of writing possible. Computer scoring of this type of item, however, does require sophisticated parsing and evaluation routines. Writing items such as essays or short compositions that may not be scored by the computer but that are composed on it tend to be cleaner and more polished since the examinees tend to revise and edit their work repeatedly (Nickell). Testing listening skills requires an added peripheral to the computer itself, that is, a random access audio device, either an audio tape or audiodisk player driven by the computer. It is possible to program the computer to access items from a bank of items prerecorded on the audio device rather than items stored in the computer itself. At this point, unfortunately, assessing speaking ability via computer-assisted testing has not become a reality. However, considering the advances in technology made over the past few years, it is conceivable that this skill, too, may soon be evaluated by computer.
Other benefits, besides those relating to the unique capabilities of the computer, should be considered. For example, computer-assisted testing reduces test administration costs. (While the hardware used for computer testing is expensive, it is not solely a testing expense; these same computers can be used by faculty, staff, and students for many other purposes, such as secretarial assistance, word processing, data analysis, budget and student-record keeping, and CAI exercises. Computer-delivered tests do not require the hiring of test proctors and reduce the expense of testing materials and handling, that is, test packets, paper, pencils, answer sheets, scoring, and reporting.
Computer-assisted testing also greatly diminishes the concern for test security and compromise. There are no test packets that have to be safeguarded and stored from one test administration to the next. Even if students were to gain access to one of the CAT computer disks, they would virtually have to memorize the entire contents of the disk since test items can be generated at random, yielding a separate and unique test for each testing session.
Perhaps the first limitation that one recognizes is the expense of providing a full-scale CAT program. In order to test several students in a limited amount of time, several computer stations are necessary. The exact cost per station will vary greatly, depending on the type of computer, monitor, and peripherals required. If, however, the computer is used mainly as a test generator, meaning that the tests are sent to a printer for reproduction, the expense of the CAT program is within the range of most language departments' budgets. Expenses for testing software development or for purchase of software must also be considered. One must weigh these expenditures against the amounts required for alternative testing procedures.
Another disadvantage associated with computer-assisted testing is limited screen size. Computer systems normally allow approximately twenty-five lines of forty to eighty columns of text on one screen display and hence do not provide a lot of space for long items that are commonly used to test reading comprehension. An option, however, is to allow for scrolling forward and backward, which makes it possible to include long reading passages.
Since computers judge responses on a right-wrong basis, computer-assisted testing uses objectively scored item types almost exclusively (e.g., multiple-choice, matching, true-false, exact answer fill-in-the-blank). These types of items are often satisfactory for discrete-point information assessment but present some difficulties in the evaluation of global language competence. An additional concern is that sound approaches to evaluation including essays, dictation, and holistically-scored oral interviews might well be neglected by persons eager to employ computers in language testing (Larson and Madsen 32).
Although most students today are generally comfortable around computers, some may feel intimidated by them. This uneasiness, coupled with the usual test anxiety, may contribute to a substandard performance by these students and give advantage to students who have had previous computer experience (Cohen).
New to the field of computer-assisted language testing are computer adaptive tests. Within just the last year or two, language testers have succeeded in producing tests administered by computer that adapt to the ability level demonstrated by the examinee taking the test. These tailored tests provide a common-yardstick measurement to all examinees. The items in the test bank are calibrated on a difficulty continuum, using item-response-theory measurement models. The technique used in these tests is similar to that followed in oral proficiency interviewing. The first item presented to the examinee is a low-difficulty item. If the item is answered correctly, the next item will be more difficult. Conversely, if the first item is answered incorrectly, an easier item will be presented. Items increase in difficulty until the ability level of the examinee has been determined.
Ability estimates in CALT tests are determined by the examinee's pattern of responses. This testing methodology results in increased measures at the individual's approximate level of ability, which in turn provide for increased test accuracy. Improved test accuracy results in greater test efficiency, meaning that fewer items are necessary to evaluate the student's ability, and, consequently, less time is required for testing (Olsen et al. 22).
When taking conventional paper-and-pencil tests, students are forced to answer a wide range of questions, many of which are either much too easy or much too difficult. Because adaptive tests administer items near the ability level of the student, they reduce test boredom from too-easy items and frustration from too-hard items. Thus attitudes toward tests improve. In addition, having students work at their own pace answering items at their approximate ability levels makes for an ideal power test.
Since the items used in a computer adaptive test are carefully calibrated on a difficulty scale, it becomes possible to generate almost any number of alternative, equated tests. Examinees who take a CALT test are all measured on the same scale, yet their tests may be quite different in composition. Test security and cheating are no longer a major concern.
As stated above, the items in computer adaptive tests have been calibrated using one of the item-response-theory (IRT), or latent-trait, models. (For discussions of the IRT models, see Appendix C.) One of the assumptions that must be met in order to justify the use of these models is that of unidimensionality, meaning that all test items must measure only one variable, such as language proficiency. Some psychometricians have maintained that language tests involve more than one principal domain and therefore latent-trait models are not appropriate. However, recent research suggests that even though the overall test covers several skill areas (e.g., listening comprehension, vocabulary recognition, and grammatical accuracy), no violations of unidimensionality exist (Henning, Hudson, and Turner 151).
The success of a CALT test depends heavily on the accuracy of the item calibrations. If the items are not located properly on the difficulty/ability scale, the test will be neither valid nor reliable. In addition, several items are needed at each difficulty level in order to provide repeated measures at that level. Therefore, a fairly large bank of items may need to be created.
Another concern in developing computer adaptive tests is the determination of performance cutoff points (for grading, placement, advancement, remediation, etc.). These parameters must be set according to criteria appropriate to each department. For example, cutoff scores from a CALT placement test used to place students into beginning level language courses may differ from department to department; therefore, each department using the test must decide which ability scores belong to which of its courses.
An example of a computer adaptive language test currently in use is the S-CAPE (Spanish Computer Adaptive Placement Exam), which was developed at Brigham Young University. 1 Designed to assist in placing incoming students into appropriate lower-division Spanish courses in the Department of Spanish and Portuguese at BYU, the S-CAPE can estimate a student's ability level and provide a suggested course placement in about twenty to thirty minutes. The placement criteria have been set according to the contents of the courses. Students may take the test during the first few days of the semester or upon request during the semester, when they need information regarding which class they should take the following semester.
Using microcomputers in the Humanities Learning Resource Center, approximately 350 to 400 students can complete the test per day. Results are provided on completion of the test; a message appears on the screen informing the examinee that the test has been completed and indicates the student's placement level. The student can then consult the course placement chart, which is posted in the room, to determine which course is most appropriate.
If desired, in addition to the brief placement message presented to the examinee, a complete display of the student's performance can be shown on the computer screen or printed on hard copy. Appendix A shows three sample performance reports of actual students. Information on the reports includes the name and identification number of the student, the date of the test, the time the test began and ended, the amount of the student's previous exposure to Spanish instruction or use, the items answered (by item ID number in the test item bank), the level of each item, the student's answer, right/wrong indication, and the student's placement level.
In deciding whether computer-assisted testing is profitable for a particular situation, one needs to consider many aspects of the process. Each department must determine whether the advantages outweigh the disadvantages. It appears, however, that present and future developments in computer-assisted testing will have a significant impact on foreign language testing. In his article Using Microcomputers to Administer Tests: An Alternate Point of View, Millman presents a few dampening comments about CAT, but his prognosis is that Computer-assisted test administration not only will survive, it will flourish as a viable and healthy part of the body of test practices (21).
The author is Director of the Humanities Learning Resource Center at Brigham Young University. This article is based on a paper presented at ADFL Seminar West, 26–29 June 1986, in Monterey, California.
1 S-CAPE was produced in the Humanities Research Center at Brigham Young University by Jerry W Larson and Kim L. Smith. For further information about the exam, write the Humanities Research Center, 3060 JKHB, Brigham Young Univ., Provo, UT 84602.
Cohen, Andrew D. Fourth ACROLT Meeting on Language Testing. TESOL Newsletter 18.2 (1984): 23.
Henning, Grant, Thom Hudson, and Jean Turner. Item Response Theory and the Assumption of Unidimensionality for Language Tests. Language Testing 2.2 (1985): 141–54.
Larson, Jerry W, and Harold S. Madsen. Computerized Adaptive Language Testing: Moving beyond Computer-Assisted Testing. CALICO Journal 2.3 (1985): 32–36.
Millman, Jason. Using Microcomputers to Administer Tests: An Alternate Point of View. Educational Measurement: Issues and Practices 3.2 (1984): 20–21.
Nickell, Samila S. Computer-Assisted Writing Conferences. CALICO 1985 Symposium. Baltimore, 2 Feb. 1985.
Olsen, James B., Dennis D. Maynes, Dean Slawson, and Kevin Ho. Comparison and Equating of Paper-Administered, Computer-Administered and Computerized Adaptive Tests of Achievement. American Educational Research Association meeting. San Francisco, Apr. 1986.
SPANISH COMPUTER ADAPTIVE PLACEMENT TEST REPORT
Student A, S.S. No. Date: 04-02-1986
Time started: 11:56:06
Spanish background: One semester in college.
There were four incorrect answers at level 24
Number right: 10 Number wrong: 10
This student placed at level 23.
Test completed at 12:07:52
SPANISH COMPUTER ADAPTIVE PLACEMENT TEST REPORT
Student B, S.S. No. Date: 04-03-1986
Time started: 13:11:12
Spanish background: Two semesters in college.
There were four incorrect answers at level 34
Number right: 12 Number wrong: 11
This student placed at level 33.
Test completed at 14:03:04
SPANISH COMPUTER ADAPTIVE PLACEMENT TEST REPORT
Student C, S.S. No. Date: 06-23-1986
Time started: 14:41:37
Spanish background: One year in secondary school; lived in a Spanish-speaking area for six months or longer.
There were four incorrect answers at level 34
Number right: 8 Number wrong: 7
This student placed at level 33.
Test completed at 14:48:55
Green, Bert F. Adaptive Testing by Computer. Measurement, Technology, and Individuality in Education. Ed. R. B. Ekstrom. New Directions for Testing and Measurement 17. San Francisco: Jossey-Bass, 1983.5–12.
. The Promises of Tailored Tests. Principles of Modern Psychological Measurement: A Festschrift in Honor of Frederic Lord. Ed. H. Wainer and S. A. Messick. Hillsdale: Erlbaum, 1983.69–80.
Madsen, Harold S., and Jerry W Larson. Employing Computerized Adaptive Language Testing Techniques. Selected Papers from the Proceedings. Ed. Robert A. Russell. Eleventh Annual Symposium of the Deseret Language and Linguistic Society. Provo: DLLS, 1986. 117–28.
Takalo, Ronald. Language Test Generator. CALICO Journal 2.4 (1985): 45–48.
Urry, V. W. Tailored Testing: A Successful Application of Latent Trait Theory. Journal of Educational Measurement 14 (1977): 181–96.
Weiss, David J., and Nancy E. Betz. Ability Measurement: Conventional or Adaptive Research Report 73–1. Psychometric Methods Program. Minneapolis: Dept. of Psychology, Univ. of Minnesota, 1973.
Wyatt, David H. Computer-Assisted Teaching and Testing of Reading and Listening. Foreign Language Annals 17.4 (1984): 393–407.
Hambleton, R. K., and L. L. Cook. Latent Trait Models and Their Use in the Analysis of Educational Test Data. Journal of Educational Measurement 38 (1977): 75–96.
Henning, Grant. Advantages of Latent Trait Measurement in Language Testing. Language Testing 1.2 (1984): 123–33.
Lord, F. M. Applications of Item Response Theory to Practical Testing Problems. Hillsdale: Erlbaum, 1980.
Wainer, Howard. On Item Response Theory and Computerized Adaptive Tests. Journal of College Admissions 28.4 (1983): 9–16.
Woods, Anthony, and Rosemary Baker. Item Response Theory. Language Testing 2.2 (1985): 119–40.
Wright, Benjamin D., and Mark H. Stone. Best Test Design. Chicago: Mesa, 1979.
Wright, Benjamin D., and R. J. Mead. BICAL: Calibrating Items with the Rasch Model. Research Memorandum 23. Chicago: Statistical Laboratory, Dept. of Education, Univ. of Chicago, 1976.
© 1987 by the Association of Departments of Foreign Languages. All Rights Reserved.[an error occurred while processing this directive]