大型測驗中同時進行垂直與水平等化效果之探討

A Simultaneous Vertical and Horizontal Equating of Large-scale Assessments

郭伯臣;王暄博
共同作者:王暄博 Hsuan-Po Wang
Bor-Chen Kuo


所屬期刊: 第4卷第4期 「測驗與評量」
主編:香港教育學院講座教授
王文中
系統編號: vol015_04
主題: 測驗與評量
出版年份: 2008
作者: 郭伯臣;王暄博
作者(英文): Bor-Chen Kuo
論文名稱: 大型測驗中同時進行垂直與水平等化效果之探討
論文名稱(英文): A Simultaneous Vertical and Horizontal Equating of Large-scale Assessments
共同作者: 王暄博 Hsuan-Po Wang
最高學歷:
校院名稱:
系所名稱:
語文別:
論文頁數: 34
中文關鍵字: 平衡不完全區塊設計;定錨不等組設計;定錨試題;測驗等化
英文關鍵字: balanced incomplete block design; non-equivalent groups with anchor test design; anchor item; and test equating
服務單位: 國立臺中教育大學教育測驗統計研究所所長兼教授;國立臺中教育大學教育測驗統計研究所研究生
稿件字數: 13335
作者專長: 影像辨認、電腦化適性測驗、機械學習與資料探勘、試題反應理論、統計學習理論
投稿日期: 2008/10/31
論文下載: pdf檔案icon
摘要(中文): 近年來,隨著資訊科技快速進步、測驗形式的改變及需求量的快速增加,大型測驗(large-scale assessments)的議題廣泛受到矚目。然而這些大型測驗主要的目的為建置一套完整且客觀的學生學習成就資料庫,並透過等化連結的方法使不同年級、不同年度的受試學生測驗分數可進行比較,進而了解全國學生之學習成效。本研究以試題反應理論(item response theory, IRT)之三參數羅吉斯模式(three-parameter logistic model)為理論基礎探討利用平衡不完全區塊(balanced incomplete block, BIB)設計與定錨不等組設計(non-equivalent groups with anchor test design, NEAT)兩種連結設計在進行大型教育測驗等化時,對於不同年級不同年度間等化之連結效果,並針對受試人數、定錨試題比例及難度範圍選取法等變項進行模擬實驗。在本研究發現試題參數及受試者能力值估計誤差隨著受試人數增加而減少;以定錨比例來看,隨著定錨比例的增加,受試者能力估計誤差及試題難度參數估計誤差減少;以選題範圍來看,難度範圍並沒有明顯差異;BIB設計於試題參數估計精準度大致上優於NEAT設計;NEAT設計受試者能力估計精準度較優於BIB設計。
摘要(英文): For large-scale assessments, the spectrum of subject matter is usually wide and the simultaneous sampling of items and students is a practical way to obtain representative indications of student performance. Balanced incomplete block (BIB) design and non-equivalent groups with anther test design (NEAT) are two popular test equating methods for this condition. The purpose of this study is to explore the linking performances of two large-scale assessments which are administrated in different years and different grade by using BIB and NEAT designs. The effects of numbers of people, the percentage of anchor items and ranges of anchor items are explored under two different equating designs.The results of simulation study show that: 1. the estimation error decreases as the numbers of people increases; 2. the estimation error decreases as the number of anchor items increases, and the better equating performance occurs as the percentage of anchor items is 30%; 3. BIB outperforms NEAT in estimating item parameter and NEAT outperforms BIB in estimating abilities of examinees for two different equating designs.
參考文獻: 王寶墉(1995)。現代測驗理論。臺北市:心理。
李源煌、楊玉女(2000)。建立學科評量量尺之理論基礎。中國測驗學會測驗年刊,47(1),95-116。
張鈺卿、陳昇座、郭伯臣、王暄博(2006)。大型教育測驗不同年度量尺等化效果之模擬研究。第七屆海峽兩岸心理與教育測驗學術研討會,2006年10月28日、29日,國立政治大學。
郭伯臣、王暄博、許天維、張雅媛(2005)。大型測驗不同等化設計效果之模擬研究。2005年教育與心理測驗學術研討會,2005年11月12日。中國測驗學會,國立政治大學。
陳煥文(2004)。垂直等化連結特性之研究-四種連結方法的比較。(國科會專題研究計畫,NSC92-2413-H-024-015)。臺南市︰國立臺南大學測驗統計研究所。
曾玉琳、王暄博、郭伯臣、許天維(2005)。不同BIB設計對測驗等化的影響。測驗統計年刊,第十三輯下期,209-229。台中市:國立台中教育大學。
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R.L. Thorridike (Ed.), Educational measurement (2nd ed., 508-600). Washington, DC: American Council on Education. (Reprinted as W. A. Angoff, Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service, 1984.)
Allen, N. L., Donoghue, J. R., & Schoeps, T. L. (2001). The NAEP 1998 technical report. Washington, DC: National Center for Educational Statistics.
Crocker, L. & Algina, J. (1986). Introduction to Classical and Modem Test Theory. New York: Holt, Rinehart and Winston
Dorans, N. J. & Holland, P. W. (2000). Linking Scores from Multiple Instruments.
Haebara, T. (1980). Equating Logistic Ability Scales by a Weighted Least Squares Method. Japanese Psychological Research, 22, 144-149.
Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory: Principles and
Application. Boston, MA; Kivwer-Nijhoff.
Hanson, B. A. & Beguin, A. A. (2002). Obtaining a Common Scale for Item Response Theory Item Parameters Using Separate Versus Concurrent estimation in the Common-Item Equating Design. Applied Psychological Measurement, 26, 3-24.
Harris, D. J. & Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6, 195-240.
Kolen, M. J. (2000). Issues in Combing State NAEP and Main NAEP. In J. W. Pellegrino, L. R. Jones, & K. J. Mitchell, (Eds.), Grading the Nation’s Reportcard: Research from the Evaluation of NAEP. Committee on the Evaluation of National and State Assessments of Educational Progress.
Kuehl, R. O. (2000). Design of Experiments: Statistical Principles of Research Design and Analysis. CA: Duxbury Press.
Kim, S. H. & Cohen, A. S. (1998). A Comparison of Linking and Concurrent Calibration Under Item Response Theory. Applied Psychological Measurement, 22, 131-143.
Kolen, M. J. & Brennan, R. L. (1995). Test Equating: Methods and Practices. New York:Springer-Verlag.
Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.). New York: Springer-Verlag.
Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum.
Mislevy, R. J. (1986). Bayes model estimation in item response models. Psychometrika, 51, 177-195.
Morris, C. N. (1982). On the foundations of test equating. In P.W. Holland and D.B. Rubin (Eds.), Test equating (pp. 169-191). New York: Academic.
National Research Council. (1999). Uncommon Measures: Equivalency and Linkage of Educational Tests. Washington, DC: Author.
Nemhauser, G. L., & Wolsey, L. A. (1999). Integer and Combinatorial Optimization. New York: John Wiley.
Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). New York: Macmillan.
Stocking, M. L. & Lord, F. M. (1983). Developing a Common Metric in Item Response Theory. Applied Psychological Measurement, 7(2). 201-211.
Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic models. Psychometrika, 51, 589-601.
Tianyou, W. (2005). An Alternative Continuization Method to the Kernel Method in von Davier, Holland and Thayers (2004) Test Equating Framework.
van der Linden, W. J., & Veldkamp, B. P.,& Carlson, J. E. (2004). Optimizing Balanced Incomplete Block Designs for Educational Assessments. Applied Psychological Measurement, 28, 317-331.
von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. New York: Springer.
Yen, W. M. (1983). Tau-equivalence and equipercentile equating. Psychometrika, 48, 353-369.
Zimowski, M. F., Muraki, E., Mislevy, R. J. & Bock, R. D. (2003). BILOG-MG. Scientific Software lnternational.