試題呈現與回饋模式對Angoff標準設定結果一致姓提升效益之比較研究

Evaluating the Utility of Different Item Presentation and Feedback Approaches with the Modified Angoff Method

吳宜芳;鄒慧英
Yi-Fang Wu;Hueying Tzou


所屬期刊: 第6卷第4期 「測驗與評量」
主編:國立臺灣師範大學教育心理與輔導學系
林世華
系統編號: vol023_02
主題: 測驗與評量
出版年份: 2010
作者: 吳宜芳;鄒慧英
作者(英文): Yi-Fang Wu;Hueying Tzou
論文名稱: 試題呈現與回饋模式對Angoff標準設定結果一致姓提升效益之比較研究
論文名稱(英文): Evaluating the Utility of Different Item Presentation and Feedback Approaches with the Modified Angoff Method
共同作者:
最高學歷:
校院名稱:
系所名稱:
語文別:
論文頁數: 34
中文關鍵字: Angoff法;Reckase表;試題預先分類;標準設定
英文關鍵字: Angoff method;item-grouping;Reckase charts;standard setting
服務單位: 美國愛荷華大學教育測驗統計研究所博士生;國立臺南大學測驗統計所教授
稿件字數: 17581
作者專長: 測驗與評量
投稿日期: 2010/10/20
論文下載: pdf檔案icon
摘要(中文): 在標準設定的眾多方法中,Angoff法及其相關變形、延伸與修正程序
等,實為教育實景中相當普及的標準設定流程。然而,執行Angoff標準設定
方法的設定者在概念化最低能力受試者、估計其答題概率時,面臨相當大的
認知挑戰。試題特徵(如:試題難度)對設定者間或設定者內一致性的影
響,可能影響最後產出標準的效度。基於此,本研究試圖以實徵P值排序回
饋、Reckase表回饋與試題呈現分類與否等做法融入修正Angoff法的標準設
定程序,以促進設定結果的一致性,並從中比較前述作法融入設定程序之優
劣。
本研究係為測驗結束後所進行之標準設定研究,屬於事後做決定型,研究中
探究不同回饋模式及試題是否分類呈現對標準設定結果之影響,藉以比較二
種作法的優劣,此為本研究之獨特性所在。其次,透過這二種修正作法,期
能使設定者對於試題難度有較佳的察覺,進而改善設定間或設定者內一致性
,提高設定結果的一致性,並對標準之效度有所助益,是為本研究在功能性
之貢獻。
摘要(英文): Numerous standard setting methods have been developed to assist panels in estimating the performance of the borderline examinees. Among them, the Angoff method is one of the most popular judgmental standard setting procedures. Its extensions, modifications, and variations are often applied in practice. In standard setting, panelists hold an important role, especially in the judgmental methods such as the Angoff method and its variations. The ability of panelists to accurately estimate the borderline examinees’ performance is to some extent subjected to item difficulty. Once the accuracy is questioned, the validity of the performance standard would be damaged. Therefore, a variety of procedures and several types of feedback have been developed to reduce inconsistency among panelists or within a single panelist.
To compare different procedures embedded in the modified Angoff standard setting method for establishing cutoff scores on a large-scale achievement assessment, we designed two standard setting activities, integrating different procedures to help panelists
make more accurate estimates.
Two sets of data from a national achievement assessment in mathematics in Taiwan were used in the standard setting activities. Each set contained 104 operational multiplechoice items used to measure students’ grade-level math ability. Twelve panelists
participated in the 4th grade standard setting activity and the 6th grade panel consisted of 14 panelists. They were all math educators and some had prior experiences in the modified Angoff standard setting procedures.
The standard setting procedures included two factors, each of which involved two conditions: test items with/without item-grouping in advance; different types of feedback, such as feedback with empirical p-values and feedback with IRT calibration/Reckase charts
(Reckase, 1998, 2001). We presented a generalizability analysis design to examine the improvement of consistency for different above mentioned procedures. Item effect, item difficulty effect (both within difficulty level and between levels) and panelist effect were of interest.
First, the percentage of variance components of item effect increased consistently
from Round 1 to Round 3, while the percentage of variance components of panelist effect decreased as the setting round passes. Panelists’ consistency was raised; in addition, relatively more variability of panelists was eliminated in the procedure of feedback with
Reckase charts. Secondly, with/without item-grouping, panelists could make similar estimates of item performance toward items with similar difficulty as the setting rounds passes. Finally, item-grouping integrated into feedback with Reckase charts having the
best improvement of intra-judge consistency, since we observed that under this condition, the estimates of the root mean square error were the smallest and the estimates of generalizability coefficients and intraclass correlation coefficients (ICCs) were the highest.
Panelists are capable of distinguishing hard and easy items; however, with the help of item-group by difficulty and feedback with Reckase charts, the variability induced by item difficulty which has an impact on panelists’ consistency, has been decreased as much
as possible. This finding, undoubtedly, is beneficial in terms of defending the validity of standard.
參考文獻: 吳裕益(1986)。標準參照測驗通過分數設定方法之研究。國立政治大學教育研究
所博士論文(未出版)。
吳裕益(1988)。標準參照測驗通過分數設定方法之研究。測驗年刊,35,159-166。
鄭明長、余民寧(1994)。各種通過分數設定方法之比較。測驗年刊,41,19-40。
Allen, N. L., Jenkins, F., Kulick, E., & Zelenak, C. A. (1997). Technical report of the NAEP 1996 state assessment program in mathematics. Washington, DC: National Center for Education Statistics.
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp.508-600). Washington, DC: American Council on Education.
Berk, R. A. (1986). A consumers guide to setting performance standards on criterion referenced
tests. Review of Educational Measurement, 56(1), 137-172.
Brandon, P. R. (2004). Conclusions about Frequently Studied Modified Angoff Standard-Setting Topics. Applied Measurement in Education, 17(1), 59-88.
Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S.(2002). A comparison of Angoff
and Bookmark standard setting methods. Journal of Educational Measurement, 39(3), 253-263.
Cizek, G. J. (2001). Conjectures on the rise and call of standard setting: An introduction to context and practice. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp.1-17). Mahwah, NJ: Lawrence Erlbaum Associates.
Cizek, G. J. (2006). Standard setting. In S. M. Downing, & T. M. Haladyna (Eds.), Handbook of test development (pp.225-258). Mahwah, NJ: Lawrence Erlbaum Associates.
Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23(4), 31-50.
Cizek, G. J., & Bunch, M. B. (2007). Standard setting—A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.
Clauser, B. E., Swanson, D. B., & Harik, P. (2002). Multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an Angoffstyle
standard-setting procedure. Journal of Educational Measurement, 39(4), 269-290.
Ferdous, A. A., & Plake, B. S. (2005). Understanding the factors that influence decisions of panelists in a standard-setting study. Applied Measurement in Education, 18(3), 257-267.
Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of borderline examinees. Applied Measurement in Education, 12, 13-28.
Hambleton, R. K. (2001). Setting performance standards on educational assessments and criteria
for evaluating the process1, 2. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp.89-116). Mahwah, NJ: Lawrence Erlbaum Associates.
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement, (4th ed., pp. 433-470). Washington, DC: American Council on
Education.
Impara, J. C., & Plake, B. S. (1997). Standard setting: An
alternative approach. Journal of
Educational Measurement, 34(4), 353-366.
Jaeger, R. M. (1995). Setting performance standards through two-stage judgmental policy capturing. Applied Measurement in Education, 8(1), 15-40.
Kane, M. (1987). On the use of IRT models with judgmental standard setting procedures. Journal of Educational Measurement, 24(4), 333-345.
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461.
Lorge, I., & Kruglov, L. K. (1953). The improvement of the estimates of test difficulty.
Educational and Psychological Measurement, 13, 34-46.
MacCann, R. G., & Stanley, G. (2006, January). The use of Rasch modeling to improve standard setting. Practical Assessment, Research & Evaluation, 11(2). Retrieved from http://pareonline.
net/pdf/v11n2.pdf
McLaughlin, D. H. (1993). Validity of the 1992 NAEP achievement-level setting process. In L. Shepard, R. Glaser, R. Linn, & G. Bohrnstedt (Eds.), Setting performance standards for student achievement tests: Background studies (pp.81-122). Stanford, CA: National Academy of Education.
Matter, J. D. (2000). Investigation of the validity of the Angoff standard setting procedure for multiple-choice items. (Unpublished doctoral dissertation). University of Massachusetts, Amherst, MA.
Maurer, T. J., Alexander, R. A., Callahan, C. M., Bailey, J. J., & Dambrot, F. H. (1991).
Methodological and psychometric issues in setting cutoff scores using the Angoff method. Personnel Psychology, 44, 235-262.
National Assessment Governing Board (2006). Writing framework and specifications for the 2007 National Assessment of Educational Progress. Washington, DC: National Assessment Governing Board.
Pitoniak, M. J. (2003). Standard setting methods for complex licensure examinations (Unpublished
doctoral dissertation). University of Massachusetts, Amherst, MA.
Plake, B. S., & Impara, J. C. (2001). Ability of panelists to estimate item performance for a target
group of candidates: an issue in judgmental standard setting. Educational Assessment, 7(3), 87-97.
Plake, B. S., & Melican, G. J. (1989). Effects of item context on intrajudge consistency of expert judgments via the Nedelsky standard setting method. Educational and Psychological
Measurement, 49(1), 45-51.
Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors influencing intrajudge consistency during standard-setting. Educational Measurement: Issue and Practice, 10(2), 15-25.
Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training participants for standard setting. In G. J. Cizek (Ed.), Setting Performance Standards: Concepts, Methods,
and Perspectives (pp.119-157). Mahwah, NJ: Lawrance Erlbaum Associates.
Reckase, M. D. (1998). Setting standards to be consistent with an IRT item calibration. Iowa City,IA: ACT.
Reckase, M. D. (2000). The ACT/NAGB standard setting process: How “modified” does it have to be before it is no longer a modified-Angoff process? Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, L. A.(ED442825)
Reckase, M. D. (2001). Innovative methods for helping standard-setting participants to perform their task: The role of feedback regarding consistency, accuracy and impact. In G. J. Cizek
(Ed.), Setting Performance Standards: Concepts, Methods, and Perspectives (pp. 159-173).
Mahwah, NJ: Lawrance Erlbaum Associates.
Reckase, M. D. (2006). Some criteria for evaluating the functioning of standard-setting methods with application to bookmark and modified Angoff methods. Educational Measurement:
Issues and Practice, 25(2), 4-18.
Schraw, G., & Roedel, T. D. (1994). Test difficulty and judgment bias. Memory and Cognition,22(1), 63-69.
Shepard, L. A. (1995). Implications for standard setting of the National Academy of Education evaluation of National Assessment of Educational Progress achievement levels. Proceedings
from the Joint Conference on Standard Setting for Large-Scale Assessments. Washington, D.C.: National Assessment Governing Board and National Center for Education Statistics.
Shepard, L., Glaser, R., Linn, R., & Bohrnstedt, G. (1993). Setting performance standards for student achievement tests. Stanford, CA: National Academy of Education.
Sireci, S. G., & Biskin, B. H. (1992). A survey of national professional licensure examination
programs. CLEAR Exam Review, 3, 21 25.
Smith, R. L., & Smith, J. K. (1988). Differential use of item information by judges using Angoff and Nedelsky procedures. Journal of Educational Measurement, 25(4), 259-274.
Taube, K. T. (1997). The incorporation of empirical item difficulty data into the Angoff standardsetting
procedure. Evaluation & Health Professions, 20, 479-498.
van der Linden, W. J. (1982). A latent trait method for determining intrajudge inconsistency in the Angoff and Nedelsky techniques of standard setting. Journal of Educational Measurement,19(4), 295-308.
van der Linden, W. J. (1986). A latent trait method for determining intrajudge inconsistency in the Angoff and Nedelsky techniques of standard setting (Addendum). Journal of Educational Measurement, 23(3), 265-266.
Verhoeven, B. H., van der Steeg, A. F. W., Scherpbier, A. F. F. A., Muijtjens, A. M. M., Verwijnen, & van der Vleuten, C. P. M. (1999). Reliability and credibility of an Angoff standard setting procedure in progress testing using recent graduates as judges. Medical Education, 33, 832-837.
Wuensch, K. L. (2003). Inter-rater agreement. Retrieved from http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.doc