Evaluating the Utility of Different Item Presentation and Feedback Approaches with the Modified Angoff Method
Yi-Fang Wu;Hueying Tzou
Yi-Fang Wu;Hueying Tzou
|作者(英文)：||Yi-Fang Wu;Hueying Tzou|
|論文名稱(英文)：||Evaluating the Utility of Different Item Presentation and Feedback Approaches with the Modified Angoff Method|
|英文關鍵字：||Angoff method;item-grouping;Reckase charts;standard setting|
|摘要(英文)：||Numerous standard setting methods have been developed to assist panels in estimating the performance of the borderline examinees. Among them, the Angoff method is one of the most popular judgmental standard setting procedures. Its extensions, modifications, and variations are often applied in practice. In standard setting, panelists hold an important role, especially in the judgmental methods such as the Angoff method and its variations. The ability of panelists to accurately estimate the borderline examinees’ performance is to some extent subjected to item difficulty. Once the accuracy is questioned, the validity of the performance standard would be damaged. Therefore, a variety of procedures and several types of feedback have been developed to reduce inconsistency among panelists or within a single panelist.
To compare different procedures embedded in the modified Angoff standard setting method for establishing cutoff scores on a large-scale achievement assessment, we designed two standard setting activities, integrating different procedures to help panelists
make more accurate estimates.
Two sets of data from a national achievement assessment in mathematics in Taiwan were used in the standard setting activities. Each set contained 104 operational multiplechoice items used to measure students’ grade-level math ability. Twelve panelists
participated in the 4th grade standard setting activity and the 6th grade panel consisted of 14 panelists. They were all math educators and some had prior experiences in the modified Angoff standard setting procedures.
The standard setting procedures included two factors, each of which involved two conditions: test items with/without item-grouping in advance; different types of feedback, such as feedback with empirical p-values and feedback with IRT calibration/Reckase charts
(Reckase, 1998, 2001). We presented a generalizability analysis design to examine the improvement of consistency for different above mentioned procedures. Item effect, item difficulty effect (both within difficulty level and between levels) and panelist effect were of interest.
First, the percentage of variance components of item effect increased consistently
from Round 1 to Round 3, while the percentage of variance components of panelist effect decreased as the setting round passes. Panelists’ consistency was raised; in addition, relatively more variability of panelists was eliminated in the procedure of feedback with
Reckase charts. Secondly, with/without item-grouping, panelists could make similar estimates of item performance toward items with similar difficulty as the setting rounds passes. Finally, item-grouping integrated into feedback with Reckase charts having the
best improvement of intra-judge consistency, since we observed that under this condition, the estimates of the root mean square error were the smallest and the estimates of generalizability coefficients and intraclass correlation coefficients (ICCs) were the highest.
Panelists are capable of distinguishing hard and easy items; however, with the help of item-group by difficulty and feedback with Reckase charts, the variability induced by item difficulty which has an impact on panelists’ consistency, has been decreased as much
as possible. This finding, undoubtedly, is beneficial in terms of defending the validity of standard.
Allen, N. L., Jenkins, F., Kulick, E., & Zelenak, C. A. (1997). Technical report of the NAEP 1996 state assessment program in mathematics. Washington, DC: National Center for Education Statistics.
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp.508-600). Washington, DC: American Council on Education.
Berk, R. A. (1986). A consumers guide to setting performance standards on criterion referenced
tests. Review of Educational Measurement, 56(1), 137-172.
Brandon, P. R. (2004). Conclusions about Frequently Studied Modified Angoff Standard-Setting Topics. Applied Measurement in Education, 17(1), 59-88.
Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S.(2002). A comparison of Angoff
and Bookmark standard setting methods. Journal of Educational Measurement, 39(3), 253-263.
Cizek, G. J. (2001). Conjectures on the rise and call of standard setting: An introduction to context and practice. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp.1-17). Mahwah, NJ: Lawrence Erlbaum Associates.
Cizek, G. J. (2006). Standard setting. In S. M. Downing, & T. M. Haladyna (Eds.), Handbook of test development (pp.225-258). Mahwah, NJ: Lawrence Erlbaum Associates.
Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23(4), 31-50.
Cizek, G. J., & Bunch, M. B. (2007). Standard setting—A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.
Clauser, B. E., Swanson, D. B., & Harik, P. (2002). Multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an Angoffstyle
standard-setting procedure. Journal of Educational Measurement, 39(4), 269-290.
Ferdous, A. A., & Plake, B. S. (2005). Understanding the factors that influence decisions of panelists in a standard-setting study. Applied Measurement in Education, 18(3), 257-267.
Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of borderline examinees. Applied Measurement in Education, 12, 13-28.
Hambleton, R. K. (2001). Setting performance standards on educational assessments and criteria
for evaluating the process1, 2. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp.89-116). Mahwah, NJ: Lawrence Erlbaum Associates.
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement, (4th ed., pp. 433-470). Washington, DC: American Council on
Impara, J. C., & Plake, B. S. (1997). Standard setting: An
alternative approach. Journal of
Educational Measurement, 34(4), 353-366.
Jaeger, R. M. (1995). Setting performance standards through two-stage judgmental policy capturing. Applied Measurement in Education, 8(1), 15-40.
Kane, M. (1987). On the use of IRT models with judgmental standard setting procedures. Journal of Educational Measurement, 24(4), 333-345.
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461.
Lorge, I., & Kruglov, L. K. (1953). The improvement of the estimates of test difficulty.
Educational and Psychological Measurement, 13, 34-46.
MacCann, R. G., & Stanley, G. (2006, January). The use of Rasch modeling to improve standard setting. Practical Assessment, Research & Evaluation, 11(2). Retrieved from http://pareonline.
McLaughlin, D. H. (1993). Validity of the 1992 NAEP achievement-level setting process. In L. Shepard, R. Glaser, R. Linn, & G. Bohrnstedt (Eds.), Setting performance standards for student achievement tests: Background studies (pp.81-122). Stanford, CA: National Academy of Education.
Matter, J. D. (2000). Investigation of the validity of the Angoff standard setting procedure for multiple-choice items. (Unpublished doctoral dissertation). University of Massachusetts, Amherst, MA.
Maurer, T. J., Alexander, R. A., Callahan, C. M., Bailey, J. J., & Dambrot, F. H. (1991).
Methodological and psychometric issues in setting cutoff scores using the Angoff method. Personnel Psychology, 44, 235-262.
National Assessment Governing Board (2006). Writing framework and specifications for the 2007 National Assessment of Educational Progress. Washington, DC: National Assessment Governing Board.
Pitoniak, M. J. (2003). Standard setting methods for complex licensure examinations (Unpublished
doctoral dissertation). University of Massachusetts, Amherst, MA.
Plake, B. S., & Impara, J. C. (2001). Ability of panelists to estimate item performance for a target
group of candidates: an issue in judgmental standard setting. Educational Assessment, 7(3), 87-97.
Plake, B. S., & Melican, G. J. (1989). Effects of item context on intrajudge consistency of expert judgments via the Nedelsky standard setting method. Educational and Psychological
Measurement, 49(1), 45-51.
Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors influencing intrajudge consistency during standard-setting. Educational Measurement: Issue and Practice, 10(2), 15-25.
Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training participants for standard setting. In G. J. Cizek (Ed.), Setting Performance Standards: Concepts, Methods,
and Perspectives (pp.119-157). Mahwah, NJ: Lawrance Erlbaum Associates.
Reckase, M. D. (1998). Setting standards to be consistent with an IRT item calibration. Iowa City,IA: ACT.
Reckase, M. D. (2000). The ACT/NAGB standard setting process: How “modified” does it have to be before it is no longer a modified-Angoff process? Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, L. A.(ED442825)
Reckase, M. D. (2001). Innovative methods for helping standard-setting participants to perform their task: The role of feedback regarding consistency, accuracy and impact. In G. J. Cizek
(Ed.), Setting Performance Standards: Concepts, Methods, and Perspectives (pp. 159-173).
Mahwah, NJ: Lawrance Erlbaum Associates.
Reckase, M. D. (2006). Some criteria for evaluating the functioning of standard-setting methods with application to bookmark and modified Angoff methods. Educational Measurement:
Issues and Practice, 25(2), 4-18.
Schraw, G., & Roedel, T. D. (1994). Test difficulty and judgment bias. Memory and Cognition,22(1), 63-69.
Shepard, L. A. (1995). Implications for standard setting of the National Academy of Education evaluation of National Assessment of Educational Progress achievement levels. Proceedings
from the Joint Conference on Standard Setting for Large-Scale Assessments. Washington, D.C.: National Assessment Governing Board and National Center for Education Statistics.
Shepard, L., Glaser, R., Linn, R., & Bohrnstedt, G. (1993). Setting performance standards for student achievement tests. Stanford, CA: National Academy of Education.
Sireci, S. G., & Biskin, B. H. (1992). A survey of national professional licensure examination
programs. CLEAR Exam Review, 3, 21 25.
Smith, R. L., & Smith, J. K. (1988). Differential use of item information by judges using Angoff and Nedelsky procedures. Journal of Educational Measurement, 25(4), 259-274.
Taube, K. T. (1997). The incorporation of empirical item difficulty data into the Angoff standardsetting
procedure. Evaluation & Health Professions, 20, 479-498.
van der Linden, W. J. (1982). A latent trait method for determining intrajudge inconsistency in the Angoff and Nedelsky techniques of standard setting. Journal of Educational Measurement,19(4), 295-308.
van der Linden, W. J. (1986). A latent trait method for determining intrajudge inconsistency in the Angoff and Nedelsky techniques of standard setting (Addendum). Journal of Educational Measurement, 23(3), 265-266.
Verhoeven, B. H., van der Steeg, A. F. W., Scherpbier, A. F. F. A., Muijtjens, A. M. M., Verwijnen, & van der Vleuten, C. P. M. (1999). Reliability and credibility of an Angoff standard setting procedure in progress testing using recent graduates as judges. Medical Education, 33, 832-837.
Wuensch, K. L. (2003). Inter-rater agreement. Retrieved from http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.doc