Often a written test is used as an inexpensive substitute for a performance measure. A specified minimum performance level or probability of successful performance can be translated into a minimum passing score for the written test most efficiently by measuring the performance of students whose written test scores are near the desired cutoff score. Stochastic approximation methods accomplish this purpose. The up-and-down method and the Robbins-Monro process are presented, discussed, and...

Some probabilistic illustrations of the reliability coefficient are provided to assist in interpretation of this measure. All explanations are derived under the assumption that the joint distribution of examinee scores from two parallel tests is well approximated by a bivariate normal distribution.

The subject of constructing criterion-referenced tests is often researched, but many technical problems remain to be satisfactorily resolved. Foremost, criterion-referenced test developers need a comprehensive set of steps for construction. In this paper, 14 logical steps for building criterion-referenced tests that refer to several different applications and allow for objective and non-objective formats are offered: 1) preliminary considerations; 2) identification of possible content; 3)...

A measure of the usefulness of a pass/fail testing decision procedure is the ratio of the utility of the given procedure to the utility of a procedure based on knowledge of scores on a criterion measure. It is computed from scores for a representative sample of persons tested. Utility functions may be specified by the test user or set by convention to be linear with unit slope. The utility ratio can be used for comparing tests or for selecting test items. (Author)

Research indicating that different cut-off points result from the use of different standard-setting techniques leaves decision makers with a disturbing dilemma: Which standard-setting method is best? This investigation of the reliability and validity of 10 different standard-setting approaches was designed to provide information that might help answer that question. The 10 procedures for setting a standard on the Missouri College English Test included: a normative method (33rd percentile), the...

This paper provides formulas for expected true-score measures and reliability of binary items as a function of their Rasch difficulty parameters when the trait distribution is normal or logistic. With the proposed formula, one can evaluate the theoretical values of classical reliability indexes for norm-referenced and criterion-referenced interpretations without information about raw-score or trait scores of persons from the target population. This is achieved by representing the theoretical...

Some general questions about minimum competency tests are discussed, and various methods of setting standards are reviewed with major attention devoted to those methods used for dichotomizing a continuum. Methods reviewed under the heading of Absolute Judgments of Test Content include Nedelsky's, Angoff's, Ebel's, and Jaeger's. These methods are compared and a preference for Jaeger's approach is stated. Under Standards Based on Judgments about Groups, the Zieky and Livingston contrasting group...

In the spring of 1975, a total of 634 colleges and universities were surveyed to determine the institutions' use of the College Level Examination Program (CLEP) General Examinations to award credit. Of the total, 535 institutions completed and returned usable questionnaires. This report presents a full account of the survey and its findings. Some of the principal findings are: (1) a majority of the CLEP-user institutions permit any student to earn credit through the General Examinations...

A general model along with four illustrations is presented for the consideration of budgetary constraints in the setting of passing scores in instructional programs involving remedial action for poor test performers. Budgetary constraints normally put an upper limit on any choice of passing score. Given relevant information, this limit may be determined. Alternately, ways to assess the budgetary consequences associated with a given passing score are provided. Such information would be useful in...

Wilcox (1977) examines two methods of estimating the probability of a false-positive on false-negative decision with a mastery test. Both procedures make assumptions about the form of the true score distribution which might not give good results in all situations. In this paper, upper and lower bounds on the two possible error types are described which make no assumption about the form of the true score distribution. Illustrations are given on how these bounds might be used to determine the...

This paper presents comparisons among three item-selection criteria for the sequential probability ratio test. The criteria were compared in terms of their efficiency in selecting items, as indicated by average test length and the percentage of correct decisions. The item-selection criteria applied in this study were the Fisher information function, the Kullback-Leibler information function, and a weighted log-odds ratio. Also examined were the effects of the cutoff scores, the width of the...

Four recent indices emphasizing the interrelationships of score distribution shape, modality, mean, and variance were investigated to determine the reliability of mastery tests. Attention was focused on the values of the indices when the cutoff score was near to or far from the modes of distribution. Five types of score distributions were examined: bill shaped; highly negatively skewed unimodal; bimodal with a stronger mode at the upper end; symmetric bimodal with modes well separated; and...

Cut scores, quartile ranking, sample size, and overall classification scheme were studied as personnel selection procedures in two samples. The first was 120 simulated observations of employee scores based on actual selection procedures for applicants for administrative assistant positions. The other sample was composed of test results for 73 applicants for a municipality staff assistant position. It was apparent that overall ranking candidates may result in loss in cost and staff time, since...

Since the inception of the FCAT, the scaled scores students receive have been summarized by categorizing them into levels of performance. For every single test (in one subject area, at one grade level, for one year) the State of Florida has carefully determined cutoff scores that classify performance into 5 different levels of proficiency. It seems quite natural to refer to the percentage of students beyond a particular cutoff, as when one says a certain percentage of students scored at Level 3...

Based on the experiences of four equating studies conducted by the Austin (Texas) Independent School District, a practical "cookbook" approach to test equating is presented. Three types of equating procedures are discussed: choosing a cutoff score on a new instrument, predicting Y from X, and symmetric equating of X and Y. (BW)

The purpose of this study is to determine the extent of scale drift on a test that employs cut scores. It is essential to examine scale drift in a testing program using new forms that are often put on scale through a series of intermediate equatings (known as equating chains). This may cause equating error to accumulate to a point where scale scores are rendered incomparable across two parallel chains or time periods. The study examined whether scale drift occurred for two conditions (i.e.,...

A phenomenon previously noted in prior prediction of mathematics scores on the Content Mastery Examinations for Educators (CMEE) of the National Teacher Examinations Core Battery subtests of General Knowledge (NTE-GK) and Communication Skills (NTE-CS) was investigated. Prior research with 1991-1992 data sets had established an equation for the prediction of CMEE mathematics required cut-score of 340 for Mississippi teachers from the NTE scores. The subsequent addition of more data sets revealed...

In two-stage course placement systems, students first take a screening test. Students who score at or above the screening test cutoff score "K" enroll directly in a standard college course, whereas those who score below "K" take a placement test. Students who subsequently score at or above the placement test cutoff "K1" also enroll in the standard course. Consequently, students in the standard course will not have placement scores below "K1." Moreover,...

The purpose of this study was to investigate whether systematic, non-zero differences between pairs of item bank b-values have occurred in the recent history of two licensure examinations. Licensing examinations were studied for two related health care professions (Program 1 and Program 2). A series of analysis of covariance models was fit to the data in order to investigate the magnitude of changes in item bank b-values and the relationship of any changes to variables indexing factors that...

The kernel of the Angoff method of standard setting (W. Angoff, 1971) would seem to be the judgment of whether a minimally competent person could answer an item on a test correctly or not. So it would seem that any procedure that requires independent judgment of the correctness or incorrectness of a response to items for a minimally acceptable examinee would merit being labeled as an Angoff method. Standard setting methods that use the Angoff kernel can be quite different in practice, but they...

This document contains the results of a standard setting conducted in January 2002 on the Delaware Student Testing Program (DSTP) Science and Social Studies tests at grades 4 and 6. Each standard setting process entailed convening four groups, one for each grade level and content area, and each group met for 2 days. At the standard setting judges were asked to recommend only the cut point between Below the Standard and Meets the Standard, and the cut point between Meets the Standard and Exceeds...

Elements of arbitrariness in the standard setting process are explored, and an alternative to the use of cut scores is presented. The first part of the paper analyzes the use of cut scores in large-scale assessments, discussing three different functions: (1) cut scores define the qualifications used in assessments; (2) they simplify the reporting of achievement distributions; and (3) they allow for the setting of targets for such distributions. The second part of the paper gives a...

Two types of classification error are possible in competency tests: erroneous classification of an individual as a "master" of the subject (Type II error), and erroneous classification of a master as a "nonmaster" of the subject (Type I). If steps are taken to minimize Type II errors, an artificially high number of true masters will be classified as nonmasters. The remedy for this problem is to empirically correct for incorrect answers caused by irrelevant factors (fatigue,...

This report gives the basic definition and purpose of competency-based teacher education (CBTE) cut-off scores. It describes the basic characteristics of CBTE as a yes-no dichotomous decision regarding the presence of a specific ability or knowledge, which necesitates the establishment of a cut-off point to designate competency vs. incompetency on stated objectives. Statistical considerations for establishing CBTE cut-off scores are reviewed, and, based on test scores, two types of...

The United States Training and Employment Service General Aptitude Test Battery (GATB), first published in 1947, has been included in a continuing program of research to validate the tests against success in many different occupations. The GATB consists of 12 tests which measure nine aptitudes: General Learning Ability; Verbal Aptitude; Numerical Aptitude; Spatial Aptitude; Form Perception; Clerical Perception; Motor Coordination; Finger Dexterity; and Manual Dexterity. The aptitude scores are...

This study investigated the cut score setting process as it occurred in two large Midwestern school districts, focusing on how the teachers who were the instruments by which cut scores were set experienced the process. Eight standard setting workshops using the Angoff approach were observed. Workshops for mathematics, reading, or writing at grades 2, 5, and 8 involved panels of from 24 to 28 teachers, and the ninth grade workshop involved 15 teachers. In addition to observation data,...

Numerous techniques are available for determining cutoff scores for distinguishing between proficient and non-proficient examinees. One of the more commonly cited techniques for standard setting is the Nedelsky Method. In response to criticism of this method, Gross (1985) presented a revised Nedelsky technique. However, no research beyond that presented by Gross has yet to appear. This study examined and compared cutoff scores derived using the original and revised Nedelsky techniques and...

This handbook treats a restricted set of statistical procedures for addressing some of the most prevalent technical issues that arise in domain-referenced testing. The procedures discussed here were chosen because they do not necessitate extensive computations. The five major sections of the paper cover: (1) item analysis procedures for using data to help identify items that may be flawed; (2) a simple procedure for establishing a cutting score; (3) a procedure for establishing an advancement...

Two- and three-category versions of Nedelsky's procedure for setting minimum passing scores, based on item content, were compared. Graduate students acting as judges classified the response options on their midterm into two categories: (1) those which should be rejected as incorrect by a minimally performing (B average) student; and (2) those which should not. Another group of classmates was also allowed to categorize options as undecided. Comparisons of the resulting sets of passing scores...

The validity and dependability of functional competency tests for adults are examined as they relate to the information needs of instructional decision makers. Test data from the Adult Performance Level (APL) Program (funded by the U.S. Office of Education at the University of Texas at Austin) is used to illustrate key points. In the discussion of validity, the importance of a test's demonstrated relevance to functional competency is discussed in terms of the definitions of the competency....

Adult norms are shown as cutting scores for each of the aptitudes judged significant for a given occupation. Tables for converting adult scores to their ninth and tenth grade equivalents are included. The standard error of measurement is reported for each of the nine aptitudes of the General Aptitude Test Battery (GATB): intelligence, verbal aptitude, numerical aptitude, spatial aptitude, form perception, clerical perception, motor coordination, finger dexterity, and manual dexterity. The bulk...

The purpose of this study was to introduce a procedure to detect differential item functioning (DIF) particularly suitable for criterion-referenced tests and to demonstrate how this approach would affect the identification of DIF items using real data sets. The procedure based on item response theory (IRT) assesses DIF at a limited closed interval of thetas at which a cutoff score fails. Illustrative data showed that identification of DIF could be quite different with this unconventional...

Standard-setting research has yielded a rich array of more than 50 standard-setting procedures, but practitioners are likely to be confused about which to use. By synthesizing the accumulated research on standard setting and progress monitoring, this study developed a three-dimensional taxonomy for conceptualizing and operationalizing the various procedures: outcome versus growth assessment, theory-driven versus data-driven approach, and observed scale versus latent scale mapping.An empirical...

Until data is obtained concerning the regression of job-performance on test-performance, the setting of passing scores on professional licensing and certification examinations will contain some degree of arbitrariness. Data from performance domains suggests that some tests have differential validity and an adverse impact on minority groups, disproportionately excluding them from professional practice on the absence of any data indicating criterion related validity of the licensing examinations....

A person's obtained score on a test provides an estimate of the individual's "true" score on that test. The obtained score is considered to have two parts, the true component and the error component. Classical test theory assumes that obtained scores for an individual over multiple administrations of the same test will lie symmetrically around the individual's true score. Confidence intervals can be used to determine the range within which the true score is apt to fall and to identify...

This paper discusses how cut scores are set and used and how accurately they reflect student achievement. Regardless of the method used, the cut-score setting process is subjective. The cut score is the point on a score scale that separates one performance standard from another. Cut scores may also be used to set performance levels for open-response assessments like essay tests. This monograph discusses three methods of setting cut scores: (1) the modified Angoff method; (2) Contrasting Groups;...

The competency testing of teacher candidates has become almost universal in the United States. This paper explores some of the issues associated with competency testing, especially as they relate to teacher testing in California. The practice has its origin in concern that some of those who choose to pursue teaching careers in the elementary and secondary schools may lack the competencies that effective teaching requires. Stated in terms of decision errors in scoring competency tests, the fear...

The aim of the current study is to analyse the 2014 Post UTME scores of candidates in the university of Ibadan towards the establishment of cut off using two methods of standard settings. Prospective candidates who seek admission to higher institution are often denied admission through the Post UTME exercise. There is no single recommended approach to standard setting and many methods exist. These include norm-referenced methods and the criterion-referenced methods. The Angoff method is the...

Some of the different approaches to standard setting are discussed. Brief comments and references are offered concerning strategies that rely primarily on the use of expert judgment. Controversy surrounds methods that use expert judges, as well as those using test groups to set scores empirically. A minimax procedure developed by H. Huynh, an empirical procedure that invokes evaluation of the mathematical properties of various cutoffs through the application of decision theory, is illustrated....

This paper reports the results of using several alternative methods of setting cut scores. The methods used were: (1) a variation of the Angoff method (1971); (2) a variation of the borderline group method; and (3) an advanced impact method (G. Dillon, 1996). The results discussed are from studies undertaken to set the cut scores for fourth grade reading and mathematics in a public school system. Twenty-two elementary school teachers served as judges. The three methods tended to result in...

A single-administration classification reliability index is described that estimates the probability of consistently classifying examinees to mastery or nonmastery states as if those examinees had been tested with two alternate forms. The procedure is applicable to any test used for classification purposes, subdividing that test into two half-tests, each with a cut score, where the sum of the two half-test cut scores is equal to the cut score for the total test. The application of this...

The choice of a cutting score for criterion-related tests influences decisions related to classifying people into dichotomous categories. This paper proposes an empirical methodology for determining the best cutting score when there is information about the test score frequency distribution of test-takers defined as actually successful and actually unsuccessful on some criterion. The method is based on two statistics calculated for each possible cutting score. The first is a pure hit rate,...

This paper is a report of a study designed to develop recommendations on minimum qualifying scores for National Teacher Examinations (NTE) that are valid for certification and endorsement in Tennessee. The functions performed in the review of the NTE Core Battery and Specialty Area tests were conceptualized as panel activities. The number of panels required for the study was determined by the number of tests and the functions to be performed. The size of each panel was based on the scope of the...

The purpose of this investigation was to establish the effects of repeaters on test equating. Since consideration was not given to repeaters in test equating, such as in the derivation of equations by Angoff (1971), the hypothetical effect needed to be established. A case study was examined which showed results on a test as expected; overall mean was lower for repeaters. Applying these data to the available equating equations, it was shown that an additional 3 percent of the examinees was...

Issues involved in standard setting along with methods for standard setting are reviewed, with specific reference to their relevance for criterion referenced testing. Definitions are given of continuum and state models, and traditional and normative standard setting procedures. Since continuum models are considered more appropriate for criterion referenced testing purposes, they are examined in greater depth. The continuum models are subdivided into three categories: judgmental methods;...

There are two major aspects to the cutoff score issue in criterion-referenced (CR) measurement: whether a cutoff score is actually needed for a CR test, and the method for setting the cutoff score if one is used. There are many factors to be considered in determining a cutoff score. While cutoff scores are very often set arbitrarily (e.g., 80%), there have been many methods suggested to improve the quality of judgment or to utilize quantitative approaches of varying degrees of complexity and...

Student scores on the Florida Comprehensive Assessment Test (FCAT) tests are increasingly being used for administrative purposes. Promotion and placement decisions are largely governed by FCAT performance. Although efforts are made to make the FCAT scores available as soon as possible, for some purposes they are not on hand when needed. Additionally, as the importance of FCAT scores increases, it becomes more desirable to have some kind of early indication of how students are expected to score...

The traditional reliability coefficient and standard error of measurement are not adequate measures of reliability for tests used to make pass/fail decisions. Answering the important reliability questions requires estimation of the joint distribution of true and observed scores. Lord's "Method 20" estimates this distribution without the deficiencies of other methods. New output formats condense the estimated distribution into readily usable information, including a 2 x 2 contingency...

This paper describes (1) the procedures developed to set achievement levels for the National Assessment of Educational Progress (NAEP) that contribute to establishing the validity of the levels and (2) the research studies designed to collect information related to the validity of the achievement levels and the outcomes of the process. The central issue in examining the validity of standards is whether there is evidence of procedural validity. The standards must be generally accepted as...

Different procedures for setting cut points on achievement test scales provide the standard-setting participants with different information to support the unique judgment task associated with each procedure. This study examined how participants in standard settings used the different information from three different procedures in Kentucky in 2000. Three cut points had to be established on the scales for each of six content areas in each of three grades for Kentucky's assessment system. The...

