In practical applications of item response theory (IRT), item parameters are usually estimated first from a calibration sample. After treating these estimates as fixed and known, ability parameters are then estimated. However, the statistical inferences based on the estimated abilities can be misleading if the uncertainty of the item parameter estimates is ignored. Instead, estimated item parameters can be regarded as covariates measured with error. Along the line of this...

A method of adjusting for bias due to nonresponse to an alumni survey is demonstrated, based upon analysis of data from both on-time and late respondents. The responses are opinions of 254 alumni regarding how they rate their recently completed program of study within a college of an urban university. As part of the university's curriculum review process, a questionnaire was developed to obtain alumni ratings of their degree program. Twenty-two questionnaire items were selected from the...

This review examines the recently released Thomas P. Fordham Institute report, "Education Olympics: The Games in Review." Published just after the completion of the 2008 Beijing Summer Olympics, Education Olympics strategically parallels the international competition by awarding gold, silver and bronze medals to top performing countries based on indicators including scores from international assessments in reading, mathematics, and science. The report contrasts American students'...

Many arguments have been made against allowing examinees to review and change their answers after completing a computer adaptive test (CAT). These arguments include: (1) increased bias; (2) decreased precision; and (3) susceptibility of test-taking strategies. Results of simulations suggest that the strength of these arguments is reduced or eliminated by using specific information item selection (SIIS), under which items are selected to meet information targets, instead of the more common...

Qualitative research evokes rather stereotyped responses from the mainstream of social science. The following 10 standardized responses to the stimulus "qualitative research interview" (QRI) are discussed: (1) it is not scientific, only common sense; (2) it is not objective, but subjective; (3) it is not trustworthy, but biased; (4) it is not reliable, but rests on leading questions; (5) it is not intersubjective, as different interpreters find different meanings; (6) it is not...

A valuable extension of the single-rating regression discontinuity design (RDD) is a multiple-rating RDD (MRRDD). To date, four main methods have been used to estimate average treatment effects at the multiple treatment frontiers of an MRRDD: the "surface" method, the "frontier" method, the "binding-score" method, and the "fuzzy instrumental variables" method. This paper uses a series of simulations to evaluate the relative performance of each of these...

Fairness or unfairness may be an attribute of a test per se, or of its use, or of its statistical treatment. An hypothetical situation designed to be intrinsically fair and unbiased is used to show that analysis of covariance as a statistical method may introduce bias to the treatment of test scores. In contrast, equipercentile equating methods are shown, in this situation, to result in a fair and unbiased treatment of test scores. A graphic figure illustrates the comparison of the two...

A report from the School Choice Demonstration Project examines issues concerning the funding formula used for the Milwaukee Parental Choice Program (MPCP). It finds that the program generates a net saving to taxpayers in Wisconsin but imposes a significant fiscal burden on taxpayers in Milwaukee. However, these findings depend significantly on how many students would have attended public school if the voucher option were not available, as well as on the actual resource requirements for those...

The effects of monetary gratuities on response rates to mail surveys have been considered in a number of studies. This meta-analysis examined: (1) the nature of the population surveyed; (2) the effects of gratuities in relation to the number of follow-ups; (3) whether the gratuity was equally effective across different populations; (4) whether the gratuity was promised or enclosed; and (5) the year of publication of the study. The bulk of the studies was done in the context of market research....

The National Household Education Survey (NHES) was conducted for the first time in 1991 as a way to collect data on the early childhood education experiences of young children and participation in adult education. Because the NHES methodology is relatively new, field tests were necessary. A large field test of approximately 15,000 households was conducted during the fall of 1989 to examine several methodological issues. This report focuses on measurement errors arising from the use of proxy...

Federal assistance for special educational programs makes necessary the regular study of evaluations of thousands of innovations in compensatory education, bilingual education, and reading programs. The results are reported to the President and to Congress. However, investigating organizations find only a few programs with adequate evidence and thousands with faulty evaluation designs. Some of the most common faults are discussed, with examples. There are other factors which lower hopes. If...

To assess the relative effectiveness of public and private school environments, researchers must distinguish between the effects of the schools' programs and the students' innate abilities. Student background variables do not appear to account for all the important differences among students attending public and private schools. This document proposes and tests a model of school selection that relates school choice to a family's assessment of public and private school quality. The assumption is...

In this paper, the detection of response patterns aberrant from the Rasch model is considered. For this purpose, a new person fit index, recently developed by I. W. Molenaar (1987) and an iterative estimation procedure are used in a simulation study of Rasch model data mixed with aberrant data. Three kinds of aberrant response behavior are considered: (1) guessing to complete the test; (2) guessing in accordance with the three-parameter logistic model; and (3) responding with different...

Theoretically preferred item response theory (IRT) bias detection procedures were applied to both a mathematics achievement and vocabulary test. The data were from black seniors and white seniors on the High School and Beyond data files. We wished to account for statistical artifacts by conducting cross-validation or replication studies. Therefore, each analysis was repeated on randomly equivalent samples of blacks and whites (n's=1500). Furthermore, to establish a baseline for judging bias...

Among the most popular techniques used to estimate item response theory (IRT) parameters are those used in the LOGIST and BILOG computer programs. Because of its accuracy with smaller sample sizes or differing test lengths, BILOG has become the standard to which new estimation programs are compared. However, BILOG is still complex and labor-intensive, and the sample sizes required are still rather large. For this reason, J. Ramsay developed the program TESTGRAF (1989), which uses nonparametric...

A project was designed to develop and test a library of cassette audiotapes for improving the technical skills of educational researchers. Fourteen outstanding researchers from diverse fields were identified, and a short instructional tape was prepared by each. Subjects of the tapes included instructional objectives for intellectual skills, sources of bias in surveys, implications for the next 20 years of change, some precepts for conducting educational research, statistical interactions,...

This paper evaluates the logic underlying various criticisms of statistical significance testing and makes specific recommendations for scientific and editorial practice that might better increase the knowledge base. Reliance on the traditional hypothesis testing model has led to a major bias against nonsignificant results and to misinterpretation of significant results. A finding of statistical significance does not mean that the null hypothesis is false, since there are many factors affecting...

The Education Trust research report "Stuck Schools" suggests a framework for identifying chronically low-performing schools in need of turnaround. The study uses Maryland and Indiana to show that some low-performing schools make progress while others remain stagnant. The report has four serious problems of reliability and validity, however. First, the norm-referenced methodology guarantees "failed" schools independent of any true performance or improvement level by the...

When most people think of the perks of teaching, an image that comes to mind is a shiny apple presented by a gap-toothed pupil. A recent paper by Jason Richwine of the Heritage Foundation and Andrew Biggs of the American Enterprise Institute claims that public school teachers enjoy lavish benefits that are more valuable than their base pay and twice as generous as those of private-sector workers (Richwine and Biggs 2011). According to Richwine and Biggs, this makes teachers' total compensation...

The effects of variations in degree of range restriction and different subgroup sample sizes on the validity of several item bias detection procedures based on Item Response Theory (IRT) were investigated in a simulation study. The degree of range restriction for each of two subpopulations was varied by cutting the specified subpopulation ability distribution at different locations and retaining the upper portion of the distribution. It was found that range restriction did have an effect on the...

Of particular import to this study, is collider bias originating from stratification on retreatment variables forming an embedded M or bowtie structural design. That is, rather than assume an M structural design which suggests that "X" is a collider but not a confounder, the authors adopt what they consider to be a more reasonable position and that is "X" is both a collider and confounder. Accordingly, in this study they examined the extent to which confounder induced bias...

A procedure for predicting categorical outcomes using categorical predictor variables was described by Moonan. This paper describes a related technique which uses prior probabilities, updated by joint likelihoods, as classification criteria. The procedure differs from Moonan's in that the outcome having the greatest posterior probability is selected as the prediction regardless of misclassification cost. It also differs in method of screening and weighting the predictor variables, and treats...

Previous studies have indicated that the reliability of test scores composed of testlets is overestimated by conventional item-based reliability estimation methods (S. Sireci, D. Thissen, and H. Wainer, 1991; H. Wainer, 1995; H. Wainer and D. Thissen, 1996; G. Lee and D. Frisbie). In light of these studies, it seems reasonable to ask whether the item-based estimation methods for the conditional standard errors of measurement (SEM) would provide underestimates for tests composed of testlets. The...

A study was conducted to investigate whether augmenting the calibration of items using computerized adaptive test (CAT) data matrices produced estimates that were unbiased and improved the stability of existing item parameter estimates. Item parameter estimates from four pools of items constructed for operational use were used in the study to arrive at a final number of 1,392 unique items. Fifty sets of true parameter estimates were generated from the base item prior information, and each true...

Regression discontinuity design (RD) has been widely used to produce reliable causal estimates. Researchers have validated the accuracy of RD design using within study comparisons (Cook, Shadish & Wong, 2008; Cook & Steiner, 2010; Shadish et al, 2011). Within study comparisons examines the validity of a quasi-experiment by comparing its estimates to trustworthy benchmarks (usually an experiment) with the same treatment group. First developed by Lalonde (1986), it is a rigorous method to...

A meta-analysis of 29 separate studies investigating pretest effects was conducted. Outcomes of the studies (achievement gains or attitude improvements) were computed as standardized differences between pretested and non-pretested groups. Eleven other variables were coded for each outcome. Initial descriptive statistics were indicative of differences between randomized and nonrandomized studies, so all further analyses were based on the 110 randomized group outcomes. For all outcomes, the...

A new report published by the Manhattan Institute for Education Policy, "The Effect of Special Education Vouchers on Public School Achievement: Evidence from Florida's McKay Scholarship Program," attempts to examine the complex issue of how competition introduced through school vouchers affects student outcomes in public schools. The possible contributions of this report, however, are outweighed by research design problems, failure to take into account alternative explanations, and...

The surgical theatre educational environment measures STEEM, OREEM and mini-STEEM for students (student-STEEM) comprise an up to now disregarded systematic overestimation (OE) due to inaccurate percentage calculation. The aim of the present study was to investigate the magnitude of and suggest a correction for this systematic bias. After an initial theoretical exploration of the problem, published scores were retrieved from the literature and corrected using statistical theorems....

Grouping is a statistical procedure through which members of the same group are considered as a single unit of observation. There are various ways to assign group membership and various ways to assign values of variables to groups. There are methodological problems associated with grouping in general and with particular methods of grouping. This paper argues that a wide variety of complex analytical problems concerning inferences from grouped observations can be understood from the use of a few...

A framework is described for preparing process profiles of two different surveys conducted by the National Center for Education Statistics (NCES): the Recent College Graduates Survey and the Higher Education General Information Survey (HEGIS) Fall Enrollment Survey. The process profile examines the adequacy of the entire survey process, using generally available data. For each component of the survey--sample design, instrumentation, data collection, data processing, estimation procedure, and...

The validity of the analysis of variance (ANOVA) of a split plot factorial design was investigated using a complex interaction contrast with matched two-group ability data for detecting biased items. The definition of a biased item by this method is: an item is biased if there is an item-by-group interaction when there is no group difference in achievement levels. Data were drawn from a study by M. J. Subkoviak et al. (1984), consisting of the responses of 1,022 white and 1,008 black college...

Results of an initial attempt at measuring teamwork processes using computer simulation are presented, focusing on assessing team processes that emerge during the negotiation of a contract. The interaction between team members and how that interaction affects team performance were explored using a computer simulated negotiation. Participants interacted through computers using predetermined messages that were categorized as belonging to five teamwork processes. Tracking the messages gives a...

The National Assessment Governing Board and the National Center for Education Statistics sponsored a Joint Conference on Standard Setting for Large-Scale Assessments to provide a forum for technical and policy issues relevant to setting standards at local, state, and national levels. Volume I contains an executive summary of the conference and synopses of the conference papers. This volume comprises the papers prepared for the conference and summaries of the plenary sessions and small breakout...

Data drawn from 30 journal articles and ERIC documents reporting on gender differences in natural science achievement were re-examined. Three meta-analysis methods were used: (1) vote counts and vote-counting estimation procedures; (2) tests of combined significance; and (3) analyses of effect sizes. The three methods produced seemingly contradictory conclusions which were explained in terms of differences in the hypotheses tested by the methods, as well as the statistical properties of the...

We explore the use of instrumental variables (IV) analysis with a multi-site randomized trial to estimate the effect of a mediating variable on an outcome in cases where it can be assumed that the observed mediator is the only mechanism linking treatment assignment to outcomes, as assumption known in the instrumental variables literature as the exclusion restriction. We use a random-coefficient IV model that allows both the impact of program assignment on the mediator (compliance with...

This paper investigates how bias reduction was affected when different degrees of measurement error were systematically introduced into the measures constituting the final estimated propensity score (PS), the PS only for the set of effective covariates and the PS only for the ineffective ones. Since there was already some error in the Shadish et al. covariate measures, a more complex simulation was also done without this source of error. In many ways, this last analysis is the most important....

There has been an active debate in the literature over the validity of value-added models. In this study, the author tests the central assumption of value-added models that school assignment is random relative to expected test scores conditional on prior test scores, demographic variables, and other controls. He uses a Chicago charter school's lottery to identify school effects, and then compares this "experimental" estimate to that of a school value-added model, which is estimated...

Using examples from evaluations of the Emergency School Aid Act (ESAA) Basic Grants Program, the ESAA Pilot Program, and the sustaining effects of compensatory education programs, school and student attrition are discussed. As in the example cases, appreciable attrition can be expected in most longitudinal studies. The possible effects of this attrition on descriptive analyses, analyses of student gains for each school year, and analyses of differential achievement gains for different treatment...

This study addressed which, if any, contemporary fit indices are least susceptible to the bias associated with confirmatory factor analysis (CFA) involving a large number of measured variables. Data were obtained from student responses from 1980 to 1990 on the Student Evaluations of Educational Quality (SEEQ) instrument of H. Marsh (1987). Factor analytic studies have validated the factor structure of the SEEQ. For this study, only student scores for 28 SEEQ items (7,407 classes) were included...

This fifth and final paper in the Fordham Institute's series examining digital learning policy is "Overcoming the Governance Challenge in K-12 Online Learning". The purpose of this report is to outline the steps required to move the governance of K-12 online learning from the local district level to the less restrictive state level and to create a free market for corporate innovation in K-12 online learning. Unfortunately, the report is based on an unsupported premise that K-12 online...

The research reported here uses a pre/post-test model and stimulated recall interviews to assess teachers' statistical reasoning about comparing distributions, when enrolled in a graduate-level statistics education course. We discuss key aspects of the course design aimed at improving teachers' learning and teaching of statistics, and the resulting different ways of reasoning about comparing distributions that teachers exhibited before and after the course.

The purpose of this study is to determine an efficient way to reduce the bias in estimates of the Rasch model parameters due to aberrant response patterns. First, the benefits of using one- or two-sided goodness-of-fit tests of patterns with the model are discussed. Then, the consequences of removing non-fitting patterns from Rasch model data are considered. Finally, an iterative procedure to reduce the bias is presented. This procedure replaces non-fitting patterns by certain patterns sampled...

This study compared correctional education participants and non-participants in Maryland, Minnesota, and Ohio to assess the impact of correctional education on recidivism and post-release employment of inmates. The study attempted to address criticism of previous studies by using a treatment and comparison group, using statistical controls, addressing possible self-selection bias, using more than one measure of recidivism, and using a longer time period. These two study groups were chosen:...

The goal of this study is to better understand how methods for estimating treatment effects of latent groups operate. In particular, the authors identify where violations of assumptions can lead to biased estimates, and explore how covariates can be critical in the estimation process. For each set of approaches, the authors first review the assumptions necessary for identification and discuss practical issues that arise in estimation; second, they then examine how covariates allow for improved...

It is a truism of research on social stratification that the effects of socioeconomic or family background on educational attainment lead to biases in the simple regression of occupational status (or other putative outcomes of schooling) on educational attainment. Using a structural equation model of sibling resemblance in educational attainment and occupational status, Hauser and Mossel have found minimal evidence of family bias in the effects of postsecondary schooling on occupational status...

The use of value-added models in education research has expanded rapidly. These models allow researchers to explore how a wide variety of policies and measured school inputs affect the academic performance of students. An important question is whether such effects are sufficiently large to achieve various policy goals. Judging whether a change in student achievement is important requires some meaningful point of reference. In certain cases a grade-equivalence scale or some other intuitive and...

This paper examines the question of the hereditary nature of intelligence and the validity of some of the statistical procedures which have been used in measuring the degree of hereditability. The author feels that proof of the question lacks sufficient scientific rigor for the support of any conclusion, particulary for a question of such political and emotional importance. (CTM)

Fleiss' popular multirater kappa is known to be influenced by prevalence and bias, which can lead to the paradox of high agreement but low kappa. It also assumes that raters are restricted in how they can distribute cases across categories, which is not a typical feature of many agreement studies. In this article, a free-marginal, multirater alternative to Fleiss' multirater kappa is introduced. Free-marginal Multirater Kappa (multirater K[free]), like its birater free-marginal counterparts...

Extensive education research on the contribution of teachers to student achievement produces two generally accepted results. First, teacher quality varies substantially as measured by the value added to student achievement or future academic attainment or earnings. Second, variables often used to determine entry into the profession and salaries--including postgraduate schooling, experience, and licensing examination scores--appear to explain little of the variation in teacher quality so...

The increasing availability of data from multi-site randomized trials provides a potential opportunity to use instrumental variables methods to study the effects of multiple hypothesized mediators of the effect of a treatment. We derive nine assumptions needed to identify the effects of multiple mediators when using site-by-treatment interactions to generate multiple instruments. Three of these assumptions are unique to the multiple-site, multiple-mediator case: 1) the assumption that the...

