Aggregation, or grouping, is a statistical procedure through which all members of a study within a specified range of scores (usually observed scores) are assigned a common or "group" score (for example, the group mean). The various social science methodology literatures agree on the costs of grouping: not only does one always lose information in grouping, in a wide variety of situations grouping introduces systematic error (bias). For most educational research applications the...

This review examines the recently released Thomas P. Fordham Institute report, "Education Olympics: The Games in Review." Published just after the completion of the 2008 Beijing Summer Olympics, Education Olympics strategically parallels the international competition by awarding gold, silver and bronze medals to top performing countries based on indicators including scores from international assessments in reading, mathematics, and science. The report contrasts American students'...

The conclusion of a 1999 Organisation for Economic Cooperation and Development (OECD) report that wage gains for training are higher for workers with lower levels of education was revisited using data for males from the 1997 Australian Survey of Education and Training (SET). The study used methods similar to the OECD report (ordinary least squares and treatment effects model) with the following findings: (1) earnings effects for workers with Skilled and Basic Vocational Qualifications were...

This article presents a method for addressing the self-selection bias of students who participate in learning communities (LCs). More specifically, this research utilizes equivalent comparison groups based on selected incoming characteristics of students, known as bootstraps, to account for self-selection bias. To address the differences in academic preparedness in the fall 2012 cohort, three stratified random samples of students were drawn from the non-LC population to match the LC cohort in...

To assess the relative effectiveness of public and private school environments, researchers must distinguish between the effects of the schools' programs and the students' innate abilities. Student background variables do not appear to account for all the important differences among students attending public and private schools. This document proposes and tests a model of school selection that relates school choice to a family's assessment of public and private school quality. The assumption is...

The National Household Education Survey (NHES) was conducted for the first time in 1991 as a way to collect data on the early childhood education experiences of young children and participation in adult education. Because the NHES methodology is relatively new, field tests were necessary. A large field test of approximately 15,000 households was conducted during the fall of 1989 to examine several methodological issues. This report focuses on measurement errors arising from the use of proxy...

A project was designed to develop and test a library of cassette audiotapes for improving the technical skills of educational researchers. Fourteen outstanding researchers from diverse fields were identified, and a short instructional tape was prepared by each. Subjects of the tapes included instructional objectives for intellectual skills, sources of bias in surveys, implications for the next 20 years of change, some precepts for conducting educational research, statistical interactions,...

The Center for Research on Education Outcomes (CREDO) at Stanford University conducted a large-scale analysis of the impact of charter schools on student performance. The center's data covered 65-70% of the nation's charter schools. Although results varied by state, 17% of the charter school students have significantly higher math results than their matched twins in comparable traditional public schools (TPS), while 37% had significantly worse results. The CREDO study strengthens the...

A new report published by the Buckeye Institute for Public Policy Solutions is a minor variant on six similar reports published by the Friedman Foundation over the past three years. The new report repeats some of the errors in the previous reports, and it follows a parallel structure, arguing that the costs of dropping out are dramatic for the state of Ohio, and that last-chance charter schools for dropouts can increase graduation and address the dropout problem. However, the report's claims...

Smith et al. (1980) analyzed 475 psychotherapy studies and concluded that individuals receiving treatment were better off than 80 percent of the untreated control groups. These studies were criticized on methodological grounds, particularly for failing to enable calculation of an index of effect size. To address these methodological issues, 20 published studies cited in the Smith study, from two treatment domains, i.e., the effectiveness of client-centered therapy (N=17) and transactional...

The relationship of sample size to number of variables in the use of factor analysis has been treated by many investigators. In attempting to explore what the minimum sample size should be, none of these investigators pointed out the constraints imposed on the dimensionality of the variables by using a sample size smaller than the number of variables. A review of studies in this area is made as well as suggestions for resolution of the problem. (Author)

An investigation of the effects of randomly missing data in two-predictor regression analyses is described. The differences in the effectiveness of five common treatments of missing data on estimates of R-squared values and each of the two standardized regression weights is also investigated. Bootstrap sample sizes of 50, 100, and 200 were drawn from three sets of actual field data. Randomly missing data were created within each sample, and the parameter estimates were compared with those...

Intended to provide an overview of program evaluation as it applies to the evaluation of faculty development and clinical training programs in substance abuse for health and mental health professional schools, this guide enables program developers and other faculty to work as partners with evaluators in the development of evaluation designs that meet the specialized needs of faculty development and clinical training programs. Section I discusses conceptual issues in program evaluation,...

Technical problems with norm-referenced achievement testing that can lead to the erroneous evaluation of schools for Chapter 1 Program improvement is discussed, and an alternative testing model is presented. The history of Chapter 1 testing and evaluation policies is briefly reviewed, and problems with the norm-referenced model are explored. Data from a large urban district with an extensive Chapter 1 system for the Iowa Test of Basic Skills are used to demonstrate the way in which regression...

Using an empirical investigation of alternate item nonresponse adjustment procedures in a National Longitudinal Study (NLS) of missing and faulty data, it is indicated that in some cases imputation can reduce the accuracy of survey estimates. A National Sample of the high school class of 1972 is designed to provide statistics on students moving into early adulthood. The bias resulting from nonresponse and response errors is evaluated using hot deck and weighting class adjustment techniques to...

Given the different possibilities of matching in the context of multilevel data and the lack of research on corresponding matching strategies, the author investigates two main research questions. The first research question investigates the advantages and disadvantages of different matching strategies that can be pursued with multilevel data structures. The goal is first to outline possible matching strategies and then to identify an optimal matching strategy for different treatment selection...

This exploratory study extends the work done by B. Plake and others (2000) and R. Guille and others (2001) by investigating whether a negligible occasion facet would still be found when ratings for licensure and certification examinations were completed in isolation. A set of items was sent to a standard-setting committee to be reviewed at home, completely independently of all other members of the committee. Seven to nine raters reviewed each item. The examination was a medical certification...

The purpose of this inquiry was to investigate the effectiveness of item response theory (IRT) proficiency estimators in terms of estimation bias and error under multistage testing (MST). We chose a 2-stage MST design in which 1 adaptation to the examinees' ability levels takes place. It includes 4 modules (1 at Stage 1, 3 at Stage 2) and 3 paths (low, middle, and high). When creating 2-stage MST panels (i.e., forms), we manipulated 2 assembly conditions in each module, such as difficulty level...

Many arguments have been made against allowing examinees to review and change their answers after completing a computer adaptive test (CAT). These arguments include: (1) increased bias; (2) decreased precision; and (3) susceptibility of test-taking strategies. Results of simulations suggest that the strength of these arguments is reduced or eliminated by using specific information item selection (SIIS), under which items are selected to meet information targets, instead of the more common...

Previous studies have indicated that the reliability of test scores composed of testlets is overestimated by conventional item-based reliability estimation methods (S. Sireci, D. Thissen, and H. Wainer, 1991; H. Wainer, 1995; H. Wainer and D. Thissen, 1996; G. Lee and D. Frisbie). In light of these studies, it seems reasonable to ask whether the item-based estimation methods for the conditional standard errors of measurement (SEM) would provide underestimates for tests composed of testlets. The...

A report from the School Choice Demonstration Project examines issues concerning the funding formula used for the Milwaukee Parental Choice Program (MPCP). It finds that the program generates a net saving to taxpayers in Wisconsin but imposes a significant fiscal burden on taxpayers in Milwaukee. However, these findings depend significantly on how many students would have attended public school if the voucher option were not available, as well as on the actual resource requirements for those...

We explore the use of instrumental variables (IV) analysis with a multi-site randomized trial to estimate the effect of a mediating variable on an outcome in cases where it can be assumed that the observed mediator is the only mechanism linking treatment assignment to outcomes, as assumption known in the instrumental variables literature as the exclusion restriction. We use a random-coefficient IV model that allows both the impact of program assignment on the mediator (compliance with...

This study addresses the sample error and linking bias that occur with small and unrepresentative samples in a non-equivalent groups anchor test (NEAT) design. We propose a linking method called the "synthetic function," which is a weighted average of the identity function (the trivial equating function for forms that are known to be completely parallel) and a traditional equating function (in this case, the chained linear equating function) used in the normal case in which forms are...

In practical applications of item response theory (IRT), item parameters are usually estimated first from a calibration sample. After treating these estimates as fixed and known, ability parameters are then estimated. However, the statistical inferences based on the estimated abilities can be misleading if the uncertainty of the item parameter estimates is ignored. Instead, estimated item parameters can be regarded as covariates measured with error. Along the line of this...

The surgical theatre educational environment measures STEEM, OREEM and mini-STEEM for students (student-STEEM) comprise an up to now disregarded systematic overestimation (OE) due to inaccurate percentage calculation. The aim of the present study was to investigate the magnitude of and suggest a correction for this systematic bias. After an initial theoretical exploration of the problem, published scores were retrieved from the literature and corrected using statistical theorems....

This study used real data to construct testing conditions for comparing results of chained linear, Tucker, and Levine-observed score equatings. The comparisons were made under conditions where the new- and old-form samples were similar in ability and when they differed in ability. The length of the anchor test was also varied to enable examination of its effect on the three different equating methods. Two tests were used in the study, and the three equating methods were compared to a criterion...

Eta-Squared (ES) is often used as a measure of strength of association of an effect, a measure often associated with effect size. It is also considered the proportion of total variance accounted for by an independent variable. It is simple to compute and interpret. However, it has one critical weakness cited by several authors (C. Huberty, 1994; P. Snyder and S. Lawson, 1993; and T. Snijders, 1996), and that is a sampling bias that leads to an inflated judgment of true effect. The purpose of...

This study addresses the test-disclosure-related need for more Graduate Record Examinations (GRE) General Test editions in a situation where the number of examinees is stable or declining. Equating is used to guarantee that examinees of different test editions are treated equitably. The data collection designs used in this study were: (1) Nonrandom Group External Anchor Test (NREAT); and (2) Random Group, Preoperational Section (RPOS). Bias and root mean squared error were calculated for the...

Data drawn from 30 journal articles and ERIC documents reporting on gender differences in natural science achievement were re-examined. Three meta-analysis methods were used: (1) vote counts and vote-counting estimation procedures; (2) tests of combined significance; and (3) analyses of effect sizes. The three methods produced seemingly contradictory conclusions which were explained in terms of differences in the hypotheses tested by the methods, as well as the statistical properties of the...

Problems associated with low response rates to surveys are considered, drawing from the literature on the methodology of survey research. A series of analyses are presented which were designed to examine the efficacy of Astin and Molm's procedure to adjust for nonresponse biases. Data were obtained form the Cooperative Institutional Research Program survey of 1987 incoming students and the 1991 followup survey. Data were analyzed for 209,627 students at 390 institutions. After separating the...

The application of a multivariate analytic technique for the analysis of data from longitudinal designs with multiple dependent variables is presented. The technigue is the multivariate generalization of univariate repeated measures ANOVA. An application of the technique to data collected using materials from the Asian Studies Curriculum Project is included. The example analysis indicated the technique is viable and should be a useful tool for the methodologist/evaluator. (Author)

This study evaluated the impact of unequal reliability on test equating methods in the nonequivalent groups with anchor test (NEAT) design. Classical true score-based models were compared in terms of their assumptions about how reliability impacts test scores. These models were related to treatment of population ability differences by different NEAT equating methods. A score model was then developed based on the most important features of the reviewed score models and used to study reliability...

A valuable extension of the single-rating regression discontinuity design (RDD) is a multiple-rating RDD (MRRDD). To date, four main methods have been used to estimate average treatment effects at the multiple treatment frontiers of an MRRDD: the "surface" method, the "frontier" method, the "binding-score" method, and the "fuzzy instrumental variables" method. This paper uses a series of simulations to evaluate the relative performance of each of these...

In a provocative and influential paper, Jesse Rothstein (2010) finds that standard value-added models (VAMs) suggest implausible future teacher effects on past student achievement, a finding that obviously cannot be viewed as causal. This is the basis of a falsification test (the Rothstein falsification test) that appears to indicate bias in VAM estimates of current teacher contributions to student learning. More precisely, the falsification test is designed to identify whether or not students...

Advantages and disadvantages of standard Rasch analysis computer programs are discussed. The unconditional maximum likelihood algorithm allows all observations to participate equally in determining the measures and calibrations to be obtained quickly from a data set. On the advantage side, standard Rasch programs can be used immediately, are debugged and accurate, and can report statistics that are difficult to calculate. On the disadvantage side, the user must have the correct hardware,...

This study uses simulation examples representing three types of treatment assignment mechanisms in data generation (the random intercept and slopes setting, the random intercept setting, and a third setting with a cluster-level treatment and an individual-level outcome) in order to determine optimal procedures for reducing bias and improving precision in each of these three settings. Evaluation criteria include bias, variance, MSE, confidence interval coverage rate, and remaining sample size....

The effects of variations in degree of range restriction and different subgroup sample sizes on the validity of several item bias detection procedures based on Item Response Theory (IRT) were investigated in a simulation study. The degree of range restriction for each of two subpopulations was varied by cutting the specified subpopulation ability distribution at different locations and retaining the upper portion of the distribution. It was found that range restriction did have an effect on the...

The effectiveness of Stout's procedure for assessing latent trait unidimensionality was studied. Strong empirical evidence of the utility of the statistical test in a variety of settings is provided. The procedure was modified to correct for increased bias, and a new algorithm to determine the size of assessment sub-tests was used. The following two issues were addressed via a Monte Carlo simulation: (1) the ability to approximate the nominal level of significance via the observed level of...

The Education Trust research report "Stuck Schools" suggests a framework for identifying chronically low-performing schools in need of turnaround. The study uses Maryland and Indiana to show that some low-performing schools make progress while others remain stagnant. The report has four serious problems of reliability and validity, however. First, the norm-referenced methodology guarantees "failed" schools independent of any true performance or improvement level by the...

When most people think of the perks of teaching, an image that comes to mind is a shiny apple presented by a gap-toothed pupil. A recent paper by Jason Richwine of the Heritage Foundation and Andrew Biggs of the American Enterprise Institute claims that public school teachers enjoy lavish benefits that are more valuable than their base pay and twice as generous as those of private-sector workers (Richwine and Biggs 2011). According to Richwine and Biggs, this makes teachers' total compensation...

The applicability is explored of the Bayesian random-effect analysis of variance (ANOVA) model developed by G. C. Tiao and W. Y. Tan (1966) and a method suggested by H. K. Suen and P. S. Lee (1987) for the generalizability analysis of autocorrelated data. According to Tiao and Tan, if time series data could be described as a first-order autoregressive series with parameter "p" (rho), unbiased estimates of random error variance could be derived via a Bayesian process. Suen and Lee's...

Lord's bias function and the weighted likelihood estimation method are effective in reducing the bias of the maximum likelihood estimate of an examinee's ability under the assumption that the true item parameters are known. This paper presents simulation studies to determine the effectiveness of these two methods in reducing the bias when the item parameters are unknown. The simulation results show that Lord's bias function and the weighted likelihood estimation method might not be as effective...

The purpose of this study is to determine an efficient way to reduce the bias in estimates of the Rasch model parameters due to aberrant response patterns. First, the benefits of using one- or two-sided goodness-of-fit tests of patterns with the model are discussed. Then, the consequences of removing non-fitting patterns from Rasch model data are considered. Finally, an iterative procedure to reduce the bias is presented. This procedure replaces non-fitting patterns by certain patterns sampled...

The 11 papers in this volume were presented at the 1996 American Statistical Association (ASA) meeting in Chicago (Illinois), August 4 through 8. This is the fourth collection of ASA papers of particular interest to users of National Center for Education Statistics (NCES) survey data published in the "Working Papers" series. The following are included: (1) "Teacher Quality and Educational Inequality" (Richard M. Ingersoll); (2) "Using Qualitative Methods To Validate...

The goal of this paper is to provide guidance for applied education researchers in using multi-level data to study the effects of interventions implemented at the school level. Two primary approaches are currently employed in observational studies of the effect of school-level interventions. One approach employs intact school matching: matching schools that are implementing the treatment to schools not implementing the treatment that are similar in observable characteristics. An alternative...

Three judges rated General Educational Development (GED) student essays that had been written by nine incarcerated youths. If judge's ratings were not consistent, students would receive a biased average rating. Traditional classical measurement theory computes an intraclass reliability coefficient to determine if judges' ratings are reliable. In contrast, a latent trait measurement theory approach can determine a student's fair average rating adjusted for any bias in the judges ratings. A...

Most of the recent literature on the achievement effects of school size has examined school and district performance. These studies have demonstrated substantial benefits of smaller school and district size in impoverished settings. To date, however, no work has adequately examined the relationship of size and socioeconomic status (SES) with students as the unit of analysis. One study, however, came close (Lee & Smith, 1997), but failed to adjust its analyses or conclusions to the...

This paper applies a standard treatment effects model to determine that participation in Freshman Learning Communities (FLCs) improves academic performance and retention. Not controlling for individual self-selection into FLC participation leads one to incorrectly conclude that the impact is the same across race and gender groups. Accurately assessing the impact of any educational program is essential in determining what resources institutions should devote to it. Reduced form OLS estimation...

This study examined the effect of type of correlation matrix on the robustness of LISREL maximum likelihood and unweighted least squares structural parameter estimates for models with categorical manifest variables. Two types of correlation matrices were analyzed; one containing Pearson product-moment correlations and one containing tetrachoric, polyserial, and product-moment correlations as appropriate. Using continuous variables generated according to the equations defining the population...

In the UK, education is the third largest area of government spending (of which school spending has the largest share). Since 2000, school expenditure has increased by about 40 per cent in real terms for both primary and secondary schools (see Figure 1). The question as to whether such investment is worthwhile is of central importance. The national debate is not revealing as to the answer. The government points to the improvement in the number of students achieving government targets in...

