Criteria for prediction of multinomial responses are examined in terms of estimation bias. Logarithmic penalty and least squares are quite similar in behavior but quite different from maximum probability. The differences ultimately reflect deficiencies in the behavior of the criterion of maximum probability.

Smith et al. (1980) analyzed 475 psychotherapy studies and concluded that individuals receiving treatment were better off than 80 percent of the untreated control groups. These studies were criticized on methodological grounds, particularly for failing to enable calculation of an index of effect size. To address these methodological issues, 20 published studies cited in the Smith study, from two treatment domains, i.e., the effectiveness of client-centered therapy (N=17) and transactional...

The relationship of sample size to number of variables in the use of factor analysis has been treated by many investigators. In attempting to explore what the minimum sample size should be, none of these investigators pointed out the constraints imposed on the dimensionality of the variables by using a sample size smaller than the number of variables. A review of studies in this area is made as well as suggestions for resolution of the problem. (Author)

An investigation of the effects of randomly missing data in two-predictor regression analyses is described. The differences in the effectiveness of five common treatments of missing data on estimates of R-squared values and each of the two standardized regression weights is also investigated. Bootstrap sample sizes of 50, 100, and 200 were drawn from three sets of actual field data. Randomly missing data were created within each sample, and the parameter estimates were compared with those...

Intended to provide an overview of program evaluation as it applies to the evaluation of faculty development and clinical training programs in substance abuse for health and mental health professional schools, this guide enables program developers and other faculty to work as partners with evaluators in the development of evaluation designs that meet the specialized needs of faculty development and clinical training programs. Section I discusses conceptual issues in program evaluation,...

This exploratory study extends the work done by B. Plake and others (2000) and R. Guille and others (2001) by investigating whether a negligible occasion facet would still be found when ratings for licensure and certification examinations were completed in isolation. A set of items was sent to a standard-setting committee to be reviewed at home, completely independently of all other members of the committee. Seven to nine raters reviewed each item. The examination was a medical certification...

This fifth and final paper in the Fordham Institute's series examining digital learning policy is "Overcoming the Governance Challenge in K-12 Online Learning". The purpose of this report is to outline the steps required to move the governance of K-12 online learning from the local district level to the less restrictive state level and to create a free market for corporate innovation in K-12 online learning. Unfortunately, the report is based on an unsupported premise that K-12 online...

Lord's bias function and the weighted likelihood estimation method are effective in reducing the bias of the maximum likelihood estimate of an examinee's ability under the assumption that the true item parameters are known. This paper presents simulation studies to determine the effectiveness of these two methods in reducing the bias when the item parameters are unknown. The simulation results show that Lord's bias function and the weighted likelihood estimation method might not be as effective...

The purpose of this study is to determine an efficient way to reduce the bias in estimates of the Rasch model parameters due to aberrant response patterns. First, the benefits of using one- or two-sided goodness-of-fit tests of patterns with the model are discussed. Then, the consequences of removing non-fitting patterns from Rasch model data are considered. Finally, an iterative procedure to reduce the bias is presented. This procedure replaces non-fitting patterns by certain patterns sampled...

The 11 papers in this volume were presented at the 1996 American Statistical Association (ASA) meeting in Chicago (Illinois), August 4 through 8. This is the fourth collection of ASA papers of particular interest to users of National Center for Education Statistics (NCES) survey data published in the "Working Papers" series. The following are included: (1) "Teacher Quality and Educational Inequality" (Richard M. Ingersoll); (2) "Using Qualitative Methods To Validate...

This paper applies a standard treatment effects model to determine that participation in Freshman Learning Communities (FLCs) improves academic performance and retention. Not controlling for individual self-selection into FLC participation leads one to incorrectly conclude that the impact is the same across race and gender groups. Accurately assessing the impact of any educational program is essential in determining what resources institutions should devote to it. Reduced form OLS estimation...

It is a truism of research on social stratification that the effects of socioeconomic or family background on educational attainment lead to biases in the simple regression of occupational status (or other putative outcomes of schooling) on educational attainment. Using a structural equation model of sibling resemblance in educational attainment and occupational status, Hauser and Mossel have found minimal evidence of family bias in the effects of postsecondary schooling on occupational status...

Hierarchical Bayes procedures were compared for estimating item and ability parameters in item response theory. Simulated data sets from the two-parameter logistic model were analyzed using three different hierarchical Bayes procedures: (1) the joint Bayesian with known hyperparameters (JB1); (2) the joint Bayesian with information hyperpriors (JB2); and (3) the marginal Bayesian with known hyperparameters (MB). Prior and posterior distributions focusing on one- and two-stage hierarchical...

A Monte Carlo study was conducted to estimate the small sample standard errors and statistical bias of psychometric statistics commonly used in the analysis of achievement tests. The statistics examined in this research were: (1) the index of item difficulty; (2) the index of item discrimination; (3) the corrected item-total point-biserial correlation coefficient; and (4) coefficient alpha. Sample sizes of 5, 10, 20, 40, 80, and 160 were evaluated. One thousand samples of each size were drawn...

This report is the methodology report for the National Longitudinal Study of the High School Class of 1972 follow-up in 1986. The fifth follow-up survey of the National Longitudinal Study of the High School Class of 1972 (NLS-72) took place during spring and summer of 1986. A mail questionnaire was sent to a subsample of 14,489 members of the original sample of 22,652. A total of 12,841 persons returned the questionnaire, for a response rate of 89 percent. By the time of the survey, the sample...

A normally distributed data set of 1,000 values--ranging from 50 to 150, with a mean of 50 and a standard deviation of 20--was created in order to evaluate the bootstrap method of repeated random sampling. Nine bootstrap samples of N=10 and nine more bootstrap samples of N=25 were randomly selected. One thousand random samples were selected from each of the 18 bootstrap samples, and its mean and standard deviation were calculated. The cumulative means and standard deviations diverged from the...

The Anchor Test Study provides a method for translating a pupil's score on any one of eight widely used standardized reading tests for Grades 4, 5, and 6 to a corresponding score on any of the other seven tests, as well as furnishing new nationally representative norms for each of the eight tests. In addition, the study presents new estimates of alternate form reliability for each test, provides estimates of the intercorrelations among the tests, and explores empirically some methodological...

To correct for the effects of measurement error on structural parameter estimates, many researchers are now estimating models of educational achievement with LISREL. In order to estimate such models it is desirable to obtain multiple manifest measures of the latent constructs. Many researchers restrict their models to two manifest measures per latent construct for reasons of economy, but doing so assumes, in the absence of external information, that all of the covariance between the...

Empirical evidence is presented of the relative efficiency of two potential linkage plans to be used when equivalent test forms are being administered. Equating is a process by which scores on one form of a test are converted to scores on another form of the same test. A Monte Carlo study was conducted to examine equating stability and statistical bias in a single and double linkage plan in small samples. Small random samples of 25, 50, and 100 were drawn with replacement from archival test...

A key issue in quasi-experimental studies and also with many evaluations which required a treatment effects (i.e. a control or experimental group) design is selection bias (Shadish el at 2002). Selection bias refers to the selection of individuals, groups or data for analysis such that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed (Shadish el 2002). There are many ways in which selection bias...

A variety of methods are used by the Austin (Texas) Independent School District to report the results of student achievement testing. These techniques were developed to alleviate some of the problems that occurred previously: (1) a school's average score represents very few of its students because large numbers of students score very high or very low; (2) a median score masks achievement gains; or (3) a total group's average declines while all subgroups' averages rise. Case studies illustrate...

The findings of a recent Economic Policy Institute report that asserts that U.S. schools are financially undernourished compared to other nations is critiqued in this paper. The argument is made that not only is there no systematic relationship between increased spending on education and academic performance, but also that increased spending is likely to cause economic decline. Spending more on obsolete, inefficient schools and colleges will waste resources and weaken the U.S. economy....

The sampling of teachers for nationwide surveys offers a challenging endeavor in obtaining a representative and adequate sample to truly represent opinions of the teachers. Ten national surveys of public school teachers conducted between 1980 and 1985 are presented with respect to their sampling design and procedures. Concepts and theoretical considerations provide the background of the critical review. This paper discusses adequacy and representativeness as criteria of a good sample, sample...

The research reported here uses a pre/post-test model and stimulated recall interviews to assess teachers' statistical reasoning about comparing distributions, when enrolled in a graduate-level statistics education course. We discuss key aspects of the course design aimed at improving teachers' learning and teaching of statistics, and the resulting different ways of reasoning about comparing distributions that teachers exhibited before and after the course.

This module on statistics consists of 18 worksheets that cover such topics as sample spaces, mean, median, mode, taking samples, posting results, analyzing data, and graphing. The last four worksheets require the students to work with samples and use these to compare people's responses. A computer dating service is one result of this work. Teaching suggestions are included. (MK)

A Monte Carlo investigation of six robust correlation estimators was conducted for data from distributions with longer than Gaussian tails: a bisquare coefficient, the Tukey correlation, the standardized sums and differences, a biweight standardized sums and differences, the transformed Spearman's rho and a bivariate trimmed Pearson. Evaluation of the estimators was based on bias and variability as measured by mean square error, and efficiency relative to the Pearson correlation coefficient....

Using examples from evaluations of the Emergency School Aid Act (ESAA) Basic Grants Program, the ESAA Pilot Program, and the sustaining effects of compensatory education programs, school and student attrition are discussed. As in the example cases, appreciable attrition can be expected in most longitudinal studies. The possible effects of this attrition on descriptive analyses, analyses of student gains for each school year, and analyses of differential achievement gains for different treatment...

Item response theory (IRT) has been adapted as the theoretical foundation of computerized adaptive testing (CAT) for several decades. In applying IRT to CAT, there are certain considerations that are essential, and yet tend to be neglected. These essential issues are addressed in this paper, and then several ways of eliminating noise and bias in estimating the individual parameter, theta, of person "a" are proposed and discussed, so that accuracy and efficiency in ability estimation...

I take issue with several points in the Howleys' reanalysis (Vol. 12 No. 52 of this journal) of "High School Size: Which Works Best and for Whom?" (Lee & Smith, 1997). That the original sample of NELS schools might have underrepresented small rural public schools would not bias results, as they claim. Their assertion that our conclusions about an ideal high-school size privileged excellence over equity ignores the fact that our multilevel analyses explored the two outcomes...

A comparison between six rater agreement measures obtained using three different approaches was achieved by means of a simulation study. Rater coefficients suggested by Bennet's [sigma] (1954), Scott's [pi] (1955), Cohen's [kappa] (1960) and Gwet's [gamma] (2008) were selected to represent the classical, descriptive approach, [alpha] agreement parameter from Aickin (1990) to represent loglinear and mixture model approaches and [Delta] measure from Martin and Femia (2004) to represent...

Since there is no standard national Pre and Post Test for Principles of Finance, akin to the one for Economics, by authors created one by selecting questions from previously administered examinations. The Cronbach's Alpha of 0.851, exceeding the minimum of 0.70 for reliable pen and paper test, indicates that our Test can detect differences in learning outcomes. Improvements between Pre and Post Test scores, statistically significant at the 1% level, in the entire sample and within different...

The goal of this study is to better understand how methods for estimating treatment effects of latent groups operate. In particular, the authors identify where violations of assumptions can lead to biased estimates, and explore how covariates can be critical in the estimation process. For each set of approaches, the authors first review the assumptions necessary for identification and discuss practical issues that arise in estimation; second, they then examine how covariates allow for improved...

Abuses and misuses of statistics are frequent. This digest attempts to warn against these in three broad classes of pitfalls: sources of bias, errors of methodology, and misinterpretation of results. Sources of bias are conditions or circumstances that affect the external validity of statistical results. In order for a researcher to make legitimate conclusions about the specified population, two characteristics must be present in the sample: representative sampling and valid statistical...

Qualitative research evokes rather stereotyped responses from the mainstream of social science. The following 10 standardized responses to the stimulus "qualitative research interview" (QRI) are discussed: (1) it is not scientific, only common sense; (2) it is not objective, but subjective; (3) it is not trustworthy, but biased; (4) it is not reliable, but rests on leading questions; (5) it is not intersubjective, as different interpreters find different meanings; (6) it is not...

The use of value-added models in education research has expanded rapidly. These models allow researchers to explore how a wide variety of policies and measured school inputs affect the academic performance of students. An important question is whether such effects are sufficiently large to achieve various policy goals. Judging whether a change in student achievement is important requires some meaningful point of reference. In certain cases a grade-equivalence scale or some other intuitive and...

This study compared correctional education participants and non-participants in Maryland, Minnesota, and Ohio to assess the impact of correctional education on recidivism and post-release employment of inmates. The study attempted to address criticism of previous studies by using a treatment and comparison group, using statistical controls, addressing possible self-selection bias, using more than one measure of recidivism, and using a longer time period. These two study groups were chosen:...

The known interval scale, referred to as the 7.8 scale, has been criticized as an invalid measuring instrument in the form of an attitude scale. It is the purpose of this paper to demonstrate that this scale can produce spuriously inflated correlation coefficients, high reliability, and false significance on statistical tests. The case will be made along two general lines. First, the effects of the scale on reliability, validity, and significance testing will be presented and second the...

The traditional indicator of test speededness, missing responses, clearly indicates a lack of time to respond (thereby indicating the speededness of the test), but it is inadequate for evaluating speededness in a multiple-choice test scored as number correct, and it underestimates test speededness. Conventional item response theory (IRT) parameter estimation ignores the mixture of random response during calibration; consequently, estimated parameters are biased. The HYBRID model (K. Yamamoto,...

The third-year evaluation of the federally funded Washington, D.C. voucher program shows that low-income students offered vouchers in the first two years of the program had modestly higher reading scores after three years but showed no significant difference in mathematics. Students were randomly assigned to treatment and control groups, and the authors assessed the treatment effect on the overall, combined sample as well as some sample subgroups. The authors, however, interpret the results in...

This paper studies the determinants of college major choice using a unique "information" experiment embedded in a survey. We first ask respondents their "self" beliefs--beliefs about their own expected earnings and other major-specific outcomes conditional on various majors, their "population" beliefs--beliefs about the population distribution of these characteristics, as well as their subjective beliefs that they will graduate with each major. After eliciting...

The development of an index reflecting the probability that the observed correspondence between multiple choice test responses of two examinees was due to chance in the absence of copying was previously reported. The present paper reports the implementation of a statistic requiring less restrictive underlying assumptions but more computation time and a related Bayesian procedure designed to adjust the standard error estimates to counteract the effect of the presence of a substantial proportion...

This study addressed which, if any, contemporary fit indices are least susceptible to the bias associated with confirmatory factor analysis (CFA) involving a large number of measured variables. Data were obtained from student responses from 1980 to 1990 on the Student Evaluations of Educational Quality (SEEQ) instrument of H. Marsh (1987). Factor analytic studies have validated the factor structure of the SEEQ. For this study, only student scores for 28 SEEQ items (7,407 classes) were included...

In the educational literature, responses to surveys commonly serve as the source of data for many empirical articles. Whenever a survey is used as a source of data, the response rate can greatly affect the potential generalizability of the findings. Using Monte Carlo methods, this study examined the effects on sample estimates of the population mean and standard deviation for 3 levels of effect size differences between the responders and nonresponders (0.0, 0.25, and 0.50). Two data sets were...

The purpose of this study was to empirically determine the effects of quantified violations of the underlying assumptions of parametric statistical tests commonly used in educational research, namely the correlation coefficient (r) and the t test. The effects of heterogeneity of variance, nonnormality, and nonlinear transformations of scales were studied separetely and in all combinations. Monte Carlo procedures were followed to generate random digits which had the following shapes: normal,...

The document, part of a series of chapters described in SO 011 759, considers the problem of censoring in the analysis of event-histories (data on dated events, including dates of change from one qualitative state to another). Censoring refers to the lack of information on events that occur before or after the period for which data are available. Unless censorship is dealt with, researchers are likely to make erroneous inferences about the change process. The report considers several approaches...

Some students are excluded from the National Education Longitudinal Study of 1988 (NELS:88) because of an inability, whether due to physical, mental, or linguistic barriers, to participate in studies requiring questionnaire or cognitive test completion. The implications of this exclusion for sample representativeness, national estimation, and policy studies are examined. Also described is a special study undertaken in the NELS:88 First Follow-Up to compensate in key respects for undercoverage...

Grouping is a statistical procedure through which members of the same group are considered as a single unit of observation. There are various ways to assign group membership and various ways to assign values of variables to groups. There are methodological problems associated with grouping in general and with particular methods of grouping. This paper argues that a wide variety of complex analytical problems concerning inferences from grouped observations can be understood from the use of a few...

A framework is described for preparing process profiles of two different surveys conducted by the National Center for Education Statistics (NCES): the Recent College Graduates Survey and the Higher Education General Information Survey (HEGIS) Fall Enrollment Survey. The process profile examines the adequacy of the entire survey process, using generally available data. For each component of the survey--sample design, instrumentation, data collection, data processing, estimation procedure, and...

Possible bias due to sampling problems or low response rates has been a troubling "nuisance" variable in empirical research since seminal and classical studies were done on these problems at the beginning of this century. Recent research suggests that: (1) earlier views of the alleged bias problem were misleading; (2) under a variety of fairly well-specified conditions, allegedly biased samples are in fact random; and (3) "de facto" biased samples can and will...

The main goal of this study was to illustrate and provide some direction for dealing with the complexities of propensity score matching within different multilevel contexts. Special attention is given to how procedures typically applied in a non-hierarchical setting may be modified to properly reduce the expected bias in the estimated treatment effect of a high school-level intervention on college-level outcomes. In particular, students self-selected into a high school level intervention and...

