Proposal view
Proposal Type: Individual Paper 
Domain: Assessment and Evaluation 
SIG: Assessment and Evaluation 
Type Submitted Paper 
Equipment  
Paper Details
Title Measuring measuring: An item response theory approach
Abstract
The purpose of this study is to examine individual understanding of a particular evidence-based framework for constructing measures (NRC, 2001). Given the broad range of misconceptions that still persist, experts have lamented the challenges for the field of measurement and assessment (Braun & Mislevy, 2005; Popham, 2000, 2004). Building on the model of expert-novice studies in how individuals think and learn in other domains, the study explores the “building blocks” associated with proficiency in a new evidence-based approach to measurement, the Constructing Measures (Wilson, 2005) framework. The study employs mixed quantitative and qualitative methods to generate and evaluate an instrument designed to measure CM proficiency. The instrument is based on a content analysis of the CM framework, and resulted in six dimensions of proficiency. The open-ended and fixed response format test instrument was utilized to collect responses from 72 participants. Guided by the 1999 Standards for Educational and Psychological Testing, analysis of validity evidence, consisting of evidence related to content, response processes, internal structure, and relations to other variables, suggested that the instrument can detect meaningful differences in proficiencies on several dimensions. The partial credit Rasch item response model employed fit the data acceptably well. Internal consistency indicators, such as person separation (.87) and Cronbach’s alpha (.89), provided evidence for the instrument’s reliability. Results from further reliability analysis indicated high inter-rater agreement (.98) for scoring of the non-objective items. The relationship between individual CM proficiency and other background variables was tested with multiple linear regression analysis. The results indicated that both research and consulting experience have a statistically significant (p < .05) effect on CM proficiency. While the consecutive scaling analyses of the “building block” constructs indicates a degree of positive correlation between the dimensions, additional results from a multidimensional scaling analysis are considered.
Summary
The primary purpose of this paper is to offer a conceptual framework for defining the nature of measurement knowledge as a subject domain for research. This work is based in part on: (a) the National Research Council's (2001) assessment triangle that identifies cognition, observation, and interpretation as the foundations of educational measurement, and (b) a "building blocks" or evidence based framework (Mislevy, Steinberg, & Almond, 2003) that conceptualizes assessment in terms of a construct map, items design, outcome space, and measurement model (Wilson, 2005). This paper employs a cognitive developmental framework (Flavell, 1982, 1985) for the study of individual thinking and learning about a subject domain, called evidence based frameworks for measurement.

 

The second purpose of this paper is to empirically investigate the schemata employed by a range of individuals, including but not limited to those enrolled in an introductory graduate-level measurement course. The nature and structure of novice, intermediate, and expert thinking is explored with multiple measures, including course assessments, interviews, and a test instrument. Consistent with these purposes, a final aim is to present and analyze evidence for the reliability and validity of the CM instrument employed in the study.

 

A survey of the methodologies used to research cognition and learning strongly suggests that mixed qualitative and quantitative methods are important to the investigation of any subject domain (NRC 1999, 2001). The paper shows how mixed methods are applied primarily at the level of instrument design and validation. The qualitative methods employed include examination of university course embedded assessments and semi-structured interviews taken with course participants. The quantitative methods employed in the instrument development process primarily concern the choice and evaluation of a measurement model to examine the data from the CM instrument. The use of Rasch item response modeling software such as ConQuest (Wu, Adams & Wilson, 1998) provided an opportunity to check hypotheses about the constructs, examine overall item functioning, and analyze the utility of categories of responses to items. Item and respondent fit statistics also guide judgments about the appropriateness of the measurement model itself.  

 

The paper specifically examines empirical data provided by a mixed format test instrument (i= 26). Respondents (n = 72) were sampled from three pools: those who have taken an introductory graduate-level measurement course; those who have had experience with the principles expounded in the building blocks framework including research projects affiliated with a measurement research center; and those who are measurement experts, particularly in Rasch item response modeling.

 

There are several sources of evidence collected for the instruments’ quality (AERA, APA, NCME, 1999). Validity data for the instruments’ content, response processes, internal structure, and relations to other variables was collected. All test items are fitted with a partial credit item response model (Wright & Masters, 1981). Furthermore, data on internal consistency and inter rater consistency is collected for reliability analyses.

   

NRC (2001) and other evidence based frameworks derived from experts’ scripts, frames, and schemata were used to define measurement knowledge. Hypotheses about the nature and structure of that knowledge was put to empirical test. The instrumentation used to measure knowledge of an evidence based framework for measuring in this empirical study has resulted in a careful examination of key dimensions articulated in the NRC report.

 

First, the content validity evidence supports the CM instrument. The qualitative analysis of the instrument alignment with measurement course objectives, assignments, and the instructor’s expectations provides strong evidence to support claims about representativeness of the constructs and their mapping onto the building block framework. Second, the validity evidence for the response processes for the instrument also supports the argument for the CM instrument’s uses. Respondent were, on the whole, positive about the items. Overall, few found them confusing or inappropriate, hence construct irrelevant information was generally kept at a minimum. Third, the internal structure evidence for the CM instrument’s uses was more complicated. Both general item analyses and differential item functioning analysis indicated support for items design of the CM instrument overall. The internal structure evidence in support of the construct maps was more mixed. It was found that the item thresholds for UCM, UID, and UWM sub-scales corresponded relatively well with the respective construct theories. Yet, the results for the UQC dimensions appear to move in the other direction and require further research. In particular, it appears that the UQC-R construct map represented by the fixed-response items was not supported by the Wright map for that dimension. Interestingly, the relationship between these scales, after a correction for disattentuation was performed, showed that the correlation coefficient between sub-dimensions was considerable, with the coefficients ranging between .985 and .796 (lending support to the argument for the unidimensionality of the CM instrument). Fourth, the CM instrument’s relationship to external variables was positively and strongly correlated (e.g., .89) to the grades they received in the course EDU274A as expected. Fifth, the reliability evidence for the CM instrument was more than acceptable for research purposes. Despite the lack of any absolute standards for what is acceptable, the reliability (r > 0.85) of the CM instrument overall offered support for limited uses, including formative classroom assessment and professional development. Moreover, the inter-rater reliability evidence for the non-objective items on the CM instrument was also acceptable: in fact, the Pearson correlation between the MLE values for both raters was very high (0.98).        

Lastly, the results from the linear multiple regression analysis suggest that, together, at least four independent variables can “explain” about 35% of the variation in (CMEAP) proficiency.

 

Many experts in the field of educational measurement, assessment, and testing have commented on the consequences of assessment illiteracy, test-theory misconceptions, and “novice” thinking about measurement on both educational policy and practice (Braun & Mislevy, 2005). To address these misconceptions, measurement teachers and professionals need to better assess student learning, particularly at the classroom level. This study offers a conceptual framework for defining the nature of measurement knowledge, and evaluates the effectiveness of measures that are intended to assess that knowledge.
Keywords Assessment of competence
Item response theory (IRT)
Validity/reliability
Appendices
Authors
Name Surname Institution Country e-mail EARLI Number Presenting
Brent Duckor University of California, Berkeley United States bduckor@berkeley.edu   *  
Visit NQcontent
© European Association for Research on Learning and Instruction, 2012 All rights reserved.