Proposal view
Proposal Type: Individual Paper 
Domain: Assessment and Evaluation 
SIG: Assessment and Evaluation 
Type Submitted Paper 
Equipment PC and projector
Paper Details
Title Educational Standards for Mathematics in Germany: Results for coder reliability
Abstract
Following the mediocre German results of international large scale assessments like TIMSS, PISA or PIRLS the 16 German federal states recently decided to set up educational standards for several subject areas and several degrees of graduation with the goal to improve student performances. The standards define educational goals in terms of competencies. Whether these educational standards are met will be tested by means of standardized tests. The present contribution describes the development of a standardized test to measure the attainment of the standard in mathematics for the “medium” graduation level (mittlerer Schulabschluss), usually obtained after grade ten, with a focus on coder reliability. Based on a faceted theoretical competency model an item set was developed and field tested as a national extension of PISA 2006 (N = 12000). Nearly half of the 313 items were open ended, so that the students’ answers have to be coded before statistical analysis. Therefore 14 markers were trained on basis of a standardized coding guide. To ensure reliable measures the consistency of codes given by all coders for 180 responses to each of the open ended items was examined within a generalisability theory framework. The observed variance was decomposed into variance components for three main effects (Student, Coder, Item), three first level interactions (Student x Item, Student x Coder, Coder x Item) and a residual (including the second order interaction). The results show that the observed variance can mostly be explained by effects that incorporate the variables Item and Student while the variance components where coders are involved are negligibly small. Thus, there is no systematic coder effect and therefore the given codes can be interpreted independently of the coders. Advantages of using generalisability theory to examine coder reliability are discussed in contrast to other approaches.
Summary
Following the mediocre German results of international large scale assessments like TIMSS, PISA or PIRLS the 16 German federal states recently decided to set up educational standards for several subject areas and several degrees of graduation with the goal to improve student performances. The standards define educational goals in terms of competencies. Whether these educational standards are met will be tested by means of standardized tests. Implementing a standard based approach means a shift from the traditionally input oriented German school system towards an output oriented system.

The present contribution describes the development of a standardized test to measure the attainment of the standard in mathematics for the “medium” graduation level (mittlerer Schulabschluss), usually obtained after grade ten, with a focus on coder reliability. The test is based on a faceted theoretical competency model that is oriented on the German curricula in mathematics. It differentiates five big ideas, six competencies and three difficulty levels. About 800 Items were developed to cover the theoretical model. The items were field tested as a national extension of PISA 2006 (field trial and main study). In the PISA 2006 main study 313 items were used. Nearly half of them are open ended, so that the students’ answers have to be coded before statistical analysis can take place. Therefore 14 coders were trained on basis of a standardized coding guide. To ensure reliable measures the consistency of codes given by different coders for the same answers was examined. The study investigates three main questions:

1. Are the given codes sufficiently consistent between the coders?

2. How many coders should be used for future applications of the test?

3. In how far do the coders have systematic effects on the observed scores that may jeopardize interpretations of the data?

Based on a balanced design for each of the open ended items 180 responses were multiple coded. The results were analysed within a generalisability theory framework. A similar procedure was used in PISA 2003 to examine coder reliability. Generalisability theory expands the concept of reliability by incorporating more than one source of error within a measured variable. The observed variance was decomposed into variance components for three main effects (Student, Coder, Item), three first level interactions (Student x Item, Student x Coder, Coder x Item) and a residual (including the second order interaction Student x Coder x Item).

Three main results that refer to the above mentioned questions can be differentiated. First, the generalisability coefficients computed for different scenarios are very high. Thus, the given codes are sufficiently consistent between coders. Second, the generalisability coefficient stays very high even if only one marker is taken into consideration. Hence the coding in future applications of the test can be done by one marker. Third, the observed variance can mostly be explained by effects that incorporate the variables Item and Student while the variance components where coders are involved are negligibly small. Thus, there is no systematic coder effect (main effect and first level interactions) and therefore the codes can be interpreted independently of the coders.

Summarizing, the study shows that the codes were sufficiently consistent between coders, the coding of open ended items can be done by one coder in future applications of the test and the results can be interpreted independently from the respective coder. Furthermore the study demonstrates the application of generalisability theory to examine coder reliability. Compared to the more often used Cohen’s Kappa, within generalisability theory more than two coders can be considered simultaneously. Furthermore the approach gives more detailed information on the source of variations of the observed codes that can be used to identify threats to coder reliability. If – for example – the results show a relatively large variance component for the interaction Coder ´ Item (which was not observed in the present study) a researcher can most probably identify systematic problems one or more coders had with specific items. Such detailed information is mostly not available within other approaches to coder reliability. From a practical point of view it is discussed how generalisability theory can be applied if incomplete designs are used what is mostly the case for large scale assessments.
Keywords Assessment of competence
Generalizability theory
Validity/reliability
Appendices
Authors
Name Surname Institution Country e-mail EARLI Number Presenting
Andreas Frey Leibniz-Institute for Science Education (IPN) Germany frey@ipn.uni-kiel.de   *  
Claus H. Carstensen Leibniz-Institute for Science Education (IPN) Germany carstensen@ipn.uni-kiel.de    
Visit NQcontent
© European Association for Research on Learning and Instruction, 2012 All rights reserved.