- Published on
The Kappa statistic
The Kappa statistic is a measure of inter-rater reliability when the construct or behavior is being rated using a dichotomous categorical outcome. When a sequential series of steps must be completed to yield an end product, such as with performance assessment, then a "checklist" or series of "yes/no" responses are scored by independent raters. The Kappa statistic can be used to assess the level of agreement/consistency/reliability between raters on subsequent dichotomous responses.
It is important that raters have an operational definition of what constitutes a "yes" or "no" in regards to performance. The construct or behavior of interest must be standardized between raters so that unsystematic bias can be reduced. A lack of operationalization and standardization in performance assessment significantly DECREASES the chances of obtaining evidence of inter-rater reliability when using the Kappa statistic.
Kappa is not a "powerful" statistic because of the dichotomous categorical variables used in the analysis. Larger sample sizes are needed to achieve adequate statistical power when categorical outcomes are utilized. So, many observations of the performance of simulation may be needed to adequately assess BOTH inter-rater reliability and outcomes of interest. The chances of having adequate inter-rater reliability decreases with fewer observations of performance or simulation.
It is important that raters have an operational definition of what constitutes a "yes" or "no" in regards to performance. The construct or behavior of interest must be standardized between raters so that unsystematic bias can be reduced. A lack of operationalization and standardization in performance assessment significantly DECREASES the chances of obtaining evidence of inter-rater reliability when using the Kappa statistic.
Kappa is not a "powerful" statistic because of the dichotomous categorical variables used in the analysis. Larger sample sizes are needed to achieve adequate statistical power when categorical outcomes are utilized. So, many observations of the performance of simulation may be needed to adequately assess BOTH inter-rater reliability and outcomes of interest. The chances of having adequate inter-rater reliability decreases with fewer observations of performance or simulation.
0 Comments