Kappa is a measure of inter-rater reliability
Rating performance or constructs a dichotomous categorical level
It is important that raters have an operational definition of what constitutes a "yes" or "no" in regards to performance. The construct or behavior of interest must be standardized between raters so that unsystematic bias can be reduced. A lack of operationalization and standardization in performance assessment significantly DECREASES the chances of obtaining evidence of inter-rater reliability when using the Kappa statistic.
Kappa is not a "powerful" statistic because of the dichotomous categorical variables used in the analysis. Larger sample sizes are needed to achieve adequate statistical power when categorical outcomes are utilized. So, many observations of the performance of simulation may be needed to adequately assess BOTH inter-rater reliability and outcomes of interest. The chances of having adequate inter-rater reliability decreases with fewer observations of performance or simulation.