Categorical - Eric Heidel, PhD PStat - Statistician For Hire

Tags

Published on

April 1, 2015

Categorical measurement caveats

95% Confidence Interval Categorical Diagnostic Testing Inter-rater Reliability Intraclass Correlation Coefficient Kappa Statistic Multivariate Statistics Negative Predictive Value Odds Ratio With 95% CI Positive Predictive Value Relative Risk Sample Size Sensitivity Specificity Statistical-power-test

Effects of categorical measurement

Decrease statistical power and increase sample size

Categorical variables are very prevalent in medicine. Measures like presence of comorbidities, mortality, and test results are categorical in nature. Here are some general caveats associated with categorical measurement and sample size:

1. Categorical outcomes will always DECREASE statistical power and INCREASE the needed sample size. This is due to the lack of precision and accuracy in categorical measurement.

2. The underlying algebra associated with calculating 95% confidence intervals of odds ratios and relative risk is 100% dependent upon the sample size. With smaller sample sizes, by default, wider and less precise 95% confidence intervals will be found. If one of the cells of a cross-tabulation table has fewer observations that the other cells, then the 95% confidence interval will be wider and potentially not truly interpretable. A 95% confidence interval will become narrower or more precise only with larger sample sizes.

3. When using categorical variables for diagnostic testing purposes, larger samples sizes will be needed to calculate precise measures of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV). With smaller sample sizes in diagnostic studies, a change in one or two observations can have drastic effects on the diagnostic values.

This is especially true when there is a subjective rating used for purposes of diagnosing someone as "positive" or "negative" for a given disease state (radiologist reading an X-ray). Inter-rater reliability coefficients such as Kappa or ICC should be employed to ensure consistency and reliability among subsequent ratings and raters. Sensitivity, specificity, and PPV will be affected by inter-rater reliability. Receiver Operator Characteristic (ROC) curves can be used to find a given value where sensitivity and specificity of a test is maximized. ROC curves can also be used to compare the area under the curve (AUC) between several diagnostic tests at the same time so that the best can be chosen.

4. For each predictor categorical parameter (or variable) that you want to include in a multivariate model, you have to increase your sample size by at least 20-40 observations of the outcome. This due to the limited precision, accuracy, and statistical power associated with categorical measurement. Researchers HAVE to collect more observations in order to detect any potential significant multivariate associations.

In the case that a polychotomous variable is to be used in a model, create (a-1), where a is the number of categories, dichotomous variables with "0" as not being that category and "1" as being that category. For each level, 20-40 more observations of the outcome will be needed to have enough statistical power to detect differences amongst the multiple groups.
Published on

November 16, 2014

Dichotomous variables in SPSS

Categorical Variables

Analyze dichotomous variables in SPSS

Choose reference categories or dummy code variables

Here is a really quick tip for making the statistics and outputs of SPSS much easier to interpret when using dichotomous predictor and outcome variables. Whatever "level" of the dichotomy that you are most interested in should be codified as a "1." If a participant has the characteristic or outcome of interest, codify those observations as "1" and the absence of the characteristic or outcome of interest as "0."

SPSS has a default that always makes the highest numerical category be the reference group. However, most times, researchers want to know the odds of something occurring versus not occurring, NOT the odds of something not occurring versus the odds of it occurring. Therefore, it is important when running bivariate associations between dichotomous categorical variables to always use the codification scheme above so that the statistical outputs can be interpreted properly.

When conducting multivariate analyses, SPSS still uses the same reference default for the highest number category. The "point and click" interface for multivariate statistics in SPSS gives you the option to click on a "Categorical" button. Always do this and make sure that you set the category to "first" when running these types of statistics.
Published on

October 7, 2014

Measurement at continuous levels

Categorical Continuous Ordinal Parametric Statistics Variables

Measure variables at the highest level possible

Don't discount your continuous variables!

There is a tendency for researchers to take continuous variables and recode them into ordinal or categorical variables. For example, researchers may ask participants to answer if they are 20-30 years old, 31-40 years old, 41-50 years old, 51-60 years old, or 60+ years old. Or, they may set an arbitrary "cut-off" of values above or below a certain value (People who are 55 years and older versus everyone younger than 55 years).

Researchers lose valuable precision and accuracy in measurement when continuous variables are demoted to ordinal or categorical levels. It is ALWAYS better to take an actual numerical value with a "true zero" and analyze it using parametric statistics. If there is a theoretical, conceptual, or empirical basis for pairing down continuous measures into lower levels of measurement, then and only then should it be done. If you were a researcher and wanted to know the most precise and accurate measure possible of my age, which of the following is the best way to ask?

1. How many years old are you? (continuous)

2. How old are you? (circle one) 20-30 31-40 41-50 51-60 60+ (ordinal)

3. Are you above or below the age of 55? (categorical)

The continuous method will give you a stronger measure of age, which can then be broken down into separate ordinal or categorical levels, AT YOUR DISCRETION. So, always measure at the continuous level if at all possible.

With this being said, PLEASE realize that while we can go from continuous to ordinal and continuous levels of measurement, it is IMPOSSIBLE to change categorical and ordinal variable into a continuous level of measurement.

Let's use a basic example:

Gender - 0 = male and 1 = female

Is there any way to convert this into a continuous variable? No.

Here is another example:

How old are you? (circle one) 20-30 31-40 41-50 51-60 60+

Can you convert this into a continuous variable? No, again.

In conclusion, ALWAYS try to measure your variables at a continuous level, if at all possible or feasible. They can be broken down into ordinal and categorical variables as needed. Also, REALIZE that once you have decided to measure something at a categorical or ordinal level, it cannot be converted to continuous.