Outliers - Eric Heidel, PhD PStat - Statistician For Hire

Tags

Published on

April 11, 2015

Basic principles of correlational research

Correlations Kurtosis Non-parametric Statistics Normality Outliers Pearson's R Skewness Spearman's Rho

Spearman's rho vs. Pearson's r

Bivariate associations between variables

Surveys and the outcomes they generate are oftentimes not able to meet the assumption of normality, as per skewness and kurtosis statistics. Also, some types of variables are just naturally skewed (i.e. income, length of stay at a hospital), and thus require the use of non-parametric statistics.

Spearman's rho correlation is considered non-parametric because it is the correlational test used when finding the association between two variables measured at an ordinal level. Ordinal level measurement does not possess a "true zero" and therefore cannot possess the precision and accuracy of continuous variables.

Pearson's r is used when correlating two continuous variables. However, one MUST check for the assumption of normality and identify and make decisions about any outliers (observations more than 3.29 standard deviations away from the mean). This is of PARAMOUNT IMPORTANCE because correlations are highly influenced by outlying observations. Just ONE outlier can artifically skew a correlation positively or negatively, and in a statistically significant fashion!

Going back to the introduction, remember to use Spearman's rho on interval and ordinal variables as well as with variables that are naturally skewed. Statistics, in and of itself as a science, is very flawed. Not everything you come across in existence will fit the normal curve. Luckily, we have non-parametric statistics that are robust to these common violations of inferential statistical tests.
Published on

December 23, 2014

Non-parametric Friedman's ANOVA

Friedman's ANOVA Logarithmic Transformations Mauchly's Test Non-parametric Statistics Normality Of Difference Scores Outliers Sphericity Assumption Wilcoxon

Analyze three or more measures of an ordinal outcome

Wilcoxon is used as a post hoc test for significant main effects

The Greenhouse-Geisser correction is often employed when analyzing data with repeated-measures ANOVA. The statistical assumption of sphericity, as assessed by Mauchly's test in SPSS, is more often times than not violated. The Greenhouse-Geisser correction is robust to the violation of this statistical assumption with repeated-measures ANOVA. The means and standard deviations from a repeated-measures ANOVA can then be interpreted.

Friedman's ANOVA, in my experience, does not make many appearances in the empirical literature. Few people take three or more within-subjects or repeated measures of an ordinal outcome in order to answer their primary research question, I guess. It is a non-parametric statistical test since the data is measured at more of an ordinal level. When a significant main effect is found with a Friedman's ANOVA, then post hoc comparisons must be made within-subjects or amongst observations using Wilcoxon tests.

Friedman's ANOVA, while being a non-parametric statistic, may have the most statistical power when employed with cross-sectional data yielded from a survey instrument that has limited reliability and validity evidence. Likert scales and composite scores from such tests may be naturally skewed due to systematic and unsystematic error. Friedman's ANOVA is robust to these types of distributions that come from cross-sectional studies in the social sciences.

If the assumption of normality among the difference scores between observations of a continuous outcome cannot be met, then Friedman's ANOVA can be used to yield inferential evidence. But it is always a better idea to first check for outliers in a distribution (individual observations that are more than 3.29 standard deviations away from the mean) and make a decision as to whether 1) delete the observation in a listwise fashion, or 2) run a logarithmic transformation on the distribution.

You will have transform the other observations of the outcome if you choose #2 above. The means and standard deviations of transformed variables cannot be interpreted but the p-values can be interpreted. Report the median and interquartile range for transformed variables.

Deleting observations can introduce bias into the statistical analysis. This should only be done if the number of outliers constitutes less than 10% of the overall distribution. One can also run between-subjects comparisons between participants with all observations of the outcome versus participants without all observations. If there are no differences on predictor, confounding, and outcome variables between these two groups, then lessened observation bias can be assumed.
Published on

December 3, 2014

The assumption of independence of observations

Generalized Estimating Equations (GEE) Homogeneity Of Variance Independence Of Observations Assumption Kurtosis Levene's Test Non-parametric Statistics Normality Outliers P-value Skewness Statistical Assumptions

Independence of observations

Each participant in a sample can only be counted as one observation

As a biostatistician, I spend a lot of time testing for normality and homogeneity of variance.

Skewness and kurtosis statistics are used to assess the normality of a continuous variable's distribution. A skewness or kurtosis statistic above an absolute value of 2.0 is considered to be non-normal. Distributions are often non-normal due to outliers in the distribution. Any observation that falls more than 3.29 standard deviations away from the mean is considered an outlier.

Levene's Test of Equality of Variances is used to measure for meeting the assumption of homogeneity of variance. Any Levene's Test with a p-value below .05 means that the assumption has been violated. In the event that the assumption is violated, non-parametric tests can be employed.

There is one more important statistical assumption that exists coincident with the aforementioned two, the assumption of independence of observations. Simply stated, this assumption stipulates that study participants are independent of each other in the analysis. They are only counted once.

In between-subjects designs, each study participant is a mutually exclusive observation that is completely independent from all other participants in all other groups.

For within-subjects designs, each participant is independent of other participants. There are just multiple observations of the outcome, per participant.

With this being said, it is prevalent for researchers to take multiple measurements of an outcome and compare these multiple measurements in an independent fashion (oftentimes with differing numbers of observations across participants) or within-subjects (ALWAYS with differing numbers of observations of the outcome). By default, these are not independent measures and violate the assumption of independence of observations. What is one to do?

The answer is generalized estimating equations (GEE). This family of statistical tests are robust to multiple observations (or correlated observations) of an outcome and can be used for between-subjects, within-subjects, factorial, and multivariate analyses.
Published on

September 23, 2014

Using naturally skewed continuous variables as outcome variables

Kurtosis Listwise Deletion Logarithmic Transformations Non-parametric Statistics Outcome Outliers Skewness

Transformed outcomes

Some continuous variables will be naturally skewed

In medicine, there is an important metric that signifies efficiency and quality in healthcare, length of stay (LOS) in the hospital. When thinking about the distribution of a variable such as LOS, you have to put it into a relative context. The vast majority of people will have an LOS of between 0-3 days given the type of treatment or injury that brought them to hospital. VERY FEW individuals will stay at the hospital one month, six months, or a year. Therefore, the distribution looks nothing like the normal curve and is extremely positively skewed.

As a researcher, you may want to predict for a continuous variable that has a natural and logical skewness to its distribution in the population. Yet, the assumption of normality is a central tenet of running statistical analyses. What is one to do in this situation?

The answer is to first, run skewnessand kurtosis statistics to assess the normality of your continuous outcome. If the either statistic is above an absolute value of 2.0, then the distribution is non-normal. Check for outliers in the distribution that are more than 3.29 standard deviations away from the mean. Make sure that the outlying observations were entered correctly.

You now have a choice:

1. You can delete the outlying observations in a listwise fashion. This should be done only if the number of outlying variables is less than 10% of the overall distribution. This is the least preferable choice.

2. You can conduct a logarithmic transformation on the outcome variable. Doing this will normalize the distribution so that you can run the analysis using parametric statistics. The unstandardized beta coefficients, standard errors, and standardized beta coefficients are not interpretable, but the significance of the associations between the predictor variables and the transformed outcome can yield some inferential evidence.

3. You can recode the continuous outcome variable into a lower level scale of measurement such as ordinal or categorical and run non-parametric statistics to seek out any associations. Of course, you are losing the precision and accuracy of continuous-level measurement and introducing measurement error into the outcome variable, but you will still be able to run inferential statistics.

4. You can use non-parametric statistics without changing the skewed variable at all. That is one of the primary benefits of non-parametric statistics: They are robust to violations of normality and homogeneity of variance. Instead of interpreting means and standard deviations, you will interpret medians and interquartile ranges with non-parametric statistics.

Click on the Statistics button to learn more.

Statistics