Continuation of the summary of the replies to the question posed by John De Witt regarding statistical power are sample sizes. Part 2 includes responses in the categories of 'CORRECT IN THE APPROACH' and 'CONCERNS ABOUT MEETING ANOVA ASSUMPTIONS'

SUMMARY OF RESPONSES

CORRECT IN THE APPROACH

John,

I agree with you that these reviewers are wrong. Sample size 5 can be enough, if you do an experiment in 5 subjects and the effect of an intervention or treatment is always in the same direction, a simple non-parametric test will tell you that there is a probability of 1 in 2 to the power 5 that this occurs by chance. This is less than 0.05 which is considered enough. I agree with you that in such cases, the sample size turns out to be sufficient (but only after getting the results).

The example also shows nicely that a sample size of 4 is always too small.

Good comment on the null hypothesis. If you were to hypothesize that a treatment has no effect, and you get one effect in 2 subjects and the opposite effect in 3 (i.e. exactly what you would expect if there was no effect), the same non-parametric statistic will tell you that this can happen by chance with a probability much higher than 0.05, so there the sample size is not enough for the conclusion that there is no effect.

Ton van den Bogert

------------------

Hi, John. I would think by definition that if you find a significant effect, the sample size is not too small. From all statistics books I have read, performing a power study after the fact (as reviewers have suggested to me in the past) is essentially meaningless. If you fail to find a significant effect, then one reason can be that the study is underpowered. Then I guess it could make sense to do a post-hoc power study to see how big a sample size you would need. But if you find a significant effect, I think your sample size is big enough.

Just my two cents,

Dana Carpenter

------------------

Statistical power increases with sample size therefore if an effect is weak you need a larger sample to discover a difference. If the effect is strong, you don't need a large sample. Since you got a significant difference with a small sample than the effect was strong. Therefore, I totally agree with you that you did NOT need a larger sample. It would have been a waste of time and money.

Gordon Robertson

------------------

Dear John,

I agree with you. If you found a statistically significant effect then you had enough power. The only problem may be for results that were not statistically significant. Perhaps if you included the effect size (required by some journals) and power you may be able to address the reviewers' concerns.

Regards,

Danny Russell

------------------

Hi John

I think that you are right on.

If you collect data from substantially more people or other animals than is needed, I would argue that it is unethical.

The repeated measures design is quite powerful and not all reviewers recognize that.

Perhaps you should include your apriori power calculation in the methods section of the paper.

rk

Rodger Kram, Ph.D.

------------------

Hi John,

I think you are exactly right. If you are rejecting the null hypothesis, type II error is not a concern. I assume you have appealed to the editor

and referred the reviewer to a statistics text book? Obviously this can be

a touchy issue and should be phrased appropriately, but the bottom line is that you are correct.

__________________________________________

Samuel R. Ward, PT, PhD

------------------

Hi John,

I emailed an article from JBJS that you might find interesting; although the author discusses the improper use of a post-hoc power analysis in cases where the results are not significant, I think that much of it is relevant to your question. As long as you have performed a power calculation using a clinically meaningful difference, then your study should have adequate power. If the results are significant, then I don't quite understand why there is even a question about power; perhaps you could reference this article in response to your reviewers. I am by no means an expert in statistics, but I hope this helps.

Regards,

Chris Deuel, Ph.D

------------------

CONCERNS ABOUT MEETING ANOVA ASSUMPTIONS

Hi, John

Good to hear you from Biomech-L.

I had same situation previously on repeated measure studies. I hope these articles would give some help to you. I like Overall_Doyle 1994 article.

Normally according to your pilot study, we will get proposed statistical power. And using this result of statistical power, you need to search the value inside of table depending on your number of treatment. However, most of article only suggested one-way repeated measure situation. I have not seen two-way repeated measure case for sample size determination. Thus you have to guess the appropriate number of treatment in your study.

It's not easy job.

There is alternative way in stat if your data collection was done.

Since ANOVA and repeated measure ANOVA used a model (assumption of normality for ANOVA and sphericity for repeated measures ANOVA), there is error in reporting when sample size was small. In case of huge violation of assumption, the results are meaningless sometimes. Because it is very hard to meet this assumption with small sample.

Thus some people suggested non-parametric stat method. Some non-parametric stat method is assumption-free method. And it's idea is same as paired t-test. Wilcoxon T test or Wilcoxon test is the non-parametric equivalent of paired t test.

I hope it would work for you.

Have a good weekend!!

YK - Young-Kwan Kim

------------------

Hi John,

My name is Tamar and I'm a Biostatistics PhD student. I got your email through Biomch-L.

There are probably people who can be more helpful, but in case you didn't hear from them---

There are a few problems with small samples. Foe once, you would need a pretty big effect to reach statistical significance.

Another important issue is that of asymptotic tests. Many statistical tests rely on asymptotic distributions and thus with a small sample size, the may not be valid. For example, the variance of the parameters estimates and their confidence intervals that are output in regular software packages are not reliable.

What I would recommend you to try is "exact methods"- usually permutation based. (You can look up this link for example for an explanation http://v8doc.sas.com/sashtml/stat/chap28/sect28.htm )

The basic idea in permutation tests are - you take the measurements you have and put aside your summary statistics. You permute the results between the groups or environments and have a new summary statistics. You repeat many times. At the end you compare your "real" statistics with the distribution of the permuted ones and see how likely it is to receive this statistics if the measurements were taken randomly.

Another question- did you compare the results of an experiment on a subject between two environments and then took the difference? or did you "ignore" the fact that you have the same subject doing both experiments in two environments? (because, depends on your design and question, I'm not sure it helps you in terms of type 1 and type 2 error rates that you have two or three experiments done by the same person in different environments).

I hope I helped you somehow

- Tamar Sofer.

------------------

John,

It sounds to me that you are technically correct in your statistics; however, while a significant result by definition means your power is sufficient (for that variable), there is a danger in having too few samples. Consider if you only ran a single subject, found significance and claimed success. I am not up to date on guidelines that define the minimum number of samples to ensure your sample is representative. Maybe you could do an analysis where you add or remove samples and compare the increasing samples until they demonstrate convergence? Perhaps you could make the argument that you are in practice testing the population and not just a representative sample? I'm not sure if that would be easier to pass by a reviewer or not. Otherwise you might try the nonparametric methods. I'm not convinced that they would be more correct than your repeated measures ANOVA, but you may find reviewers are more accepting.

Good luck

Bryan Kirking

------------------

Dear Dr. Dewitt,

In my opinion, your argument is sound. With your sample size, you were able to determine that p < 0.05, which is a well accepted criteria for statistical significance. Certainly, with a larger sample size you might be able to measure smaller p value if the trend continues, but unless you think that p < 0.05 is insufficient, adding more subjects wouldn't help that concern. I also agree that a power analysis is important when you do not detect a statistically significant difference, in order to determine how confident you are that there is no difference. However, with a small sample size there is the concern as to whether the data is really normally distributed-- to address this concern, I recommend performing a non parametric test (eg. Wilcoxon Signed Rank Test in place of or in addition to a paired t test).

In our research we also often have repeated measures designs. I find that people are very confused as to how we can detect small differences with large variations between subjects. Often I find it useful to present the difference in a variable (calculated for each

subject) due to the treatments-- in this case, the standard deviation bars shrink considerably and you can easily visualize the difference between treatments.

Thanks for your time.

Sincerely,

Lou DeFrate, Ph.D.

------------------

John - I take your example of a repeated measures design where each subject tests two or more treatments. If the small number of subjects is truly a random sample from a homogeneous population where the differences in the response measure between any two treatments is normally distributed and that all these differences have the same population variance, then yes, the t-test or multiple comparisons after ANOVA are all valid. The problem lies with the above assumptions - both with how the subjects are sampled and with respect to the distribution of the response variable. With very small sample sizes, moderate violations of these assumptions can lead to completely erroneous results. With larger sample sizes, ANOVA and t-tests are much more robust to such violations.

As a reviewer or reader of the journal, I would be extremely skeptical of trying to make inference about a population of astronauts from what you observed on 4 or 5 self-selected subjects (who are probably not even astronauts!). Also, all you are doing when you do ANOVA or t-tests is compare means - says nothing about variance or (for example) what percent of potential astronuats might be helped by this countermeasure. You could not hope to estimate this percentage with any relaibility with such small sample sizes.

You could point out these limitations in your manuscript - but whether the journal would accept this sort of disclaimer, I wouldn't know.

Al - Al Feiveson

------------------

I'll echo Al's comments, and provide my "2-cents" as well, as I've experienced similar struggles with small-n studies, and as a Biostatistician have had to really contemplate the appropriateness of hypothesis testing in general, and then specifically using traditional t-test or ANOVA techniques. And I've taught a fair amount on this subject too, so forgive my verbose email!

One requirement that we all have with our small-n studies is to begin our inquiry into the data with a very deliberate, nearly obnoxious test of model assumptions. This is especially relevant because, as Al indicated, what we learned early in grad.-school about statistics, when we were all "newbies" to the math and art of it, was that "ANOVA is robust to violations of assumptions," but what we sometimes forget from those lessons is that this is only true for violations of SOME of the assumptions, and then only when there is sufficient n to be able to rely on the law of large numbers. We don't live in that world here at NASA most of the time, ao we MUST pay special attention to our assumption testing.

With repeated measures designs in the ANOVA context, that means NOT-ONLY that our data are normally distributed, but also that it meets the assumption of sphericity, and homogeneity of variance (if there are any between-subjects factors). These latter two are critical, and there are statistical tests to determine whether you've met the assumptions. With big-n studies, even if your data aren't normally distributed, studies have shown that as long as you meet the latter two assumptions, usually ANOVA performs adequately. but again, we don't have big-n, so we can't rely on that being true for us. And, if you haven't met the assumptions, then you probably should be using other techniques, or at least considering alternative adjustments for violations, and/or data transformations.

In the instances where you DO MEET all of the statistical assumptions (possibly after some data transformations), then it would be beneficial in your publication efforts that you BEGIN your statistics/results section by clarifying what tests you have performed to test them, and that your data meet the assumptions. That way you convince the reviewers/readers that your data are appropriately analyzed with your techniques. Then proceed...

As for post-hoc tests with repeated measures, that's another area where researchers sometimes get confused, and a statistically savvy reviewer will pick up on it. The term "post-hoc" tests is typically reserve for comparisons between GROUPS, not between times/within groups, so I need to clarify that we're not talking about something like Tukey's post-hoc adjustments, but instead something like Bonferonni, or other flavors appropriate for within-group comparisons. There again, you're starting to add fuel to the fire for the skeptic, so you should be considerate of the skeptics and choose more CONSERVATIVE options for these tests. Bonferonni is a good choice because it's commonly used and known to be conservative. Another approach might be to utilize a-priori contrasts, which hold alphs to .05 (or whatever) for k-1 comparisons, where k= the number of levels of your repeated measures factor. Sometimes researchers want to "make all pairwise comparisons" because they haven't clearly thought out what they REALLY want to do, so they avoid a-priori contrasts because it's limited to k-1 comparisons. This is an unfortunate situation, because often times we aren't truly interested in ALL pairwise comparisons, but instead a scientifically meaningful set of comparisons. For example, if you're interested in comparing pre-flight observations to multiple observations taken during and post flight, then simple effects contrasts comparing all values to pre would solve the problem without getting excessive on your Bonferonni adjustments for all possible pairs (do you really care if R+12 is different from R+14?). There are other commonly uses contrasts too... maybe you want to model the PATTERN of change to determine if it fits different polynomial functions (common in biological science). A few other choices exist, but my point is that maybe you can avoid some criticism by narrowing in on the comparisons that ARE important, and thus increase your potential for determining significance (because you haven't adjusted critical alpha so much), AND address your scientific inquiry more appropriately. Icing on the cake--less critical reviews.

Having said all of this... you might also consider newer statistical techniques commonly referred to as mixed-modeling, multi-level modeling, hierarchical modeling, or growth modeling. These techniques are fairly recent extension of ANOVA/Regression, and they are better capable of handling some of the data problems that we experience with repeated-measures ANOVA. They won't solve all of your post-hoc comparison problems, but they are generally more appropriate for all longitudinal research (big or small n). If your data meet the assumptions for these tests, then you're better off starting here instead of ANOVA. (FYI: I've seen NIH make references in their presentations to new investigators stating something to the effect that if you propose repeated measures ANOVA for longitudinal research instead of MLM, it's a flat out reason for rejection.) Remember... you need to meet assumptions of WHATEVER test you employ, so you might as well shoot for the best test first!

There are good applied text books out there on these techniques if you are interested. Software is a little tricky sometimes, but SAS does an excellent job if you're already a SAS user. STATA is equally good and easier to use if you're starting from scratch. R software is commonly used too but I've no experience with it. Avoid SPSS for these techniques...

Rob

________________________________________________

Robert J. Ploutz-Snyder, PhD

------------------

Hello

I am by far an expert in statistics however, like you I often work on small samples. I generally use non-parametric tests based on the idea that t-tests and ANOVAs usually require n>30 and certainly are based on the assumption of a normal distribution. Apparently tests of normality also require n>30 therefore in very small samples, it is not possible to test for normality.

I don't know it that will help you any...

Johanna

Johanna Robertson

------------------

Dr. Dewitt, I do not consider myself a stats expert, however I do teach basic stats to our orthopaedic residents and at College of Charleston where I am a professor in exercise science.

While I understand your concern regarding sample size I would encourage you to use a non-parametric test on your samples.

With a repeated measure use Friedman's and with 3 or more seperate groups used Kruskal Wallis.

While these are less powerful tests and actually ANOVA and t-tests are more robust this may reduce the referee concern and comments you are seeing on your reviews.

With paired samples used Mann Whitney and/or Wilcoxon depending whether the samples are paired or independent.

My contribution may be what you already know.

If not, hope this is helpful.

Regards.

Bill Barfield, Ph.D., FACSM

------------------

Pax!

In fact I was just calculating sample size effects for our own paper.

The required sample size depends entirely on the purpose of the study, desired strength of the results, estimated size of the effect, and other background assumptions. For instance if I want to establish a correlation between two variables with p < 0.05 and r >= 27 % then n about 50 is required. But if one eg wants to test whether an exercise program affects weight then 5 persons may suffice. In this case, if the alternative hypothesis is no change, then an observed weight decrease for all 5 has a p = 1/2^5 = 1/32 (simple sign test). (Of course without a control we cannot conclude it was the program alone that resulted in the change ...). The point is to be able to show with probabilistic arguments that the probability that the result is due to *pure chance* is less than 1:20. Non-parametric tests (such as Wilcoxon) can be useful here since they do not require assumptions about the distributions. The sign test though does not reflect the size of the change in the example so one might use the variable z = x_before - x_after (or a scaled variant [x_before - x_after]/x_before if appropriate) instead. If the hypothesis is that no systematic changes has occurred, one may assume that z is normally distributed around 0, and use the sample variance Vx_before + Vx_after to estimate that of z. This finally lets us calculate p = Prob(Z < z) (prob that decrease is due to chance). One possibility today is to use computer simulations to calculate probabilities, they can make strong arguments showing *concretely* the odds against the result being due to chance. Ok i got to hurry to the office.

Regards Frank Borg

------------------

John Dewitt,

It is my understanding that ANOVA and t-test have an underlying normality assumption; with such a small sample size you cannot show that your samples are in fact normally distributed. In other words the a priori p value is meaningless if the assumptions of the testing tool are violated. I recommend a non-parametric tests like Mann-Whitney or Kruskal-Wallis tests, they are not as powerful but that is the trade off.

By the way, the rule of thumb given to me to test normality was a minimum sample size of 12. This was quickly followed by, "I have been sworn to secrecy as to from where that number came."

Cheers,

Rob Richards

------------------

Have you looked into the aspect of non-parametric v. parametric statistical results? That addresses 'small' sample sizes...I am not an expert but I had a similar issue in the past....good luck

Nicole Jacobs

------------------

Hello,

Just my humble opinion, but given that Type I error can still be committed, even with small samples, the reviewers concerns are valid.

A solution would be to assure your reviewers that you have tested your data for assumptions of normalcy (eg., skewness, variance), and given the repeated measures design, that sphericity is not an issue. Even with small sample sizes, Type I error is a potential risk because of the influence of skewness associated with a small cluster of observations at one end...also, with small samples, providing all your data in a table might alleviate concerns - allow the reader to see the variance, etc.

Thanks,

Daniel

Daniel Cipriani, PT, PhD

------------------

Dear John,

There are several problems with small samples:

1. Statistical power: you analysed this problem correctly, in my eyes.

2. The verification of normal distribution, with is one of the prerequisites to apply parametric tests, such as Student's "t" and ANOVA (to a lesser extent for the latter). The problem is that the tests that are available to test normality of distribution are not appropriate for small samples. Thus, a major criticism is the choice of statistical tests. Non-parametric tests, such as a Wilcoxon matched pairs test, would be more appropriate.

3. Power to generalise observations from the sample to the target population. Of course, if the sample is small, the chance is large that the sample is not representative enough of the population.

I hope that this is of use

With kindest regards

Veronique

__________________________________________________ ______

Prof. V. Feipel, PhD

------------------

John,

Just a quick note . . . I 'll try to add more later:

I use paired testing methods in my studies as well due to sample size/ cost constraints. In your case, I'm obviously not familiar with the specifics; however, two possibilities come to mind:

1 If your subjects are not randomly drawn from the target population, the results are biased regardless of sample size and/or the statistical outcome.

For Example: Measuring weight from a population of athletes would not represent the general US population. Assessing usability based on 4 20 year olds can not be extrapolated to older adults.

It seems obvious, however, I have seen several studies where internal subjects were used to assess product usability. The results, of course, did not match the general population becuase of the subjects' familiarity with the control design.

2. Is Normaility assumed? The complexity of the measure's distribution may not be captured with such a small sample size. If the distribution is sufficiently skewed, nonparametric techniques would be requred.

Scott A. Ziolek

------------------

Dear Dr. De Witt,

My understanding is that most traditional parametric statistical analyses (e.g., t-test) are based on the assumption that the data are normally distributed. When this assumption is true, the p-value, based on the area under the normal distribution bell curve, is meaningful. However, with only 4-10 data points, it is very difficult to assess if the data fit the assumption of normal distribution. In this case, using a traditional statistical analysis may be problematic, even though your results are "significant". Based on my previous conversation with some statisticians, they usually recommend having at least 30 data points in order to assess the distribution of the data, which could be very challenging in biomechanical research. Recently, there are some modern statistical analyses that doesn't require normal distribution assumption. You can try them to see if the results match the results of the paired-t test. If it does, then it may provide a strong support to convince the reviewers.

Ching

************************************************** **************

Liang-Ching Tsai, MS, PT

------------------

SUMMARY OF RESPONSES

CORRECT IN THE APPROACH

John,

I agree with you that these reviewers are wrong. Sample size 5 can be enough, if you do an experiment in 5 subjects and the effect of an intervention or treatment is always in the same direction, a simple non-parametric test will tell you that there is a probability of 1 in 2 to the power 5 that this occurs by chance. This is less than 0.05 which is considered enough. I agree with you that in such cases, the sample size turns out to be sufficient (but only after getting the results).

The example also shows nicely that a sample size of 4 is always too small.

Good comment on the null hypothesis. If you were to hypothesize that a treatment has no effect, and you get one effect in 2 subjects and the opposite effect in 3 (i.e. exactly what you would expect if there was no effect), the same non-parametric statistic will tell you that this can happen by chance with a probability much higher than 0.05, so there the sample size is not enough for the conclusion that there is no effect.

Ton van den Bogert

------------------

Hi, John. I would think by definition that if you find a significant effect, the sample size is not too small. From all statistics books I have read, performing a power study after the fact (as reviewers have suggested to me in the past) is essentially meaningless. If you fail to find a significant effect, then one reason can be that the study is underpowered. Then I guess it could make sense to do a post-hoc power study to see how big a sample size you would need. But if you find a significant effect, I think your sample size is big enough.

Just my two cents,

Dana Carpenter

------------------

Statistical power increases with sample size therefore if an effect is weak you need a larger sample to discover a difference. If the effect is strong, you don't need a large sample. Since you got a significant difference with a small sample than the effect was strong. Therefore, I totally agree with you that you did NOT need a larger sample. It would have been a waste of time and money.

Gordon Robertson

------------------

Dear John,

I agree with you. If you found a statistically significant effect then you had enough power. The only problem may be for results that were not statistically significant. Perhaps if you included the effect size (required by some journals) and power you may be able to address the reviewers' concerns.

Regards,

Danny Russell

------------------

Hi John

I think that you are right on.

If you collect data from substantially more people or other animals than is needed, I would argue that it is unethical.

The repeated measures design is quite powerful and not all reviewers recognize that.

Perhaps you should include your apriori power calculation in the methods section of the paper.

rk

Rodger Kram, Ph.D.

------------------

Hi John,

I think you are exactly right. If you are rejecting the null hypothesis, type II error is not a concern. I assume you have appealed to the editor

and referred the reviewer to a statistics text book? Obviously this can be

a touchy issue and should be phrased appropriately, but the bottom line is that you are correct.

__________________________________________

Samuel R. Ward, PT, PhD

------------------

Hi John,

I emailed an article from JBJS that you might find interesting; although the author discusses the improper use of a post-hoc power analysis in cases where the results are not significant, I think that much of it is relevant to your question. As long as you have performed a power calculation using a clinically meaningful difference, then your study should have adequate power. If the results are significant, then I don't quite understand why there is even a question about power; perhaps you could reference this article in response to your reviewers. I am by no means an expert in statistics, but I hope this helps.

Regards,

Chris Deuel, Ph.D

------------------

CONCERNS ABOUT MEETING ANOVA ASSUMPTIONS

Hi, John

Good to hear you from Biomech-L.

I had same situation previously on repeated measure studies. I hope these articles would give some help to you. I like Overall_Doyle 1994 article.

Normally according to your pilot study, we will get proposed statistical power. And using this result of statistical power, you need to search the value inside of table depending on your number of treatment. However, most of article only suggested one-way repeated measure situation. I have not seen two-way repeated measure case for sample size determination. Thus you have to guess the appropriate number of treatment in your study.

It's not easy job.

There is alternative way in stat if your data collection was done.

Since ANOVA and repeated measure ANOVA used a model (assumption of normality for ANOVA and sphericity for repeated measures ANOVA), there is error in reporting when sample size was small. In case of huge violation of assumption, the results are meaningless sometimes. Because it is very hard to meet this assumption with small sample.

Thus some people suggested non-parametric stat method. Some non-parametric stat method is assumption-free method. And it's idea is same as paired t-test. Wilcoxon T test or Wilcoxon test is the non-parametric equivalent of paired t test.

I hope it would work for you.

Have a good weekend!!

YK - Young-Kwan Kim

------------------

Hi John,

My name is Tamar and I'm a Biostatistics PhD student. I got your email through Biomch-L.

There are probably people who can be more helpful, but in case you didn't hear from them---

There are a few problems with small samples. Foe once, you would need a pretty big effect to reach statistical significance.

Another important issue is that of asymptotic tests. Many statistical tests rely on asymptotic distributions and thus with a small sample size, the may not be valid. For example, the variance of the parameters estimates and their confidence intervals that are output in regular software packages are not reliable.

What I would recommend you to try is "exact methods"- usually permutation based. (You can look up this link for example for an explanation http://v8doc.sas.com/sashtml/stat/chap28/sect28.htm )

The basic idea in permutation tests are - you take the measurements you have and put aside your summary statistics. You permute the results between the groups or environments and have a new summary statistics. You repeat many times. At the end you compare your "real" statistics with the distribution of the permuted ones and see how likely it is to receive this statistics if the measurements were taken randomly.

Another question- did you compare the results of an experiment on a subject between two environments and then took the difference? or did you "ignore" the fact that you have the same subject doing both experiments in two environments? (because, depends on your design and question, I'm not sure it helps you in terms of type 1 and type 2 error rates that you have two or three experiments done by the same person in different environments).

I hope I helped you somehow

- Tamar Sofer.

------------------

John,

It sounds to me that you are technically correct in your statistics; however, while a significant result by definition means your power is sufficient (for that variable), there is a danger in having too few samples. Consider if you only ran a single subject, found significance and claimed success. I am not up to date on guidelines that define the minimum number of samples to ensure your sample is representative. Maybe you could do an analysis where you add or remove samples and compare the increasing samples until they demonstrate convergence? Perhaps you could make the argument that you are in practice testing the population and not just a representative sample? I'm not sure if that would be easier to pass by a reviewer or not. Otherwise you might try the nonparametric methods. I'm not convinced that they would be more correct than your repeated measures ANOVA, but you may find reviewers are more accepting.

Good luck

Bryan Kirking

------------------

Dear Dr. Dewitt,

In my opinion, your argument is sound. With your sample size, you were able to determine that p < 0.05, which is a well accepted criteria for statistical significance. Certainly, with a larger sample size you might be able to measure smaller p value if the trend continues, but unless you think that p < 0.05 is insufficient, adding more subjects wouldn't help that concern. I also agree that a power analysis is important when you do not detect a statistically significant difference, in order to determine how confident you are that there is no difference. However, with a small sample size there is the concern as to whether the data is really normally distributed-- to address this concern, I recommend performing a non parametric test (eg. Wilcoxon Signed Rank Test in place of or in addition to a paired t test).

In our research we also often have repeated measures designs. I find that people are very confused as to how we can detect small differences with large variations between subjects. Often I find it useful to present the difference in a variable (calculated for each

subject) due to the treatments-- in this case, the standard deviation bars shrink considerably and you can easily visualize the difference between treatments.

Thanks for your time.

Sincerely,

Lou DeFrate, Ph.D.

------------------

John - I take your example of a repeated measures design where each subject tests two or more treatments. If the small number of subjects is truly a random sample from a homogeneous population where the differences in the response measure between any two treatments is normally distributed and that all these differences have the same population variance, then yes, the t-test or multiple comparisons after ANOVA are all valid. The problem lies with the above assumptions - both with how the subjects are sampled and with respect to the distribution of the response variable. With very small sample sizes, moderate violations of these assumptions can lead to completely erroneous results. With larger sample sizes, ANOVA and t-tests are much more robust to such violations.

As a reviewer or reader of the journal, I would be extremely skeptical of trying to make inference about a population of astronauts from what you observed on 4 or 5 self-selected subjects (who are probably not even astronauts!). Also, all you are doing when you do ANOVA or t-tests is compare means - says nothing about variance or (for example) what percent of potential astronuats might be helped by this countermeasure. You could not hope to estimate this percentage with any relaibility with such small sample sizes.

You could point out these limitations in your manuscript - but whether the journal would accept this sort of disclaimer, I wouldn't know.

Al - Al Feiveson

------------------

I'll echo Al's comments, and provide my "2-cents" as well, as I've experienced similar struggles with small-n studies, and as a Biostatistician have had to really contemplate the appropriateness of hypothesis testing in general, and then specifically using traditional t-test or ANOVA techniques. And I've taught a fair amount on this subject too, so forgive my verbose email!

One requirement that we all have with our small-n studies is to begin our inquiry into the data with a very deliberate, nearly obnoxious test of model assumptions. This is especially relevant because, as Al indicated, what we learned early in grad.-school about statistics, when we were all "newbies" to the math and art of it, was that "ANOVA is robust to violations of assumptions," but what we sometimes forget from those lessons is that this is only true for violations of SOME of the assumptions, and then only when there is sufficient n to be able to rely on the law of large numbers. We don't live in that world here at NASA most of the time, ao we MUST pay special attention to our assumption testing.

With repeated measures designs in the ANOVA context, that means NOT-ONLY that our data are normally distributed, but also that it meets the assumption of sphericity, and homogeneity of variance (if there are any between-subjects factors). These latter two are critical, and there are statistical tests to determine whether you've met the assumptions. With big-n studies, even if your data aren't normally distributed, studies have shown that as long as you meet the latter two assumptions, usually ANOVA performs adequately. but again, we don't have big-n, so we can't rely on that being true for us. And, if you haven't met the assumptions, then you probably should be using other techniques, or at least considering alternative adjustments for violations, and/or data transformations.

In the instances where you DO MEET all of the statistical assumptions (possibly after some data transformations), then it would be beneficial in your publication efforts that you BEGIN your statistics/results section by clarifying what tests you have performed to test them, and that your data meet the assumptions. That way you convince the reviewers/readers that your data are appropriately analyzed with your techniques. Then proceed...

As for post-hoc tests with repeated measures, that's another area where researchers sometimes get confused, and a statistically savvy reviewer will pick up on it. The term "post-hoc" tests is typically reserve for comparisons between GROUPS, not between times/within groups, so I need to clarify that we're not talking about something like Tukey's post-hoc adjustments, but instead something like Bonferonni, or other flavors appropriate for within-group comparisons. There again, you're starting to add fuel to the fire for the skeptic, so you should be considerate of the skeptics and choose more CONSERVATIVE options for these tests. Bonferonni is a good choice because it's commonly used and known to be conservative. Another approach might be to utilize a-priori contrasts, which hold alphs to .05 (or whatever) for k-1 comparisons, where k= the number of levels of your repeated measures factor. Sometimes researchers want to "make all pairwise comparisons" because they haven't clearly thought out what they REALLY want to do, so they avoid a-priori contrasts because it's limited to k-1 comparisons. This is an unfortunate situation, because often times we aren't truly interested in ALL pairwise comparisons, but instead a scientifically meaningful set of comparisons. For example, if you're interested in comparing pre-flight observations to multiple observations taken during and post flight, then simple effects contrasts comparing all values to pre would solve the problem without getting excessive on your Bonferonni adjustments for all possible pairs (do you really care if R+12 is different from R+14?). There are other commonly uses contrasts too... maybe you want to model the PATTERN of change to determine if it fits different polynomial functions (common in biological science). A few other choices exist, but my point is that maybe you can avoid some criticism by narrowing in on the comparisons that ARE important, and thus increase your potential for determining significance (because you haven't adjusted critical alpha so much), AND address your scientific inquiry more appropriately. Icing on the cake--less critical reviews.

Having said all of this... you might also consider newer statistical techniques commonly referred to as mixed-modeling, multi-level modeling, hierarchical modeling, or growth modeling. These techniques are fairly recent extension of ANOVA/Regression, and they are better capable of handling some of the data problems that we experience with repeated-measures ANOVA. They won't solve all of your post-hoc comparison problems, but they are generally more appropriate for all longitudinal research (big or small n). If your data meet the assumptions for these tests, then you're better off starting here instead of ANOVA. (FYI: I've seen NIH make references in their presentations to new investigators stating something to the effect that if you propose repeated measures ANOVA for longitudinal research instead of MLM, it's a flat out reason for rejection.) Remember... you need to meet assumptions of WHATEVER test you employ, so you might as well shoot for the best test first!

There are good applied text books out there on these techniques if you are interested. Software is a little tricky sometimes, but SAS does an excellent job if you're already a SAS user. STATA is equally good and easier to use if you're starting from scratch. R software is commonly used too but I've no experience with it. Avoid SPSS for these techniques...

Rob

________________________________________________

Robert J. Ploutz-Snyder, PhD

------------------

Hello

I am by far an expert in statistics however, like you I often work on small samples. I generally use non-parametric tests based on the idea that t-tests and ANOVAs usually require n>30 and certainly are based on the assumption of a normal distribution. Apparently tests of normality also require n>30 therefore in very small samples, it is not possible to test for normality.

I don't know it that will help you any...

Johanna

Johanna Robertson

------------------

Dr. Dewitt, I do not consider myself a stats expert, however I do teach basic stats to our orthopaedic residents and at College of Charleston where I am a professor in exercise science.

While I understand your concern regarding sample size I would encourage you to use a non-parametric test on your samples.

With a repeated measure use Friedman's and with 3 or more seperate groups used Kruskal Wallis.

While these are less powerful tests and actually ANOVA and t-tests are more robust this may reduce the referee concern and comments you are seeing on your reviews.

With paired samples used Mann Whitney and/or Wilcoxon depending whether the samples are paired or independent.

My contribution may be what you already know.

If not, hope this is helpful.

Regards.

Bill Barfield, Ph.D., FACSM

------------------

Pax!

In fact I was just calculating sample size effects for our own paper.

The required sample size depends entirely on the purpose of the study, desired strength of the results, estimated size of the effect, and other background assumptions. For instance if I want to establish a correlation between two variables with p < 0.05 and r >= 27 % then n about 50 is required. But if one eg wants to test whether an exercise program affects weight then 5 persons may suffice. In this case, if the alternative hypothesis is no change, then an observed weight decrease for all 5 has a p = 1/2^5 = 1/32 (simple sign test). (Of course without a control we cannot conclude it was the program alone that resulted in the change ...). The point is to be able to show with probabilistic arguments that the probability that the result is due to *pure chance* is less than 1:20. Non-parametric tests (such as Wilcoxon) can be useful here since they do not require assumptions about the distributions. The sign test though does not reflect the size of the change in the example so one might use the variable z = x_before - x_after (or a scaled variant [x_before - x_after]/x_before if appropriate) instead. If the hypothesis is that no systematic changes has occurred, one may assume that z is normally distributed around 0, and use the sample variance Vx_before + Vx_after to estimate that of z. This finally lets us calculate p = Prob(Z < z) (prob that decrease is due to chance). One possibility today is to use computer simulations to calculate probabilities, they can make strong arguments showing *concretely* the odds against the result being due to chance. Ok i got to hurry to the office.

Regards Frank Borg

------------------

John Dewitt,

It is my understanding that ANOVA and t-test have an underlying normality assumption; with such a small sample size you cannot show that your samples are in fact normally distributed. In other words the a priori p value is meaningless if the assumptions of the testing tool are violated. I recommend a non-parametric tests like Mann-Whitney or Kruskal-Wallis tests, they are not as powerful but that is the trade off.

By the way, the rule of thumb given to me to test normality was a minimum sample size of 12. This was quickly followed by, "I have been sworn to secrecy as to from where that number came."

Cheers,

Rob Richards

------------------

Have you looked into the aspect of non-parametric v. parametric statistical results? That addresses 'small' sample sizes...I am not an expert but I had a similar issue in the past....good luck

Nicole Jacobs

------------------

Hello,

Just my humble opinion, but given that Type I error can still be committed, even with small samples, the reviewers concerns are valid.

A solution would be to assure your reviewers that you have tested your data for assumptions of normalcy (eg., skewness, variance), and given the repeated measures design, that sphericity is not an issue. Even with small sample sizes, Type I error is a potential risk because of the influence of skewness associated with a small cluster of observations at one end...also, with small samples, providing all your data in a table might alleviate concerns - allow the reader to see the variance, etc.

Thanks,

Daniel

Daniel Cipriani, PT, PhD

------------------

Dear John,

There are several problems with small samples:

1. Statistical power: you analysed this problem correctly, in my eyes.

2. The verification of normal distribution, with is one of the prerequisites to apply parametric tests, such as Student's "t" and ANOVA (to a lesser extent for the latter). The problem is that the tests that are available to test normality of distribution are not appropriate for small samples. Thus, a major criticism is the choice of statistical tests. Non-parametric tests, such as a Wilcoxon matched pairs test, would be more appropriate.

3. Power to generalise observations from the sample to the target population. Of course, if the sample is small, the chance is large that the sample is not representative enough of the population.

I hope that this is of use

With kindest regards

Veronique

__________________________________________________ ______

Prof. V. Feipel, PhD

------------------

John,

Just a quick note . . . I 'll try to add more later:

I use paired testing methods in my studies as well due to sample size/ cost constraints. In your case, I'm obviously not familiar with the specifics; however, two possibilities come to mind:

1 If your subjects are not randomly drawn from the target population, the results are biased regardless of sample size and/or the statistical outcome.

For Example: Measuring weight from a population of athletes would not represent the general US population. Assessing usability based on 4 20 year olds can not be extrapolated to older adults.

It seems obvious, however, I have seen several studies where internal subjects were used to assess product usability. The results, of course, did not match the general population becuase of the subjects' familiarity with the control design.

2. Is Normaility assumed? The complexity of the measure's distribution may not be captured with such a small sample size. If the distribution is sufficiently skewed, nonparametric techniques would be requred.

Scott A. Ziolek

------------------

Dear Dr. De Witt,

My understanding is that most traditional parametric statistical analyses (e.g., t-test) are based on the assumption that the data are normally distributed. When this assumption is true, the p-value, based on the area under the normal distribution bell curve, is meaningful. However, with only 4-10 data points, it is very difficult to assess if the data fit the assumption of normal distribution. In this case, using a traditional statistical analysis may be problematic, even though your results are "significant". Based on my previous conversation with some statisticians, they usually recommend having at least 30 data points in order to assess the distribution of the data, which could be very challenging in biomechanical research. Recently, there are some modern statistical analyses that doesn't require normal distribution assumption. You can try them to see if the results match the results of the paired-t test. If it does, then it may provide a strong support to convince the reviewers.

Ching

************************************************** **************

Liang-Ching Tsai, MS, PT

------------------