No announcement yet.

Statistical Power and Sample Size - Summary of Replies Part 3

This topic is closed.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Statistical Power and Sample Size - Summary of Replies Part 3

    Continuation of the summary of the replies to the question posed by John De Witt regarding statistical power are sample sizes. Part 3 includes responses in the categories of 'APPLICATION TO THE POPULATION', 'PROVIDE EFFECT SIZE CALCULATIONS WITH CONFIDENCE INTERVALS TO GIVE AN INDICATION OF THE MAGNITUDE OF THE DIFFERENCE', 'PERFORM A PRIORI POWER ANALYSIS TO JUSTIFY THE SMALL SAMPLE SIZE', and 'GENERAL COMMENTS'



    Hello Dr. Dewitt,

    I feel your pain! I also use repeated measure studies with limitations on how large I can get my sample size, and testing populations known to have high variability.

    The smaller your sample size, the greater the chance that your sample may not be representative of the whole population. That can cause reviewers to be skeptical of your results even if they do reach statistical significances: The first people to volunteer to participate tend to be very active, and highly motivated people (or
    that their parents are highly motivated if testing minors, etc.).
    Individual differences and outliers can have a much stronger effect on the overall study results than in larger samples. You could get positive results, but ones that may not apply to the whole population depending on how representative your sample is.

    For an extreme example, last I checked, there were 9 people known to have a complete loss of their sense of proprioception. One of them, Ian Waterman, actively participates in research. By testing Ian, we may guess that this condition affects people in their 20s but that it is possible for people to relearn how to control their movements through the use of vision alone. However, even though my sample is more than 10% of the entire population, the things we learn from studying a small sample may not apply to the rest of that population:
    most of the rest of that population developed this condition when they were 50 or more years of age, and Ian is the only one who has relearned how to walk. He is an outlier, his high level of functioning makes it less daunting for him to get to the labs for research studies.

    In terms of planning studies, multi-site research can help get larger samples. Depending on the population, it may also help if you can get portable data collection equipment as many people are willing to let researchers come to their home rather than drive several hours for a study.

    In terms of publishing, providing the observed power statistics when you send in your results can help address researchers concerns--statisticians are typically satisfied if power is above .7, though lower is acceptable in some fields.

    Some journals will flat out not accept papers if the sample size is considered too small, so checking if the journal has published some papers in similar populations with small sample size before deciding where to submit the paper can help. The more focused the journal is on your particular research area, the more likely reviewers are to understand the inherent limitations on sample sizes, along with the benefits of your repeated measures design.

    If you have provided information on group means & stds, consider providing data on each individual's performance. When reviewers see that 9 out of the 10 people in the sample benefited from a particular environment, it helps assure them that your significant results aren't just due to an outlier or two. Depending on what kind of measurements you are collecting, MANOVAs may be able to increase your power, especially if you have multiple dependent variables.

    Many reviewers will simply be satisfied if you make certain to address the limits in sample size in the discussion, and suggest other researchers try to replicate your results before giving a lot of weight to the recommendations you suggest based on your results.
    Others may prefer that the paper be presented as a 'pilot study'
    rather than something more definitive, but would still be perfectly willing to publish that pilot study and others may still be perfectly willing to cite it.

    Hope this helps,

    Genna Mulvey, Ph.D.


    Excellent questions. I believe there is quite a problem in the scientific literature based on the inconsistent and incorrect use of statistical analysis techniques. Too many researchers were poorly trained in statistics and do not attempt to remedy that or try to keep up with advancements in the field. The push for many journals to have standards for statistical analysis and recent papers on statistical analysis support your questions and my concerns. I hope you had a chance to read the papers out earlier this year on statistical analysis by Will Hopkins et al. in MSSE (Feb 09) and by myself in Sports Biomechanics (March 09).

    Here are my opinions on your questions:

    Rejecting the null hypothesis does not negate all concerns about a small sample size. Remember, the size and quality of the sample (representative of the population) are of critical importance because the assumed purpose of statistical tests as decision makers for the effect of the treatment/independent variable on THE POPULATION. Too many modern scientific writers forget this fact and mix up the internal and external validity issues in writing up their reports. Often modern writers talk about the statistical test they do on the sample evidence, and unconsciously switch to external validity and talk about the results in general. Often they compound their mistake of overgenearlizing from a small convenience sample to the population of similar subjects, to OVERgeneralizing to all subjects and musculoskeletal systems! To make matters worse, it doesn't matter what the statistical test says if there is just one error in experimental control or bias introduced from non-randomization in the sample.

    My advice is to focus the discussion with the reviewer/readers on the justification for the sample size and limit the discussion of your results to your sample. All statistical tests, somewhere, are a subjective decision rule You have to subjectively set the alpha level and expected difference/association subjectively. Focus on why you need the sample to be small. Unfortunately, there are not a lot of well-know sample size calculation formulae for the repeated measures designs you described.

    In some cases, what looks like a small sample is actually almost the whole population. John, I believe that if you studied 10 astronauts in a study it would certainly represent a large percentage of astronauts on the planet. If the study is a preliminary exploration of an issue, it certainly makes sense to use a small sample and limit the explanation/discussion of the results to the sample and encourage further study in a larger sample or other populations to verify the results. I don't believe it is very effective to argue that biomechanical hard and costly to collect and calculate, so a sample size of 10 is common in biomechanics. I think you are just as likely to run into a small sample bias in reviewers from almost any journal (biomechanics or not).

    Editor and reviewer bias/standards about sample sizes are hard to overcome, but if you focus on the statitical and research design issues and limit your discussion of the results you have put yourself in the best position to win the argument.

    Some of the best work is often published in second tier journals because of bias an other injustice in so-call top tier journals. Remember that you will have the last laugh when your paper published in this less prestigious journal creates numerous citations and contributes to the decline in prestige of the journal with reviewers that did not focus on substantive issues.

    Duane Knudson, Ph.D.

    Dear Dr. Dewitt,

    I have encountered the same problem a few times recently.

    Assuming sample size was the only thing that has been criticized, statistically speaking, in your manuscript, I would argue the way we understand the term "Significant difference".

    Some reviewers suggest that a p-value represent whether your observed difference really exist or you saw the different by chance. A greater sample size will reduce the chance of this situation. Stats textbook tell us otherwise. A p-value tell us the chance that your observed difference represent the population where your data were sampled from. There is a difference in your samples if you see one. But you are risking a false positive or type I error. Meaning there may not be a difference in the population even you see difference in your sample. In your case, I sense that you have sampled the entire population! You are not even trying to use your data to represent a larger population, you have exhausted, or close to, the population. Therefore, there is a difference if you see one. Theoretically, you don't even need to run t-test or paired t-test, if that is the case.

    Tell the reviewers that you have exhausted the population, if I am correct here.

    "Statistics helps you generating an educated guess, if you don't know the truth." A friend of mine once said, "You don't need statistics if you know the truth."


    Li Li, Ph.D.

    Dr. Dewitt,
    Generally the comment on small sample size is that the results may not be generalizable. Certainly small sample size can decrease statistical power, but small sample sizes also tend to violate normality assumptions in the distribution, so the statistical inferences become less trustworthy. In particular, your sample mean values may not accurately represent population mean values. Therefore it is hard to infer if the population mean really changes during the 2 or 3 conditions in your experiment. Thus this is a type I error issue.

    In terms of explaining the work, this seems to be a limitation that needs to be addressed, but not necessarily correctable. You could say that "This result needs to be confirmed/replicated in other similar experiments." A more statistically explicit way to deal with the issue may be the use of Bayesian inference, but this is not very widely used, and may be more difficult to explain than "the small sample size is a limitation."

    I hope that helps.

    Hyun Gu Kang, PhD


    Please post all the replies you get so that we all can learn from them.
    Obviously, if you find significance the effect was large enough. The concern is in the actual power of the sample size. One might surmise that the generalizability of the results are limited due to the small sample size. In other words, even though power appears to be sufficient since the null hypothesis was rejected, we may have a question of whether or not the sample studied accurately reflects the population under study (assumptions of your statistical procedures are normally distributed curves and with a small sample size it is difficult to assess whether or not the curve truly is normally distributed). If you look at your confidence intervals they may be rather wide even though you found significance. It would seem to me, and I am not a statistician, that ways around this might be to compare your confidence intervals with those of similarly done studies that have been published and use this data in your discussion when you address the small sample size as a limitation, explain clearly why you had to have such a small sample size and/or call the manuscript a pilot.

    I'm not sure if this helps at all.




    Hi John,
    If you look at where the statistical techniques that we base all of our "significant" results on came from, they all assume decisively normal / Gaussian distributions, which is impossible to verify with any confidence for any sample that isn't on the order of hundreds or thousands of samples at least, and the holy "p < 0.05"
    threshold is completely arbitrary and has no basis in physiology.

    I'm not sure exactly how to address the issues you mentioned, but one statistic I like a lot is the effect size. It was developed by a guy named Cohen:

    Cohen J (1990). Statistical Power Analysis for Behavioral Sciences.
    New Jersey: Erlbaum.

    The effect size (ES) is simply the ratio of the difference in group or condition means to the pooled standard deviation. Cohen provides some guidelines for interpreting ES as a measure of the "biological significance" of the results. For example if ES > 0.8, that implies a result with "strong" biological significance that would be even more significant at the p-value level if you had more subjects in the study. I like to use it as an argument against folks who complain about small sample sizes, even though I see that argument as trivial and pointless in our field since everyone's sample size is small.

    Hope all is well,
    Ross Miller []



    It appears to me you are doing things right for a data analysis that is focused on p-value based interpretations. I assume you have adjusted the critical p value if you were doing multiple comparisons in the single paper. Other than that, my only suggestion is, if you are thinking about doing a post-hoc power test to prove to the reviewers that there was sufficient power even with your small sample size, then the following article arguing against that approach could be considered:
    Hoenig, J., & Heisey, D. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, 19-24

    I would be interested in seeing other comments you get.

    ************************************************** ***********************
    Gordon Chalmers, Ph.D.

    Dear John,

    Since it's a must to work with small sample sizes you are to do three things:

    1- Make sure that no one in your field has done similar research using bigger sample sizes.

    2- Report in your paper the motive and/or reasons to do so:
    it should make sense if it is the nature of your research and all the researchers in your area have reported similar cases and have used small sample sizes as well.

    3- When representing your results, report the power of your statistical analysis associated with the significance levels you obtained for tests. (you should be able to calculate this value easily using your statistical software package **SPSS, SAS, or Minitab**). Power of 80% or more should justify your results.


    Best Regards

    Tamer Khalaf, PhD

    Dear John,

    I am not an expert, but I would say you are largely right. As you say, the sample size you need depends on your effect size. If whatever you're looking at has a large effect, you only need a small sample. The classical reference here is the work by Jacob Cohen - for instance "Cohen J. A power primer Psychological Bulletin 112(1992):155-159" which you can find on the web through Google.

    Nevertheless, it would still be good (definitely for a paper but I suppose also for your application to an ethical committee?) to perform a power analysis. You can download free software (G*Power) to do that from

    Since you probably have a good idea of the effect size you try to find, and have good data on standard deviations, you should be able to do sample size calculations which you can include in future papers.

    One thing where the reviewers might have a point is the generalizability of your findings. After all, with only four volunteers they might have some special characteristic? Therefore you might want to describe clearly in the methods section of your paper how you selected your volunteers, and in the discussion section you should try to point out why you think your research findings on four people are more widely valid.

    Hope this helps and best regards,

    Jan Herman

    Hi John,
    As an author I certainly understand your frustration, but as a peer reviewer this is one of the ways we gage how rigorous the science is. No doubt some reviewers are more picky than others, but there's probably room for compromise on both sides: it really is the author's job to demonstrate that their experiments have been designed to maximize the value of the results.

    Small samples are a fact of life in our field, and your approach of using repeated measures designs is one way to deal with this. However, taking the extra step of doing a "power" or "sample size" analysis is not terribly onerous, and can help you determine how many reps/conditions you need with a given sample to acheive a certain power.

    All you need is a computer program that does power calculations. There are many out there, but my favorite is G*Power, created at University of Duesseldorf. You can get it free at this website:
    Among the F-tests you will find various repeated measures designs to chose from.

    Another suggestion, if you want to gain a better understanding of the "nuts & bolts"
    behind power analysis is to get a copy of Cohen's "Statistical Power Analysis for the Behavioral Sciences" (Lawrence Earlbaum Associates, NJ). It's basically the bible of statistical power. It is also convenient that G*Power bases its calculations largely on Cohen's work, so having both at hand is very helpful.

    Chris A. McGibbon, PhD


    All I can give you is my 2 cents. First, a little statistical knowledge is dangerous and many reviewers fancy themselves experts.

    If you conducted pre-experimental sample size estimation for a power of 0.80, you are on fairly safe ground. You also also have to demonstrate that the data meet the assumptions of the statistical test (most people neglect this aspect of data analysis). A significant result then means something. I do not understand what the problem is.

    If you conducted pre-experimental sample size estiamtion for sample size estimation at a power of 0.80, and the data meet the assumptions of the statistical test, and you fail to demonstrate significant differences, then there really was no significant differences. Post-experimental power analysis will always say the you need more subjects, and numerous papers exist that explain why post experimental power analysis is a fallacious form of data analysis. This is where most people get "dinged".

    You will have to justify your results based on pre-experimental sample size estimation techniques and that the data meet the assumption of the statistical tests. If the reviewer continues to give you a hard time, they probably do not understand statistics enough to recognize the right answer and you are in trouble. Sorry ;-)

    Best Wishes,

    David. David Gabriel []

    Hi John,

    I don't know exactly what the reviewers have written but I have gotten similar comments before. Interestingly enough, I received similar comments on computer simulations of N=1, which didn't make sense at all.

    It sounds like that your statistics is solid. What you could do is to add a power analysis to your manuscript justifying the small sample size. While this wouldn't change anything on the subsequent analysis it would take the concerns off the reviewer's minds before they even think about questioning it.

    Good luck,
    Michael Liebschner - Michael Liebschner []

    Hi John,

    Like I said, I even got the comment when I submitted a manuscript on a numerical simulation. Since the model will always return the same answer it really didn't make any sense to add another model.

    You also made an interesting point that I sometimes question on work done by my colleagues. In your case your small sample size resulted in rejecting your null hypothesis, which is good. My collaborators sometimes use a small sample size because the experiment is cumbersome and time consuming.
    However, they ended up having to accept the null hypothesis. When I asked them to do a power analysis with the data at hand to see if it would make sense to just add a couple more samples I just given the comment that their previous analyses justified the sample size. This brings out an interesting question, what is more important, your statistics or the actual research findings? In my opinion, statistics is only one way to explain that data and the actual data are more important. If you research data suggest that you should go back and include a few more samples to get statistical significance than you should have the obligation to do so.

    Best wishes,
    Michael - Michael Liebschner []

    One suggestion might be to perform a brief pilot study prior to engaging in the full blown event. Using your pilot data you can calculate an effect size and then using some statistical software (i.e. G*Power) you could calculate a sample size appropriate for the effect size at a given level of power.

    In your manuscripts, in order to limit the discussion on whether or not your sample size is sufficient you could discuss the findings of the pilot study and your sample size determination/power analysis.

    Hope this helps.

    Jason Scibek, PhD, ATC

    Dr. De Witt,
    You may be able to respond by performing an a priori power analysis with something like GPower software that will tell you the number of subjects required for the study based on pilot or others' work. I believe this will certainly improve your study and the significance of the results, rather than choosing what might be considered by the statistically bent reviewers to be a random number of collected subjects. I would certainly be interested to hear what others on the listserv say though, and most importantly how the reviewers have responded to your pleas.


    Toran MacLeod

    I ran into the same problem in trying to get my dissertation approved through my college dean. It wasn't until I ran a power analysis and showed that I could have sufficient power in a repeated measures design with 8 participants that he signed off on the prospectus. You often can't go back and collect data on more individuals, but if you include an a priori power analysis in your methods section, that may answer the reviewer's concerns.
    Gary Christopher


    Dear John,
    I work in the field of sport psychology and movement science and often have the same problem.
    Recently we started to use a priori calculations on sample size, power, etc using the GPOWER Software. It can be used for free.
    You can find it here:
    It is used a priori, for instance to calculate sample size, given alpha, power and effect size.
    If is used post-hoc to calculate achieved power, given alpha, sample size and effect size.
    If you can estimate effect sizes from current research and use a priori calculation to justify your design, this may help in your argumentation. Even if you calculate achieved power in case you have an effect or not, this will strengthen your argumentation and get rid of any subjective judgment on sample sizes. A further resource is: Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NY: Lawrence Erlbaum Associates...but maybe you already know this.
    Tom. - Heinen, Thomas []

    Hi john, I too have the same issue. I have done a series of longitudinal studies on the motion mechanics of pregnant women. They were tested 5 times. I also have a control group so I have a really good understanding of what is a change due to pregnancy and what is repeat testing. Getting pregnant women into a lab when they A) have morning sickness and B) likely to have not even told their friend and family they are pregnant because they are less than 12 weeks is really really hard. I still get papers knocked back because of small n.( n=9 maternal in this case). My frustration is mounting. Given the very small numbers of papers on gait of pregnant women in existence I find it astoundinsg that the papers keep get being knocked back because of sample size. Perhaps it is no wonder there is a small number of published studies. There may well be several studies out there that have never managed to get past this issue with the journals.
    Please put your summary of replies in biomech - l as i think there is a probably a number of us who are working with small n for a very good reason but it is still valuable information to get out.
    Wendy - Wendy Gilleard []

    Hi John,

    Unfortunately this is a very 'easy' and common criticism of studies: that the sample size is too small. Often, I feel, the criticism is made without much thought. In my opinion, there are three essential issues regarding sample size selection:

    1) The nature of the population. For young, healthy, heterogeneous individuals, a sample size of around 10 seems fine for most biomechanics studies; some would say that once the sample size in the double digits it's acceptable! For a clinical/patient population, a larger sample size would be required to be able to generalize the results to that clinical population.

    2) Study design. If your study involves any kind of between-group comparison then more than 10 would be needed (depending on the anticipated effect size), but it doesn't sound like this is the case for your research.

    3) Available subject pool. It sounds like this is a limitation for you.
    This is often also a limitation in clinical research where there are just not a lot of patients available that meet certain criteria. As long as this limitation is explained in the paper, it should be acceptable.

    Finally, as long as you justify your sample size, I believe it would be harsh to allow a seemingly small sample to prevent a paper from being published.

    I hope this helps!

    Avril - Avril Mansfield []

    Hi John,

    I believe the rule of thumb for comparing the means of two groups is that you need 16 per group to detect an effect size of 1 (which is actually rather large). While not a formal reference, I believe this link helps to explain that concept:
    For my own a priori calculations, I usually refer to Chow's "Sample Size Calculations in Clinical Research"
    p/0824709705). This book contains formulae for some common study designs in clinical research. If that doesn't contain the design I'm concerned about, I will try searching statistics journals for relevant articles.

    In reading over my response, I noted a typo. Of course, I meant with a young, healthy, *homogeneous* population, a sample of ~10 is sufficient.

    I look forward to reading the summary of responses.

    Avril - Avril Mansfield []

    Small sample size is indeed a concern when using standard statistical measures. Ty using Effect Size statistics and magnitude based inferences, this will overcome your problem.
    There are recent publications by Batterham and W. Hopkins (one in MSSE and one in IJSPP) on this issue. Also look at Will Hopkins site for further reading and helpful spreadsheets.

    Kind regards,

    Paul Montgomery

    Hey John,

    the described problem seems to be very common in many research areas that only have a small sample size available. To understand the concerns the reviewers have, it is necessary to clarify the meaning of "significance" in this context.

    First of all a statistical analysis normally is performed to estimate an effect, observed in a representative sample and to transfer that observation to a bigger population. For this reason, the experimental sample should fulfil some conditions related to the research question / hypothesis.

    To estimate an effect from a sample, it should be normally distributed, have a certain standard deviation etc... If you only want to draw conclusions to a very small population that presumably has a small standard deviation and a limited number of varying factors that (in the best case) could even be described or controlled, a very small sample size may be sufficient. In the discussion it needs to become clear, why this small sample size should be sufficient. - Some times, one should even ask the question, if a statistical analysis is needed / necessary in these cases.

    If you want to draw conclusions for a bigger population that has more and some uncontrolled factors that may also effect the measured variables, you need a bigger sample size. The sample should have the same standard deviation that would be expected for the whole population. Normally this is only provided by a certain minimum sample size. Otherwise the SD of the sample and of the population is not homogeneous.

    In addition, the sample size depends on the number of factors you want / need to measure. The more factors you have, the bigger your sample size needs to be. In the best case, you only have one factor (all other conditions are constant / or very well controlled for all measurements).

    Furthermore the sample size depends on the magnitude of the detectable contrast you want to measure. This again is directly related to the standard deviation of the sample or (as described above) of the presumable SD of the whole population.

    A "significant" difference for a measured parameter depends on two facts.
    One is the sample size, the other is the magnitude of the difference related to the standard deviation between the two samples for the parameter. - In a paired test the sample sizes should be the same (as every subject performs every test). If the sample size is very small, the standard deviation within one test may be very small as well. This needs not to represent the "real"
    standard deviation you might get, if you are measuring more subjects or looking at the whole population. If now the difference between the first and the second test is somewhat bigger than the small standard deviation, you may see a "significant" difference between the tests. This may be true for the small sample size, but does not necessarily needs to be true for a bigger sample size or even a whole population.

    This leads back to the introduction. The sample size and subject selection must be related to the research question and the size of the population for what the estimations / conclusions should be made.
    Dr. Lars Janshen []


    My guess is the reviewers are concerned that you have a Type I error, rather than Type II:
    Null hypothesis was true and you said it wasn't.
    The findings are outside your alpha level (extremes).
    Problem is you may lead others astray to dead end.

    This could still happen in the case you have suggested. One of the ways to decrease the reviewers' concerns would be to provide them with an effect size (basically how much different are the conditions). SPSS and some other programs will calculate this for you when you select the appropriate option; however it is not available under T-tests, only with ANOVA's. This isn't to say you can't calculate it by hand after running a T-test. Then you can say whether the effect is small medium or large.

    Effect sizes using partial eta2 (çp2) were also obtained for each dependent variable using the formula: çp2 = SSeffect / (SSeffect - SSerror), where SSeffect = effect variance and SSerror = error variance. Interpretation of effect size was done using a scale for effect size classification based on F-values for effect size and were converted to çp2 using the formula: F = (çp2 / (1 - çp2))0.5. Consequently, the scale for classification of çp2 was: 0.04 = trivial, 0.041 to 0.249 = small, 0.25 to 0.549 = medium, 0.55 to 0.799 = large, and .0.8 = very large [Comyns TM, Harrison AJ, Hennessy L, Jensen RL. The determination of the optimal complex training resistive load in male rugby players. Sport Biomech 6: 59-70, 2007]. I hope the equations come out...

    Essentially it gives you an indication of how different is different, and not just the probability of difference. This is because you could find a significant difference, but it doesn't really mean much because the values are still so close together. Hope that helps.

    Randall Jensen, PhD, FACSM, CSCS

    Small sample size might be just an outlier.. even if you reject , you might be just evaluating the outliers. Although it shows that you didnt do type II error, it doesnt justify that you did it right either. You might be just working on outliers of a big sample. still doing the right but only for outliers as your sample is not big enough.
    Hope this gives a different point of view to your understanding.
    Senay Mihcin

    John De Witt, Ph.D., C.S.C.S.
    Exercise Physiology Laboratory Lead / Biomechanist
    Exercise Physiology Laboratory
    NASA - Johnson Space Center
    281-483-8939 / 281-483-4181 (fax)