View Full Version : Summary of Responses: Minimum number of trials for reliablemeasures??

Corey Scholes
05-03-2006, 09:48 AM
Hello to all,

Here is a summary of responses regarding my query on determining the minimum number of trials necessary to describe a 'stable/reliable' performance. Based on the information received, I will be using pilot testing to quantify the variability of the measurement equipment and a reliability study to determine the variability of the measures in my target population. It seems that calculating the minimum number of trials is dictated by the question to be answered and the movement/task of interest. I will be using a combination of power calculations (based on pilot testing) and intra-class correlations to determine the minimum number of trials for maximum reliability. I would be very interested in any additional thoughts anyone has in regards to the discussion so far.
I have included my initial query below and the responses underneath.

Many thanks to all that responded

Corey Scholes
.................................................. .........

Hi everyone,

I'm investigating the effects of fatigue on knee function during landing movement. I am using a simple pre-post comparison to determine changes in muscle activation and knee load. Unfortunately there seems to be some uncertainty regarding the minimum number of trials that are necessary to determine a reliable representation of normal performance. Particularly in landing movements other than walking, running and drop landing.

Can anyone offer an opinion and/or references on how to determine when a particular measure is 'stable' or 'reliable'? One criterion suggested is when successive mean deviations are within 1/4 standard deviation of the mean value for each variable (Bates et al 1983; Hamill & McNiven) Many thanks to Nick Stergiou who recommended the text Innovative Analyses of Human Movement - Human Kinetics where this was mentioned.

Is the 1/4 SD a common approach? Is anyone using other criteria to determine adequate 'stability' of a measure prior to a treatment? Hopefully this will generate some discussion on this forum as it seems to be a largely unwritten component of biomechanics studies in the literature.

Of course I will post a summary of replies.

Thanks in advance

Corey Scholes

.................................................. .........

Hi Corey,

I've done some work with neck EMG and reliability analysis of MVIC. We only used 5 subjects and got good results. We used ICC and %SEM.
Reference is attached.

Hope this helps


Kevin Netto
Lecturer - Exercise and Sport Science
School of Science and Primary Industries
Charles Darwin University

Tel: +61 8 8946 6716
Fax: +61 8 8946 6847

.................................................. ..........

Hello Corey,

I read with interest your recent posting on BIOMCH-L. There have been a few studies over the years that have examined the number of trials reliability / stability issue for a variety of movement patterns and variables. As you indicated in your message, I have somewhat explored this issue for landing. In addition to the retrospective data presented in Dr. Stergiou's book, I also conducted a follow-up study in which the results were presented at ACSM along with the published abstract. The full manuscript was submitted to J. Applied Biomechanics, but unfortunately the reviewer's indicated that there was little value in having such information. Obviously, based on your question and the number of publications on this topic over the years, there obviously is value in understanding the nature of inter-trial variability. The SD criterion method has been used in a few studies to determine stability, but the SD value is somewhat arbitrary and the results (number of trials necessary for stability) vary with the criterion value. Other (more traditional) methods of determining reliability and stability have been suggested to examine this issue, but I am not aware of any published works for landing.

Best wishes and good luck!
C. Roger James, PhD, FACSM
Associate Professor and Director
Center for Rehabilitation Research
Clinical Biomechanics Research Laboratory
Department of Rehabilitation Sciences
3601 4th Street, MS 6223
Texas Tech University Health Sciences Center
Lubbock, TX 79430-6223

Email: Roger.James@ttuhsc.edu
Voice: (806) 743-4524 x280 (faculty office)
Voice: (806) 743-3251 (Center lab)
Fax: (806) 743-1262 (dept office)

.................................................. ..........

Hi Corey,

A measure is reliable if the noise in the data is small enough for you to detect what you are looking for.
A comparison has sufficient statistical power if it is able to detect the difference one was looking for.

Having said that.....

1. This is an interesting question and can be considered in various ways First I think that you should be critical in allocating all the variance in a data set to the performance of the individual. Clearly variance is related to three domains, the subject, the instrument and the tester The first step is to quantify the variance attributable to the factors other than the subject. (Not always easy). But if 80% of the variance is related to the equipment / set up or instructions then trying to get a reduction in variability could be problematic.
Knowing the level of the typical error associated with the setup will often define the minimal detectable difference that is possible with the setup. If a subject is randomly performing within this typical error range then you are unlikely to be able to detect a systematic change which is smaller using a reasonable number of trials. Probably a good reason to stop.

2. Look at your derived variables. If you calculate the median / means of a moving average of trials you may find a level of stability earlier than what is generated by a single trial data file. This type of analysis relates to the issues of group variance and removing random variance by some form of signal averaging. (see below A. & B).

A Method of determining the best way to reduce group variance.
If you are looking at the stability of the derived variables then you can consider the group coefficient of variation. Some work was done on this in determining EMG amplitude variance and how to amplitude normalise the data. It has been argued that reducing the CV of the group data is a key factor in determining the best amplitude normalisation procedure. [In statistics the numbers of trials to detect greater cross group stability is a similar analogy]. However the problem here is that by normalising based on the reduction of the CV could be removing true biological data. This was evident in amplitude normalisation of the EMG signal during eccentric and concentric actions. So a feature that you should consider is the magnitude of the bias you observe with continuing practice of the task. This is the systematic changes across trials relative to the variance in the data - in other words look at the p value. The p-value of the trial effect is telling you the effect size relative to the variance. If you are set on establishing p values at 0.05 then you are probably limiting the value of using this technique. Since once the value gets to p=.06 people will tend to say hey there are no differences from trials 4 onwards. This (un)fortunately will probably get through many Journal reviews so is worth a try:-) The fact is that this is a relative comparison and again relates to the variance in the data. The p-value relates to the statistical power and therefore if you did the study on 12 people and found that the difference between trials became non-significant at trial 4, if you did the same study on 20 subjects you would find you have to take more trials to "see" statistical stability. Therefore the key is to be able to quantify the magnitude of the bias that will be seen as the person systematically changes their performance.

Review your derived variables. If they are estimators of performance criteria then are you using a cluster of trials to reflect what you see?
The estimate of the performance is related to multiple testing on this performance. If this is a skill related task. [I know that it is argued that maximal peak torque is the best performance etc - but if you are looking for performance and representative movements then it is less clear]
Consider the case when trying to assess joint positioning matching error The estimate of the accuracy (mean error of a sub set of trials) and precision (SD of the set of trials) of the performance when derived from 6 trials improved the statistical power substantially when compared to 3 trials. [Yes calculating the SD from a population of 3 numbers is common and accepted in this area of the literature]. So in terms of logistics we use the idea that the derived variable is only an estimate of the true performance and therefore we investigate the stability of the derived variable in an attempt to improve statistical power. (Allison and Fukushima Spine. 2005

B. Try to find a clinically meaningful correlate.
This quantification if possible can be set using something which has clinical meaning. For example in the management of pain syndromes some studies have quantified the threshold at which patients think there has been a clinically meaningful change. Therefore this is the threshold of the performance to say that a clinical meaningful change has occurred.
In biomechanics I think you may try to attempt to get a clinically meaningful benchmark from a pilot work. For example, if you do a bench press with 40kg and then 45kg. The difference could be said to reflect a clinically small effect.

The 1/4SD is of interest. I am sure people have rules of thumb that would work equally as well.

Good luck

Dr Garry T Allison Associate Professor of Physiotherapy
The Centre for Musculoskeletal Studies http://www.cms.uwa.edu.au/
School of Surgery and Pathology, The University of Western Australia.
Level 2 Medical Research Foundation Building Rear 50 Murray Street Perth Western Australia 6000.
email gtacms.uwa.edu.au
ph: (618) 9224 0219 . Fax (618) 9224 0204
The University of Western Australia: CRICOS Provider No. 00126G

.................................................. ..........


I'd suggest using a power calculation. You would need to decide what magnitude of difference you would like to be able to detect, then use the power calculation to find out how many trials you would need, given the amount of variability you have (so you'd need some preliminary data to measure this variability).

Good luck with your work,

Sharon R. Bullimore
Room B304
Human Performance Lab
Faculty of Kinesiology
University of Calgary
2500 University Drive NW
Calgary Alberta
Canada T2N 1N4
Email: sbullimore@kin.ucalgary.ca
Phone: 1-403-220-2269
Fax: 1-403-284-3553

.................................................. .........

I strongly advise you read the chapter by Dr. Bates and colleagues published in the Human Kinetics book "Innovative Analyses for Human Movement". In general his work and Dr. Dufek's addresses this exact problem.
Nick Stergiou, PhD
Isaacson Professor and Director of the HPER Biomechanics Laboratory
University of Nebraska at Omaha
6001 Dodge St.
Omaha, NE 68182-0216
tel. 402-5542670
fax. 402-5543693
e-mail: nstergiou@mail.unomahaedu

I use an ICC of >0.7 and CV% less than 15% (STOKES, M.J.
Reliability and repeatability of methods for measuring muscle in
physiotherapy. Physiother Pract. 1:71-76. 1985.)

Good luck.

Brian Schilling, PhD, CSCS
Assistant Professor, The University of Memphis
Exercise Neuromechanics Lab Director

NSCA South Central Regional Coordinator
O: 901-678-3475 F: 901-678-3591