Announcement

Collapse
No announcement yet.

Test-retest reliability: implication of the average or single measures of ICC models

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test-retest reliability: implication of the average or single measures of ICC models

    Dear subscribers,

    There is a contradiction in the literature in choosing and interpreting ICC(2,1) or ICC(2,k) for day-to-day test-retest reliability assessment. I have a question and will be grateful if anyone kindly shares his/her experience about it. Also please let me know if there is a problem in my understanding of ICCs as described below.

    After Shrout and Fleiss [1], the six ICC models are denoted by ICC(n,k); where n=1,2,3 is the main model and k denotes singles measure (k=1), or average of many trials (k>1). For test-retest studies in which we wish to generalize the results and find the trial-to-trial or day-to-day relative reliability, the second model (two-way random) is suitable (n=2) [2]. Here, I have no question on n, but let’s focus on choosing k.

    In a test-retest context, k is the number of trials in each session (number of columns of data). Suppose we’ve collected 3 trials of data for 10 subjects in two different days. For within-day (intra-session), we can use ICC(2,1) and ICC(2,3), albeit we can compute ICC(2,2) after eliminating the last column. The message of an acceptable ICC(2,1) value is that the within-day recordings are reliable and one can rely on the results of one trial (perhaps the first trial). On the other hand, ICC(2,3) implies that the reader should rely on the average of the 3 trials. Accordingly, if we got an unreliable value of ICC(2,1), I think it is better to compute ICC(2,2) before calculating ICC(2,3), because it is beneficial to the readers in that if ICC(2,2) becomes reliable, it suggests averaging on two trials rather than 3 trials.

    For day-to-day (inter-session) in the above example, I think only ICC(2,1) would be informative. Because, if we use ICC(2,k), where k is the number of days, its message would be that one should rely on the average values of different days!.

    In contrast to my above understandings, some researchers have utilized ICC(2,k) for day-to-day reliability assessment. For the aforementioned example, what are your suggested models of ICC for both intra- and inter-session reliabilities?

    Sorry for my lengthy post.

    References:
    1. Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 1979. 86(2): p. 420-428.
    2. Weir, J.P., Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. The Journal of Strength & Conditioning Research, 2005. 19(1): p. 231-240.


    Bests,
    Ali

    -----------------------------------
    M.A. Sanjari, PhD.
    Director of Biomechanics Lab.,
    Rehabilitation Research Center, TUMS.
    Tehran University of Medical Sciences
    Tel: (+98) 21 2225 9306
    Fax: (+98) 21 2222 0946

    -----------------------------------

  • #2
    Re: Test-retest reliability: implication of the average or single measures of ICC mod

    Dear Ali,

    My co-authors and myself deal with this issue directly to answer your questions. The response would be quite length but if you e-mail me directly (dgabriel@brocku.ca) I will send you a reprint. I will do so for anyone else who finds this topic interesting. The reference is directly below:

    Best Wishes,

    David

    Christie, A., Kamen, G., Boucher, J.P., Inglis, J.G., & Gabriel, D.A. (2010). A comparison of statistical models for calculating reliability of the H-reflex. Measurement in Physical Education and Exercise Science, 14, 164-175.

    Comment


    • #3
      Re: Test-retest reliability: implication of the average or single measures of ICC mod

      It's definitely the (2,1) model. I explain all in the well-cited Hopkins WG (2000). Measures of reliability in sports medicine and science. Sports Medicine 30, 1-15.

      Will
      Will G Hopkins, PhD FACSM
      Contact info: http://sportsci.org/will
      Sportscience: http://sportsci.org
      Statistics: http://newstats.org
      Be creative: break rules.

      Comment


      • #4
        Re: Test-retest reliability: implication of the average or single measures of ICC mod

        Originally posted by whopkins59 View Post
        It's definitely the (2,1) model. I explain all in the well-cited Hopkins WG (2000). Measures of reliability in sports medicine and science. Sports Medicine 30, 1-15.
        It is true that this paper is well cited. And, it is a first rate review paper all the way. However, it does not answer the original posters question who is mainly concerned with determining the optimal combination of days and trials for his proposed experimental design. We not only address the most appropriate ICC model but also the most appropriate statistical model from which to make the calculations. All of these things are specified in Christie et al. (2010).

        Best Wishes,

        David

        Comment


        • #5
          Re: Test-retest reliability: implication of the average or single measures of ICC mod

          I am guilty of not reading the original poster's message carefully. He wanted to know what formula to use to calculate the retest correlation between two days when on each day you perform three trials and average them. I said it's the (2,1) model, which is the plain old between-day retest correlation that you would get almost exactly by doing the usual Pearson correlation, but I should have also said that you do the calculation using the mean of each day's three trials.

          My Sports Med paper does actually provide a practical approach to estimating the between-day error of measurement for the mean of within-day multiple trials and for deciding when there is nothing to be gained by adding more trials on each day. I wasn't too concerned about the various intraclass correlations then or now, because they can be hard to interpret, and because the error of measurement is the key. There is, of course, the problem of thresholds for important differences or changes in the variable under study, and the default approach is fractions and multiples of the between-subject standard deviation (my thresholds are 0.20, 0.60, 1.2, 2.0 and 4.0 for small, moderate, large, very large and extremely large). The intraclass correlation, being a combination of within-subject and between-subject variance, is a statistic telling you how big the error is in comparison with magnitudes that matter. It's easy to show that the correlations corresponding to the error being equal to each of the thresholds are 0.96, 0.74, 0.41, 0.20 and 0.06. But this scale doesn't apply if the magnitude thresholds are defined some other way, as they need to be for competitive performance of athletes.

          David, thanks for the praise of my paper. it's disappointing that the only Hopkins you cited in your paper was not I, considering I addressed the question of combining trials. You have provided formulae for estimating the various variances and the ICC from ANOVA, which was already well documented and now a bit passe. You can get the errors directly from mixed modeling, which is easier than when you know how and allows for imbalance There is also the issue of confidence limits for the errors and the ICCs... You need to use the Satterthwaite approximation to get the degrees of freedom for combined error variances, which then gives their confidence limits, and the F distribution to get the confidence limits for the ICCs dervied from the variances.

          Will

          Comment


          • #6
            Re: Test-retest reliability: implication of the average or single measures of ICC mod

            Hi,

            Once I learned from one of the Ton’s messages [1] that it is better to provide a scientific brief as a reply, and then cite the work. In case of a lengthy discussion, the art of simplification (summarizing) becomes important.

            Thanks for your helpful comments.

            Ali

            Ref.:
            1 - http://biomch-l.isbweb.org/threads/2...2689#post22689

            Comment


            • #7
              Re: Test-retest reliability: implication of the average or single measures of ICC mod

              David, in his paper [1], has advocated using the nested ANOVA as an alternative to the repeated measure ANOVA scheme. But my original question is a basic topic: to choose a suitable model in the framework of Shrout and Fleiss [2]. I wish to discuss about these points, although now I know the answer to some of them:

              1-Contrary to some works, I think the (2,k) model is not suitable for day-to-day reliability assessment, as it implies an averaging over days, which is not desirable. Therefore (2,1) model should be used. In this case, we have an nx2 matrix of data, which n=number of subjects and columns are day 1 and day 2 scores.

              2-To obtain a column that represents a day in the above matrix, one can average within-day trials or use the first trial of a day. This depends on the acceptable values of (2,k) or (2,1) models on the nx3 matrix of within-day scores, respectively (in this matrix, columns are # of trials = k).

              What is your idea about the following steps for reliability assessment in two different days, each day consists of 3 trials for n subjects:

              1-Within-day (test session) reliability assessment (we have an nx3 data matrix):
              a. Perform ICC(2,1) on the whole nx3 matrix.
              b. If reliable, keep only the first trial for further analysis (e.g. step 3)
              c. If not reliable, keep the first two columns (trials) then
              d. Perform ICC(2,k), now it is ICC(2,2) on the nx2 data.
              e. If reliable, average the first two trials to form one column of data and keep it for step 3.
              f. If not reliable, perform ICC(2,k) on the whole nx3 data (it is now ICC(2,3)).
              g. If reliable, average the all three trials to form one column of data and keep it for step 3.

              2-Within-day (re-test session) reliability assessment (we have an nx3 data matrix):
              a. Same as above

              3-Day-to-day reliability assessment:
              a. Define an n by 2 data matrix whose columns are obtained from the sub-steps b,e, or g above.
              b. Perform ICC(2,1) on this nx2 matrix.

              Bests,
              Ali

              References:

              1.Christie, A., G. Kamen, J.P. Boucher, J. Greig Inglis, and D.A. Gabriel, A Comparison of Statistical Models for Calculating Reliability of the Hoffmann Reflex. Measurement in Physical Education and Exercise Science. 14(3): p. 164-175.

              2.Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 1979. 86(2): p. 420-428.

              Comment

              Working...
              X