Scientific Papers

Assessing inter- and intra-rater reliability of movement scores and the effects of body-shape using a custom visualisation tool: an exploratory study | BMC Sports Science, Medicine and Rehabilitation


Study design and ethical approval

The study, where the authors aimed to investigate inter-rater and intra-rater reliability in movement assessments, specifically examining the influence of athletes’ body shapes on these evaluations, was approved by the research ethics board at the University of Ottawa (file no: H-10-19-4983). Experts in orthopedics, physiotherapy, strength and conditioning, kinesiology, and movement performance were recruited as raters for this study. The study utilized motion capture data from 542 athletes to create 630 animations showcasing seven different common screening movements. This approach allowed for a detailed examination of reliability across and within sessions and the exploration of potential weight biases in assessments. Ethical considerations were addressed through informed consent procedures before and after the study, with participants initially unaware of the investigation’s full scope to minimize bias.

Settings and participants

Raters with expertise in orthopedics, physiotherapy, strength and conditioning, kinesiology, and movement performance were recruited for this study. Before data collection started, each rater was asked to fill out an online form providing demographic information including age, gender, job title, years of experience, certifications, and average number of movement assessments performed per day, week, month or year. The consent form outlined the purpose of the study as to examine the inter-rater reliability of the used dataset. To try and obtain unbiased and/or natural reactions, the purposes of examining intra-rater reliability between sessions and within sessions were omitted.

Procedures

Animation preparation

To create the animations, motion capture data from 542 athletes (473 males, 69 females) performing seven unique movement screening movements (i.e., bird-dog, drop-jump, hop-down, L-hop, lunge, step-down, and T-balance) were collected in the USA between 2012 and 2016. At the time of collection, athletes competed in one of 12 sports (i.e., baseball, basketball, cricket, football, golf, lacrosse, rugby, soccer, squash, tennis, track and field, or volleyball) and ranged in skill level from youth to professional (e.g., NFL, NBA, MLB, FIFA). The average age, height, weight were 20.2 ± 4.7 years, 183.3 ± 19.3 cm, and 83.1 ± 22.9 kg, respectively. Athletes were included in the study as long as they were physically able to compete in practices and games at the time of collection. To collect whole body kinematics, 42 markers were placed on anatomical landmarks and captured using an 8-camera Raptor-E motion capture system (Motion Analysis, Santa Rosa, CA, USA). All data were labelled and gap-filled in Cortex (Motion Analysis, Santa Rosa, CA, USA). Once the data were cleaned, MoSh was applied to the data. For MoSh, body-shape and kinematic data are coded so they can be manipulated independently from one another. Body-shape is able to be manipulated by adjusting the 10 weights that represent body-shape, whereas kinematic data can be altered by changing joint angles and how they change over time. The marker set used, while resembling the ideal marker set proposed by Loper et al., 2014, was not identical. Differences included the absence of markers positioned on the breasts, buttocks, and hands. The breast and buttock markers were pertinent for fitting the female body-shape model; therefore, only male motion data were retained for this analysis. The hand markers were necessary to create realistic hand movements. Since our data did not include hand markers, we removed the hands from the animations. For this study, the 5th, 50th, and 95th percentile body-mass indexes (BMI) of the dataset were calculated and used as the cut-offs for the three body-shape classes: underweight, normal, and overweight (Fig. 1).

Fig. 1
figure 1

An example of the three different body-shapes (underweight, normal, and overweight) used for the intra-rater reliability within session with body-shape modification

A database of 630 animations was created consisting of 90 animations from each of the seven movements (7 movements x 90 animations = 630 animations). For each of the seven movement tasks, animations were created to be able to test for intrasession reliability, intersession reliability, and weight bias (Fig. 2), as well as having a diversity of movement competency levels with approximated scores ranging from 1 to 10 with 10 being the best, which were selected based on scoring from two pilot raters. Two pilot raters assessed the animations without specific scoring criteria. Movement profiles were considered only when there was agreement between the raters. Due to criticism of scoring criteria for lacking sensitivity, we opted for a 0–10 scale to enhance sensitivity in our evaluations. Subsequently, animations with diverse movement scores between 1 and 10, reflecting the raters’ assessments, were chosen as the 30 movers, with 10 of them selected for body-shape manipulation. To test intrasession reliability, 30 different movers with unique movement patterns and body-shape were generated and duplicated, to create 60 of the 90 animations (Fig. 2). In the debrief, after revealing the body-shape manipulation, some raters disclosed their biases. Interestingly, some found it easier to rate individuals with more wobbly mass, citing it as an indicator of stability. Others found it challenging, as they believed the wobbly mass motion detracted from the underlying motion pattern. To test weight-bias, for each approximated score, three animations were created with identical movement patterns but body-shape was manipulated so each of the three animations had a body-shape of a different class (e.g., underweight, normal, overweight), making up the remaining 30 animations (10 movement scores x 3 weight classes; Fig. 2). If a movement task was performed bilaterally, only animations for the right-side were included.

Fig. 2
figure 2

A visual depiction of the animations being compared to assess inter- and intra-rater reliability. InterRater = inter-rater reliability between raters. InterSession = intra-rater reliability between days. IntraSession = intra-rater reliability within session without body-shape modification. BodyShape = intra-rater reliability within session with body-shape modification

Software preparation

A custom-built, online, visualisation software was developed using the Unity game engine (Unity Software Inc., San Francisco, CA, USA), which was deployed on a Compute Canada server and linked to a common domain name. Within the software, there were three modules: Training, Day 1, and Day 2. Within each module, the raters were able to: zoom, rotate, and translate the animation for 360° views; play the animation; replay the animation; score the animation; move between the next and previous animation; view the control short-cut keys; and return to the main menu (Fig. 3). For each animation, the score, date and time of score, time to score, and number of replays were recorded and stored in a MariaDB database using phpMyAdmin.

Fig. 3
figure 3

A screenshot of the custom visualisation tool user interface

Protocol and outcome measures

The study consisted of three modules: Training, Day 1 and Day 2. Before beginning to score movements, raters partook in the training module, where five animations for each movement were at their disposal to study. To select training module animations, two pilot raters completed Day 1 of the protocol and animations were chosen that had complete agreement between the two raters. Since the training module animations were part of the testing database, depending on the movement task, the training animations either had a score of {1, 3, 5, 7, or 9} or {2, 4, 6, 8, or 10}, to minimize the number of animations the raters were exposed to prior to the start of the study. In order to minimize bias, the pilot raters’ scores were shown, but explanations for each score were not provided. Raters were asked to use their training and expertise to determine their own scoring criteria based on whole-body kinematics of the given training animations. The raters were able to return to the training module at any time during the study and were able to replay the animations as many times as they liked.

For the Day 1 and Day 2 module, raters scored each animation from 1 to 10 based on the animation’s movement competency for each movement task with 10 being the best. In order to decrease the risk of fatigue, raters did not have to complete all modules in one sitting but were able to complete them at their own pace. In addition, the raters were able to score the movements in whichever order they chose. The Day 1 and Day 2 modules had identical animations; however, the order in which the animations were presented within each task were different between the two days. To decrease the risk of a learning effect, raters had to wait a minimum of 48 h after completing the Day 1 module of the movement task before starting the Day 2 module of the same movement task. Raters were only able to replay each movement three times at real-time speed, but had the ability to zoom, translate, and rotate the vantage point during the movement. The limited number of replays was to decrease the risk of recall bias, especially since many of the movements were duplicates. If a rater submitted multiple scores for the same animation, only the last score was registered. After completing the Day 1 and Day 2 module, the true purposes of the study were disclosed, and the raters signed a post-study consent form that confirmed their acknowledgment and understanding of the use of deception in the study and permission to use their data. All participants completed both modules, except for one female who only completed Day 1.

Data analysis

To test inter- and intra-rater reliability, the arithmetic means of weighted Cohen’s kappa were used. For inter-rater reliability, comparisons between each rater and the mean of the 44 weighted Cohen’s kappa values were calculated for each movement task. For the intra-rater reliability between sessions, weighted Cohen’s kappa was calculated for each rater (except Rater 3 who only completed Day 1) between the exact same movements for Day 1 and Day 2. Both the individual and mean kappa values were retained for each movement task. For intra-rater reliability within session without body-shape manipulation, weighted Cohen’s kappa was calculated between the 30 unique movements for each rater for each day, resulting in 19 kappa values (10 raters for Day 1 + 9 raters for Day 2) for each movement task. To investigate intra-rater reliability within session when body-shape was manipulated, weighted Cohen’s kappa between each weight class for the 10 unique movements per day was calculated resulting in three kappa values (Overweight-Normal, Overweight-Underweight, Normal-Underweight) per rater per day. The kappa values were then averaged within raters. Weighted Cohen’s kappa values were interpreted as no (≤ 0), slight (0.01–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (0.81-1.00) agreement [12].



Source link