IMMT data analysis - re repost (easier to read) : Triathlon Forum: Slowtwitch Forums

1. Dataset

An analysis of the 2014 Ironman Mont Tremblant results dataset follows. The data are on slowtwitch; search for �2014 IMMT full dataset.� Please also see �2014 IMMT Finish Time Histograms.� Table 1 describes the data used for the rest of these analyses. Times are in hours and fractions of hours and show the means for each gender / age group. These data only include finishers.

Women Men
Age Group N Finish Swim Bike Run N Finish Swim Bike Run
18-24 7 13.48 1.29 6.88 5.06 40 12.20 1.26 6.23 4.49
25-29 48 12.67 1.28 6.58 4.56 96 12.21 1.28 6.17 4.52
30-34 85 13.37 1.35 6.88 4.86 181 12.34 1.30 6.21 4.57
35-39 102 13.41 1.38 6.89 4.88 265 12.34 1.30 6.22 4.57
40-44 96 13.52 1.41 6.90 4.91 325 12.37 1.31 6.24 4.57
45-49 101 14.11 1.49 7.12 5.15 289 12.72 1.34 6.34 4.76
50-54 89 14.16 1.46 7.09 5.29 218 13.01 1.37 6.45 4.91
55-59 22 14.22 1.48 7.21 5.18 98 13.34 1.38 6.60 5.03
60-64 6 14.24 1.44 7.14 5.35 47 13.75 1.42 6.80 5.18
65-69 0 NA NA NA NA 8 15.17 1.60 7.17 6.03
70-74 0 NA NA NA NA 1 15.97 1.54 6.93 7.20
Pro 6 9.97 1.07 5.44 3.34 10 8.94 0.91 4.85 3.09

2. Correlations

Below are correlations between disciplines and with finish time:

cor(Swim, Bike) = 0.62
cor(Swim, Run) = 0.44
cor(Swim, Finish) = 0.66
cor(Bike, Run) = 0.70
Cor(Bike, Finish) = 0.92
Cor(Run, Finish) = 0.91

Note:

1. Correlations can be misleading if there are outliers or non-linear relationships, but neither of those things is true here.

2. We can�t learn things like �If I go 10 minutes slower on the bike, I�ll be 15 minutes faster on the run.� from these data. These analyses only talk about what people did. Eventually, itd be nice to build a dataset with multiple races where we can see how someone did in more than one IM. Optimistically, I think we could learn a lot from those data.

3. All the correlations are positive. On average, if someone is faster (slower) than average in one, they�ll be faster (slower) than average in another. On average, fast is fast, and slow is slow. To illustrate this with an arbitrary example, the histograms in Figure 1 compare the run times for people who are within 3 minutes of 5:30 on the bike to the run times of people who are within 3 minutes of 6 hours. People who are faster on the bike also tend to be faster on the run. (Figure posted separately.)

4. Figure 2 extends the idea of Figure 1 and shows the estimated median, 25th, and 75th percentiles of run times for each bike time. (Figure posted separately.)

5. Faster than average swimmers tend to be faster than average runners. Who knew?

6. I didn�t expect a correlation of 0.66 between the swim time and the finish time.
7. To help interpret that, it says that if you use a simple linear regression model to predict finish time from swim time, then you can account for about 44% (0.66 squared) of the variability in the data.

8. To help interpretation a little more, the next section has a table summarizing simple linear models to predict finish from each disciple. I also include a 95% prediction interval to illustrate how precisely (or imprecisely) each model can predict. For each prediction interval, I use an arbitrary value for the swim, bike or run times as inputs. (Arbitrary = personal interest!) The estimated equations (in hours) are listed after the table. Analogous tables for multiple linear regressions (2 predictors) are in the next section.

3. One Factor Models

% Variability Predicted 95% Prediction Interval
Covariate Explained Finish Lower Upper Range
Swim 44% 11:05 (1 hr swim) 8:35 13:36 5:01
Bike 85% 11:53 (6hr bike) 10:35 13:11 2:36
Run 83% 11:29 (4hr run) 10:05 12:52 2:47

In general, you cannot predict performance too precisely from the time of any discipline, swimming especially.

Estimated equations:

Predictive model from swim:
Expected Finish = 5.950 + 5.139*Swim

Predictive model from bike:
Expected Finish = -0.4926 + 2.0628*Bike

Predictive model from run:
Expected Finish = 4.253 + 1.806*Run

3. Two Factor Models

% Variability Predicted 95% Prediction Interval
Covariates Explained Finish Lower Upper Range
Swim, Bike 87% 11:32 (1hr swim, 6hr bike) 10:18 12:45 2:27
Swim, Run 91% 10:48 (1hr swim, 4hr run) 9:48 11:48 2:00
Bike, Run 98% 11:29 (6hr bike, 4hr run) 11:04 11:54 0:50

The performance of these models is not too surprising or impressive. The first two are not very precise. In the last, you know the bike and run times, so only swim time needs to be predicted. This model essentially means that we can predict a swim time with a range of 50 minutes. Huh.

Estimated equations:

Predictive model from swim and bike:
Expected Finish = -4.3959 + 3.9320*Swim + 2.4124*Bike - 0.4141*Swim*Bike

Predictive model from swim and run:
Expected Finish = 1.3149 + 3.2016*Swim + 1.7103*Run - 0.1404*Swim*Run

Predictive model from bike and bike:
Expected Finish = -1.61349 + 1.48228*Bike + 1.33416*Run - 0.04727*Bike*Run

4. Caveats and Notes

1. I also did analyses stratified by gender and age group. That changed the numbers, but it didn�t seem to change the patterns, so I only show overall analyses.

2. These are only pairwise analyses between disciplines and/or finish. The same people who contribute a lot to one pairwise relationship (say swim vs finish) do not necessarily contribute a lot to another (say swim vs run). Maybe I�ll do multivariate analyses later.

3. Residual analyses suggest that the linear models fit decently except for very fast or very slow people. The linear models tend to be biased slightly high (prediction is slower than reality) for very fast or very slow people.

4. What did I learn?

(a) The not too surprising stuff wasn�t backed up by data before the analysis. It�s still only one race...

(b) Speed in one discipline associates strongly with speed in another.

(c) I am a relatively fast swimmer, but the rest of my race is significantly slower than its predicted value based on my swim (i.e. it sucks). There goes one excuse.

(d) Figure 2 shows what good (and bad) runs are for particular bike times.

(e) Figure 2 also illustrates limitations to these data:

Imaginary situation: An individual does the same course under the same conditions a bunch of times. Each time he changes the ride effort. If I plot run time as a function of ride time, I hypothesize that I�d get a line that slopes downward (i.e. longer bike time = shorter run time) within the range of the data. That could be useful to know.

Reality: Data from 1 race for a bunch of people do not show this. Shorter bike times associate with shorter run times.

datadatadata wrote:

Imaginary situation: An individual does the same course under the same conditions a bunch of times. Each time he changes the ride effort. If I plot run time as a function of ride time, I hypothesize that I�d get a line that slopes downward (i.e. longer bike time = shorter run time) within the range of the data. That could be useful to know.

I'm pretty sure that this kind of data would correllate very closely with an ax^-b+c equation.

On a side note I've done a few analysis of my duathlons this year. run-run correlations are interesting to look at. Hilly bike courses lead to lower correlations. It's easy to cook up the legs.

____________________________________
Pain is inevitable. Suffering is up to you.

Poll

Triathlon Forum

Our Partners

Poll

Triathlon Forum

Our Partners

Newsletter