Christof Schwiening's random thoughts and writings: Predicting marathon performance from training data

Those who train for a marathon following a pre-prepared plan, of which there are many available, should have a reasonable expectation of achieving their goal: a 3:15 marathon plan should get you a 3:15 marathon time if you execute both the plan and the race appropriately. Unfortunately runners train in the 'real-world' where sessions get skipped and targets missed. The effect of failing to precisely execute 'The Plan' is hard to predict. Can missing one day/week really have a measurable effect? The lack of predictability presents a serious problem to many runners and can lead to injury as they attempt to make-up for missed sessions or bonk badly in the race by failing to scale back their speed to match their lack of diligence in training.

There are ways of predicting marathon performance from race data - multiple rules of thumb and scalers exist. They all work to a greater or lesser degree depending upon the race distance used and the training performed. It is most often the case that novice marathon runners, slower runners and those who yo-yo between fitness levels have the greatest problem with these race predictors especially when scaling from 5K to marathon distance. It is simply the case that for most people the mechanisms that limit performance at 5K are not quite the same as those that limit performance at marathon distance.

Using 'data' to predict marathon pace

The literature is scattered with attempts to predict marathon performance from training and other data - each paper having its own focus. There has yet to be a very large cohort study that uses advance machine learning algorithms to study the complexity that represents athletics training - the interaction between different types of training done in succession, macro cycles, sub-km efforts and additional loads (thermal, altitude, ground surface, clothing weight, nutritional status etc). The complexity requires that the first attempts are based on simple 'average metrics'. This is a sensible first approach since once one can identify the major drivers of fitness one can begin to look for individuals who appear to be able to do better than their prediction. In studying these individuals one can then begin to identify the other elements that drive athletic performance.

To give you a flavour of what is out there in terms of predictions based on average metrics here is a list of a few of them.

Hagan et al. (1981) looked at male runners and found that marathon finishing time was predicted using the equation:
Race time (min) = 525.9 + 7.09* distance per workout in km - 0.45* workout speed in meters per min - 0.17*(distance run in 9 weeks in km) - 2.01*(VO₂max in ml of O₂ per kg body weight per min) - 1.24*age in years

Whilst the equation may well work (and it doesn't for me - see later for a hint as to why) it contains a few problems that make it less useful than others. First it contains the parameter VO₂max - the maximal rate of oxygen uptake. Most runners will not know what their VO₂max is at a given time. Second the implication is that you get faster with age. Whilst older marathon runners may well, on average, run a faster marathon than younger ones, it is likely that beyond the age of about 30 performance declines. Ageing may well be associated with better training and race execution, but these are not necessarily directly associated with age - i.e. a dedicated, careful and informed young runner would not fit the equation.

Hagan et al. (1987) looked at female runners (around 4 hour finishing times) and found the following equation predicts the finishing time:
Race time (min)=449.88-7.61*distance run per day-10.5*speed in km per hour

Schmid et al. (2012) looked at recreational female runners finishing in just over 4 hours and stated that the marathon performance could be estimated (although not terribly well) from the equation:
Race time (min)= 184.4 + 5* calf circumference in cm - 11.9*training speed in km per hour.

Barandun et al. (2012) looked at male runners and suggested that marathon finishing time could be 'estimated to some extent' by the formula:
Race time (min)= 326.3+2.394 * Body fat percentage - 12.06* training speed in km per hour.

Tanda (2011) looked at mostly male runners and found that gross descriptive data for an 8 week block of training, ending one week before the marathon could predict marathon performance relatively well compared to the equations above. The equation (although published in a slightly different arrangement) is:
Race time (min) = 12 + 98.5 * e^{(-km per week/189)}+1390/average speed in km per hour

The value of such equations is their applicability to other datasets including those that extrapolate beyond the original dataset - and the extent to which they represent 'over-fitting' to the subject group on which they were based. The two easiest to compare in this regard are Hagan et al. 1987 and Tanda 2011. To test this we can calculate the performance time for a mild extrapolation. If we take an average daily distance of 30 km at an average pace of 5 mins per km (12 kph). Hagan 1987 predicts a 1 hour 35 min marathon time - which is obviously ridiculous since it is about 30 mins faster than the World Record. Tanda 2011 predicts 2 hours 35 mins which is at least possible. If we look at higher pace running: 4 mins per km (15 kph) for 30 km per day (something that an elite runner might do) Hagan predicts a 1 hour 5 min finish and Tanda a 2 hour 12 min time. At the other end of the performance scale is someone who has not trained for a marathon and perhaps covers 5 km per day by walking at 12 mins per km (5 kph). In this case both equations predict a marathon finish in around 6 hours.

Equations fitted to physiological data need to be treated with care. Some are very good at describing the dataset to which they are fitted - but lack meaningful variables and fail dramatically on extrapolation - others may appear less good but contain variables with some fundamental relevance to the physiology such that they fail gracefully outside their fitted range. The Tanda (2011) equation could, potentially, be a bit better in this regard - however, it passes the 'extrapolation' test in that it does not produce impossible values at the extreme ends of pace and distance. Indeed, the failure at the edges seems very graceful indeed lending some confidence that it might be of more general use.

In the next post I will consider the Tanda (2011) formula in more detail.

Next: Tanda (2011)

Christof Schwiening's random thoughts and writings

Wednesday, 20 January 2016

Predicting marathon performance from training data

Using 'data' to predict marathon pace

No comments:

Post a Comment