Exercise in Implementing a MARS Model in R

This post is based off an assignment for a CART course I took. It was meant as a brief exercise in implementing a basic additive multivariate adaptive regresssion splines model in R. I haven’t really found occasion to use MARS outside the classroom, but when I do (and my understanding of it improves), I’ll be updating this post.


This exercise uses multivariate adaptive regression splines (MARS) to develop an additive model allowing the estimation of body fat for men. The body fat data was sourced from the “fat” dataset contained in the R library “faraway”. The R library “mda” was used as open-source implementation of MARS. The outcome variable was “siri”, which represents a gold standard body fat measurement using the Siri underwater technique. Fourteen feature variables were considered during model development. These variables included age, weight, height, adiposity index, and the circumference measurements of various limbs and body areas such as the abdomen, biceps, and hip.

1.  Setting Up the Model

We first load the required libraries, and perform some basic data management.

We use the default parameters for MARS. Specifically, an additive model was desired, which calls for degree to be one. We also wanted to enable pruning; thus, prune = T.

With the default parameters, we note that MARS has generated a model with thirteen basis functions, with a generalized cross-validation (GCV) error of 17.62105 and a total sum of squared residuals (SSR) of 3603.156.

These basis functions can be tabulated.

This gives us the model:

These basis functions can also be plotted, individually.

mars_each_var_graph

 

These plots ease interpreting how each variable affects body fat percentage. For example, we note that this model considers weight, height and the circumference of the chest, knee, ankle, and forearm to contribute little to reducing model error. On the other hand, these plots show that adiposity index, and the circumference of the neck, abdomen, hip, ankle, biceps, and wrist to have some predictive ability. Of these variables, we see that adiposity index and the circumference of the abdomen, hip, and bicep have the strongest effect (largest absolute coefficients) on predicting body fat.

On an aside, adiposity index is a function of height and weight, specifically adipos = weight/height2. Removing adipos from the model introduces height (but not weight) as a variable in a basis function in the model (not shown). This is expected behavior (particularly where one variable is a function another).

2.  Model Diagnostics

Normality and homoskedascity can be quickly checked with a QQ plot, and with a residual plot.

mars_diagnostics

 

Since the QQ plot appears more or less straight (through the y=x line), it does not provide any evidence against normality, though there are a few points on the extremes that exhibit non-normal behavior. There is no apparent evidence against homoskedascity, from the residual plot.

3.  The ‘nk’ Parameter

Similar to previous exercises, I thought it would be interesting to try many different values for some parameters for MARS. The parameter of interest here is ‘nk’, which controls the maximum number of model terms.

With nk = 11 and nk = 12, GCV is increased to 18.56944, and SSR is increased to 4215.268. With nk = 8, we note an increase in GCV to 18.90747 and SSR to 4430.425. Lower GCV and SSR indicate better fit (keeping in mind that SSR does not account for model complexity). For comparison, the MARS model without a maximum term limit specified produced a model with 13 terms, a GCV of 17.62105 and an SSR of 3603.156. So it appears that MARS selects an optimal number of model terms.

mars_gcv.and.ssr.vs.nk

 

Overall, error at low ‘nk’ is very high, as expected, since the model can only fit very few terms in the model. The reduction in error by adding a few additional terms is very large, for the first few terms, before levelling out. This is also expected because after so many terms, additional terms would not be expected to explain much more error.

In these plots, there is no apparent “elbow” to determine the best ‘nk’. I am not sure whether an elbow should be expected, however.

Certainly it is somewhat silly to try large ‘nk’ (such as 50), given that there are only 14 predictor variables! However, there are some fluctuation in error between 30 and 50 nk. I am not sure what causes this. Perhaps it is random fluctuation.

Finally, my initial understanding was that ‘nk’ limits the number of variables in the model. But it turns out that each variable can be considered two ‘terms’ (for the positive and negative sections of a hinge). This explains the pattern of duplicate errors for every two ‘nk’. It also explains why, though our final model has 13 terms, its error is equivalent to setting ‘nk’ to around 26 – since the variables in our model were each considered twice, but had some positive, or negative, hinge sections pruned later.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">