Introduction

High blood low-density lipoprotein (LDL) cholesterol level has been associated with a greater risk of heart disease (Panlasigui et al., 2003), which is a possibly fatal condition with high yearly mortality rates. According to (Wenger et al., 2010), almost one million Americans experience an heart attack caused by the disease annually. Consequently, analysis of risk factors for high blood LDL cholesterol is a major public health concern. In this study, we will explore a data set from the Sysdimet study (Lankinen et al., 2011) with various nutritional and medical measurements and the associated blood LDL cholesterol levels and attempt to find correlations between certain nutrient intake and the LDL cholesterol level. These correlations can then be utilized by nutritional experts for giving advice to patients suffering from high blood LDL cholesterol. However, this study should be interpreted as a preliminary study, and further research should be conducted on promising effects suggested by this study before using the results for nutritional guidance.

The choice of model in this study is a generalized linear model (GLM) with normal likelihood and identity link function. However, given the exploratory nature of this study, several sets of predictors are experimented with and their relative predictive performance is evaluated and compared. Furthermore, we make a division of the data into two groups: the ones receiving cholesterol mediation (Med group) and the ones without medication (Nomed group). This is because medication likely has strong effect on LDL-levels that might affect other predictors and outweight their relative effect, so it is controlled for instead. Four sets of predictors are tested for each group, two of which are so called reference models and the rest are hypotheses based on initial correlation analysis.

Following the best practices of Bayesian data analysis, several diagnostics checks are performed to detect misaligned models or sampling problems. Furthermore, posterior predictive checking and prior sensitivity analysis are conducted to assess the model sensibility and robustness. This work is inspired by a StanCon talk by (Turkia, 2018), which explores a similar more general problem of finding nutritional effects, even on a personal level. However, here we focus explicitly on LDL cholesterol as the target variable and use a simpler model for the analysis task. We only focus on population level effects in this study. Hierarchical models, which take into account personal variability within patients, are out of the scope of this work. However, they present a splendid research direction for further analysis of the problem.

Description of the Sysdimet data

The data set used in this study is from 131 volunteers from Kuopio area, Finland, who participated in a 12-week parallel controlled dietary intervention study (Lankinen et al., 2011). However, our copy of the data set only includes measurements for 106 participants, possibly due to discontinued participation for some participants or missing data. Each participant took 4 measurements during the study yielding 424 observations (data rows) in total. Despite there being multiple measurements for one participant, this is not explicitly modeled for in this study. Only the pooled observations are considered to keep the model design simple and sampling times reasonable for rapid exploration of the data. More sophisticated models can be built for studying promising effects shown by this preliminary study.

In addition to target variable blood LDL cholesterol, several other variables are present in the data. Most of them are nutrients, but some are other blood measurements as well. Only continuous variables are modeled for, expect the split into two groups is done based on a boolean variable cholesterol_medication. The following code will load this data and perform some renaming of the columns into English and then plot the histogram of the target variable.

cols_to_eng = list(
  sukupuoli = "sex",
  ryhma = "group",
  pituus = "height",
  kolestrolilaakitys = "cholesterol_medication",
  vyotaro = "waist",
  paino = "weight",
  sakkar = "sucrose",
  prot = "protein",
  kuitu = "fibre",
  sellul = "cellulose",
  hhydr = "carbohydrate",
  energia = "energy",
  rasva = "fat",
  linoli = "linoleic_acid",
  linoleeni = "alpha_lipoic_acid",
  ligniini = "lignin",
  alkoholi = "alcohol",
  kol = "cholesterol",
  foolih = "folic_acid",
  fsins = "blood_insulin",
  fskol = "blood_cholesterol",
  fpgluk = "blood_glucose",
  fshdl = "blood_hdl_chol",
  fsldl = "blood_ldl_chol"
)

# Read the data description
datadesc <- read.csv(file=root("data/data_desc.csv"), header = TRUE, sep = ";")
# Read the actual data matching the description
sysdimet <- read.csv(file=root("data/SYSDIMET_diet.csv"), sep=";", dec=",")

# Rename column names to English
for (name in names(cols_to_eng)) {
  names(sysdimet)[names(sysdimet) == name] <- cols_to_eng[name]
}

# change subject ID to numbers (compatibility with brms)
sysdimet$SUBJECT_ID = as.numeric(gsub("S", "", sysdimet$SUBJECT_ID))

ggplot(sysdimet, aes(x=blood_ldl_chol)) + 
  geom_histogram(color="blue", fill="lightblue") + 
  xlab("Blood LDL cholesterol (mmol/l)") + ylab("Participants")

The target variable distribution resembles a normal distribution so the choice of normal likelihood in our linear model is well justified.

Division into groups and data preprocessing

Next we will split the data into Med and Nomed groups and scale the variables to same order of magnitude to improve model stability.

sysdimet_nomed = data.frame(sysdimet[sysdimet$cholesterol_medication == 0,])
sysdimet_med = data.frame(sysdimet[sysdimet$cholesterol_medication == 1,])

sysdimet_nomed = scale(sysdimet_nomed[,5:35])
sysdimet_med = scale(sysdimet_med[,5:35])

# display predictor names
preds =
  setdiff(colnames(sysdimet_med), c("cholesterol_medication", "blood_ldl_chol"))
preds

##  [1] "height"            "rrdias"            "rrsyst"           
##  [4] "waist"             "weight"            "mufa"             
##  [7] "pufa"              "safa"              "sucrose"          
## [10] "protein"           "fibre"             "carbohydrate"     
## [13] "energy"            "fat"               "lignin"           
## [16] "linoleic_acid"     "alpha_lipoic_acid" "cholesterol"      
## [19] "cellulose"         "alcohol"           "dvit"             
## [22] "folic_acid"        "cvit"              "epa"              
## [25] "dha"               "blood_hdl_chol"    "blood_insulin"    
## [28] "blood_cholesterol" "blood_glucose"

cat("\nNum predictors =", length(preds), "\n")

## 
## Num predictors = 29

Here is a list of all predictors used in this study. Given this much information the modeling task should be feasible, which we will see shortly.

Now we are ready to do some analysis on the data.

Methods

In the following we will go though all relevant steps of Bayesian data analysis work flow. We shall begin with forming our hypotheses about the modeling problem and then proceed to test those.

Selecting model candidates

As there are more than 500 million different ways to select an (unordered) subset of 29 predictors \((2^{29})\), we will only focus on a select few of them. First we formulate so called null hypothesis model, which contains no nutrient predictors. This means if it yields the best performance in model selection process, we have some evidence to believe nutrition does not significantly affect blood LDL cholesterol levels. The second extreme model is to use all the predictors, which uses all the available information to make predictions about blood LDL cholesterol and should be likely the most accurate of the models. However there is likely a lot of redundant information in the set of all predictors and lot of overhead comes from making measurements for all of them, so two predictor sets with significantly reduced number of variables are also tested.

The selection criteria used for these two hypothesis sets was to maximize correlation with LDL cholesterol and minimize the correlation with other predictors by looking at correlation matrices computed from the data. High collinearity within predictors is known to hinder predictive performance of linear models and makes the interpretation of individual effect coefficients difficult, which is the motivation behind the second selection criterion. The following code will define all the models and other helper data structures.

# define all predictor sets (i.e. models)
null_pred = c("height")

hypothesis_pred1 = c("weight", "sucrose",
                     "fibre", "alcohol", "rrdias")

hypothesis_pred2 = c("waist", "safa",
                     "lignin", "dvit", "blood_insulin")

complete_pred = c("weight", "height", "alcohol", "sucrose",
                  "waist", "safa", "pufa", "mufa", "protein", 
                  "fibre", "carbohydrate", "energy", "cholesterol",
                  "fat", "cellulose", "dvit", "folic_acid", "linoleic_acid",
                  "alpha_lipoic_acid", "cvit", "epa", "dha", "rrdias", "rrsyst",
                  "blood_insulin", "blood_glucose", "blood_hdl_chol", "lignin")


# define other relevant model information
target = "blood_ldl_chol"
groups = c("med", "nomed")
models = c("null", "hypothesis1", "hypothesis2", "complete")

predictors = hash(
  "null" = null_pred,
  "hypothesis1" = hypothesis_pred1,
  "hypothesis2" = hypothesis_pred2,
  "complete" = complete_pred
)

data_splits = hash(
  "med" = sysdimet_med,
  "nomed" = sysdimet_nomed
)

p0s = hash(
  "null" = 0,
  "hypothesis1" = 3,
  "hypothesis2" = 3,
  "complete" = 5
)

pred_to_modelname = hash(
  "null" = "Null hypothesis Model",
  "hypothesis1" = "Hypothesis 1 Model",
  "hypothesis2" = "Hypothesis 2 Model",
  "complete" = "Complete Model"
)

Correlations between predictors and the target variable

As discussed earlier, it is useful to glance at the correlations between variables to get an idea how to select suitable predictor sets. First let us plot the correlation matrices for both groups with all the predictors included.

par(mfrow=c(1,2))
for (group in groups) {
  for (model in models) {
    if (model %in% c("complete")) {
      fit_key = paste(group, model, sep = "_")
      data = data_splits[[group]][, c(target, predictors[[model]])]
      corrplot(cor(data), tl.pos='n')
    }
  }
}

In the above figure the full data correlations for the Med group (left) and the Nomed group (left) are depicted (variable names omitted for brevity). It is clearly visible that Med group has more collinearity between its variables. The first row of the correlation matrix corresponds to the target variable, so it is beneficial to look at predictors which clearly correlate with the target variable (have bright colors in the correlation plot) as they are more likely to contribute to the predictive performance of the model. It is evident from the plot there is significant amount of collinearity between predictors, which makes interpretation of the modeled effects difficult to interpret as the perceived effect might be aggregated from other correlating variables. Thus it is desirable to select a variable set that only has bright colors on the first row/column and the diagonal of the correlation matrix.

Two hypothesis models for both groups

Now let us analyze the correlations between variables in our hypothesis predictor sets and the target variable.

par(oma=c(0,0,2,0), mfrow=c(2,2))
for (group in groups) {
  for (model in models) {
    if (!(model %in% c("null", "complete"))) {
      fit_key = paste(group, model, sep = "_")
      data = data_splits[[group]][, c(target, predictors[[model]])]
      corrplot(cor(data), mar = c(0,0,2,0), cl.pos = "n",
               title = paste(tools::toTitleCase(group),
                             pred_to_modelname[[model]]))
    }
  }
}

As evident from the shallow colors away from the diagonals, there is no significant collinearity between predictors anymore, but still some correlation with the target variable. We interpret this as a good initial indicator of model performance with this variable selection and proceed to test them in practise. This is justified heuristic as correlation plot values can be thought as fitting separate linear model for each of the variables and looking at the estimated slope parameter, which indicates the effect size of that single variable. However, given the collinearities and limited amount of data, separate model maximum likelihood estimation is not enough and we proceed to fit a full GLM with proper prior information included.

Prior selection

In Bayesian probalistic modeling we typically include some broad prior information about the data generating process into the model. This is beneficial because such a model is less likely to overfit by ruling out some impossible values (like the effect of some nutrient being in the order of one million for example).

For the latent parameters intercept \(\beta_0\) and the residual standard deviation \(\sigma\) we use Student-T distribution with the assumption both parameters are relatively close to zero, as the data has been centered to zero mean and scaled. The Student-T distribution was chosen over equivalent normal distribution as it is more robust to outliers in regression. For the coefficients of the linear model we use regularized horseshoe prior by (Piironen and Vehtari, 2017). This prior encodes our belief that there is significant amount of irrelevant coefficients with minimal effect, and they should be pushed towards zero with the help of this prior. This is useful because there is not enough data to support this hypothesis by itself so the use of informative priors is encouraged. We define the regularized horseshoe prior slab scale parameter as done in (Turkia, 2018) and global scale parameter is defined by our prior guess for the number of non-zero coefficients \(p_0\) as recommended by (Piironen and Vehtari, 2017). The following code defines a list of priors for the brms R package (Bürkner, 2018) used for model fitting.

define_priors <- function(selected_predictors, model, p0 = 3, 
                          mu_offset = 0, sigma_offset = 0) {
  D = length(selected_predictors)
  # guess for number of non-zero coefficients
  p0 = min(D, max(1, p0))
  par_ratio = p0/(D-p0) # used to form tau0 (divided by sqrt(N) internally)
  
  rhs_prior = horseshoe(par_ratio = 1234, 
                        df_slab = 4,  # dof for the half-t priors for lambdas
                        scale_slab = 1, 
                        autoscale=TRUE) # scale tau0 by sigma
  
  rhs_prior = gsub("1234", par_ratio, rhs_prior)
  
  sigma_prior = paste("student_t(7,", 1+mu_offset, 
                      ",", 1+sigma_offset, ")", sep="")
  
  icept_prior = paste("student_t(7,", mu_offset, 
                      ",", 1+sigma_offset, ")", sep="")
  
  if (model == "null") {
    priors = c(set_prior("normal(0, 1)", class = "b"),
               set_prior(sigma_prior,
                          class = "sigma"),
               set_prior(icept_prior,
                          class = "Intercept"))
  } else {
    priors = c(set_prior(rhs_prior, class = "b"),
               set_prior(sigma_prior,
                          class = "sigma"),
               set_prior(icept_prior,
                          class = "Intercept"))
  }
  priors
}

Gaussian linear model definition

A generalized linear model for target variable \(y\) is defined by the following equation

\[ \begin{aligned} {\displaystyle \operatorname {E} (y |\mathbf {X} )={\mu}=g^{-1}(\mathbf {X} {\boldsymbol {\beta }})} \end{aligned} \] where \(g\) is the link function, \(\mathbf {X}\) is a matrix of observations of independent random variables and \(\beta\) is a vector of linear coefficients. Given identity link function for the Gaussian linear model the above equation simplifies to

\[ \begin{aligned} {\displaystyle \operatorname {E} (y |\mathbf {X} )={\mu }=\mathbf {X} {\boldsymbol {\beta }}}. \end{aligned} \] With gaussian linear model we expect \(y\) to be normally distributed with mean \(\mu\). By separating the intercept term \(\beta_0\) and specifying the priors, the full model specification is

\[ \begin{aligned} \operatorname{E} (y |\mathbf {X} ) = {\mu } &\sim N(\mathbf {X} {\boldsymbol {\beta }} + \beta_0, \sigma) \\ \boldsymbol {\beta } &\sim \text{horseshoe}(p_0) \\ \beta_0 &\sim t_7(0, 1) \\ \sigma &\sim t_7(1, 1) \\ \end{aligned} \]

where horseshoe refers to the prior defined in (Piironen and Vehtari, 2017) with a prior guess for the number of non-zero parameters \(p_0\). The only thing varying between the different models compared in this study is data matrix \(\mathbf {X}\) and parameter \(p_0\). We choose different predictors (columns) for each of the models and adjust \(p_0\) accordingly. We do this separately for two sets of observations (rows) from the whole \(\mathbf {X}\).

With the priors specified, it is time to define our statistical model using brms interface to probabilistic programming language framework Stan (Carpenter et al., 2017). The following helper function defines a statistical model for each predictor set and fits it using specific portion of the data and given prior settings. We use four chains with 10000 iterations (4000 warmup) and default initialization settings to run the simulations. Based on the diagnostic values later these settings seem to be sufficient for reliable parameter estimation. We save the fitted model and generated stan code for convenience.

fit_model_with_predictors <- function(predictors, data_group, group, 
                                      model, p0 = 3, priors = NULL) {
  # define linear model with population level effects (pearson correlation test)
  formula_pooled = 
    paste(target, "~", paste(predictors, collapse = "+"))
  
  brms_file_pooled = root(paste("model_fits/brms_fit_",
                                group, "_", model, ".rds", sep=""))
  if (is.null(priors)) {
    priors = define_priors(predictors, model, p0)
  }

  if (!file.exists(brms_file_pooled)) {
    gc()
    fit_pooled = brm(formula_pooled,
                     data = data_group,
                     family = gaussian(),
                     prior = priors,
                     warmup = 4000,
                     seed=SEED, chains=4, iter=10000,
                     refresh = 1000,
                     control = list(max_treedepth = 15, adapt_delta = 0.9999),
                     save_model = gsub("model_fits/", "stan_code/", 
                                       gsub(".rds", ".stan", brms_file_pooled)))
    saveRDS(fit_pooled, brms_file_pooled)
  } else {
    fit_pooled <- readRDS(brms_file_pooled)
  }
  fit_pooled
}

Even though we fit four different models the generated Stan code is the same for all the models (expect for null model that uses slightly different priors), as the data and number of predictors have been parametrized in the resulting code. Here we will briefly review how the Stan code for the Gaussian Linear Model looks like and how it works. We will review the most relevant blocks here and the full source code is available in the Appendix for the interested reader.

transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}

This block performs preprocessing for the input data X so all columns (variables) have zero mean to make the regression task easier. This would not actually be needed as we already centered the data earlier, but brms does this automatically anyway.

model {
  // priors including all constants
  target += std_normal_lpdf(zb);
  target += student_t_lpdf(hs_local | hs_df, 0, 1)
    - rows(hs_local) * log(0.5);
  target += student_t_lpdf(Intercept | 7, 0, 1);
  target += student_t_lpdf(hs_global | hs_df_global, 0, hs_scale_global * sigma)
    - 1 * log(0.5);
  target += inv_gamma_lpdf(hs_slab | 0.5 * hs_df_slab, 0.5 * hs_df_slab);
  target += student_t_lpdf(sigma | 7, 1, 1)
    - 1 * student_t_lccdf(0 | 7, 1, 1);
  // likelihood including all constants
  if (!prior_only) {
    target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
  }
}

This block defines the statistical Bayesian model for the linear regression task. As the code was generated by the brms package, it lacks the usual syntactic sugar such as the use of \(\sim\) operator. Instead, it explicitly sums log probability densities to the target variable that acts the response or target variable \(y\), which in this case corresponds to the blood LDL cholesterol level we are trying to model. Majority of the code here is computing prior log probability densities and summing them to the target variable. Summing makes sense because we assume factorized log posterior probability distribution for the target variable here. For the detailed computations used to in regularized horseshoe prior, please refer to (Piironen and Vehtari, 2017). Lastly, the normal log probability density is computed on the data given current parameters to perform the maximum likelihood estimation step in model fitting. The use of log probability distributions is related to numerical stability and is a common trick in statistical computing.

Model fitting

Now we are all set to do the model fitting. The following code will do that for all the model candidates and both data groups.

fits = hash()

for (group in groups) {
  for (model in models) {
    fit = fit_model_with_predictors(predictors[[model]], data_splits[[group]],
                                    group, model, p0s[[model]])
    fits[[paste(group, model, sep = "_")]] = fit
  }
}

Simulation convergence diagnostics checking

Firstly, one must check for common problems happening during MCMC simulations to ensure our further analysis is based on reliable estimates of interest. This can be done in RStan as follows:

for (group in groups) {
  for (model in models) {
    fit_key = paste(group, model, sep = "_")
    cat("\r**", paste(tools::toTitleCase(group),
                  pred_to_modelname[[model]]), "HMC diagnostics **\n")
    check_divergences(fits[[fit_key]]$fit)
    check_treedepth(fits[[fit_key]]$fit)
    check_energy(fits[[fit_key]]$fit)
    diagnostics = as.data.frame(monitor(fits[[fit_key]]$fit, print = FALSE))
    cat("\rMaximum Rhat value     =", max(as.numeric(diagnostics[, "Rhat"])))
    cat("\n\rMinimum Bulk ESS value =", 
        min(as.numeric(diagnostics[, "Bulk_ESS"])))
    cat("\n\rMinimum Tail ESS value =", 
        min(as.numeric(diagnostics[, "Tail_ESS"])), "\n\n")
  }
}

## 
** Med Null hypothesis Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.00034
## 
Minimum Bulk ESS value = 8952
## 
Minimum Tail ESS value = 12007 
## 
## 
** Med Hypothesis 1 Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.000433
## 
Minimum Bulk ESS value = 6095
## 
Minimum Tail ESS value = 4617 
## 
## 
** Med Hypothesis 2 Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.00041
## 
Minimum Bulk ESS value = 5750
## 
Minimum Tail ESS value = 7811 
## 
## 
** Med Complete Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.000673
## 
Minimum Bulk ESS value = 5385
## 
Minimum Tail ESS value = 10213 
## 
## 
** Nomed Null hypothesis Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.000511
## 
Minimum Bulk ESS value = 9119
## 
Minimum Tail ESS value = 12674 
## 
## 
** Nomed Hypothesis 1 Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.000512
## 
Minimum Bulk ESS value = 6168
## 
Minimum Tail ESS value = 8514 
## 
## 
** Nomed Hypothesis 2 Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.000314
## 
Minimum Bulk ESS value = 5863
## 
Minimum Tail ESS value = 10072 
## 
## 
** Nomed Complete Model HMC diagnostics **

## 0 of 24000 iterations ended with a divergence.

## 0 of 24000 iterations saturated the maximum tree depth of 15.

## E-BFMI indicated no pathological behavior.

## 
Maximum Rhat value     = 1.000659
## 
Minimum Bulk ESS value = 4962
## 
Minimum Tail ESS value = 7975

There are no divergences indicating problems during sampling regions of low density, so we have good reason to believe the simulation estimates are not biased because of sampling problems of the MCMC algorithm. No other Hamiltonian Monte Carlo (HMC) simulation specific problems are reported either.

Furthermore, all \(\widehat{R}\) values are below 1.01 and all effective sample size (ESS) values above 400 as recommended in (Vehtari et al., 2020), so we can conclude the Markov chain simulations have converged properly. Thus, the estimates produced by the simulation should be reliable.

To elaborate, \(\widehat{R}\) is a measure of convergence for multiple Markov chain simulations that measures the within- and between-sequence variances of the Markov chain equilibrium distributions. It summarizes these properties with a single number that declines to 1 as the number of simulation iterations goes to infinity. If the \(\widehat{R}\) measure is not close to 1, there is a reason to believe using more iterations in the simulation may improve our inference about the target distribution of the associated scalar estimand.

Moreover, Effective sample size (ESS) values measure sampling efficiency by approximating how many independent draws are needed from the target posterior distribution to yield the same posterior estimate of interest (e.g. mean) as the dependent draws from the MCMC simulation. According to (Vehtari et al., 2020) the \(\widehat{R}\) diagnostic as an indicator of simulation quality can only be trusted if ESS is high enough. Well, they certainly are in this case.

Nevertheless, all our simulations pass these recommendations with flying colors, so we can proceed to the next phase of the analysis with confidence.

Posterior predictive checking

Convergence diagnostics are not enough to determine whether the model produces sensible results. They merely state that whatever we are simulating likely succeeded, but that does not mean the simulation model itself is sensible. Consequently, more checks are needed, namely the posterior predictive checks (PPC). The following code will iterate through the models and replicate distributions of target variable for each model. Then we can compare the replicated distribution \(y_{\text{rep}}\) to the real data distribution \(y\) and see if the distributions align properly. If this is not the case, there is some reason to believe the model could be misspecified for the regression task.

y_reals = hash()
y_reps = hash()

for (group in groups) {
  for (model in models) {
    fit_key = paste(group, model, sep = "_")
    data_split = as.data.frame(data_splits[[group]])
    y <- data_split$blood_ldl_chol
    y_reals[[fit_key]] = y
    n_y <- nrow(data_split)
    y_reps[[fit_key]] <- posterior_predict(fits[[fit_key]], draws = 100)
  }
}

PPC for Med group

Firstly, we perform PPC checks for the Med group with the following plot.

y = y_reals[["med_null"]]
y_rep = y_reps[["med_null"]]

ppcm1 = ppc_dens_overlay(y, y_rep[1:length(y), ]) + 
  xlab("Null Hypothesis Model")

y = y_reals[["med_hypothesis1"]]
y_rep = y_reps[["med_hypothesis1"]]

ppcm2 = ppc_dens_overlay(y, y_rep[1:length(y), ]) + xlab("Hypothesis 1 Model")

y = y_reals[["med_hypothesis2"]]
y_rep = y_reps[["med_hypothesis2"]]

ppcm3 = ppc_dens_overlay(y, y_rep[1:length(y), ]) + xlab("Hypothesis 2 Model")

y = y_reals[["med_complete"]]
y_rep = y_reps[["med_complete"]]

ppcm4 = ppc_dens_overlay(y, y_rep[1:length(y), ]) + xlab("Complete Model")

bayesplot_grid(ppcm1, ppcm2, ppcm3, ppcm4, legends = FALSE, 
               grid_args = list(nrow = 2))

Based on the above plots, it seems the replicated data sets for \(y\) for all models excite no pathological behavior and the models can be deemed sensible.

PPC for Nomed group

Let us check the other group as well.

y = y_reals[["nomed_null"]]
y_rep = y_reps[["nomed_null"]]

ppcn1 = ppc_dens_overlay(y, y_rep[1:length(y), ]) + 
  xlab("Null Hypothesis Model")

y = y_reals[["nomed_hypothesis1"]]
y_rep = y_reps[["nomed_hypothesis1"]]

ppcn2 =ppc_dens_overlay(y, y_rep[1:length(y), ]) +xlab("Hypothesis 1 Model")

y = y_reals[["nomed_hypothesis2"]]
y_rep = y_reps[["nomed_hypothesis2"]]

ppcn3 = ppc_dens_overlay(y, y_rep[1:length(y), ]) + xlab("Hypothesis 2 Model")

y = y_reals[["nomed_complete"]]
y_rep = y_reps[["nomed_complete"]]

ppcn4 = ppc_dens_overlay(y, y_rep[1:length(y), ]) + xlab("Complete Model")

bayesplot_grid(ppcn1, ppcn2, ppcn3, ppcn4, legends = FALSE, 
               grid_args = list(nrow = 2))

Same situation with the Nomed group, all posterior predictive distributions seem plausible. It is also worth noticing that there is no visible difference between the models and they seem to have almost equivalent predictive performance.

Model selection with leave-one-out cross validation

Now that we have eight reasonable candidate models, it is time to select the best performing models. Visual posterior predictive checking did not reveal a clear difference between model performances, so a more rigorous performance score is needed. One general option is the expected log (posterior) predictive density (elpd) score computed using leave-one-out cross-validation according to (Vehtari, Gelman and Gabry, 2017). This score is given by the equation

\[ \begin{aligned} \text{elpd}_{\text{loo}} &= \frac{1}{n} \sum_{i=1}^n \log {p(y_i|x_i,D_{-i})}. \end{aligned} \] Here \(n\) denotes number of observations and \(D_{-i}\) is the data set with \(i\)th observation left out. It attempts to approximate how much posterior predictive density there is at the location of a new unseen (\(i\)th) data point sampled from the true underlying data generation mechanism on average. The higher the elpd value the better, as it indicates our predictive model is likely closer to the true data generation mechanism.

To compute these performance metrics for each model we use efficient Pareto-smoothed importance-sampling (PSIS) implementation of the leave-one-out cross-validation by (Vehtari, Gelman and Gabry, 2017) that eliminates the need to fit the model for all possible folds of data, which would be extremely slow. This can be done as follows.

loos = hash()
loo_plots = vector(mode = "list", length = 8)
j = 1
par(mfrow=c(4,2))
for (model in models) {
  for (group in groups) {
    fit_key = paste(group, model, sep = "_")
    loo_file = root(paste("model_fits/loo_", fit_key, ".rds", sep=""))
    if (!file.exists(loo_file)) {
      gc()
      loo_i = loo(fits[[fit_key]], cores = 1, model_names = fit_key)
      saveRDS(loo_i, loo_file)
    } else {
      loo_i <- readRDS(loo_file)
    }
    loos[[fit_key]] = loo_i
    (loo_plots[[j]] = plot(loo_i, main = paste(tools::toTitleCase(group), 
                                         pred_to_modelname[[model]])))
    j = j + 1
  }
}

Before comparing computed \(\text{elpd}_{\text{loo}}\) values, we must assess their reliability as PSIS-LOO is only approximating the equation given above and there can be approximation errors. For this we plotted diagnostics known as \(\hat{k}\)-values for each estimated \(\text{elpd}_{\text{loo}}\) value in the figure above. The important part to notice is that no significant amount of \(\hat{k}\)-values for each model performance estimate is above threshold of 0.7 (indicated by horizontal dashed line), otherwise the \(\text{elpd}_{\text{loo}}\) estimate might be biased (too optimistic) according to (Vehtari, Gelman and Gabry, 2017). Using this criterion most of the performance estimates look reliable, however there is one \(\hat{k}\)-value above 0.7 for \(\text{elpd}_{\text{loo}}\) estimate of Med group Complete Model. This indicates there is a slight concern that it may be biased. In the case where many \(\hat{k}\)-values are above 0.7 one could consider a technique called moment matching by (Paananen et al., 1906), but it would require additional computation. However, considering that possibly biased log posterior density estimate is only one element in a sum of more than 100 values (see equation above), the overall bias is likely negligible and we can safely proceed to compare these \(\text{elpd}_{\text{loo}}\) estimates.

Then let’s compare the computed performance metrics for each data group separately.

Med group model comparison

loo_compare(loos[["med_null"]], loos[["med_hypothesis1"]],
            loos[["med_hypothesis2"]], loos[["med_complete"]])

##                 elpd_diff se_diff
## med_complete      0.0       0.0  
## med_hypothesis2 -13.7       5.0  
## med_null        -15.5       6.1  
## med_hypothesis1 -15.8       6.4

Complete model seems to clearly yield the best score for common predictive performance measure, and the difference to next best model is not purely explained by the estimation standard error induced by limited number of data. However, the variables in complete model are highly correlated so its estimates for the effects can be biased, because the effect of other variables is visible in the marginal of a single coefficient estimate. Therefore, it’s meaningful to consider Hypothesis 2 model as well, which has significantly less collinearities between its variables, so the estimates are more reliable (i.e. likely less biased) even though its predictive performance is not as good as that of the complete model.

Nomed group model comparison

loo_compare(loos[["nomed_null"]], loos[["nomed_hypothesis1"]],
            loos[["nomed_hypothesis2"]], loos[["nomed_complete"]])

##                   elpd_diff se_diff
## nomed_complete     0.0       0.0   
## nomed_hypothesis1 -2.8       2.6   
## nomed_null        -8.8       3.6   
## nomed_hypothesis2 -9.1       3.6

Again complete model is the winner, not too surprisingly, but this time the difference to next best model is not too significant. Looking at the standard error for the difference it could possibly be explained by chance as well. Given only slight reduction in predictive performance, Hypothesis 1 model is worth further evaluation as well due to significantly reduced model complexity.

Results

In both groups the null hypothesis model performed significantly worse than other models, thus we can conclude our sets of predictors likely have some effect on our target variable: blood LDL cholesterol. Although, it is still not clear if any of these significant predictors are actually nutrients at this stage.

After the model selection we are down to four models, two for each group. Now let us examine the estimated effects of our selected predictors for both data groups and further refine our understanding of the best fitting models by checking for collinearities and consistency with the existing literature.

Analyzing estimated effects from posterior draws

Now we are ready to analyze the effects given our chosen models and two data splits. The following code produces interval plots for all the models of their top 5 most significant effects.

selected_models = c("med_complete", "med_hypothesis2",
                    "nomed_complete", "nomed_hypothesis1")
coeff_plots = vector(mode="list", length = 8)
i = 1;
for (model in models) {
  for (group in groups) {
    fit_key = paste(group, model, sep = "_")
    results = as.data.frame(monitor(fits[[fit_key]]$fit, print = FALSE))
    
    # Filter out zero parameters
    results = results[abs(results$Q50) > 0.001, ]
    
    # order by median statistic
    (results_sorted = results[order(-abs(results$Q50)), ])
    
    row_names = row.names(results_sorted)
    results_sorted_pars = head(row_names[grepl("b_", row_names)], 5)
    
    p = mcmc_intervals(as.data.frame(fits[[fit_key]]), 
                      pars=results_sorted_pars, prob_outer = 0.95) +
                      xlab(paste(tools::toTitleCase(group), 
                                 pred_to_modelname[[model]]))
    
    # highlight selected models from loo_compare
    if (fit_key %in% selected_models) {
      p = p + theme(axis.title.x = element_text(face = "bold", 
                                                color = "#2967C9"))
    }
    coeff_plots[[i]] = p
    i = i + 1
  }
}

Now let’s put the plots in a grid for comparison.

bayesplot_grid(plots=coeff_plots, xlim = c(-1, 1), legends = FALSE,
               grid_args = list(nrow = 4))

The first column is for Med group and the second is for Nomed group. The plots highlighted bold blue titles correspond to the best models selected using leave-one-out cross-validation earlier. It is evident from these plots that variable selection and set of observations substantially affects which predictors are estimated to have a significant effect. For example Hypothesis 1 has completely different permutation in the effect ranking in Med group when compared to Nomed group. Thus, exploration over sufficiently large set of models is essential for robust statistical estimation of these effects under different conditions. However, looking at the posterior marginals alone is not enough, because of aforementioned collinear effects. Consequently, these effects will be analyzed next. We will only focus on the four best models selected earlier to keep the length of this report reasonable.

Checking for correlated effects

Purely looking at marginal distributions of the posterior draws for the coefficients can be misleading as two coefficients can be highly correlated and such interactions are not shown in the marginals. To better understand the relationships between different effects, we will now analyze the correlations between regression coefficients in the two best models for each group.

Complete Model for Med group

We begin by plotting the full correlation matrix for complete model to visualize all interactions between \(\beta\) parameters.

fit_key = "med_complete"
coeff_pars = gsub("^(.*?)", "b_", predictors[["complete"]])
draws = as.data.frame(fits[[fit_key]])
coeff_draws = as.matrix(draws[, coeff_pars, drop=FALSE])
corrplot(cor(coeff_draws))

Let’s draw a scatter plot for the two most dominant effects.

mcmc_scatter(as.data.frame(fits[["med_complete"]]), 
             pars = c("b_blood_hdl_chol", "b_weight")) +
             geom_vline(xintercept=0) + geom_hline(yintercept=0)

It is clear from the scatter plot that the two most dominant effects correlate with each other and the effect of other is visible in the marginal of the other. Furthermore, it is visible from the correlation matrix that the third most dominant effect systolic blood pressure correlates negatively with weight as well.

Given the highly correlated non-zero coefficients for the complete model of the medication group and the fact that HDL-cholesterol should help reducing LDL cholesterol levels according to (Panlasigui et al., 2003), we reject this model despite the best predictive performance, because the results are ambiguous and misleading. This happens likely because weight and high blood pressure are dominant features in the group needing cholesterol medication and their effects are merely projected onto correlating variable HDL-cholesterol level. In addition, HDL-cholesterol levels likely correlate with LDL-levels given prior knowledge about human body’s attempts to maintain homeostasis so it is not very useful predictor either. Moreover, as the emphasis of this study is on nutritional effects these effects are not very interesting either.

Hypothesis 2 Model for Med group

Now we can analyze the second best candidate model for the medication group, which is the Hypothesis 2 model.

fit_key = "med_hypothesis2"
coeff_pars = gsub("^(.*?)", "b_", predictors[["hypothesis2"]])
draws = as.data.frame(fits[[fit_key]])
coeff_draws = as.matrix(draws[, coeff_pars, drop=FALSE])
corrplot(cor(coeff_draws))

Let’s plot a scatter plot for the two most dominant effects.

mcmc_scatter(as.data.frame(fits[["med_hypothesis2"]]), 
             pars = c("b_blood_insulin", "b_dvit")) +
             geom_vline(xintercept=0) + geom_hline(yintercept=0)

Here the marginal posteriors of two most prominent effects are not correlating significantly anymore. Fixing the blood insulin level does not affect the distribution of D-vitamin coefficient noticeably. Furthermore D-vitamin is an interesting nutritional effect, so we choose to use this model for the Med group. However, care should be taken with the conclusions made from these observations. Leaving out variables to reduce correlations as done in all Hypothesis models can result in the effect of some unknown variable not part of the model manifesting in the marginal posterior of D-vitamin coefficient. It was observed during the initial exploratory analysis that it highly correlates with omega-3 fatty acids such as EPA and DHA, so both of those variables were purposely left out. This is likely because many omega-3 fatty acid suplements include D-vitamin as well.

However, there is some evidence in the literature that D-vitamin production from UVB exposure requires cholesterol as an ingredient according to (Bogh et al., 2010), so the negative correlation is not too surprising. Under this premise, the intake of D-vitamin supplements should not necessarily be interpreted as inhibitor of cholesterol levels, when in fact the exposure to sunlight is the true cause and D-vitamin is just the end product of the process that consumes cholesterol. Although, it is not clear whether the reported D-vitamin value in the data is measured from nutrition intake or directly from blood. In the former case the earlier hypothesis would only partially explain the negative correlation between D-vitamin and LDL cholesterol. Either way, we have established a clear correlation between a nutrient D-vitamin and LDL cholesterol. The study of exact causal relations between the two variables is left as a topic for future research.

We have now successfully identified one nutritional effect on blood LDL cholesterol conditioned on the fact that the participants were taking cholesterol medication. Let’s look at the effects in opposite case were participants are without cholesterol medication.

Complete Model for Nomed Group

Again let’s start with the complete model and its correlation matrix.

fit_key = "nomed_complete"
coeff_pars = gsub("^(.*?)", "b_", predictors[["complete"]])
draws = as.data.frame(fits[[fit_key]])
coeff_draws = as.matrix(draws[, coeff_pars, drop=FALSE])
corrplot(cor(coeff_draws))

Let’s plot a scatter plot for the two most dominant effects as well.

mcmc_scatter(as.data.frame(fits[["nomed_complete"]]), 
             pars = c("b_blood_hdl_chol", "b_rrsyst")) +
             geom_vline(xintercept=0) + geom_hline(yintercept=0)

In case of Nomed group, coefficient for blood HDL cholesterol is clearly negative, as supported by the literature (Panlasigui et al., 2003). There is no significant correlation with systolic blood pressure either nor any of the other variables according to scatter plot and correlation matrix above. Thus we can trust this model to some extent. However, according to posterior predictive performance comparison earlier, the Hypothesis Model 1 is almost as good as the complete model for no medication group, and the elpd difference can largely be explained by standard error induced by finite number of samples. Therefore, it is worthwhile to check that model as well. Furthermore, the complete model gives little emphasis on interesting nutritional factors.

Hypothesis 1 Model for Nomed Group

The correlations between predictors are shown below

fit_key = "nomed_hypothesis1"
coeff_pars = gsub("^(.*?)", "b_", predictors[["hypothesis1"]])
draws = as.data.frame(fits[[fit_key]])
coeff_draws = as.matrix(draws[, coeff_pars, drop=FALSE])
corrplot(cor(coeff_draws))

and scatter plots for the three most dominant effects look like this.

sc1 = mcmc_scatter(as.data.frame(fits[["nomed_hypothesis1"]]),
             pars = c("b_fibre", "b_rrdias")) +
             geom_vline(xintercept=0) + geom_hline(yintercept=0)
sc2 = mcmc_scatter(as.data.frame(fits[["nomed_hypothesis1"]]),
             pars = c("b_fibre", "b_alcohol")) +
             geom_vline(xintercept=0) + geom_hline(yintercept=0)
bayesplot_grid(sc1, sc2, grid_args = list(nrow = 1), ylim = c(-0.2, 0.5))

Fibre intake has no significant correlation with blood pressure and has clear negative effect on blood LDL cholesterol levels as reported by (Theuwissen and Mensink, 2008). There appears to be small positive correlation with alcohol, which means the more you drink alcohol, the less intake of fibre helps to reduce blood LDL cholesterol, although the joint effect is small and might be explained by other covariates as well as indicated by the above correlation matrix.

Although the effect is slightly smaller than that of D-vitamin for Med group, we have successfully identified another possible factor affecting blood LDL cholesterol levels, but this time the sign of correlation is reversed and the effect is more visible in group taking no cholesterol medication. In other words, there is some empirical evidence to suspect that fibre intake might help to reduce blood LDL cholesterol. Furthermore, this finding is in line with the nutritional literature (Theuwissen and Mensink, 2008).

Prior sensitivity analysis

The choice of prior distributions for model parameters that properly reflect our prior knowledge about them is a crucial part of Bayesian data analysis. The use of prior information has a compelling regularizing effect for the parameter inference in contrast to maximum likelihood estimation, especially if little data is available about the phenomenon of interest. However, the flip side is that a poor choice of prior can significantly bias the parameter estimates. Therefore, some analysis about model’s sensitiveness to prior choices should be conducted in retrospect to avoid usage of completely misspecified models and reporting of clearly biased results.

To keep this analysis in within sensible time constraints with limited computational resources, we will focus on Hypothesis Model 2 for Med group and Hypothesis Model 1 for Nomed group and monitor how the effects of D-vitamin and fibre are affected by two alternative prior choices. Firstly let us define the prior choices we are going to compare in the table below.

	`b` (\(\boldsymbol{\beta}\))	`Intercept` (\(\boldsymbol{\beta_0}\))	`sigma` (\(\sigma\))
Orig Priors	`horseshoe(p0 = 3)`	`student_t(7,0,1)`	`student_t(7,1,1)`
Alt Priors 1	`horseshoe(p0 = 2)`	`student_t(7,5,1)`	`student_t(7,6,1)`
Alt Priors 2	`horseshoe(p0 = 4)`	`student_t(7,0,1001)`	`student_t(7,1,1001)`

The three rows correspond to the prior configurations to be tested and columns correspond to the parameter class. The first row is the original prior configuration already fitted. See earlier model specification for details about model parameters.

The fitting the two models with two alternative priors happens as follows.

alt_priors1 = define_priors(predictors[["hypothesis1"]], "hypothesis1", p0=2,
                             mu_offset = 5)
alt_priors2 = define_priors(predictors[["hypothesis1"]], "hypothesis1", p0=4,
                             sigma_offset = 1e3)

nomed_hypothesis1_altfit1 = fit_model_with_predictors(
                                                predictors[["hypothesis1"]], 
                                                data_splits[["nomed"]],
                                                "nomed", "hypothesis1_altfit1",
                                                priors = alt_priors1)

nomed_hypothesis1_altfit2 = fit_model_with_predictors(
                                                predictors[["hypothesis1"]], 
                                                data_splits[["nomed"]],
                                                "nomed", "hypothesis1_altfit2",
                                                priors = alt_priors2)

med_hypothesis2_altfit1 = fit_model_with_predictors(predictors[["hypothesis2"]], 
                                                data_splits[["med"]],
                                                "med", "hypothesis2_altfit1",
                                                priors = alt_priors1)

med_hypothesis2_altfit2 = fit_model_with_predictors(predictors[["hypothesis2"]], 
                                                data_splits[["med"]],
                                                "med", "hypothesis2_altfit2",
                                                priors = alt_priors2)

First, we analyze the effect of prior choices on Hypothesis Model 2 for Med group. More specifically we consider the posterior marginal distribution for the interesting coefficient \(\beta_{\text{dvit}}\) under different prior choices.

combined_df = data.frame(matrix(NA, nrow = 24000, ncol = 0))
combined_df["Original"] = 
  as.data.frame(fits[["med_hypothesis2"]])$b_dvit
combined_df["Alt 1"] = as.data.frame(med_hypothesis2_altfit1)$b_dvit
combined_df["Alt 2"] = as.data.frame(med_hypothesis2_altfit2)$b_dvit

combined_df = combined_df %>%
  gather(Priors, value, "Original", "Alt 1", "Alt 2")

ggplot(data=combined_df) +
  geom_density(aes(x=value, color=Priors, fill=Priors), alpha=0.6) +
  geom_density(aes(x=value, color=Priors, fill=Priors), alpha=0.6) +
  geom_density(aes(x=value, color=Priors, fill=Priors), alpha=0.6) +
  scale_fill_manual(values = c("#77FF77", "#7777FF", "#FF77FF")) +
  scale_color_manual(values = c("#00FF00", "#0000FF", "#FF00FF")) +
  xlab("b_dvit") + theme(legend.position="top") +
  ggtitle('Hypothesis 2 Model for Med Group')

As can be seen from the above visualization, the prior choice does not significantly change where most of the posterior mass is located. Therefore, the model and the estimated effect are not very sensitive to the choice of priors. Thus, we can consider this estimate more reliable after this analysis as slightly poor initial choice for the prior has no significant effect.

Secondly, we analyze the effect of prior choices on Hypothesis Model 1 for Nomed group. More specifically we consider the posterior marginal distribution for the interesting coefficient \(\beta_{\text{fibre}}\) under different prior choices.

combined_df = data.frame(matrix(NA, nrow = 24000, ncol = 0))
combined_df["Original"] = 
  as.data.frame(fits[["nomed_hypothesis1"]])$b_fibre
combined_df["Alt 1"] = as.data.frame(nomed_hypothesis1_altfit1)$b_fibre
combined_df["Alt 2"] = as.data.frame(nomed_hypothesis1_altfit2)$b_fibre

combined_df = combined_df %>%
  gather(Priors, value, "Original", "Alt 1", "Alt 2")

ggplot(data=combined_df) +
  geom_density(aes(x=value, color=Priors, fill=Priors), alpha=0.6) +
  geom_density(aes(x=value, color=Priors, fill=Priors), alpha=0.6) +
  geom_density(aes(x=value, color=Priors, fill=Priors), alpha=0.6) +
  scale_fill_manual(values = c("#77FF77", "#7777FF", "#FF77FF")) +
  scale_color_manual(values = c("#00FF00", "#0000FF", "#FF00FF")) +
  xlab("b_fibre") + theme(legend.position="top") +
  ggtitle('Hypothesis 1 Model for Nomed Group')

Similar conclusion can be drawn here as with the Hypothesis Model 2 for Med group, the estimated effect of fibre is not very sensitive to the choice of priors.

Discussion

Having tested multiple hypotheses and performed series of rigorous data analysis checks, we can now be relatively confident that the observed effects are indeed somewhat related to the true underlying data generation mechanism, and not just artifacts of poor statistical practices, assuming our models are realistic. However, given the limited amount of data and the use of relatively simple models, further research is needed to make any definite conclusions.

Indeed, what comes to model assumptions, the GLM assumes i.i.d. variables, which was only partially satisfied in this study though heuristic variable selection. This variable selection in turn might cause some coefficients to carry the effects of other collinear coefficients, yielding biased estimates for the said effect coefficients. Furthermore, the relationship between predictors and the target variable were assumed to be linear, which might not be true at all in complicated systems such as the human body.

Variable selection was done in a heuristic manner as the number of different combinations is immense and exhaustive search is not an option. Nevertheless, variable selection is a critical design choice of the experiment and might induce significant decision bias to the results if done carelessly. Therefore, more effort should be put into making proper statistically justified variable selection. However, more principled selection methods do exists, such as the complete variable selection method using projection predictive variable selection approach proposed in (Pavone et al., 2020). The method is computationally very expensive, and given the limited data the approximation errors might be too large for it to be practical. Nevertheless, given more data and time budget, it might be a worthwhile research direction.

One more thing to improve would be to link these discovered effects to domain knowledge and compare them to some reference values. For now there has been no discussion of how large or significant these effects are, and for that some reference scale is needed. Indeed, there is no need to alarm the nutrition experts if the effects are insignificant.

In addition, one should perhaps focus on a certain largest quantile of the LDL cholesterol distribution as those people are most at risk for these nutritional effects. In fact, we have indirectly done that by separating the Med group, as it is very likely cholesterol medication is not prescribed for people with healthy levels of LDL cholesterol. Interestingly, the nutritional effects were quite different for that group of people as the results show. According to our results, we would not recommend intake of fibre as means to lower the LDL cholesterol for those people receiving the medication, even when the whole population level effect might suggest otherwise. In fact, the other (discarded) hypothesis model suggests the effect of fibre might be the opposite for the Med group.

Conclusion

In this paper effects of nutrition intake on blood LDL cholesterol were extensively studied given highly multivariate data set of nutritional and medical data. Several model hypotheses were tested and the best descriptive ones were chosen following the best practices of Bayesian data analysis. Two sub population level nutritional effects were identified, namely the intake of D-vitamin and Fibre. The first effect is positive given the person has taken cholesterol medication and the second is negative assuming no medication is taken. The second observation is backed by the literature and therefore not very surprising (Theuwissen and Mensink, 2008). The positive effect of D-vitamin to blood LDL cholesterol given use of cholesterol medication is somewhat surprising on the other hand. The scientific literature states there is no clear consensus on the link between D-vitamin and LDL cholesterol (Ponda et al., 2012). However, this has likely been less studied while controlling for the use of statins (cholesterol medication), so there is a possibility that some novel information has been revealed in this study about the link. Assuming the link is reflecting the true biological causal effect, we could give patients under cholesterol medication the advice to cut down on D-vitamin intake to avoid heart disease based on this study. Nevertheless, as stated earlier, more research is needed, which accounts for the aforementioned shortcomings of this study, to make any definite conclusions or dietary recommendations.

Self-reflection

As this project was quite technical, the most notable thing I learned during this project was the usage of many statistical computer tools, such as R notebooks, R programming and many R packages like brms and rstan. It really takes proficiency in many technologies to use R notebooks efficiently. In addition, I learned basic statistical data analysis work flow steps, at least in Bayesian framework. While many of these steps were introduced during the exercises, combining them in the course project gave more perspective to their meaning in practice. With the many statistical checks I learned the inherent epistemic uncertainty involved in most statistical analysis and that the knowledge extracted from it is only as good as our data and the models. Even if the model is perfect and there are plenty of quality data, there are still many pitfalls, such as using too little MCMC iterations, involved in Bayesian data analysis that I am now able to avoid. Overall, the project was a really fulfilling learning experience.

References

Bogh, M. K., Schmedes, A. V., Philipsen, P. A., Thieden, E. and Wulf, H. C. (2010) ‘Vitamin d production after UVB exposure depends on baseline vitamin d and total cholesterol but not on skin pigmentation’, Journal of Investigative Dermatology. Elsevier, 130(2), pp. 546–553.

Bürkner, P.-C. (2018) ‘Advanced Bayesian Multilevel Modeling with the R Package brms’, The R Journal, 10(1), pp. 395–411. doi: 10.32614/RJ-2018-017.

Carpenter, B. et al. (2017) ‘Stan: A probabilistic programming language’, Journal of statistical software. Columbia Univ., New York, NY (United States); Harvard Univ., Cambridge, MA …, 76(1).

Lankinen, M. et al. (2011) ‘Whole grain products, fish and bilberries alter glucose and lipid metabolism in a randomized, controlled trial: the Sysdimet study’, PLoS ONE, 6(8), p. e22646.

Paananen, T., Piironen, J., Bürkner, P. and Vehtari, A. (1906) ‘Implicitly adaptive importance sampling’, arXiv.

Panlasigui, L. N., Baello, O. Q., Dimatangal, J. M. and Dumelod, B. D. (2003) ‘Blood cholesterol and lipid-lowering effects of carrageenan on human volunteers.’, Asia Pacific Journal of Clinical Nutrition, 12(2).

Pavone, F., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2020) ‘Using reference models in variable selection’, arXiv preprint arXiv:2004.13118.

Piironen, J. and Vehtari, A. (2017) ‘Sparsity information and regularization in the horseshoe and other shrinkage priors’, Electron. J. Statist. The Institute of Mathematical Statistics; the Bernoulli Society, 11(2), pp. 5018–5051. doi: 10.1214/17-EJS1337SI.

Ponda, M. P., Dowd, K., Finkielstein, D., Holt, P. R. and Breslow, J. L. (2012) ‘The short-term effects of vitamin d repletion on cholesterol: A randomized, placebo-controlled trial’, Arteriosclerosis, thrombosis, and vascular biology. Am Heart Assoc, 32(10), pp. 2510–2515.

Theuwissen, E. and Mensink, R. P. (2008) ‘Water-soluble dietary fibers and cardiovascular disease’, Physiology & behavior. Elsevier, 94(2), pp. 285–292.

Turkia, J. (2018) ‘Modeling the Effects of Nutrition with Mixed-Effect Bayesian Network’, StanCon Talks.

Vehtari, A., Gelman, A. and Gabry, J. (2017) ‘Practical bayesian model evaluation using leave-one-out cross-validation and WAIC’, Statistics and computing. Springer, 27(5), pp. 1413–1432.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., Bürkner, P.-C. and others (2020) Bayesian Analysis. International Society for Bayesian Analysis.

Wenger, N., Boden, W., Carabello, B., Carney, R., Cerqueira, M. and Criqul, M. (2010) ‘Cardiovascular disability: Updating the social security listings’, National Academy of Sciences.

Appendix

Complete Stan model codes

The Stan codes produced by the brms package used to fit most of the models in this study are shown in this section.

Horseshoe Model

This is the main model code used in most of the experiments. It is slightly complicated thanks to the novel regularized horseshoe prior but for the purposes of this study we treat it as a simple prior distribution similar to the student_t distribution.

// generated with brms 2.13.5
functions {
  /* Efficient computation of the horseshoe prior
   * see Appendix C.1 in https://projecteuclid.org/euclid.ejs/1513306866
   * Args:
   *   z: standardized population-level coefficients
   *   lambda: local shrinkage parameters
   *   tau: global shrinkage parameter
   *   c2: slap regularization parameter
   * Returns:
   *   population-level coefficients following the horseshoe prior
   */
  vector horseshoe(vector z, vector lambda, real tau, real c2) {
    int K = rows(z);
    vector[K] lambda2 = square(lambda);
    vector[K] lambda_tilde = sqrt(c2 * lambda2 ./ (c2 + tau^2 * lambda2));
    return z .* lambda_tilde * tau;
  }
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for the horseshoe prior
  real<lower=0> hs_df;  // local degrees of freedom
  real<lower=0> hs_df_global;  // global degrees of freedom
  real<lower=0> hs_df_slab;  // slab degrees of freedom
  real<lower=0> hs_scale_global;  // global prior scale
  real<lower=0> hs_scale_slab;  // slab prior scale
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  // local parameters for horseshoe prior
  vector[Kc] zb;
  vector<lower=0>[Kc] hs_local;
  real Intercept;  // temporary intercept for centered predictors
  // horseshoe shrinkage parameters
  real<lower=0> hs_global;  // global shrinkage parameters
  real<lower=0> hs_slab;  // slab regularization parameter
  real<lower=0> sigma;  // residual SD
}
transformed parameters {
  vector[Kc] b;  // population-level effects
  // compute actual regression coefficients
  b = horseshoe(zb, hs_local, hs_global, hs_scale_slab^2 * hs_slab);
}
model {
  // priors including all constants
  target += std_normal_lpdf(zb);
  target += student_t_lpdf(hs_local | hs_df, 0, 1)
    - rows(hs_local) * log(0.5);
  target += student_t_lpdf(Intercept | 7, 0, 1);
  target += student_t_lpdf(hs_global | hs_df_global, 0, hs_scale_global * sigma)
    - 1 * log(0.5);
  target += inv_gamma_lpdf(hs_slab | 0.5 * hs_df_slab, 0.5 * hs_df_slab);
  target += student_t_lpdf(sigma | 7, 1, 1)
    - 1 * student_t_lccdf(0 | 7, 1, 1);
  // likelihood including all constants
  if (!prior_only) {
    target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
  }
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

This code is the same for most of the models but the input data and parameters vary.

Null Model

The null model code is much simpler (e.g. empty functions block and some parameters missing) as it does not use the regularized horseshoe prior. It is shown below.

// generated with brms 2.13.5
functions {
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // residual SD
}
transformed parameters {
}
model {
  // priors including all constants
  target += normal_lpdf(b | 0, 1);
  target += student_t_lpdf(Intercept | 7, 0, 1);
  target += student_t_lpdf(sigma | 7, 1, 1)
    - 1 * student_t_lccdf(0 | 7, 1, 1);
  // likelihood including all constants
  if (!prior_only) {
    target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
  }
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

Linear Regression Analysis of Nutritional Effects on Blood LDL Cholesterol Levels

Pasi Pyrrö

2021-02-09