Regressions using completly fake data!
Why can’t I just have all of the data?
I have been updating some code that uses data that I don’t have access to. I used one of the data sets from the survival
package but it annoyed me that the variable names were not what I wanted them to be and I couldn’t be bothered changing them. I was killing time on a train the other day and discovered the simstudy
package and decided to give it a go. It worked beautifully and made updating my code so much easier. It also made me realise how much R I have learnt in the last year so I wanted to show the difference between the old and the new code with you.
Creating data
Start by defining the variables you want in your data set and create the survival part of the data and before creating the full data set.
def <- defData(varname = "v1_6", dist = "binary", formula = 0.6, id = "idnum", link = "logit")
def <- defData(def, varname = "v1_5", dist = "uniformInt", formula = "18;110")
def <- defData(def, varname = "v2_1_1", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_1_2", dist = "binary", formula = 0.6)
def <- defData(def, varname = "v2_1_3", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_1_4", dist = "binary", formula = 0.55)
def <- defData(def, varname = "v2_1_5", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_5", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v7_4", dist = "binary", formula = 0.5)
def <- defData(def, varname = "event", dist = "binary", formula = 0.1)
def <- defData(def, varname = "died", dist = "binary", formula = 0.5)
def <- defData(def, varname = "NIHSS_15", dist = "uniformInt", formula = "0;42", link="identity")
def <- defData(def, varname = "age_group", dist = "categorical", formula = "0.5;0.2;0.3", link="identity")
# Define survival data part
sdef <- defSurv(varname = "survTime", formula = "v1_5/2", scale = 20,
shape = "v2_5*1 + (1-v2_5)*1.5")
sdef <- defSurv(sdef, varname = "censorTime", scale = 80, shape = 1)
# Create data
df <- genData(5000, def)
df <- genSurv(df, sdef)
What a lovely data set I created!
Descriptive statistics
Now that I have data I want to compare the group that experienced the event to the group that didn’t. I use CreateTableOne from the tableone
package. I need to specify which variables I want to include (or all variables will be included) and which ones I want to be factors (will be shown as %) or nonnormal variables (will be shown as median).
table_var<- c("v1_5", "v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")
median_var <- median_var<- c("v1_5", "NIHSS_15", "v2_2", "v7_4")
factor_var <- c("v1_6","v2_5","v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5","age_group")
table1<- CreateTableOne(table_var, data = df, strata = "event", factorVars = factor_var,
testNonNormal = kruskal.test, argsNonNormal = median_var)
table1_print<-print(table1, nonnormal = median_var, contDigits=0,
exact = "LOC", printToggle = TRUE, factorVars = factor_var)
## Stratified by event
## 0 1 p test
## n 4483 517
## v1_5 (median [IQR]) 63 [40, 86] 63 [42, 85] NA nonnorm
## v1_6 = 1 (%) 2939 (65.6) 329 (63.6) 0.412
## v2_1_1 = 1 (%) 2192 (48.9) 262 (50.7) 0.471
## v2_1_2 = 1 (%) 2723 (60.7) 310 (60.0) 0.767
## v2_1_3 = 1 (%) 2260 (50.4) 250 (48.4) 0.401
## v2_1_4 = 1 (%) 2438 (54.4) 276 (53.4) 0.700
## v2_1_5 = 1 (%) 2254 (50.3) 259 (50.1) 0.974
## v2_5 = 1 (%) 2235 (49.9) 265 (51.3) 0.577
## v7_4 (median [IQR]) 1 [0, 1] 0 [0, 1] NA nonnorm
## age_group (%) 0.247
## 1 2187 (48.8) 252 (48.7)
## 2 892 (19.9) 117 (22.6)
## 3 1404 (31.3) 148 (28.6)
You might want to save the table in a different format such as a csv file. This code should help you do that. Note that the print options are slightly different.
table1_print <- print(table1, nonnormal = median_var, contDigits=0,
exact = "LOC", quote = FALSE, noSpaces = TRUE, printToggle = FALSE, factorVars = factor_var)
## Save to a CSV file
write.csv(table1_print, file = "my_table1.csv")
Regressions
Time to move on to the regressions! I wrote the first version of this code about 9 months ago and it looked a bit like this but with more models:
surv_data <- Surv(time=df$survTime, event=df$event, type="right")
coxph.fit.v1.6 <- coxph(surv_data ~ v1_6 , data=df, method="breslow")
coxph.fit.v1.5 <- coxph(surv_data ~ v1_5 , data=df, method="breslow")
coxph.fit.v2.5 <- coxph(surv_data ~ v2_5 , data=df, method="breslow")
As you can imagine this made my code really long and included a lot of copying and pasting.
As I learnt more and more R (and discovered the power of purrr
after an R ladies event where Charlotte Wickham presented) I decided to update the code before giving to the person who will run it. I start by putting all of my predictors in a vector. I specify all the predictors I care about and use map
to fit the model for each one.
predictors <- c("v1_5", "v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")
surv_data <- Surv(time=df$survTime, event=df$event, type="right")
unadjusted <- map(predictors, ~coxph(as.formula(paste("surv_data", .x, sep="~")), data=df))
I refuse to copy numbers off the screen because I know I will get it wrong so I use the texreg
package to create nice regression tables. I need to override the coefficients as I am interested in exp(coef) and this is how I did it 9 months ago. I spent most of my time terrified that I would accidentally use coefficients from the wrong model and adding new models took forever.
screenreg(list(coxph.fit.v1.6, coxph.fit.v1.5, coxph.fit.v2.5 ),
single.row= TRUE , ci.force= TRUE, include.aic = TRUE, include.rsquared = TRUE, include.maxrs= FALSE,
ci.test = 1,
override.coef = list(summary(coxph.fit.v1.6)$coef[,2],
summary(coxph.fit.v1.5)$coef[,2],
summary(coxph.fit.v2.5)$coef[,2])
)
Once again the map
function saved me. I create lists with the coefficients and upper and lower confidence intervals. This way I don’t have to worry about using the wrong coefficients and it is real easy to add more models.
unadjusted_coef <- map(unadjusted, summary ) %>%
map("coefficients") %>%
map(as.data.frame) %>%
map("exp(coef)")
unadjusted_lower_conf <- map(unadjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("lower .95")
unadjusted_upper_conf <- map(unadjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("upper .95")
screenreg(unadjusted, single.row= FALSE, ci.force= TRUE, ci.test=1,
override.coef = unadjusted_coef,
override.ci.low=unadjusted_lower_conf,
overide.ci.up=unadjusted_upper_conf)
##
## =======================================================================================================================================================
## Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9 Model 10
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## v1_5 1.00
## [1.00; 1.00]
## v1_6 0.92
## [0.74; 1.10]
## v2_1_1 1.07
## [0.89; 1.24]
## v2_1_2 0.97
## [0.79; 1.14]
## v2_1_3 0.93
## [0.76; 1.10]
## v2_1_4 0.96
## [0.79; 1.13]
## v2_1_5 0.99
## [0.82; 1.16]
## v2_5 1.04
## [0.87; 1.21]
## v7_4 0.86
## [0.68; 1.03]
## age_group 0.97
## [0.87; 1.06]
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC 8712.31 8711.94 8712.19 8712.58 8712.01 8712.49 8712.70 8712.52 8709.54 8712.23
## R^2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Max. R^2 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82
## Num. events 517 517 517 517 517 517 517 517 517 517
## Num. obs. 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000
## Missings 0 0 0 0 0 0 0 0 0 0
## PH test 0.98 0.60 0.68 0.78 0.59 0.06 0.38 0.46 0.25 0.24
## =======================================================================================================================================================
## * 1 outside the confidence interval
It is now super easy to add/remove variables to my models. I already know that I want to include two variables - v1_5 and v1_6 - in my full model so I add them to the model. I remove v1_5 from my predictors or I will have two models for just v1_5 and v1_6.
predictors <- c("v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")
adjusted <- map(predictors, ~coxph(as.formula(paste("surv_data ~ v1_5 + v1_6", .x, sep="+")),
data=df))
adjusted_coef <- map(adjusted, summary ) %>%
map("coefficients") %>%
map(as.data.frame) %>%
map("exp(coef)")
adjusted_lower_conf <- map(adjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("lower .95")
adjusted_upper_conf <- map(adjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("upper .95")
screenreg(adjusted, single.row= FALSE, ci.force= TRUE, ci.test=1,
override.coef = adjusted_coef,
override.ci.low=adjusted_lower_conf,
overide.ci.up=adjusted_upper_conf)
##
## =========================================================================================================================================
## Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9
## -----------------------------------------------------------------------------------------------------------------------------------------
## v1_5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00]
## v1_6 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92
## [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10]
## v2_1_1 1.06
## [0.89; 1.24]
## v2_1_2 0.97
## [0.79; 1.15]
## v2_1_3 0.93
## [0.76; 1.10]
## v2_1_4 0.96
## [0.79; 1.13]
## v2_1_5 0.99
## [0.82; 1.16]
## v2_5 1.04
## [0.86; 1.21]
## v7_4 0.85
## [0.68; 1.03]
## age_group 0.96
## [0.86; 1.06]
## -----------------------------------------------------------------------------------------------------------------------------------------
## AIC 8713.55 8715.05 8715.43 8714.86 8715.34 8715.54 8715.37 8712.38 8715.02
## R^2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Max. R^2 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82
## Num. events 517 517 517 517 517 517 517 517 517
## Num. obs. 5000 5000 5000 5000 5000 5000 5000 5000 5000
## Missings 0 0 0 0 0 0 0 0 0
## PH test 0.87 0.93 0.95 0.91 0.27 0.80 0.85 0.66 0.65
## =========================================================================================================================================
## * 1 outside the confidence interval
If you want to save the models in a word document this is how to do it.
htmlreg(adjusted, single.row= FALSE, ci.force= TRUE, ci.test=1, type="html", file="My_regressions.doc",
override.coef = adjusted_coef,
override.ci.low=adjusted_lower_conf,
overide.ci.up=adjusted_upper_conf)
## The table was written to the file 'My_regressions.doc'.
Packages I used
library("tidyverse")
library("survival")
library("simstudy")
library("texreg")
library("tableone")