Why can’t I just have all of the data?

I have been updating some code that uses data that I don’t have access to. I used one of the data sets from the survival package but it annoyed me that the variable names were not what I wanted them to be and I couldn’t be bothered changing them. I was killing time on a train the other day and discovered the simstudy package and decided to give it a go. It worked beautifully and made updating my code so much easier. It also made me realise how much R I have learnt in the last year so I wanted to show the difference between the old and the new code with you.

Creating data

Start by defining the variables you want in your data set and create the survival part of the data and before creating the full data set.

def <- defData(varname = "v1_6", dist = "binary", formula = 0.6, id = "idnum", link = "logit")
def <- defData(def, varname = "v1_5", dist = "uniformInt", formula = "18;110")
def <- defData(def, varname = "v2_1_1", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_1_2", dist = "binary", formula = 0.6)
def <- defData(def, varname = "v2_1_3", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_1_4", dist = "binary", formula = 0.55)
def <- defData(def, varname = "v2_1_5", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_5", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v7_4", dist = "binary", formula = 0.5)
def <- defData(def, varname = "event", dist = "binary", formula = 0.1)
def <- defData(def, varname = "died", dist = "binary", formula = 0.5)
def <- defData(def, varname = "NIHSS_15", dist = "uniformInt", formula = "0;42", link="identity")
def <- defData(def, varname = "age_group", dist = "categorical", formula = "0.5;0.2;0.3", link="identity")
# Define survival data part
sdef <- defSurv(varname = "survTime", formula = "v1_5/2", scale = 20, 
                shape = "v2_5*1 + (1-v2_5)*1.5")
sdef <- defSurv(sdef, varname = "censorTime", scale = 80, shape = 1)
# Create data
df <- genData(5000, def)
df <- genSurv(df, sdef)

What a lovely data set I created!

Descriptive statistics

Now that I have data I want to compare the group that experienced the event to the group that didn’t. I use CreateTableOne from the tableone package. I need to specify which variables I want to include (or all variables will be included) and which ones I want to be factors (will be shown as %) or nonnormal variables (will be shown as median).

table_var<- c("v1_5", "v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")
median_var <- median_var<- c("v1_5", "NIHSS_15", "v2_2", "v7_4")
factor_var <- c("v1_6","v2_5","v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5","age_group")

table1<- CreateTableOne(table_var, data = df,  strata = "event", factorVars = factor_var,
                        testNonNormal = kruskal.test, argsNonNormal = median_var)
table1_print<-print(table1, nonnormal = median_var, contDigits=0, 
                 exact = "LOC",  printToggle = TRUE, factorVars = factor_var)

##                      Stratified by event
##                       0             1            p      test   
##   n                   4483          517                        
##   v1_5 (median [IQR])   63 [40, 86]  63 [42, 85]  NA    nonnorm
##   v1_6 = 1 (%)        2939 (65.6)   329 (63.6)    0.412        
##   v2_1_1 = 1 (%)      2192 (48.9)   262 (50.7)    0.471        
##   v2_1_2 = 1 (%)      2723 (60.7)   310 (60.0)    0.767        
##   v2_1_3 = 1 (%)      2260 (50.4)   250 (48.4)    0.401        
##   v2_1_4 = 1 (%)      2438 (54.4)   276 (53.4)    0.700        
##   v2_1_5 = 1 (%)      2254 (50.3)   259 (50.1)    0.974        
##   v2_5 = 1 (%)        2235 (49.9)   265 (51.3)    0.577        
##   v7_4 (median [IQR])    1 [0, 1]     0 [0, 1]    NA    nonnorm
##   age_group (%)                                   0.247        
##      1                2187 (48.8)   252 (48.7)                 
##      2                 892 (19.9)   117 (22.6)                 
##      3                1404 (31.3)   148 (28.6)

You might want to save the table in a different format such as a csv file. This code should help you do that. Note that the print options are slightly different.

table1_print <- print(table1, nonnormal = median_var, contDigits=0, 
                 exact = "LOC", quote = FALSE, noSpaces = TRUE, printToggle = FALSE, factorVars = factor_var)
## Save to a CSV file
write.csv(table1_print, file = "my_table1.csv")

Regressions

Time to move on to the regressions! I wrote the first version of this code about 9 months ago and it looked a bit like this but with more models:

surv_data <- Surv(time=df$survTime, event=df$event, type="right")

coxph.fit.v1.6 <- coxph(surv_data ~ v1_6 , data=df, method="breslow")
coxph.fit.v1.5 <- coxph(surv_data ~ v1_5 , data=df, method="breslow")
coxph.fit.v2.5 <- coxph(surv_data ~ v2_5 , data=df, method="breslow")

As you can imagine this made my code really long and included a lot of copying and pasting.

As I learnt more and more R (and discovered the power of purrr after an R ladies event where Charlotte Wickham presented) I decided to update the code before giving to the person who will run it. I start by putting all of my predictors in a vector. I specify all the predictors I care about and use map to fit the model for each one.

predictors <- c("v1_5", "v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")

surv_data <- Surv(time=df$survTime, event=df$event, type="right")

unadjusted <- map(predictors, ~coxph(as.formula(paste("surv_data", .x, sep="~")), data=df))

I refuse to copy numbers off the screen because I know I will get it wrong so I use the texreg package to create nice regression tables. I need to override the coefficients as I am interested in exp(coef) and this is how I did it 9 months ago. I spent most of my time terrified that I would accidentally use coefficients from the wrong model and adding new models took forever.

screenreg(list(coxph.fit.v1.6, coxph.fit.v1.5, coxph.fit.v2.5 ),  
        single.row= TRUE , ci.force= TRUE, include.aic = TRUE, include.rsquared =  TRUE, include.maxrs= FALSE,
        ci.test = 1,
        override.coef = list(summary(coxph.fit.v1.6)$coef[,2],
                             summary(coxph.fit.v1.5)$coef[,2],
                             summary(coxph.fit.v2.5)$coef[,2])
        )

Once again the map function saved me. I create lists with the coefficients and upper and lower confidence intervals. This way I don’t have to worry about using the wrong coefficients and it is real easy to add more models.

unadjusted_coef <- map(unadjusted, summary ) %>% 
  map("coefficients") %>% 
  map(as.data.frame) %>% 
  map("exp(coef)")

unadjusted_lower_conf <- map(unadjusted, summary) %>% 
  map("conf.int") %>% 
  map(as.data.frame) %>% 
  map("lower .95")

unadjusted_upper_conf <- map(unadjusted, summary) %>% 
  map("conf.int") %>% 
  map(as.data.frame) %>% 
  map("upper .95")

screenreg(unadjusted, single.row= FALSE, ci.force= TRUE, ci.test=1,
          override.coef = unadjusted_coef,
          override.ci.low=unadjusted_lower_conf,
          overide.ci.up=unadjusted_upper_conf)

## 
## =======================================================================================================================================================
##              Model 1       Model 2       Model 3       Model 4       Model 5       Model 6       Model 7       Model 8       Model 9       Model 10    
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## v1_5            1.00                                                                                                                                   
##              [1.00; 1.00]                                                                                                                              
## v1_6                          0.92                                                                                                                     
##                            [0.74; 1.10]                                                                                                                
## v2_1_1                                      1.07                                                                                                       
##                                          [0.89; 1.24]                                                                                                  
## v2_1_2                                                    0.97                                                                                         
##                                                        [0.79; 1.14]                                                                                    
## v2_1_3                                                                  0.93                                                                           
##                                                                      [0.76; 1.10]                                                                      
## v2_1_4                                                                                0.96                                                             
##                                                                                    [0.79; 1.13]                                                        
## v2_1_5                                                                                              0.99                                               
##                                                                                                  [0.82; 1.16]                                          
## v2_5                                                                                                              1.04                                 
##                                                                                                                [0.87; 1.21]                            
## v7_4                                                                                                                            0.86                   
##                                                                                                                              [0.68; 1.03]              
## age_group                                                                                                                                     0.97     
##                                                                                                                                            [0.87; 1.06]
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC          8712.31       8711.94       8712.19       8712.58       8712.01       8712.49       8712.70       8712.52       8709.54       8712.23     
## R^2             0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00     
## Max. R^2        0.82          0.82          0.82          0.82          0.82          0.82          0.82          0.82          0.82          0.82     
## Num. events   517           517           517           517           517           517           517           517           517           517        
## Num. obs.    5000          5000          5000          5000          5000          5000          5000          5000          5000          5000        
## Missings        0             0             0             0             0             0             0             0             0             0        
## PH test         0.98          0.60          0.68          0.78          0.59          0.06          0.38          0.46          0.25          0.24     
## =======================================================================================================================================================
## * 1 outside the confidence interval

It is now super easy to add/remove variables to my models. I already know that I want to include two variables - v1_5 and v1_6 - in my full model so I add them to the model. I remove v1_5 from my predictors or I will have two models for just v1_5 and v1_6.

predictors <- c("v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")

adjusted <- map(predictors, ~coxph(as.formula(paste("surv_data ~ v1_5 + v1_6", .x, sep="+")),
                                   data=df))

adjusted_coef <- map(adjusted, summary ) %>% 
  map("coefficients") %>% 
  map(as.data.frame) %>% 
  map("exp(coef)")

adjusted_lower_conf <- map(adjusted, summary) %>% 
  map("conf.int") %>% 
  map(as.data.frame) %>% 
  map("lower .95")
  
adjusted_upper_conf <- map(adjusted, summary) %>% 
  map("conf.int") %>% 
  map(as.data.frame) %>% 
  map("upper .95")

screenreg(adjusted, single.row= FALSE, ci.force= TRUE, ci.test=1, 
          override.coef = adjusted_coef,
          override.ci.low=adjusted_lower_conf,
          overide.ci.up=adjusted_upper_conf)

## 
## =========================================================================================================================================
##              Model 1       Model 2       Model 3       Model 4       Model 5       Model 6       Model 7       Model 8       Model 9     
## -----------------------------------------------------------------------------------------------------------------------------------------
## v1_5            1.00          1.00          1.00          1.00          1.00          1.00          1.00          1.00          1.00     
##              [1.00; 1.00]  [1.00; 1.00]  [1.00; 1.00]  [1.00; 1.00]  [1.00; 1.00]  [1.00; 1.00]  [1.00; 1.00]  [1.00; 1.00]  [1.00; 1.00]
## v1_6            0.92          0.92          0.92          0.92          0.92          0.92          0.92          0.92          0.92     
##              [0.74; 1.10]  [0.74; 1.10]  [0.74; 1.10]  [0.74; 1.10]  [0.74; 1.10]  [0.74; 1.10]  [0.74; 1.10]  [0.74; 1.10]  [0.74; 1.10]
## v2_1_1                        1.06                                                                                                       
##                            [0.89; 1.24]                                                                                                  
## v2_1_2                                      0.97                                                                                         
##                                          [0.79; 1.15]                                                                                    
## v2_1_3                                                    0.93                                                                           
##                                                        [0.76; 1.10]                                                                      
## v2_1_4                                                                  0.96                                                             
##                                                                      [0.79; 1.13]                                                        
## v2_1_5                                                                                0.99                                               
##                                                                                    [0.82; 1.16]                                          
## v2_5                                                                                                1.04                                 
##                                                                                                  [0.86; 1.21]                            
## v7_4                                                                                                              0.85                   
##                                                                                                                [0.68; 1.03]              
## age_group                                                                                                                       0.96     
##                                                                                                                              [0.86; 1.06]
## -----------------------------------------------------------------------------------------------------------------------------------------
## AIC          8713.55       8715.05       8715.43       8714.86       8715.34       8715.54       8715.37       8712.38       8715.02     
## R^2             0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00     
## Max. R^2        0.82          0.82          0.82          0.82          0.82          0.82          0.82          0.82          0.82     
## Num. events   517           517           517           517           517           517           517           517           517        
## Num. obs.    5000          5000          5000          5000          5000          5000          5000          5000          5000        
## Missings        0             0             0             0             0             0             0             0             0        
## PH test         0.87          0.93          0.95          0.91          0.27          0.80          0.85          0.66          0.65     
## =========================================================================================================================================
## * 1 outside the confidence interval

If you want to save the models in a word document this is how to do it.

htmlreg(adjusted, single.row= FALSE, ci.force= TRUE, ci.test=1, type="html", file="My_regressions.doc",
          override.coef = adjusted_coef,
          override.ci.low=adjusted_lower_conf,
          overide.ci.up=adjusted_upper_conf)

## The table was written to the file 'My_regressions.doc'.

Packages I used

library("tidyverse")
library("survival")
library("simstudy")
library("texreg")
library("tableone")

Regressions using completly fake data!

Why can’t I just have all of the data?

Creating data

Descriptive statistics

Regressions

Packages I used

Emma Vestesson