Fits several random forest models on the same data in order to capture the effect of the algorithm's stochasticity on the variable importance scores, predictions, residuals, and performance measures. The function relies on the median to aggregate performance and importance values across repetitions. It is recommended to use it after a model is fitted (`rf()`

or `rf_spatial()`

), tuned (`rf_tuning()`

), and/or evaluated (`rf_evaluate()`

). This function is designed to be used after fitting a model with `rf()`

or `rf_spatial()`

, tuning it with `rf_tuning()`

and evaluating it with `rf_evaluate()`

.

rf_repeat( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, repetitions = 10, keep.models = TRUE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )

model | A model fitted with |
---|---|

data | Data frame with a response variable and a set of predictors. Default: |

dependent.variable.name | Character string with the name of the response variable. Must be in the column names of |

predictor.variable.names | Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |

distance.matrix | Squared matrix with the distances among the records in |

distance.thresholds | Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |

xy | (optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot |

ranger.arguments | Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |

scaled.importance | Logical. If |

repetitions | Integer, number of random forest models to fit. Default: |

keep.models | Logical, if |

seed | Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |

verbose | Logical, ff |

n.cores | Integer, number of cores to use for parallel execution. Creates a socket cluster with |

cluster | A cluster definition generated with |

A ranger model with several new slots:

`ranger.arguments`

: Stores the values of the arguments used to fit the ranger model.`importance`

: A list containing a data frame with the predictors ordered by their importance, a ggplot showing the importance values, and local importance scores.`performance`

: out-of-bag performance scores: R squared, pseudo R squared, RMSE, and normalized RMSE (NRMSE).`pseudo.r.squared`

: computed as the correlation between the observations and the predictions.`residuals`

: residuals, normality test of the residuals computed with`residuals_test()`

, and spatial autocorrelation of the residuals computed with`moran_multithreshold()`

.

if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting 5 random forest models out <- rf_repeat( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, repetitions = 5, n.cores = 1 ) #data frame with ordered variable importance out$importance$per.variable #per repetition out$importance$per.repetition #variable importance plot out$importance$per.repetition.plot #performance out$performance #spatial correlation of the residuals for different distance thresholds out$spatial.correlation.residuals$per.distance #plot of the Moran's I of the residuals for different distance thresholds out$spatial.correlation.residuals$plot #using a model as an input for rf_repeat() rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[8:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #repeating the model 5 times rf.repeat <- rf_repeat( model = rf.model, n.cores = 1 ) rf.repeat$performance rf.repeat$importance$per.repetition.plot }