Global models and predictions of plant diversity based on advanced machine learning techniques

predictors of past and present environmental conditions. (cid:1) Machine learning showed a superior performance, explaining up to 80.9% of species richness and 83.3% of phylogenetic richness, illustrating the great potential of such techniques for disentangling complex and interacting associations between the environment and plant diversity. Current climate and environmental heterogeneity emerged as the primary drivers, while past environmental conditions left only small but detectable imprints on plant diversity. (cid:1) Finally, we combined predictions from multiple modeling techniques (ensemble predictions) to reveal global patterns and centers of plant diversity at multiple resolutions down to 7774 km 2 . Our predictive maps provide accurate estimates of global plant diversity available at grain sizes relevant for conservation and macroecology.

To preserve and manage this important part of global biodiversity, knowledge of its spatial distribution and location of biodiversity centers is critical. Mapping plant distributions and diversity has a long and rich tradition starting in the 19 th century, with the collation of regional plant species numbers and expertdrawn isolines of species richness (Wulff, 1935;Barthlott et al., 2005;reviewed in Mutke & Barthlott, 2005). These maps have since then been refined and scaled to different resolutions (e.g. c. 12 100 km 2 in Kreft & Jetz, 2007) by modeling diversity patterns in response to environmental and spatial variables (Keil & Chase, 2019;Sabatini et al., 2022), allowing for continuous predictions worldwide. The accuracy of such predictive maps depends on the quality and representativeness of available plant diversity data, environmental predictors, and models applied. Recent developments in the availability both of data and of modeling techniques allows for models of plant diversity of hitherto unprecedented resolution and accuracy.
Knowledge of plant distributions worldwide has increased in recent years, thanks to international efforts to mobilize and collate species occurrence records (Enquist et al., 2016;GBIF, 2020) and vegetation plots (Sabatini et al., 2021) along with regional checklists and floras Govaerts et al., 2021). However, these data differ in precision, completeness, and scope (K€ onig et al., 2019). Specifically, fine-grained data such as occurrence records and vegetation plots are often geographically biased and only partially cover regional floras (Meyer et al., 2016;Qian et al., 2022). Despite being coarse-grained and often delimited by artificial, administrative borders, checklists and floras reflect the most complete and authoritative accounts of regional floristic composition to date and are available with near-complete global coverage Govaerts et al., 2021). As such, checklists and floras are useful resources for global-scale modeling of plant diversity-environment relationships (Kreft & Jetz, 2007) and for predicting plant diversity across different grain sizes (Keil & Chase, 2019). Including species identities further allows for the integration of species-level phylogenetic and trait information, offering a unique opportunity to study multiple facets of biodiversity.
Although it is widely accepted that plant diversity reflects the complex interplay of evolutionary, geological, and ecological processes, disentangling the drivers of global plant diversity remains an important topic of modern macroecology (Kreft & Jetz, 2007;Tietje et al., 2022). Several hypotheses related to geography, past and present climate, and environmental heterogeneity of a region have been proposed to explain plant diversity patterns (Currie et al., 2004;Mittelbach et al., 2007;Fine, 2015;Supporting Information Table S1). Large and heterogeneous areas, for example, are hypothesized to support more species by offering a greater diversity of resources and habitats, thus promoting species coexistence (Connor & McCoy, 1979) and offering refugia during environmental fluctuations (Stein et al., 2014). Also, areas with warm, wet, and relatively stable climates such as humid tropical forests should support more species owing to high speciation (Rohde, 1992;Mittelbach et al., 2007;Brown, 2014) and low extinction rates (Gillooly & Allen, 2007;Eiserhardt et al., 2015). Geographic isolation could simultaneously promote extinction (Brown & Kodric-Brown, 1977;Ouborg, 1993) and speciation (Kisel & Barraclough, 2010), by making populations less well-connected. Finally, historical processes like past plate tectonics and climatic change have influenced diversity patterns through altered biotic isolation and exchange or species range shifts (Dynesius & Jansson, 2000;Svenning et al., 2015;Couvreur et al., 2021). However, past environmental conditions remain underrepresented in global models of plant diversity and their legacies in modern plant distributions are still poorly understood (Kissling et al., 2012;Hagen et al., 2021).
Diversity-environment relationships are often complex, nonlinear, and scale-dependent (Francis & Currie, 2003;Keil & Chase, 2019). Many environmental predictors interact and show high levels of collinearity, thus presenting major challenges for conventional statistical models such as generalized linear models (GLMs) and generalized additive models (GAMs). Machine learning approaches represent powerful modeling tools that can effectively deal with multidimensional and correlated data and can reveal nonlinear relationships and interactions of predictors without a priori specification (Olden et al., 2008;Crisci et al., 2012). Therefore, machine learning has become a promising alternative to conventional techniques in ecology (Hengl et al., 2017;Park et al., 2020;Sabatini et al., 2022). However, its performance in modeling global plant diversity has yet to be explored. In addition to relying on one particular model type, combining predictions based on multiple modeling techniques (i.e. ensemble predictions) might decrease prediction uncertainties (Ara ujo & New, 2007) and can thereby further improve predictions of global plant diversity patterns.
Here, we present improved models and predictions of two key facets of vascular plant diversity, that is, species richness and phylogenetic richness, at a global extent using advanced statistical modeling techniques. In addition to nonspatial and spatial GLMs and GAMs, we systematically assess the predictive performance of machine learning methods, including random forests, extreme gradient boosting (XGBoost), and neural networks. Specifically, our aims are as follows: to compare the performance of different modeling techniques in revealing complex diversity-environment relationships and to improve global geo-statistical plant diversity models; to test hypotheses on plant diversity gradients related to geography, environmental heterogeneity, current climate, and past environmental conditions, and to quantify their relative importance for plant species and phylogenetic richness; and, to predict both facets of plant diversity at multiple grain sizes across the globe. Our study is based on c. 300 000 species from checklists and floras for 830 regions across the globe (Fig. S1) collated in the Global Inventory of Floras and Traits  GIFT; Notes S1), and a large, dated megaphylogeny of vascular plants (Jin & Qian, 2019).

Species distribution data and species richness
To calculate species and phylogenetic richness, we used the species composition of native vascular plants in regional checklists and floras from GIFT v.2.1: http://gift.unigoettingen.de). In GIFT, all nonhybrid species names are standardized and validated based on taxonomic information provided by The Plant List (v.1.1, http://www.theplantlist.org) and additional resources available via iPlant's Taxonomic Name Resolution Service (TNRS; Boyle et al., 2013;Weigelt et al., 2020). The original database contains > 3000 geographic regions representing islands, protected areas, biogeographical regions, and administrative units (e.g. countries and provinces). We excluded regions with incomplete native vascular plant checklists, incomplete data for predictor variables, or an area of < 100 km 2 . Furthermore, we coped with overlapping regions in two steps. First, for overlapping regions from one individual literature source, we kept only nonoverlapping regions preferring smaller over larger regions (e.g. the individual states of Brazil instead of the country). Second, for overlapping regions from different literature sources, we retained smaller and larger regions if smaller regions covered only parts of the larger regions. Otherwise, we removed the larger regions. A total of 298 087 vascular plant species from 775 mainland regions and 55 islands or island groups were used to proceed with the calculation of species richness (i.e. taxonomic richness) and phylogenetic richness. The geographic regions in the dataset were distributed representatively across the entire globe, covering all major biomes (Fig. S1).

Phylogeny reconstruction and phylogenetic richness
We used a large, dated megatree of vascular plants, GBOTB_extended (Jin & Qian, 2019), as a backbone to generate a phylogeny for all species in the dataset. The megatree was derived from the GBOTB tree for seed plants by Smith & Brown (2018) and the phylogeny for pteridophytes in Zanne et al. (2014). We excluded taxa not identified to the species level for calculating phylogenetic richness, leading to a dataset including 295 417 species in 466 families of vascular plants. All families and 10 128 out of 14 962 genera (67.7%) in the dataset were included in the megatree. We bound the remaining genera and species into their respective families and genera using 'Scenario 3' in the R package V.PHYLOMAKER (Jin & Qian, 2019). In 'Scenario 3', the weighted positioning of the additional taxa depends on the length and amount of already existing tips per taxon. 91.95% out of the 295 417 species in the dataset were from genera already present in the backbone. It is suggested that patterns of phylogenetic richness are similar regardless of whether the phylogeny used is resolved at the genus or species level (Qian & Jin, 2021). To test for the effect of adding missing genera to the phylogeny on phylogenetic richness, we carried out a sensitivity analysis and found consistent patterns, indicating that our method is robust (see Methods S1 for details).
Several indices exist for capturing different dimensions of phylogenetic diversity including richness, divergence, and regularity (Tucker et al., 2017). Here, we focus on phylogenetic richness, which represents the amount of unique phylogenetic history present in an assemblage (Tucker et al., 2017). We chose Faith's PD, a common measure of phylogenetic richness, calculated as the sum of the branch lengths of all species coexisting in a region (Faith, 1992), which is directly comparable to species richness. Even though highly correlated with species richness (Pearson's r = 0.98), we did not standardize phylogenetic richness (i.e. assessing the deviation of phylogenetic richness from expectations based on species richness) in our main analyses as we were not interested in whether the phylogenetic structure of a region is overdispersed or clustered, but rather aimed to capture both taxonomic and phylogenetic aspects of plant diversity. However, we present an analysis on the drivers of deviations in phylogenetic richness from species richness in Table S2.

Predictor variables
We identified a set of candidate predictor variables hypothesized to affect plant distributions and diversity and classified them into four categories: geography, current climate, environmental heterogeneity, and past environmental conditions. Twenty-five predictors were considered in the original dataset (Table S1). These have been shown or hypothesized to contribute to geographic patterns of plant diversity in previous studies (Kreft & Jetz, 2007;Kissling et al., 2012;Stein et al., 2014;Keil & Chase, 2019). Geographic variables were region area (km 2 ) and the summed proportion of landmass area in the surrounding area of the target region within buffer distances of 100, 1000, and 10 000 km, serving as a measure of geographic isolation (Weigelt & Kreft, 2013). Current climatic variables included 13 biologically relevant temperature and precipitation variables. These variables represent annual averages, seasonality, and limiting climatic factors (e.g. length of the growing season), capturing the main aspects of climate important for plant diversity (Karger et al., 2017). Furthermore, gross primary productivity (Zhao & Running, 2010) was included as a measure of potential plant productivity based on available solar energy and water. Climatic variables were extracted as mean values across the input raster layers per region. The number of soil types (Hengl et al., 2017) and elevational range (Danielson & Gesch, 2011) were calculated for each region as proxies for environmental heterogeneity within regions.
To determine the contribution of past environmental conditions to modern diversity patterns, we calculated biome area variation since the Pliocene and the Middle Miocene, temperature anomaly since the mid-Pliocene warm period, temperature stability since the last glacial maximum (LGM), and velocity of temperature change since the LGM. Terrestrial biomes are affected by multiple drivers containing atmospheric circulation, precipitation, and temperature patterns, and thus, changes in biome distributions represent major environmental changes through geological time. To calculate biome area variation, we used biome distribution maps at present (Olson et al., 2001), the LGM (c. 25-15 ka; Ray & Adams, 2001), the mid-Pliocene warm period (mid-Piacenzian, c. 3.264-3.025 Ma; Dowsett et al., 2016), and the Middle Miocene (c. 17-15 Ma; Henrot et al., 2010). The three paleo-time periods represented particularly different climates compared with present-day conditions and showed distinct biome distributions, which are hypothesized to have left imprints on current plant diversity Sandel et al., 2020). As biome definitions differed across the four datasets, we regrouped biomes to match across datasets and then calculated biome area changes (see Methods S2 for details;

Research
New Phytologist of this approach due to the coarse resolution and uncertainty of the original past biome maps. Because of the coarse resolution of the Middle Miocene map and absent data for some geographic regions, we used biome area variation only since the Pliocene and excluded Miocene biome variation from further analyses.
In addition, we calculated temperature stability from two paleo-time periods until present, that is, the LGM and the mid-Pliocene warm period, representing cooler and warmer climates than the current climate, respectively. Temperature stability since the LGM was calculated using the CLIMATESTABLITY R package (Owens & Guralnick, 2019). It takes temperature differences between 1000 yr time slices expressed as standard deviation and averages the results across all time slices. The stability is then calculated as the inverse of the mean standard deviation rescaled to (0,1). Temperature anomaly since the mid-Pliocene was calculated as the difference in mean annual temperature between the mid-Pliocene warm period and present day. The velocity of temperature change since the LGM was calculated as the ratio between temporal change and contemporary spatial change in temperature, representing the speed with which a species would have to move its range to track analogous climatic conditions (Sandel et al., 2011). For details on paleoclimate estimates, see Methods S2.
An alternative way to evaluate the effects of biogeographic history on plant diversity is to account for predefined discrete geographic regions influencing diversity via differences in diversification history and dispersal barriers. We therefore included floristic kingdoms (Takhtajan, 1986) as an additional categorical variable in the models and compared the performance of models with and without floristic kingdoms to assess whether we managed to model the effect of biogeographic history properly by only including the variables that directly quantify past environmental change.

Statistical models
Predictor variable selection To quantify diversity-environment relationships, we fitted five different types of models with species richness and phylogenetic richness as response variables: GLMs, GAMs, random forests, XGBoost, and neural networks. To compare model performance across model types, we used the same set of predictors across models. As there was significant collinearity between the 23 predictors in the initial dataset, we removed variables with low contribution to predictions until the variance inflation factors (VIFs) of all remaining variables were below a threshold of five. It has been suggested that a VIF value that exceeds five indicates a problematic amount of collinearity (James et al., 2013). The contribution to predictions was based on a preliminary ranking of predictor variables using random forests and a stepwise forward strategy for variable introduction (Genuer et al., 2015). Along these lines, we selected a subset of 15 predictor variables minimizing redundancy and maximizing model performance to fit models (bold in Table S1; Fig. S2). The predictors retained represented all aspects (geography, current climate, environmental heterogeneity, and past environment) that are hypothesized to affect plant diversity patterns.
Modeling To perform GLMs and GAMs, we used a negative binomial error distribution with a log link function for species richness to cope with the overdispersion of the response variable and a Gaussian error distribution with a log link function for phylogenetic richness. For the GLMs, some predictors were logtransformed owing to their skewed distribution (i.e. area, temperature seasonality, number of wet days, precipitation seasonality, precipitation of warmest quarter, gross primary productivity, elevational range, number of soil types, and velocity in temperature since the LGM). After log transformation, all continuous predictor variables were standardized to zero mean and unit variance to aid model fitting and make their parameter estimates comparable. Although fitting GLMs with 15 predictors might seem excessive, it is suggested not to exclude predictors hypothesized to be important when collinearity is minimized and not a hindrance to analysis (Morrissey & Ruxton, 2018). Thus, in our GLMs, we built the full model including 15 predictors and then simplified the model using Akaike's information criterion (AIC). Predictors were tested in turn, and removed if AIC values were larger in the complex models than in the simpler ones (Phillips et al., 2019; Table S4). To account for the interactive effects of environmental predictors on diversity patterns, we fitted GLMs including energywater, energy-environmental heterogeneity, and area-environment interactions, as suggested by previous studies (Kreft & Jetz, 2007;Stein et al., 2014;Keil & Chase, 2019). Models including interactions were simplified based on AIC values. First, all interactions were tested, and then, any main effects (i.e. individual predictors) that were not included in the retained interactions were tested (Phillips et al., 2019). In GAMs, we used penalized regression smoothers (with nine spline bases for species richness and 10 spline bases for phylogenetic richness) for each predictor to estimate the smooth terms. The number of spline bases was selected from values between two and 10 using random cross-validation to optimize model performance (i.e. minimizing the root-mean-square error (RMSE)). Additionally, we used a gamma value of 1.4 to reduce overfitting without compromising model fit (Wood, 2006) and also included a double penalty to variable coefficients. We used the R packages MASS (Venables & Ripley, 2002) to fit negative binomial GLMs and MGCV (Wood, 2006) to fit GAMs.
In addition, we applied machine learning techniques, that is, random forests, XGBoost, and neural networks, to fit global models of plant diversity. Random forests are an ensemble learning method that builds a large collection of decision trees and outputs average predictions of the individual regression trees, while XGBoost is an ensemble model of decision trees trained sequentially fitting the residual errors in each iteration. Several innovations make XGBoost highly effective, including a novel tree learning algorithm for handling sparse data and a theoretically justified weighted quantile sketch procedure enabling handling instance weights in approximate tree learning (Chen & Guestrin, 2016). Neural networks are a machine learning method that comprises a collection of connected units (neurons) and their connections (edges). For these machine learning methods, species and phylogenetic richness were log-transformed before modeling to reduce the skewness of their distributions. A set of tuning parameters (i.e. hyperparameters), which cannot directly be estimated from the data, needs to be set beforehand. These hyperparameters determine the training strategy and related efficiency of the algorithms. It is commonly suggested to tune hyperparameters to maximize model performance before running models for a certain problem (Bergstra & Bengio, 2012). We used the train function from the R package CARET to optimize the model tuning parameters for the three machine learning models used here (Kuhn, 2008). We used repeated random crossvalidation and selected the hyperparameters that produced the lowest RMSE. We then refitted the final models using these optimal hyperparameters. The R package RANGER was used to fit random forests (Wright & Ziegler, 2017), XGBOOST to fit XGBoost (Chen & Guestrin, 2016), and NEURALNET to fit neural networks (G€ unther & Fritsch, 2010). Unlike GLMs and GAMs, machine learning can detect and model interactions of predictors without a priori specification, and we visualized interactions in machine learning models using partial dependence plots. For details on tuning parameters, model fitting using machine learning techniques, and visualization of interactions, see Methods S3.
Spatial terms Species distribution data and environmental predictors are often spatially autocorrelated. On the one hand, this might lead to biased parameter estimates, which need to be accounted for (Dormann et al., 2007). On the other hand, including spatial information in models could increase their predictive power (Keil & Chase, 2019). Because of this, we generated spatial models using different modeling techniques. To account for spatial autocorrelation in GLM residuals, we used simultaneous autoregressive (SAR) models of the spatial error type, which is recommended for use when dealing with spatially autocorrelated species distribution data (Kissling & Carl, 2008). We evaluated SAR models with different neighborhood structures and spatial weights (lag distances between 200 and 3000 km, weighted and binary coding). As the final SAR model, we chose a model with weighted neighborhood structure and 800 km lag distance both for species and for phylogenetic richness, which had the minimal AIC and the best reduction in spatial autocorrelation in the residuals. Species and phylogenetic richness were log-transformed before modeling. In GAMs, we added a two-dimensional spline on geographical coordinates, which accounts for spatial autocorrelation in model residuals (Dormann et al., 2007;Keil & Chase, 2019). To cope with spatial autocorrelation in machine learning models, we included cubic polynomial trend surfaces (i.e. latitude (Y), centered longitude (X) as well as X 2 , XY, Y 2 , X 3 , X 2 Y, XY 2 , and Y 3 ; Bjorholm et al., 2005;Li, 2019). Overall, the spatial models successfully removed spatial autocorrelation from model residuals (Fig. S3).
Comparison with established models To compare our models to published global models of plant species richness, we rebuilt these models for the dataset analyzed here. First, we fitted the best model as in Kreft & Jetz (2007), a combined six-predictor model using GLMs; and second, we built a GAM using the same model structure as Keil and Chase's smooth model (Keil & Chase, 2019), which contained a two-dimensional spline on geographical coordinates, 15 single predictors, and interactions between each individual predictor and area. We ran models including the same 15 predictor variables and floristic kingdom using random forests and XGBoost and compared them with the models without floristic kingdom. Adding floristic kingdom increased collinearity between predictors. However, the two treebased models are able to handle multicollinearity when they are used for prediction. Random forests in the RANGER R package can handle categorical variables automatically; however, XGBoost works only with numeric vectors. We therefore converted all other forms of data into numeric vectors. Here, we used one-hot encoding (0,1) to convert the floristic kingdom into dummy variables for the XGBoost model.
Variable importance To estimate the relative importance of each environmental predictor, we used a consistent method across model types. We randomly reshuffled values of the predictor of interest in the dataset, predicted the response variables based on the modified dataset, and calculated the Spearman rank correlation coefficient between those predictions and the predictions using the original dataset. The relative importance of the predictor of interest was calculated as one minus the correlation coefficient divided by the sum of one minus the correlation coefficients of all predictors . Likewise, to compare the relative importance of different categories of predictor variables (categories in Table S1), we permuted values of a subset of predictors belonging to one category, correlated the predictions of the model using the modified dataset and predictions using the original dataset, and estimated the importance of each category as one minus the Spearman rank correlation coefficient divided by the sum of one minus the correlation coefficients of all predictor categories. Relationships between diversity metrics and predictor variables were visualized as partial dependence plots (see Methods S3 for details).

Cross-validation
To assess the accuracy of model predictions across all different model types, we used random 10-fold cross-validation and spatial 68-fold cross-validation following Ploton et al. (2020; for details, see Methods S4). To quantify model predictive performance, we summarized the cross-validation results using the RMSE and two different pseudocoefficients of determination to quantify the amount of variation explained by the model based on out-of-bag samples. R 2 _CORR is the coefficient of determination of a linear model of the predicted and observed values from all repetitions of the cross-validation. R 2 _Accuracy is the amount of variation explained by the model, calculated as R 2 _Accuracy = (1 À SSE/ SST) (Hengl et al., 2017), where SSE is the sum of the squared error between observation and prediction and SST is the total sum of squares. The model with the lowest RMSE and highest R 2 _CORR/R 2 _Accuracy was identified as the best predictive model. For all models, we calculated cross-validation results for log-transformed observed and predicted species and phylogenetic richness, because species and phylogenetic richness were logtransformed before modeling for machine learning models and fitted with log link functions in GLMs and GAMs.

New Phytologist
Variation explained according to spatial cross-validation was consistently lower than variation explained according to random cross-validation, likely because the former offers biased and pessimistic estimates (Wadoux et al., 2021). Spatial cross-validation excludes entire portions of regions with specific combinations of environmental characteristics and biogeographic histories from the training data and is therefore less representative of the globe and its environmental spectrum, likely causing predictions outside the covariate space within the models. By contrast, random cross-validation is almost unbiased when the sampling design is systematic or random (Wadoux et al., 2021). Because the geographic regions in our dataset were distributed representatively across the entire globe, covering all major biomes (Fig. S1), we argue that random cross-validation offers relatively unbiased assessments of model performance.

Predictions
We used the resulting models to predict vascular plant species and phylogenetic richness across global grids of four different resolutions (i.e. 7774, 23 322, 69 967, and 209 903 km 2 hexagon size). We used the DGGRIDR R package (Barnes & Sahr, 2017) to produce a grid of equal-area and equidistant hexagons across the Earth's surface clipped for global coastlines. Islands smaller than 1.5 times the grid cell size were treated as entire entities instead of subdividing them into several partial grid cells. For each hexagon, we calculated the same predictor variables as for the geographic regions used for fitting the models. We then used the models to predict vascular plant species and phylogenetic richness and mapped the predictions across the hexagon grid. Due to missing values in some predictor variables, a few values had to be interpolated for predictions (see Methods S5 for details).
Besides predictions based on individual models, we used an ensemble prediction procedure, which averages the predictions based on the models fitted by different techniques weighted by model accuracy (the inverse of the model squared error) from the random cross-validation process (Marmion et al., 2009). Because spatial cross-validation was likely biased (Wadoux et al., 2021), we used model accuracy from random cross-validation. In addition to the hexagon grids, we generated plant diversity maps in raster format at a resolution of 30 arc seconds based on predictions for the 7774 km 2 hexagons (see Methods S5; Fig. S4). As centers of plant diversity based on the ensemble predictions, we defined regions with predicted richness values higher than the 90 th quantile, that is, containing at least 1765 plant species and 41 866 Ma of phylogenetic richness at a resolution of 7774 km 2 .

Uncertainty
To assess variation of the predictions across models, we calculated the coefficient of variation of predicted values for each hexagon grid cell. The coefficient of variation is defined as the ratio of the standard deviation to the mean, which accounts for the differences in diversity between regions and thereby avoids artificially high uncertainty of high-diversity regions. Additionally, we calculated standard errors of predictions for GLMs, GAMs, and random forests. For XGBoost and neural networks, we modeled the relationship between model residuals and environmental predictors from the raw data and used this model to predict uncertainty across the hexagon grids.

Performance of plant diversity models
Our results reveal a great potential of machine learning, particularly decision tree methods, for modeling plant diversity-environment relationships and for accurately predicting plant diversity across various scales. Overall, the predictive power of the models was high (Table 1). Machine learning models and GAMs outperformed GLMs, and spatial models (i.e. models containing spatial terms to account for the spatial nonindependence of regions; Dormann et al., 2007) showed an overall better performance than nonspatial models (except GLMs for species richness). Extreme gradient boosting, an ensemble of sequentially trained decision trees, produced the most accurate predictions both for species richness (70.3% variation explained based on spatial cross-validation and 80.9% based on random crossvalidation) and for phylogenetic richness (73.7% and 83.3%, respectively), which was consistent across spatial and nonspatial models.
The good predictive performance of machine learning models can be attributed to their ability to uncover complex, nonlinear diversity-environment relationships (Figs S5, S6) and interactive effects (Figs S7-S18). We found strong interactions between spatial terms and environmental variables (Figs S7-S18). This indicates regional differences in plant diversity and diversityenvironment relationships and shows that different combinations of environmental variables are important when predicting diversity across geographic regions (Keil & Chase, 2019). Moreover, machine learning models revealed strong interactions between energy and water availability, energy and environmental heterogeneity, as well as area and environmental variables (Figs S7-S18). Also, the accuracy of GLMs increased when including the interactions that turned out to be important in machine learning models (70.4% vs 63.6% in species richness based on random cross-validation; 63.5% vs 45.2% in phylogenetic richness), highlighting the role of complex interactive effects among biotic and abiotic factors in shaping global plant diversity patterns (Francis & Currie, 2003;Kreft & Jetz, 2007;Keil & Chase, 2019). By implicitly accounting for grain dependence and complex interactions among spatial and environmental variables, our machine learning models outperform previous models of plant diversity (Kreft & Jetz, 2007;Keil & Chase, 2019 ; Table S4), improving our understanding of diversity-environment relationships and allowing for improved predictions of plant diversity across scales.

Drivers of global patterns of vascular plant diversity
Current climatic variables emerged as the most important drivers of plant diversity, accounting for 34.4-48.1% of the variation in species richness and 39.7-58.2% in phylogenetic richness across models ( Fig. 1; Table S1). High energy and water availability and low seasonality promoted species and phylogenetic richness (Figs S5, S6), supporting other large-scale studies that report strong effects of the current climate on plant diversity (Francis & Currie, 2003;Hawkins et al., 2003;Kreft & Jetz, 2007). Environmental heterogeneity (measured here as elevational range and number of soil types within a region) explained 21.0-40.9% of the variation in species richness and 16.3-27.2% in phylogenetic richness, with increasing heterogeneity leading to higher plant diversity as expected (Stein et al., 2014). Even though species and phylogenetic richness were highly correlated (Pearson's r = 0.98), some differences emerged in diversity-environment relationships. For example, environmental heterogeneity explained less variation in phylogenetic richness than in species richness. This potentially reflects a signal of in situ speciation that is promoted by high environmental heterogeneity, creating clusters of closely related species resulting in relatively low phylogenetic richness compared with species richness (Forest et al., 2007). This notion was also supported by a negative effect of the number of soil types on the residual variation of phylogenetic richness after accounting for species richness (Table S2).
Geographic variables (area and geographic isolation) explained 9.8-23.1% of the variation in species richness and 18.0-24.6% in phylogenetic richness. Larger regions tend to have higher in situ speciation rates owing to more opportunities for geographic isolation within a region and lower extinction rates due to larger populations (Terborgh, 1973;Kisel & Barraclough, 2010). These effects should be most pronounced in selfcontained, isolated regions like islands, mountains, or other isolated habitats and less so in regions that are similar to their surroundings (Rosenzweig, 2003;Testolin et al., 2021). Additionally, larger regions often provide a greater variety of habitats, offering more environmental niches to be occupied by species. Geographic isolation, measured here as the proportion of surrounding landmass, did not explain much variation (0.0-3.9% in species richness; 0.5-3.5% in phylogenetic richness; Fig. S19) for both diversity facets, possibly because our dataset consisted mainly of mainland regions (93.4% of all regions). While geographic isolation is a main driver of insular plant diversity (Weigelt & Kreft, 2013), isolation and peninsular effects seem to play only a minor role on the mainland, where geographic isolation can be expected to be more important for compositional uniqueness of regions and endemism, rather than for richness (Sandel et al., 2020).
We hypothesized that higher plant diversity would accumulate in regions with long-term climate stability because of low extinction and high speciation rates (Fine, 2015;Svenning et al., 2015). We therefore assessed the effects of temperature stability and biome variation as proxies for past climatic change for two paleo-time periods, that is, the LGM and the mid-Pliocene warm period. In contrast to the expected legacy effects of historical variables on modern plant diversity, past Each model was evaluated for its predictive performance using both random 10-fold and spatial 68-fold cross-validation. Nonspatial models were fitted with 15 predictors representing geography, current climate, environmental heterogeneity, and past environment conditions (Supporting Information  Table S1) except for the minimum adequate generalized linear model (GLM) and the GLM with interaction terms. Spatial models in addition contained spatial terms (i.e. simultaneous autoregressive (SAR) models, generalized additive models (GAMs) including splines of geographic coordinates, and machine learning methods including cubic polynomial trend surfaces). The minimum adequate GLM was obtained by simplifying the full GLM based on Akaike's information criterion (AIC). The GLM with interaction terms was fitted including all predictors of the full GLM and interactions of energy-water, energyheterogeneity, and area-environment-related variables and was then simplified based on AIC. Because the response variables (i.e. species and phylogenetic richness) were log-transformed in models, the accuracy statistics are provided on a log scale. Based on all out-of-bag samples, values shown are rootmean-square error (RMSE); the amount of variation explained by the model calculated as one minus the ratio of the sum of the squared error between observation and prediction to the total sum of squares (R 2 ). For more detailed cross-validation results, see Table S4.

Research
New Phytologist environmental conditions contributed only 0.8-5.5% to explaining species richness in most of our models, but up to 23.8% in neural networks. Likewise, past environmental conditions showed higher explanatory power (15.0%) for phylogenetic richness in neural networks than in other models (4.0-8.5%). Models including spatial trend surfaces or discrete biogeographic regions (i.e. floristic kingdoms) to account for regional idiosyncrasies (after statistically controlling for current and past environments) further improved model fits (Tables 1, S4). This suggests that in addition to climate stability since the LGM or mid-Pliocene warm period, biogeographic history predating the Pliocene or regional idiosyncrasies other than climatic changes affected modern plant diversity. These historical regional effects are possibly due to dispersal barriers and idiosyncratic colonization and diversification histories (Qian & Ricklefs, 2004;Ricklefs & He, 2016).

Improved global plant diversity maps
We produced global diversity maps for species and phylogenetic richness of vascular plants, based on individual well-performing models and model ensembles. Because of its outstanding predictive power and ability to handle missing data, we consider XGBoost (including geographic coordinates) the most powerful single model for predicting plant diversity (Figs S20d, S21d). In addition, we present ensemble predictions, which reduce the uncertainty introduced by the choice of one particular modeling technique and therefore improve prediction accuracy (Marmion et al., 2009). Including region area and its interactions with other predictor variables allowed us to predict plant diversity across global grids of equal-area and equidistant hexagons of different grain sizes (i.e. 7774, 23 322, 69 967, and 209 903 km 2 ; Figs S22, S23). All model predictions and their uncertainties are accessible at https://gift.uni-goettingen.de/shiny/predictions/.
Our ensemble predictions (Fig. 2a,d) describe the global patterns of species and phylogenetic richness with unprecedented detail and accuracy. The maps capture how diversity varies along environmental gradients and identify global centers of plant diversity (Fig. 2b,e). The highest concentrations of plant species and phylogenetic richness are predicted in Central America, southern Mexico, Andes-Amazonia, the Caribbean, southeastern Brazil, the Cape region of Southern Africa, Madagascar, Malay Archipelago, Indochina, and southern China (Fig. 2b,e), which is in line with empirical observations and previous studies (Myers et al., 2000;Barthlott et al., 2005;Kreft & Jetz, 2007). While patterns of phylogenetic richness closely resembled species richness (Pearson's r = 0.97), discrepancies occurred, for example, around the Mediterranean, in Central America, the Caucasus, and the Himalayas (Fig. S24). Differences might result from unequal taxonomic efforts (e.g. many closely related species described separately in Europe) or the uneven distribution of evolutionarily old or young clades across the globe (Thorne, 1999;Endress, 2001). The former suggests that predictions of phylogenetic diversity provide a taxonomically less biased representation of global plant diversity patterns.
Thanks to the high-resolution environmental data and modeling techniques that account for complex interactions, regions with steep elevational gradients show finer-tuned variation in predicted effects presented here than in previous studies Kreft & Jetz, 2007). For example, the eastern slopes

Research
New Phytologist maps could derive from the accumulation of knowledge on plant diversity worldwide and the continuously updated species distribution data in GIFT used for modeling.
Regions with high species and phylogenetic richness were found to be distributed mostly in mountainous regions (Fig. S26). Specifically, tropical mountain ranges, including the tropical Andes, eastern African highlands, and various Asian mountains (e.g. in southern China and the Malay Archipelago), are global centers of plant diversity. The high diversity of tropical mountain ranges, as also found in previous studies (Testolin et al., 2021), is linked to warm and wet climates and heterogeneous environments (Antonelli et al., 2018). Multiple biogeographical and evolutionary processes, including speciation, dispersal, and persistence that are driven by long-term orogenic and climatic dynamics in mountains, have led to outstanding regional plant diversity (Antonelli et al., 2018;Rahbek et al., 2019). Orogenic processes constantly change soil composition, nutrient levels, and local climate of mountainous regions, thus creating novel and heterogeneous habitats where plant lineages diversify and colonize from neighboring areas (Antonelli et al., 2018). Moreover, climatic fluctuations stimulate diversification by driving dynamic shifts in habitat connectivity within mountains (Rahbek et al., 2019). Due to their steep environmental gradients and heterogeneous nature, mountain regions provide refugia in times of unfavorable climate (Bennett et al., 1991;Rahbek et al., 2019).
Differences among models (measured as the coefficient of variation) were greatest in regions with extreme environments, such as deserts and Arctic regions (Fig. 2c,f). Arctic regions also consistently showed the highest prediction uncertainty across models (Figs S27, S28). The uncertainties in regions with extreme environments probably stem from two sources. First, extremely species-poor regions might be less well represented in published diversity data. Regions with extreme environments are often part of artificially delimited regions instead of being sampled individually (e.g. Chad and Libya sampled instead of the Sahara). Those artificially delimited regions are more environmentally heterogeneous, which attenuates the extreme values of environmental factors as well as plant diversity. Machine learning models are known to not extrapolate well under such conditions (Elith et al., 2010). Second, even for regions with relatively homogeneous environments, checklists and floras do not only include information on predominant but also azonal vegetation, making them richer than expected from their prevailing conditions and observed at a more local scale (compared to alpha diversity predictions in Sabatini et al., 2022).

Conclusions
We present the most accurate and comprehensive predictive global maps of regional vascular plant species and phylogenetic richness available to date. They are based on significantly improved global models using comprehensive global inventory-based plant distribution data, high-resolution past and current environmental information, and advanced machine learning models. Our findings illustrate that machine learning methods applied to large distribution and environmental datasets help to disentangle underlying complex and interacting associations between the environment and plant diversity. Machine learning methods therefore help to improve both fundamental understanding and quantitative knowledge in biogeography and macroecology. The updated global diversity maps of vascular plant diversity at multiple grain sizes (available at https://gift.uni-goettingen.de/shiny/ predictions/) provide a solid foundation for large-scale biodiversity monitoring and research on the origin of plant diversity and subsequently support future global biodiversity assessments and environmental policies.                           Uncertainty in predicted species richness from the five models used for the ensemble predictions (i.e. spatial models using machine learning methods and generalized additive models and a nonspatial generalized linear model with interactions).

Fig. S28
Uncertainty in predicted phylogenetic richness from the five models used for the ensemble predictions (i.e. spatial models using machine learning methods and generalized additive models and a nonspatial generalized linear model with interactions).
Methods S1 Sensitivity analyses of phylogenetic richness.

Methods S2 Past environmental variables.
Methods S3 Statistical models.

Methods S4 Cross-validation.
Methods S5 Handling of missing values in predictor variables for predicting and calculating predictions in raster format.
Notes S1 References of checklists and floras from the Global Inventory of Floras and Traits (GIFT) used to compile the regional species composition data.

Table S1
List of environmental predictor variables hypothesized to affect plant diversity patterns.

Table S2
Coefficients of a linear model between the residuals (deviation) from the linear regression between species richness and phylogenetic richness, and the 15 predictor variables identified to best explain plant diversity.