Figure 1 provides an overview of the methodology developed to create our harmonized global dataset. Each of the steps is described in detail in the following subsections. In brief, the methodology involves: (1) pre-processing OSM data – in this step, we disaggregate the global unprocessed OSM database to create an individual data file for each country; (2) extraction – in this step, we extract the geospatial location for a selection of CI; (3) rasterization – in this step, we develop a consistent rasterized dataset containing information on the amount of CI; and (4) the composition of CISI – in this step, we summarize the geospatial information from step 3 to calculate an index to express the spatial intensity of CI.

Fig. 1

Schematic display of workflow. The green panel represents the part of the model that performs calculations at a national scale, and the blue panel represents the part of the model that performs calculations at a global scale. On the right-side, the purple-coloured boxes show the specifications required for the model. The yellow box indicates the spatial input required.

Pre-processing of OSM data

Central to the development of the global dataset is the integration of open data collected and provided by OSM. The goal of this platform is to create and distribute free and openly accessible geospatial and attributional information on the world’s features. With 4.5 million map changes/day, the OSM database counts approximately 15.5 billion georeferenced features as of 26th November 202034.

Geographical features in OSM are projected in the form of nodes, ways and relations. A node represents a specific point in space and is defined by its latitude and longitude (e.g. telecommunication tower). The datatype ways exist as a line segment that is connected by two or more nodes (e.g. road). A polygon (or area) is described as closed ways. They are constructed from ways and created when the last node of a series of line segments is connected to the beginning (e.g. hospital). Another datatype, relations, is an ordered list of features that groups nodes, ways and relations into a larger unit. An example of unprocessed OSM data, including a breakdown of the basic datatypes, is shown in Fig. 2. Each georeferenced element in OSM has an id number that uniquely identifies it, and includes other details such as the user who modified the element and the time of last modification. Elements can be further specified by a list of attribute tags in the form of key-value pairs, whereby the value provides more detail to the key identifier. For example, primary roads that often link larger towns are specified under the key ‘highway’ in combination with the value ‘primary’.

Fig. 2
figure 2

Visualization of raw OpenStreetMap data of a given area, with a breakdown by the datatypes.

The global OSM dataset containing all state-of-the-art mapped infrastructure is available via https://planet.openstreetmap.org/, which we downloaded on 8th January 2021 in PBF format. Subsequently, the OSM planet file is disaggregated into smaller .PBF files at national level by using publicly available code35.

Extraction of critical infrastructure

The second step is to extract all the unique CI assets from the OSM dataset. No clear guiding rules exist on which specific infrastructure assets can be prioritized as critical36,37, and the way the definition of CI is interpreted may vary per country. In this study, we represent the world’s infrastructure network by seven overarching CI systems: transportation, energy, water, waste, telecommunication, education, and health. This is in line with the classification of infrastructure systems discussed in the literature1,2,8,21, whereby infrastructure related to education and health has started to gain increasing attention recently22. We further subdivided these CI systems into a total of ten subsystems. Each subsystem contains two or more specific infrastructure types, for example the telecom subsystem contains infrastructure types communication tower and mast. For an overview of the classification of the seven CI systems, ten subsystems, and the selection of infrastructure types, refer to Section Data Records. From the list of active OSM key-value pairs38, we selected 81 OSM tags to represent 39 infrastructure types (see Supplementary Table 1 for the categorization of CI, and the reclassification).

The specified infrastructure types are extracted from the pre-processed OSM files at national level. We define (iota _t,xy) as a unique CI asset (iota ) containing a specific set of xy coordinates, belonging to a specific infrastructure type t. We then define (I_t,xy=left\iota _t,1,ldots ,iota _t,nright\)as the set of all n CI assets of a specific infrastructure type. This may be, for instance, a set of CI assets that represents the infrastructure type telecom towers. Finally, we clip the set (I_t) with administrative boundaries to ensure that we only capture CI assets that fall within the administrative boundaries of a country.

The extraction of CI results in almost 153 million unique OSM elements. The data have a global coverage, with the highest number of unique CI assets found in the United States, followed by Germany and Japan (Fig. 3a). The lowest number of unique CI assets per country is mainly found in the Small Island Developing States (SIDS), and other small islands spread across the Atlantic and Pacific Ocean. Fig. 3b shows the number of unique elements per main CI system with a further specification by income class. Here, we identify a general pattern that holds for the seven main CI systems. Namely, the high-income countries have the largest share of unique CI assets for each CI system, whereas the low-income countries have the lowest contribution to this share. The high-income countries account for 60.8% of the extracted OSM elements, upper middle countries for 21.4%, lower middle countries for 14.5%, and the low-income countries for 3.3% (see details aggregated to the country level in Supplementary Table 2).

Fig. 3
figure 3

Distribution across space and statistics of the unique CI assets (point, line, and polygon data) that are returned after the extraction of OSM data. Panel (a) presents the number of unique CI assets per country. Panel (b) presents the percentage of unique CI assets for four income classes categorized by the main CI systems. Panel (c) presents the share of unique CI assets per main CI system.

Transportation

The transportation system is sub-divided into three subsystems: roads, railways, and airports. The road network provided by OSM has a completeness-level of approximately 83% in January 201639. We aggregate the 15 classes originally used by OSM to describe roadways to three classes: primary, secondary and tertiary roads. For the subsystem railways, we selected seven OSM key-value pairs that were aggregated to one common class (see Supplementary Table 1).

The share of the number of unique elements belonging to the transportation system to the total number is dominant: 84% of the extracted CI elements belong to the transportation system (Fig. 3c). Fig. 4 provides more detail on the composition of the transportation system by highlighting the percentages of extracted unique assets per infrastructure type. Here, we find that the tertiary roads (90%) account for the most of unique assets. The total length of road infrastructure extracted from OSM is over 51 million kilometers, of which approximately 42.6 million is tertiary, 5.1 million primary, and 3.7 million secondary. We extracted over 2 million kilometers of railway infrastructure, and 17,508 airports worldwide.

Fig. 4
figure 4

Relative number of unique CI assets extracted for the 39 infrastructure types, categorized by the seven main CI systems: transportation, energy, telecommunication, waste, water, education, and health.

It is worth noting that ports are not explicitly specified in this study, even though they serve as critical hubs of the transportation network. Multiple CI assets that we included in this research are assets that are typically situated in ports. Therefore, many of the CI assets of ports are captured, such as multiple road- and energy assets.

Energy

We selected seven infrastructure types for the representation of the energy system. These infrastructure types are related to the production, conversion and delivery of energy, and includes the following infrastructure types: cable, line, minor line, power tower, power pole, plant, and substation.

Cables are described by OSM as insulated assets that allow electrical power transmission or distribution in complex environments, such as indoors, underground, or undersea. In contrast, power lines are energy assets that are built above the surface and are usually carried by supporting structures. Here, OSM distinguishes between power lines that are supported by power towers, and minor power lines that are supported by poles used for low-voltage transmission. A power plant is an industrial, large-scale facility for the generation (or storing) of electricity. In general, a facility is tagged as a power plant if it generates more than 1 MW. Substations are used for the transmission and distribution of electricity within the energy network, whereby they transform high voltages to low voltages, or vice versa.

The composition of the energy system is presented in Fig. 4. In total, the dataset consists of 28,750 kilometers of power cables, over 4,3 million kilometers of power lines, and 571,416 minor lines. We find over 20 million supporting structures, of which 64% can be accounted for by power towers and the remaining 36% by power poles. The dataset contains 16,193 plants globally, and 167,190 substations.

Telecommunications

The telecommunications system is represented by two infrastructure types: communication tower and mast. We used a combination of three key-value pairs to extract these infrastructure types (see Supplementary Table 1). Communication towers are used for transmitting (a range of) radio applications (e.g. televisions, radio, and mobile phone), are often characterized by a height of over 100 meters, and are usually made of concrete. Masts, in contrast, are usually only used for a single application, and are a couple of meters high. Globally, the dataset counts approximately 141,478 communication towers and 80,750 masts (see Fig. 4).

Waste

For the waste system, we made a distinction between solid waste and water waste. Accordingly, the waste system is sub-divided into two subsystems. The solid waste subsystem is represented by infrastructure types waste transfer station and landfill. We represent the water waste subsystem using the infrastructure type water waste treatment plant. Solid waste is consolidated and transferred in bulk at waste transfer stations, whereas water waste is treated at water waste treatment plants. Landfills are sites for permanent or long-term storage of consolidated waste materials (that often come from waste transfer stations). We extracted 1,951 waste transfer stations, 34,551 land fill sites, and 15,870 water waste treatment plants at a global scale (see Fig. 4).

Water

The CI system water entails infrastructure that is critical for the water supply. We selected five infrastructure types that provide services for the extraction, distribution, and storage of both potable and non-potable water: water tower, water well, reservoir covered, reservoir, and water works. A water well is used to extract groundwater. Water works and water towers are both critical for the distribution of water. Here, water works are facilities that are used to apply water to the water pipe network, and water towers are elevated structures to pressurize the distribution network. OSM categorizes large man-made tanks for the storage of water as reservoir covered, whereas reservoir entails artificial lakes to store water. The extraction process resulted in a total of 370,218 unique water elements, with the infrastructure type reservoir having the highest contribution of 89%. Globally, we find 14,947 water towers, 4,801 water wells, 12,762 covered reservoirs, and 7,792 water works (see Fig. 4).

Education

The subsystem education is represented by five infrastructure types: college, kindergarten, library, school, and university. We extracted 863,928 education facilities from the OSM database. As is presented in Fig. 4, approximately 74.5% of the extracted education facilities are attributed to schools, followed by kindergartens (13.8%), universities (4.3%), colleges (4.0%), and libraries (3.4%).

Health

During the Ebola epidemic of 2014 in West Africa, a need arose for readily available information on the location of health facilities as well as specifics associated with a health facility (e.g. name of facility, number of doctors). As a result, the Global Healthsites Mapping Project (https://www.healthsites.io) has been launched with the aim to collect and validate a freely accessible global dataset on health facilities, which is being done in collaboration with OSM and other partners. Data on health facilities that are contributed via Healthsites.io are written to the OSM database, and vice versa. The types of health facilities included in this research are based on the list of health facilities that is defined by Global Healthsites and partners40. This list includes the following facilities: doctor, pharmacy, hospital, clinic, dentist, physiotherapist, alternative, laboratory, optometrist, rehabilitation, blood donation, birthing center.

We developed a procedure to include all georeferenced health facilities in a uniform way. Generally, multiple infrastructure types can be georeferenced as both point and polygon geometries. However, we found this inconsistency in georeferencing to be a substantial problem for the spatial completeness of health facilities. An examination of 16 randomly selected countries shed light on the usage of datatypes associated with the spatial completeness of the georeferenced health facilities. Only extracting health facilities as polygon geometries would exclude the health facilities that are exclusively tagged as point geometries, and vice versa. This reduces the spatial completeness of the health facility dataset.

The procedure entails the following steps. The set of facilities georeferenced as polygon data is merged with the set of facilities georeferenced as point data. However, prior to this, we check whether each polygon spatially intersects with point data in order to avoid double-counting. In case a spatial intersection exists, the polygon is only removed from the dataset if it concerns the same infrastructure type. This means that, for example, a specific hospital that is tagged as polygon geometry will only be removed from the dataset if: (1) it has a spatial intersection with a point feature; and (2) this point feature is tagged as a hospital. A filtered set of facilities georeferenced as polygon data remains, which is subsequently transformed into point geometries by taking the centroid of a polygon. Finally, this is merged with the set of facilities that are georeferenced as points in the original dataset. The number of health facilities at the global scale is 862,548. The composition of unique CI assets for the twelve infrastructure types defined for the CI system health is illustrated in Fig. 4.

Rasterization of CI data

The next step is to translate the detailed geospatial information on CI into a consistent rasterized dataset, whereby each grid cell holds information on the estimated amount of infrastructure. We created a consistent raster of the globe with a resolution of 0.10 × 0.10 degrees, which is approximately 11.1 × 11.1 km at the equator, and a second raster with a resolution of 0.25 × 0.25 degrees. We spatially overlay all individual CI assets with each grid cell of the consistent raster of the globe. Each grid cell in this raster can be defined as a rectangle (pleft(x_1,x_2,y_1,y_2right)), and the collection of grid cells can be denoted by the set (P=leftp_1,ldots ,p_zright\). The collection of unique CI assets of a specific infrastructure type within a given grid cell is denoted as (I_t,xyleft(pright)=left\iota _t,1left(pright),ldots ,iota _t,n(p)right\), where (left(x_1le xle x_2right)wedge left(y_1le yle y_2right)). For example, this may be the collection of all unique telecom towers that are located within a given cell.

We use this collection of CI assets (I_tleft(pright)) to estimate the total amount of each infrastructure type within a given grid cell. The amount of infrastructure associated with one unique CI asset (iota _t,xy) is denoted as a number (varphi left(iota _jright)), where j reflects the datatype of the considered asset (iota ), and the unit of spatial measurement is dependent on the datatype j. Depending on the datatype of an infrastructure type (see Supplementary Table 1), we used the following method to rasterize the global CI. We estimate: (1) the total count if the datatype j is a node; (2) the length in km is if the datatype j is a line; (3) and the area in km2 if the datatype j is a polygon. We can define the total amount of infrastructure for an infrastructure type in a given grid cell p as (sleft(I_tright)=varphi left(I_t(p)right)). For example, this could be a given grid cell p that counts four telecom towers. The rasterized data per infrastructure type are then denoted as the set (Sleft(I_tright)=left\varphi left(I_tleft(p_1right)right.,ldots ,varphi left(I_tleft(p_zright)right.right\). Using this procedure, the detailed geospatial information on the 39 selected infrastructure types is translated into two sets of 39 consistent rasterized layers containing geospatial information on the amount of infrastructure at a global scale (see Section Data Records).

Composition of CISI

The final step is to develop the CISI, which is a spatial composite of the rasterized data per infrastructure type. For the development of CISI, a four-fold conversion is needed, which is summarized in Fig. 5, and can be described as follows.

Fig. 5
figure 5

Schematic representation of the four conversions applied to derive the CISI. The procedure is illustrated for one branch of the CI dataset, starting from landfill assets up to the aggregation of CISI.

Conversion 1: We normalize the 39 consistent rasterized layers at asset level that are described in Subsection Rasterization of CI data. The normalization of rasterized data is a prerequisite to enable comparison between the different infrastructure types, but also to ensure comparison between multiple datatypes. To normalize each of the 39 rasterized layers representing a given infrastructure type (S(I_t)), we first detect the grid cell containing the highest amount of infrastructure, whereby the amount of infrastructure in this specific grid cell is denoted as (alpha =rmmax left(Sleft(I_tright)right)). Subsequently, each grid cell of a given rasterized layer representing a given infrastructure type (S(I_t)) is divided by the highest amount of infrastructure α, resulting in a normalized layer (barSleft(I_tright)). For example, for the rasterized layer containing information on global landfills (shown in the first panel of Fig. 5), we detect that the grid cell with the highest amount of infrastructure holds 20 km2 of land fill assets. Subsequently, all of the grid cells in the rasterized layer are divided by 20. The first conversion results in a dimensionless value ranging between 0 (no landfill assets) and 1 (highest intensity of landfill assets). The procedure for the normalization at asset level is described by Eq. 1:

$$barSleft(I_tright)=fracS(I_t)propto quad quad rmwithpropto =rmmax left(Sleft(I_tright)right)$$

(1)

Conversion 2: We aggregate the 39 normalized layers at infrastructure asset level into ten normalized layers at subsystem level. An infrastructure type t belongs to a specific subsystem g. Accordingly, the normalized layers for infrastructure types within a given subsystem are combined into an aggregated layer at subsystem level (barSleft(I_gright)), which represents the spatial intensity of that specific subsystem. The number of infrastructure types T within a subsystem g is denoted as (T^g). To continue with the example provided in Fig. 5, the solid waste subsystem is represented by two infrastructure types (T^g), namely landfill and waste transfer station. The normalized layers for these infrastructure types are combined into an aggregated layer representing the solid waste subsystem. We use an equal weighting, which means that each infrastructure type is considered equally as important. We denote the weighting of a given infrastructure type as (w(I_t)). The product of the summation, denoted as (sum _t=1^barSleft(I_tright)ast wleft(I_tright)), is normalized using the same method as in the first conversion. The second conversion is expressed by Eq. 2:

$$barS(I_g)=fracsum _t=1^ T^gbarS(I_t)ast w(I_t)rmmax (sum _t=1^barS(I_t)ast w(I_t))quad rmwith;tin g$$

(2)

Conversion 3: We aggregate the geospatial data per subsystem into seven layers at system level. This conversion is similar to the previous step, but applied at system level. A given subsystem g is categorized under a specific system k, whereby the total number of subsystems belonging to a system k is expressed by (G^k). We aggregate the geospatial information of the subsystems, in additive format with equal weighting, where the weighting for a given subsystem is denoted as (w(I_g)). This is then followed by a normalization to derive (barSleft(I_kright)). For example, the water waste subsystem and the solid waste subsystem comprise the waste system (Fig. 5). In this step, the two subsystems are combined to develop a normalized layer at system level, representing the spatial intensity of the overall waste system. We denote the third conversion as Eq. 3:

$$barSleft(I_kright)=fracsum _g=1^G^kbarSleft(I_gright)ast wleft(I_gright)rmmax (sum _g=1^G^kbarSleft(I_gright)ast wleft(I_gright))quad rmwith;gin k$$

(3)

Conversion 4: The CISI is developed in the final conversion. The CISI is the aggregation of the CI systems K representing the global infrastructure c. Here, the total number of CI systems is expressed by (K^c). The composition is again based on equal weighting, denoted as (wleft(I_kright)) for a given system k, followed by a normalization. The last step is represented by Eq. 4:

$$CISI=barS(I_c)=fracsum _k=1^K^cbarS(I_k)ast w(I_k)rmmax (sum _k=1^K^cbarS(I_k)ast w(I_k))quad rmwith;kin c$$

(4)

We execute conversion 1–4 to derive the CISI at the global scale for two resolutions (0.10 × 0.10 and 0.25 × 0.25 degrees). As mentioned earlier, an equal weighting is applied in this paper for the aggregation of the components at infrastructure type, sub-system and system level. Yet we would like to stress that the developed model allows for adjustments of the weightings of the infrastructure types, sub-systems, and systems. This means that a user can, for example, increase the weight of power stations to emphasize the importance of this infrastructure type to society. The adjustment of weights can be done on the basis of expert judgement or extensive literature reviews, allowing for a tailored CISI dataset with adjusted weightings that meets the specific need of the end user. However, developing the different weightings for the various components is not within the reach of this study, and indeed the weightings will differ depending on the context of the study in which the data are used.

The result of the CISI at the global scale with a resolution of 0.10 × 0.10 degrees are presented in Fig. 6., highlighting the disparities between areas where high amounts of CI are located and where not. The CISI ranges between 0 (no CI) and 1 (highest CI intensity). The CISI normalized at the global scale gives valuable information on where certain amounts of infrastructure are located. However, we would like to emphasize that locations with less amounts of infrastructure are not lot of less importance to society. In addition to this dataset, we therefore also execute conversion 1–4 at the continental scale. The datasets at continental scale allow for the comparison of the CI intensity, and thus amounts of infrastructure, across continents in a relative way.

Fig. 6
figure 6

Global visualization of the Critical Infrastructure Spatial Index (CISI) at a resolution of 0.10 × 0.10 degrees. Panel (a) represents the CISI at the global scale, whereas panel (bd) provides more detail at regional level for the East Coast of the US (b), Western Europe (c), and East Asia (d).