From the sf vignette:
Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.
The standard is widely implemented in spatial databases (such as PostGIS), commercial GIS (e.g., ESRI ArcGIS) and forms the vector data basis for libraries such as GDAL. A subset of simple features forms the GeoJSON standard.
R has well-supported classes for storing spatial data (sp) and interfacing to the above mentioned environments (rgdal, rgeos), but has so far lacked a complete implementation of simple features, making conversions at times convoluted, inefficient or incomplete. The package sf tries to fill this gap, and aims at succeeding sp in the long term.
The sf package is an R implementation of Simple Features. This package incorporates:
Most of the functions in this package starts with prefix st_
which stands for spatial and temporal.
If you’ve used readOGR()
from the rgdal
package, you’ll notice the similarities in arguments for the st_read()
function. We’ll do a quick comparison of the two functions here.
Read in a shapefile of Alaska using readOGR()
from the rgdal
package, and st_read
from the sf
package.
dsn
is the path namelayer
is the name of the fileNOTE: you do not need to add an extension to the layer name
## Read in shapefile using rgdal
system.time(ak_shp_rgdal <- readOGR(dsn="shapefiles", layer="ak_regions"))
## OGR data source with driver: ESRI Shapefile
## Source: "shapefiles", layer: "ak_regions"
## with 13 features
## It has 3 fields
## user system elapsed
## 0.400 0.012 0.412
object.size(ak_shp_rgdal)
## 3046304 bytes
plot(ak_shp_rgdal)
## Read in shapefile using sf
system.time(ak_shp_sf <- read_sf("shapefiles/ak_regions.shp"))
## user system elapsed
## 0.088 0.012 0.101
object.size(ak_shp_sf)
## 2916296 bytes
head(ak_shp_sf)
## Simple feature collection with 6 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.2296 ymin: 51.15683 xmax: 179.8567 ymax: 71.43957
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 6 x 4
## region_id region mgmt_area geometry
## <int> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 1 Aleutian … 3 (((-171.1345 52.44974, -171.1399 52.4431…
## 2 2 Arctic 4 (((-139.9552 68.70597, -139.9588 68.7056…
## 3 3 Bristol B… 3 (((-159.8745 58.62778, -159.8708 58.6229…
## 4 4 Chignik 3 (((-155.8282 55.84638, -155.8226 55.8527…
## 5 5 Copper Ri… 2 (((-143.8874 59.93931, -143.9027 59.9393…
## 6 6 Kodiak 3 (((-151.9997 58.83077, -152.0093 58.8290…
You’ll notice right away that these two objects are being plotted differently. This is because these two objects are of different types.
sf objects usually have two classes - sf
and data.frame
. Two main differences comparing to a regular data.frame
object are spatial metadata (geometry type
, dimension
, bbox
, epsg (SRID)
, proj4string
) and additional column - typically named geom
or geometry
.
class(ak_shp_sf)
## [1] "sf" "tbl_df" "tbl" "data.frame"
Every sf
object needs a coordinate reference system (or crs
) defined in order to work with it correctly. A coordinate reference system contains both a datum and a projection. The datum is how you georeference your points (in 3 dimensions!) onto a spheroid. The projection is how these points are mathematically transformed to represent the georeferenced point on a flat piece of paper. All coordinate reference systems require a datum. However, some coordinate reference systems are “unprojected” (also called geographic coordinate systems). Coordinates in latitude/longitude use a geographic (unprojected) coordinate system. One of the most commonly used geographic coordinate systems is WGS 1984.
You can view what crs
is set by using the function st_crs
st_crs(ak_shp_sf)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +datum=WGS84 +no_defs"
This is pretty confusing looking. Without getting into the details, that long string says that this data has a greographic coordinate system (WGS84) with no projection. A convenient way to reference crs
quickly is by using the EPSG code, a number that represents a standard projection and datum. You can check out a list of (lots!) of EPSG codes here.
You will often need to transform your geospatial data from one coordinate system to another. The st_transform
function does this quickly for us. You may have noticed the maps above looked wonky because of the dateline. We might want to set a different projection for this data so it plots nicer. A good one for Alaska is called the Alaska Albers projection, with an EPSG code of 3338.
ak_shp_sf <- ak_shp_sf %>%
st_transform(crs = 3338)
st_crs(ak_shp_sf)
## Coordinate Reference System:
## EPSG: 3338
## proj4string: "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
plot(ak_shp_sf)
Much better!
sf objects can be used as a regular data.frame
object in many operations
ak_shp_sf
## Simple feature collection with 13 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -2175343 ymin: 405519.1 xmax: 1579226 ymax: 2383770
## epsg (SRID): 3338
## proj4string: +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 13 x 4
## region_id region mgmt_area geometry
## <int> <chr> <dbl> <MULTIPOLYGON [m]>
## 1 1 Aleutian I… 3 (((-1156666 420855.1, -1157211 420247.…
## 2 2 Arctic 4 (((571289.9 2143072, 571154 2143006, 5…
## 3 3 Bristol Bay 3 (((-339688.6 973904.9, -339520.8 97334…
## 4 4 Chignik 3 (((-114381.9 649966.8, -114015.9 65066…
## 5 5 Copper Riv… 2 (((561012 1148301, 560165.9 1148178, 5…
## 6 6 Kodiak 3 (((115112.5 983293, 114570.4 983086.9,…
## 7 7 Kotzebue 4 (((-678815.3 1819519, -678441.9 181994…
## 8 8 Kuskokwim 4 (((-1030125 1281198, -1030021 1281773,…
## 9 9 Cook Inlet 2 (((35214.98 1002457, 35568.81 1002488,…
## 10 10 Norton Sou… 4 (((-848357 1636692, -847300.5 1635907,…
## 11 11 Prince Wil… 2 (((426007.1 1087250, 426240.8 1087939,…
## 12 12 Southeast 1 (((1287777 744574.1, 1288136 744899.2,…
## 13 13 Yukon 4 (((-375318 1473998, -374946.7 1474058,…
nrow(ak_shp_sf)
## [1] 13
ncol(ak_shp_sf)
## [1] 4
Art by Allison Horst
sf
& the TidyverseSince sf
objects are dataframes, they play nicely with packages in the tidyverse. Here are a couple of simple examples:
select()
ak_shp_sf %>%
select(region)
## Simple feature collection with 13 features and 1 field
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -2175343 ymin: 405519.1 xmax: 1579226 ymax: 2383770
## epsg (SRID): 3338
## proj4string: +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 13 x 2
## region geometry
## <chr> <MULTIPOLYGON [m]>
## 1 Aleutian Islan… (((-1156666 420855.1, -1157211 420247.7, -1157636 4198…
## 2 Arctic (((571289.9 2143072, 571154 2143006, 571005.2 2143017,…
## 3 Bristol Bay (((-339688.6 973904.9, -339520.8 973346.6, -339412.7 9…
## 4 Chignik (((-114381.9 649966.8, -114015.9 650665.6, -113611.9 6…
## 5 Copper River (((561012 1148301, 560165.9 1148178, 559780 1148160, 5…
## 6 Kodiak (((115112.5 983293, 114570.4 983086.9, 113915 982920.1…
## 7 Kotzebue (((-678815.3 1819519, -678441.9 1819948, -678018.3 182…
## 8 Kuskokwim (((-1030125 1281198, -1030021 1281773, -1029858 128233…
## 9 Cook Inlet (((35214.98 1002457, 35568.81 1002488, 35869.5 1002469…
## 10 Norton Sound (((-848357 1636692, -847300.5 1635907, -846510 1635203…
## 11 Prince William… (((426007.1 1087250, 426240.8 1087939, 426562.5 108859…
## 12 Southeast (((1287777 744574.1, 1288136 744899.2, 1288443 745136,…
## 13 Yukon (((-375318 1473998, -374946.7 1474058, -374450.7 14739…
Note the sticky geometry column! The geometry column will stay with your sf
object even if it is not called explicitly.
filter()
ak_shp_sf %>%
filter(region == "Southeast")
## Simple feature collection with 1 feature and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 559475.7 ymin: 722450 xmax: 1579226 ymax: 1410576
## epsg (SRID): 3338
## proj4string: +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 1 x 4
## region_id region mgmt_area geometry
## <int> <chr> <dbl> <MULTIPOLYGON [m]>
## 1 12 Southea… 1 (((1287777 744574.1, 1288136 744899.2, 128…
You can also use the sf
package to create spatial joins, useful for when you want to utilize two datasets together. As an example, let’s ask a question: how many people live in each of these Alaska regions?
We have some population data, but it gives the number of people by city, not by region. To determine the number of people per region we will need to:
sf
objectst_join
) to assign each city to a regiongroup_by
and summarize
to calculate the total population by regionFirst, read in the population data as a regular data.frame
.
pop <- read.csv("shapefiles/alaska_population.csv")
The st_join
function is a spatial left join. The arguments for both the left and right tables are objects of class sf
which means we will first need to turn our population data.frame
with latitude and longitude coordinates into an sf
object.
We can do this easily using the st_as_sf
function, which takes as arguments the coordinates and the crs
. The remove = F
specification here ensures that when we create our geometry
column, we retain our original lat
lng
columns, which we will need later for plotting. Although it isn’t said anywhere explicitly in the file, let’s assume that the coordinate system used to reference the latitude longitude coordinates is WGS84, which has a crs
number of 4236.
pop_sf <- st_as_sf(pop,
coords = c('lng', 'lat'),
crs = 4326,
remove = F)
head(pop_sf)
## Simple feature collection with 6 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -176.6581 ymin: 51.88 xmax: -154.1703 ymax: 62.68889
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## year city lat lng population geometry
## 1 2015 Adak 51.88000 -176.6581 122 POINT (-176.6581 51.88)
## 2 2015 Akhiok 56.94556 -154.1703 84 POINT (-154.1703 56.94556)
## 3 2015 Akiachak 60.90944 -161.4314 562 POINT (-161.4314 60.90944)
## 4 2015 Akiak 60.91222 -161.2139 399 POINT (-161.2139 60.91222)
## 5 2015 Akutan 54.13556 -165.7731 899 POINT (-165.7731 54.13556)
## 6 2015 Alakanuk 62.68889 -164.6153 777 POINT (-164.6153 62.68889)
Now we can do our spatial join! You can specify what geometry function the join uses (st_intersects
, st_within
, st_crosses
, st_is_within_distance
, …) in the join
argument. The geometry function you use will depend on what kind of operation you want to do, and the geometries of your shapefiles.
In this case, we want to find what region each city falls within, so we will use st_within
.
pop_joined_sf <- st_join(pop_sf, ak_shp_sf, join = st_within)
This gives an error!
Error: st_crs(x) == st_crs(y) is not TRUE
Turns out, this won’t work right now because our coordinate reference systems are not the same. Luckily, this is easily resolved using st_transform
, and projecting our population object into Alaska Albers.
pop_sf <- st_transform(pop_sf, crs = 3338)
pop_joined_sf <- st_join(pop_sf, ak_shp_sf, join = st_within)
head(pop_joined_sf)
## Simple feature collection with 6 features and 8 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -1537925 ymin: 472627.8 xmax: -10340.71 ymax: 1456223
## epsg (SRID): 3338
## proj4string: +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## year city lat lng population region_id region
## 1 2015 Adak 51.88000 -176.6581 122 1 Aleutian Islands
## 2 2015 Akhiok 56.94556 -154.1703 84 6 Kodiak
## 3 2015 Akiachak 60.90944 -161.4314 562 8 Kuskokwim
## 4 2015 Akiak 60.91222 -161.2139 399 8 Kuskokwim
## 5 2015 Akutan 54.13556 -165.7731 899 1 Aleutian Islands
## 6 2015 Alakanuk 62.68889 -164.6153 777 13 Yukon
## mgmt_area geometry
## 1 3 POINT (-1537925 472627.8)
## 2 3 POINT (-10340.71 770998.4)
## 3 4 POINT (-400885.5 1236460)
## 4 4 POINT (-389165.7 1235475)
## 5 3 POINT (-766425.7 526057.8)
## 6 4 POINT (-539724.9 1456223)
plot(pop_joined_sf["region"])
Next we compute the total population for each region. In this case, we want to do a group_by
and summarise
as this were a regular data.frame
- otherwise all of our point geometries would be aggregated by region which is not what we want. We remove the sticky geometry using as.data.frame
, on the advice of the sf::tidyverse
help page.
pop_region <- pop_joined_sf %>%
as.data.frame() %>%
group_by(region) %>%
summarise(total_pop = sum(population))
head(pop_region)
## # A tibble: 6 x 2
## region total_pop
## <chr> <int>
## 1 Aleutian Islands 8840
## 2 Arctic 8419
## 3 Bristol Bay 6947
## 4 Chignik 311
## 5 Cook Inlet 408254
## 6 Copper River 2294
And use a regular left_join
to get the information back to the Alaska region shapefile. Note that we need this step in order to retain our region geometries so that we can make some maps.
ak_pop_sf <- left_join(ak_shp_sf, pop_region)
head(ak_pop_sf)
## Simple feature collection with 6 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -2175343 ymin: 405519.1 xmax: 773854 ymax: 2383770
## epsg (SRID): 3338
## proj4string: +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 6 x 5
## region_id region mgmt_area total_pop geometry
## <int> <chr> <dbl> <int> <MULTIPOLYGON [m]>
## 1 1 Aleutia… 3 8840 (((-1156666 420855.1, -1157211 4…
## 2 2 Arctic 4 8419 (((571289.9 2143072, 571154 2143…
## 3 3 Bristol… 3 6947 (((-339688.6 973904.9, -339520.8…
## 4 4 Chignik 3 311 (((-114381.9 649966.8, -114015.9…
## 5 5 Copper … 2 2294 (((561012 1148301, 560165.9 1148…
## 6 6 Kodiak 3 8126 (((115112.5 983293, 114570.4 983…
#plot to check
plot(ak_pop_sf["total_pop"])
group_by
and summarize
spatial objectsSo far, we have learned how to use sf
and dplyr
to use a spatial join on two datasets and calculate a summary metric from the result of that join.
The group_by
and summarize
functions can also be used on sf
objects to summarize within a dataset and combine geometries. Many of the tidyverse
functions have methods specific for sf
objects, some of which have additional arguments that wouldn’t be relevant to the data.frame
methods. You can run ?sf::tidyverse
to get documentation on the tidyverse
sf
methods.
Let’s try some out. Say we want to calculate the population by Alaska management area, as opposed to region.
ak_mgmt <- ak_pop_sf %>%
group_by(mgmt_area) %>%
summarize(total_pop = sum(total_pop))
plot(ak_mgmt["total_pop"])
Notice that the region geometries were combined into a single polygon for each management area.
If we don’t want to combine geometries, we can specifcy do_union = F
as an argument.
ak_mgmt <- ak_pop_sf %>%
group_by(mgmt_area) %>%
summarize(total_pop = sum(total_pop), do_union = F)
plot(ak_mgmt["total_pop"])
Save the spatial object to disk using write_sf()
and specifying the filename. Writing your file with the extension .shp will assume an ESRI driver driver, but there are many other format options available.
write_sf(ak_pop_sf, "shapefiles/ak_regions_population.shp", delete_layer = TRUE)
ggplot2
now has integrated functionality to plot sf objects using geom_sf()
.
#simplest plot
ggplot(ak_pop_sf) +
geom_sf()
This is useful to make sure your file looks correct but doesn’t display any information about the data. We can plot these regions and fill each polygon based on the population
ggplot(ak_pop_sf) +
geom_sf(aes(fill = total_pop))
We can clean it up a bit, applying a cleaner theme and assigning a continuous color palette.
ggplot(ak_pop_sf) +
geom_sf(aes(fill = total_pop)) +
theme_bw() +
labs(fill = "Total Population") +
scale_fill_continuous(low = "khaki", high = "firebrick", labels = comma)
We can also plot multiple shapefiles in the same plot. Say if we want to visualize rivers in Alaska, in addition to the location of communities, since many communities in Alaska are on rivers. We can read in a rivers shapefile, doublecheck the crs
to make sure it is what we need, and then plot all three shapefiles.
rivers <- read_sf("shapefiles/ak_rivers.shp")
st_crs(rivers)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs"
ggplot() +
geom_sf(data = ak_pop_sf, aes(fill = total_pop)) +
geom_sf(data = rivers, aes(size = StrOrder), color = "black") +
geom_sf(data = pop_sf, aes(), size = .5) +
scale_size(range = c(0.01, 0.2), guide = F) +
theme_bw() +
labs(fill = "Total Population") +
scale_fill_continuous(low = "khaki", high = "firebrick", labels = comma)
We can also make an interactive map using leaflet
.
Leaflet (unlike ggplot) will project data for you. The catch is that you have to give it both a projection (like Alaska Albers), and that your shapefile must use a geographic coordinate system. This means that we need to use our shapefile with the 4326 EPSG code. Remember you can always check what crs
you have set using st_crs
.
Here we define a leaflet projection for Alaska Albers, and save it as a variable to use later.
epsg3338 <- leaflet::leafletCRS(
crsClass = "L.Proj.CRS",
code = "EPSG:3338",
proj4def = "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs",
resolutions = 2^(16:7))
You might notice that this looks familiar! The syntax is a bit different, but most of this information is also contained within the crs
of our shapefile:
st_crs(ak_pop_sf)
## Coordinate Reference System:
## EPSG: 3338
## proj4string: "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
Since leaflet
requires that we use an unprojected coordinate system, let’s use st_transform
yet again to get back to WGS84.
ak_pop_crs <- ak_pop_sf %>% st_transform(crs = 4326)
m <- leaflet(options = leafletOptions(crs = epsg3338)) %>%
addPolygons(data = ak_pop_crs,
fillColor = "gray",
weight = 1)
m
We can add labels, legends, and a color scale.
pal <- colorNumeric(palette = "Reds", domain = ak_pop_crs$total_pop)
m <- leaflet(options = leafletOptions(crs = epsg3338)) %>%
addPolygons(data = ak_pop_crs,
fillColor = ~pal(total_pop),
weight = 1,
color = "black",
fillOpacity = 1,
label = ~region) %>%
addLegend(position = "bottomleft",
pal = pal,
values = range(ak_pop_crs$total_pop),
title = "Total Population")
m
We can also add the individual communities, with popup labels showing their population, on top of that!
pal <- colorNumeric(palette = "Reds", domain = ak_pop_crs$total_pop)
m <- leaflet(options = leafletOptions(crs = epsg3338)) %>%
addPolygons(data = ak_pop_crs,
fillColor = ~pal(total_pop),
weight = 1,
color = "black",
fillOpacity = 1) %>%
addCircleMarkers(data = pop_sf,
lat = ~lat,
lng = ~lng,
radius = ~log(population/500), # arbitrary scaling
fillColor = "gray",
fillOpacity = 1,
weight = 0.25,
color = "black",
label = ~paste0(pop_sf$city, ", population ", comma(pop_sf$population))) %>%
addLegend(position = "bottomleft",
pal = pal,
values = range(ak_pop_crs$total_pop),
title = "Total Population")
m
There is a lot more functionality to sf
including the ability to intersect
polygons, calculate distance
, create a buffer
, and more. Here are some more great resources and tutorials for a deeper dive into this great package:
Spatial analysis in R with the sf package
Intro to Spatial Analysis
sf github repo
Tidy spatial data in R: using dplyr, tidyr, and ggplot2 with sf
mapping-fall-foliage-with-sf