# Tabular data access and manipulation
using DataFrames
# Vector data access and manipulation
using GeoDataFrames
import GeoInterface as GI
# Raster data access and manipulation (requires ArchGDAL for file I/O)
using Rasters
import ArchGDAL
# "Categorical" / "factor" vectors in Julia
using CategoricalArrays
# CSV file reading
using CSV
# Statistics
using Statistics, StatsBase
# Disambiguate functions exported by multiple packages
const combine = DataFrames.combine
const groupby = DataFrames.groupby2 Attribute data operations
Prerequisites
This chapter requires the following packages to be installed and attached:
2.1 Introduction
Attribute data is non-spatial information associated with geographic (geometry) data. A bus stop provides a simple example: its position would typically be represented by latitude and longitude coordinates (geometry data), in addition to its name. The Elephant & Castle / New Kent Road stop in London, for example has coordinates of -0.098 degrees longitude and 51.495 degrees latitude, which can be represented as GI.Point(-0.098, 51.495) in the GeoInterface representation described in Chapter @ref(spatial-class). Attributes, such as name, of the POINT feature (to use simple features terminology) are the topic of this chapter.
TODO: add figure with bus stop
Another example is the elevation value (attribute) for a specific grid cell in raster data. Unlike the vector data model, the raster data model stores the coordinate of the grid cell indirectly, meaning the distinction between attribute and spatial information is less clear. To illustrate the point, think of a pixel in the 3rd row and the 4th column of a raster matrix. Its spatial location is defined by its index in the matrix: move from the origin four cells in the x direction (typically east and right on maps) and three cells in the y direction (typically south and down). The raster’s lookup defines the distance for each x- and y-step. The lookups are a vital component of raster datasets, which specifies how pixels relate to spatial coordinates (see also Chapter @ref(spatial-operations)).
This chapter teaches how to manipulate geographic objects based on attributes such as the names of bus stops in a vector dataset and elevations of pixels in a raster dataset. For vector data, this means techniques such as subsetting and aggregation (see Sections @ref(vector-attribute-subsetting) to @ref(vector-attribute-aggregation)). Sections @ref(vector-attribute-joining) and @ref(vec-attr-creation) demonstrate how to join data onto simple feature objects using a shared ID and how to create new variables, respectively. Each of these operations has a spatial equivalent: the select function in DataFrames.jl, for example, works equally for subsetting objects based on their attribute and spatial objects; you can also join attributes in two geographic datasets using spatial joins. This is good news: skills developed in this chapter are cross-transferable.
After a deep dive into various types of vector attribute operations in the next section, raster attribute data operations are covered. Creation of raster layers containing continuous and categorical attributes and extraction of cell values from one or more layer (raster subsetting) (Section @ref(raster-subsetting)) are demonstrated. Section @ref(summarizing-raster-objects) provides an overview of ‘global’ raster operations which can be used to summarize entire raster datasets. Chapter @ref(spatial-operations) extends the methods presented here to the spatial world.
2.2 Vector attribute manipulation
Geographic vector datasets are well supported in Julia, and are usually represented as DataFrames. Unlike R and Python, Julia’s GeoInterface.jl ecosystem does not have a single sf class, and so the package GeoDataFrames.jl extends Julia’s DataFrames.jl package to add spatial metadata and file I/O capabilities.
Geospatial data frames have a geometry column which can contain a range of geographic entities (single and ‘multi’ point, line, and polygon features) per row.
Data frames (and geospatial tables like geographic databases, shapefiles, GeoParquet, GeoJSON, etc.) have one column per attribute variable (such as “name”) and one row per observation or feature (e.g., per bus station).
Many operations are available for attribute data, as shown in the wonderful DataFrames.jl documentation.
The column of a geographic table that holds geometry is typically called geometry or geom, but any name can be used.
You can discover the names of the geometry columns in a geospatial table using GI.geometrycolumns(table) - typically, first(GI.geometrycolumns(table)) is assumed to be the geometry column.
There is a developing convention to indicate the geometry columns in metadata using the GEOINTERFACE:geometrycolumns key.
GeoDataFrames.jl adopts and implements this convention for the DataFrame type.
There are many table manipulation packages in Julia, all of which are compatible with DataFrame objects.
We provide an abbreviated list here, and you can find more information in the DataFrames.jl documentation on data manipulation frameworks.
They all implement functionality similar to dplyr or LINQ.
- DataFramesMeta.jl provides a convenient yet fast macro-based interface to work with
DataFrames, via its@chain,@transform,@select,@combine, and various other macros. The@chainmacro is similar to the|>and%>%operators in R. DataFramesMacros.jl is an alternative implementation with better support for multi-column transformations. - TidierData.jl is heavily inspired by the dplyr and tidyr R packages (part of the R tidyverse), which it aims to implement using pure Julia by wrapping DataFrames.jl. Its entry point is also the
@chainmacro, and it uses tidy expressions as in the R tidyverse. - Query.jl is a package for querying Julia data sources. It can filter, project, join and group data from any iterable data source, and is heavily inspired by LINQ.
We also recommend the following resources for further reading: - https://juliadatascience.io/ - https://github.com/bkamins/JuliaForDataAnalysis
2.2.1 Basic DataFrame operations
Before using these capabilities, it is worth re-capping how to discover the basic properties of vector data objects. Let’s start by inspecting the world.gpkg dataset from data/:
world = GeoDataFrames.read("data/world.gpkg")| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | |
| 1 | Geometry: wkbMultiPolygon | FJ | Fiji | Oceania | Oceania | Melanesia | Sovereign country | 19290.0 | 885806.0 | 69.96 | 8222.25 |
| 2 | Geometry: wkbMultiPolygon | TZ | Tanzania | Africa | Africa | Eastern Africa | Sovereign country | 9.32746e5 | 5.22349e7 | 64.163 | 2402.1 |
| 3 | Geometry: wkbMultiPolygon | EH | Western Sahara | Africa | Africa | Northern Africa | Indeterminate | 96270.6 | missing | missing | missing |
| 4 | Geometry: wkbMultiPolygon | CA | Canada | North America | Americas | Northern America | Sovereign country | 1.0036e7 | 3.55353e7 | 81.953 | 43079.1 |
| 5 | Geometry: wkbMultiPolygon | US | United States | North America | Americas | Northern America | Country | 9.51074e6 | 3.18623e8 | 78.8415 | 51922.0 |
| 6 | Geometry: wkbMultiPolygon | KZ | Kazakhstan | Asia | Asia | Central Asia | Sovereign country | 2.72981e6 | 1.72883e7 | 71.62 | 23587.3 |
| 7 | Geometry: wkbMultiPolygon | UZ | Uzbekistan | Asia | Asia | Central Asia | Sovereign country | 4.6141e5 | 3.07577e7 | 71.039 | 5370.87 |
| 8 | Geometry: wkbMultiPolygon | PG | Papua New Guinea | Oceania | Oceania | Melanesia | Sovereign country | 4.6452e5 | 7.75578e6 | 65.23 | 3709.08 |
| 9 | Geometry: wkbMultiPolygon | ID | Indonesia | Asia | Asia | South-Eastern Asia | Sovereign country | 1.81925e6 | 2.55131e8 | 68.856 | 10003.1 |
| 10 | Geometry: wkbMultiPolygon | AR | Argentina | South America | Americas | South America | Sovereign country | 2.78447e6 | 4.29815e7 | 76.252 | 18797.5 |
| 11 | Geometry: wkbMultiPolygon | CL | Chile | South America | Americas | South America | Sovereign country | 8.14844e5 | 1.76138e7 | 79.117 | 22195.3 |
| 12 | Geometry: wkbMultiPolygon | CD | Democratic Republic of the Congo | Africa | Africa | Middle Africa | Sovereign country | 2.32349e6 | 7.37229e7 | 58.782 | 785.347 |
| 13 | Geometry: wkbMultiPolygon | SO | Somalia | Africa | Africa | Eastern Africa | Sovereign country | 4.84333e5 | 1.35131e7 | 55.467 | missing |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 166 | Geometry: wkbMultiPolygon | ET | Ethiopia | Africa | Africa | Eastern Africa | Sovereign country | 1.13239e6 | 9.73668e7 | 64.535 | 1424.53 |
| 167 | Geometry: wkbMultiPolygon | DJ | Djibouti | Africa | Africa | Eastern Africa | Sovereign country | 21880.3 | 912164.0 | 62.006 | missing |
| 168 | Geometry: wkbMultiPolygon | missing | Somaliland | Africa | Africa | Eastern Africa | Indeterminate | 1.6735e5 | missing | missing | missing |
| 169 | Geometry: wkbMultiPolygon | UG | Uganda | Africa | Africa | Eastern Africa | Sovereign country | 2.45768e5 | 3.88333e7 | 59.224 | 1637.28 |
| 170 | Geometry: wkbMultiPolygon | RW | Rwanda | Africa | Africa | Eastern Africa | Sovereign country | 23365.4 | 1.13454e7 | 66.188 | 1629.87 |
| 171 | Geometry: wkbMultiPolygon | BA | Bosnia and Herzegovina | Europe | Europe | Southern Europe | Sovereign country | 50605.1 | 3.566e6 | 76.561 | 10516.8 |
| 172 | Geometry: wkbMultiPolygon | MK | Macedonia | Europe | Europe | Southern Europe | Sovereign country | 25062.3 | 2.0775e6 | 75.384 | 12298.5 |
| 173 | Geometry: wkbMultiPolygon | RS | Serbia | Europe | Europe | Southern Europe | Sovereign country | 76388.6 | 7.13058e6 | 75.3366 | 13112.9 |
| 174 | Geometry: wkbMultiPolygon | ME | Montenegro | Europe | Europe | Southern Europe | Sovereign country | 13443.7 | 621810.0 | 76.712 | 14796.6 |
| 175 | Geometry: wkbMultiPolygon | XK | Kosovo | Europe | Europe | Southern Europe | Sovereign country | 11230.3 | 1.8218e6 | 71.0976 | 8698.29 |
| 176 | Geometry: wkbMultiPolygon | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 |
| 177 | Geometry: wkbMultiPolygon | SS | South Sudan | Africa | Africa | Eastern Africa | Sovereign country | 6.24909e5 | 1.1531e7 | 55.817 | 1935.88 |
We can get a visual overview of the dataset by showing it (simply type the variable name in the REPL). From this we can see an abbreviated view of its contents.
But what is it? We can check the type:
typeof(world) # `DataFrame`DataFrame
and the size:
size(world) # it's a 2 dimensional object, with 177 rows and 11 columns(177, 11)
We can also use the describe function to get a summary of the dataset:
describe(world)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | geometry | 0 | IGeometry{wkbMultiPolygon} | ||||
| 2 | iso_a2 | AE | ZW | 2 | Union{Missing, String} | ||
| 3 | name_long | Afghanistan | eSwatini | 0 | String | ||
| 4 | continent | Africa | South America | 0 | String | ||
| 5 | region_un | Africa | Seven seas (open ocean) | 0 | String | ||
| 6 | subregion | Antarctica | Western Europe | 0 | String | ||
| 7 | type | Country | Sovereign country | 0 | String | ||
| 8 | area_km2 | 8.32558e5 | 2416.87 | 1.85004e5 | 1.70185e7 | 0 | Float64 |
| 9 | pop | 4.28158e7 | 56295.0 | 1.04011e7 | 1.36427e9 | 10 | Union{Missing, Float64} |
| 10 | lifeExp | 70.8544 | 50.621 | 72.869 | 83.5878 | 10 | Union{Missing, Float64} |
| 11 | gdpPercap | 17106.0 | 597.135 | 10734.1 | 1.2086e5 | 17 | Union{Missing, Float64} |
This is pretty useful - we can see the type and some descriptive values for each column. describe is incredibly versatile, and you can see the docstring in the Julia REPL by typing ?describe.
Notice that the first column, :geom, is composed of IGeometry{wkbMultiPolygon} objects. This is the geometry column, and it’s loaded by ArchGDAL.jl, which allows I/O from a truly massive range of geospatial data formats.
We can also get some geospatial information - GI.geometrycolumns(world) returns (:geometry,), and GI.crs(world) returns WellKnownText{GeoFormatTypes.CRS}(GeoFormatTypes.CRS(), “GEOGCS[\”WGS 84\“,DATUM[\”WGS_1984\“,SPHEROID[\”WGS 84\“,6378137,298.257223563,AUTHORITY[\”EPSG\“,\”7030\“]],AUTHORITY[\”EPSG\“,\”6326\“]],PRIMEM[\”Greenwich\“,0,AUTHORITY[\”EPSG\“,\”8901\“]],UNIT[\”degree\“,0.0174532925199433,AUTHORITY[\”EPSG\“,\”9122\“]],AXIS[\”Latitude\“,NORTH],AXIS[\”Longitude\“,EAST],AUTHORITY[\”EPSG\“,\”4326\“]]”).
We can drop the geometry column by subsetting the DataFrame, as you’ll see in Section 2.2.2.
world_without_geom = world[:, Not(GI.geometrycolumns(world)...)]| Row | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|
| String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | |
| 1 | FJ | Fiji | Oceania | Oceania | Melanesia | Sovereign country | 19290.0 | 885806.0 | 69.96 | 8222.25 |
| 2 | TZ | Tanzania | Africa | Africa | Eastern Africa | Sovereign country | 9.32746e5 | 5.22349e7 | 64.163 | 2402.1 |
| 3 | EH | Western Sahara | Africa | Africa | Northern Africa | Indeterminate | 96270.6 | missing | missing | missing |
| 4 | CA | Canada | North America | Americas | Northern America | Sovereign country | 1.0036e7 | 3.55353e7 | 81.953 | 43079.1 |
| 5 | US | United States | North America | Americas | Northern America | Country | 9.51074e6 | 3.18623e8 | 78.8415 | 51922.0 |
| 6 | KZ | Kazakhstan | Asia | Asia | Central Asia | Sovereign country | 2.72981e6 | 1.72883e7 | 71.62 | 23587.3 |
| 7 | UZ | Uzbekistan | Asia | Asia | Central Asia | Sovereign country | 4.6141e5 | 3.07577e7 | 71.039 | 5370.87 |
| 8 | PG | Papua New Guinea | Oceania | Oceania | Melanesia | Sovereign country | 4.6452e5 | 7.75578e6 | 65.23 | 3709.08 |
| 9 | ID | Indonesia | Asia | Asia | South-Eastern Asia | Sovereign country | 1.81925e6 | 2.55131e8 | 68.856 | 10003.1 |
| 10 | AR | Argentina | South America | Americas | South America | Sovereign country | 2.78447e6 | 4.29815e7 | 76.252 | 18797.5 |
| 11 | CL | Chile | South America | Americas | South America | Sovereign country | 8.14844e5 | 1.76138e7 | 79.117 | 22195.3 |
| 12 | CD | Democratic Republic of the Congo | Africa | Africa | Middle Africa | Sovereign country | 2.32349e6 | 7.37229e7 | 58.782 | 785.347 |
| 13 | SO | Somalia | Africa | Africa | Eastern Africa | Sovereign country | 4.84333e5 | 1.35131e7 | 55.467 | missing |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 166 | ET | Ethiopia | Africa | Africa | Eastern Africa | Sovereign country | 1.13239e6 | 9.73668e7 | 64.535 | 1424.53 |
| 167 | DJ | Djibouti | Africa | Africa | Eastern Africa | Sovereign country | 21880.3 | 912164.0 | 62.006 | missing |
| 168 | missing | Somaliland | Africa | Africa | Eastern Africa | Indeterminate | 1.6735e5 | missing | missing | missing |
| 169 | UG | Uganda | Africa | Africa | Eastern Africa | Sovereign country | 2.45768e5 | 3.88333e7 | 59.224 | 1637.28 |
| 170 | RW | Rwanda | Africa | Africa | Eastern Africa | Sovereign country | 23365.4 | 1.13454e7 | 66.188 | 1629.87 |
| 171 | BA | Bosnia and Herzegovina | Europe | Europe | Southern Europe | Sovereign country | 50605.1 | 3.566e6 | 76.561 | 10516.8 |
| 172 | MK | Macedonia | Europe | Europe | Southern Europe | Sovereign country | 25062.3 | 2.0775e6 | 75.384 | 12298.5 |
| 173 | RS | Serbia | Europe | Europe | Southern Europe | Sovereign country | 76388.6 | 7.13058e6 | 75.3366 | 13112.9 |
| 174 | ME | Montenegro | Europe | Europe | Southern Europe | Sovereign country | 13443.7 | 621810.0 | 76.712 | 14796.6 |
| 175 | XK | Kosovo | Europe | Europe | Southern Europe | Sovereign country | 11230.3 | 1.8218e6 | 71.0976 | 8698.29 |
| 176 | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 |
| 177 | SS | South Sudan | Africa | Africa | Eastern Africa | Sovereign country | 6.24909e5 | 1.1531e7 | 55.817 | 1935.88 |
Dropping the geometry column before working with attribute data can be sometimes be useful; data manipulation processes can run faster when they work only on the attribute data and geometry columns are not always needed. For most cases, however, it makes sense to keep the geometry column. Becoming skilled at geographic attribute data manipulation means becoming skilled at manipulating data frames.
2.2.2 Vector attribute subsetting
There are multiple ways to subset data in Julia.
First, and probably most simply, we can index into the DataFrame object using a few kinds of selectors. This can select rows and columns.
Indices are placed inside square brackets placed directly after a data frame object name, and specify the elements to keep.
Rows are referred to using integers, and columns may be referred to using integers or symbols (:name).
Indexing in Julia is 1-based, like R, and unlike Python which is 0-based.
It’s performed using the [inds...] operator. The : operator is used to select all elements in that dimension, and you can select a range using start:stop. You can also pass vectors of indices or boolean values to select specific elements.
In DataFrames.jl, you can construct a view over all rows by using the ! operator, like world[!, :pop] (in place of world[:, :pop]). This syntax is also needed when modifying the entire column, or creating a new column.
Rows are always the first argument, and then columns go in the second position. We can select the first 5 rows of the :pop_est column, like so:
world[1:5, :pop]5-element Vector{Union{Missing, Float64}}:
885806.0
5.2234869e7
missing
3.5535348e7
3.18622525e8
This returns a vector, since we’ve only selected a single column. We can also select multiple columns by passing a vector of column names:
world[5:end, [:pop, :continent]]| Row | pop | continent |
|---|---|---|
| Float64? | String | |
| 1 | 3.18623e8 | North America |
| 2 | 1.72883e7 | Asia |
| 3 | 3.07577e7 | Asia |
| 4 | 7.75578e6 | Oceania |
| 5 | 2.55131e8 | Asia |
| 6 | 4.29815e7 | South America |
| 7 | 1.76138e7 | South America |
| 8 | 7.37229e7 | Africa |
| 9 | 1.35131e7 | Africa |
| 10 | 4.60242e7 | Africa |
| 11 | 3.77379e7 | Africa |
| 12 | 1.35694e7 | Africa |
| 13 | 1.05725e7 | North America |
| ⋮ | ⋮ | ⋮ |
| 162 | 9.73668e7 | Africa |
| 163 | 912164.0 | Africa |
| 164 | missing | Africa |
| 165 | 3.88333e7 | Africa |
| 166 | 1.13454e7 | Africa |
| 167 | 3.566e6 | Europe |
| 168 | 2.0775e6 | Europe |
| 169 | 7.13058e6 | Europe |
| 170 | 621810.0 | Europe |
| 171 | 1.8218e6 | Europe |
| 172 | 1.35449e6 | North America |
| 173 | 1.1531e7 | Africa |
and note that this returns a new DataFrame with only the selected columns.
We can also select using negations via the Not function:
world[1:5 ,Not(:pop)]| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | |
| 1 | Geometry: wkbMultiPolygon | FJ | Fiji | Oceania | Oceania | Melanesia | Sovereign country | 19290.0 | 69.96 | 8222.25 |
| 2 | Geometry: wkbMultiPolygon | TZ | Tanzania | Africa | Africa | Eastern Africa | Sovereign country | 9.32746e5 | 64.163 | 2402.1 |
| 3 | Geometry: wkbMultiPolygon | EH | Western Sahara | Africa | Africa | Northern Africa | Indeterminate | 96270.6 | missing | missing |
| 4 | Geometry: wkbMultiPolygon | CA | Canada | North America | Americas | Northern America | Sovereign country | 1.0036e7 | 81.953 | 43079.1 |
| 5 | Geometry: wkbMultiPolygon | US | United States | North America | Americas | Northern America | Country | 9.51074e6 | 78.8415 | 51922.0 |
or
world[Not(1:150) , :]| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | |
| 1 | Geometry: wkbMultiPolygon | SI | Slovenia | Europe | Europe | Southern Europe | Sovereign country | 19118.1 | 2.06198e6 | 81.078 | 28417.7 |
| 2 | Geometry: wkbMultiPolygon | FI | Finland | Europe | Europe | Northern Europe | Country | 3.41242e5 | 5.46151e6 | 81.1805 | 39017.5 |
| 3 | Geometry: wkbMultiPolygon | SK | Slovakia | Europe | Europe | Eastern Europe | Sovereign country | 47068.1 | 5.41865e6 | 76.8122 | 27285.3 |
| 4 | Geometry: wkbMultiPolygon | CZ | Czech Republic | Europe | Europe | Eastern Europe | Sovereign country | 81207.6 | 1.05253e7 | 78.8244 | 29119.6 |
| 5 | Geometry: wkbMultiPolygon | ER | Eritrea | Africa | Africa | Eastern Africa | Sovereign country | 1.1932e5 | missing | 64.174 | missing |
| 6 | Geometry: wkbMultiPolygon | JP | Japan | Asia | Asia | Eastern Asia | Sovereign country | 4.0462e5 | 1.27276e8 | 83.5878 | 37337.3 |
| 7 | Geometry: wkbMultiPolygon | PY | Paraguay | South America | Americas | South America | Sovereign country | 4.01336e5 | 6.55258e6 | 72.913 | 8501.54 |
| 8 | Geometry: wkbMultiPolygon | YE | Yemen | Asia | Asia | Western Asia | Sovereign country | 455915.0 | 2.62463e7 | 64.523 | 3766.81 |
| 9 | Geometry: wkbMultiPolygon | SA | Saudi Arabia | Asia | Asia | Western Asia | Sovereign country | 1.92032e6 | 3.07767e7 | 74.234 | 49958.4 |
| 10 | Geometry: wkbMultiPolygon | AQ | Antarctica | Antarctica | Antarctica | Antarctica | Indeterminate | 1.2336e7 | missing | missing | missing |
| 11 | Geometry: wkbMultiPolygon | missing | Northern Cyprus | Asia | Asia | Western Asia | Sovereign country | 3786.36 | missing | missing | missing |
| 12 | Geometry: wkbMultiPolygon | CY | Cyprus | Asia | Asia | Western Asia | Sovereign country | 6207.01 | 1.15231e6 | 80.173 | 29786.4 |
| 13 | Geometry: wkbMultiPolygon | MA | Morocco | Africa | Africa | Northern Africa | Sovereign country | 591719.0 | 3.43181e7 | 75.309 | 7078.88 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 16 | Geometry: wkbMultiPolygon | ET | Ethiopia | Africa | Africa | Eastern Africa | Sovereign country | 1.13239e6 | 9.73668e7 | 64.535 | 1424.53 |
| 17 | Geometry: wkbMultiPolygon | DJ | Djibouti | Africa | Africa | Eastern Africa | Sovereign country | 21880.3 | 912164.0 | 62.006 | missing |
| 18 | Geometry: wkbMultiPolygon | missing | Somaliland | Africa | Africa | Eastern Africa | Indeterminate | 1.6735e5 | missing | missing | missing |
| 19 | Geometry: wkbMultiPolygon | UG | Uganda | Africa | Africa | Eastern Africa | Sovereign country | 2.45768e5 | 3.88333e7 | 59.224 | 1637.28 |
| 20 | Geometry: wkbMultiPolygon | RW | Rwanda | Africa | Africa | Eastern Africa | Sovereign country | 23365.4 | 1.13454e7 | 66.188 | 1629.87 |
| 21 | Geometry: wkbMultiPolygon | BA | Bosnia and Herzegovina | Europe | Europe | Southern Europe | Sovereign country | 50605.1 | 3.566e6 | 76.561 | 10516.8 |
| 22 | Geometry: wkbMultiPolygon | MK | Macedonia | Europe | Europe | Southern Europe | Sovereign country | 25062.3 | 2.0775e6 | 75.384 | 12298.5 |
| 23 | Geometry: wkbMultiPolygon | RS | Serbia | Europe | Europe | Southern Europe | Sovereign country | 76388.6 | 7.13058e6 | 75.3366 | 13112.9 |
| 24 | Geometry: wkbMultiPolygon | ME | Montenegro | Europe | Europe | Southern Europe | Sovereign country | 13443.7 | 621810.0 | 76.712 | 14796.6 |
| 25 | Geometry: wkbMultiPolygon | XK | Kosovo | Europe | Europe | Southern Europe | Sovereign country | 11230.3 | 1.8218e6 | 71.0976 | 8698.29 |
| 26 | Geometry: wkbMultiPolygon | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 |
| 27 | Geometry: wkbMultiPolygon | SS | South Sudan | Africa | Africa | Eastern Africa | Sovereign country | 6.24909e5 | 1.1531e7 | 55.817 | 1935.88 |
You can pass any collection of indices to Not, and it will cause all elements in the dataframe that are not in that collection to be selected.
Here’s a small exercise: guess the number of rows and columns in the DataFrame objects returned by each of the following commands, then check your answer by executing the commands in Julia.
world[1:6, ] # subset rows by position
world[:, 1:3] # subset columns by position
world[1:6, 1:3] # subset rows and columns by position
world[:, [:name_long, :pop]] # columns by name
world[:, [true, true, false, false, false, false, false, true, true, false, false]] # by logical indices
world[:, 888] # an index representing a non-existent columnWe can also drop all missing values in a column using the dropmissing function:
world_with_area = dropmissing(world, :area_km2)| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | |
| 1 | Geometry: wkbMultiPolygon | FJ | Fiji | Oceania | Oceania | Melanesia | Sovereign country | 19290.0 | 885806.0 | 69.96 | 8222.25 |
| 2 | Geometry: wkbMultiPolygon | TZ | Tanzania | Africa | Africa | Eastern Africa | Sovereign country | 9.32746e5 | 5.22349e7 | 64.163 | 2402.1 |
| 3 | Geometry: wkbMultiPolygon | EH | Western Sahara | Africa | Africa | Northern Africa | Indeterminate | 96270.6 | missing | missing | missing |
| 4 | Geometry: wkbMultiPolygon | CA | Canada | North America | Americas | Northern America | Sovereign country | 1.0036e7 | 3.55353e7 | 81.953 | 43079.1 |
| 5 | Geometry: wkbMultiPolygon | US | United States | North America | Americas | Northern America | Country | 9.51074e6 | 3.18623e8 | 78.8415 | 51922.0 |
| 6 | Geometry: wkbMultiPolygon | KZ | Kazakhstan | Asia | Asia | Central Asia | Sovereign country | 2.72981e6 | 1.72883e7 | 71.62 | 23587.3 |
| 7 | Geometry: wkbMultiPolygon | UZ | Uzbekistan | Asia | Asia | Central Asia | Sovereign country | 4.6141e5 | 3.07577e7 | 71.039 | 5370.87 |
| 8 | Geometry: wkbMultiPolygon | PG | Papua New Guinea | Oceania | Oceania | Melanesia | Sovereign country | 4.6452e5 | 7.75578e6 | 65.23 | 3709.08 |
| 9 | Geometry: wkbMultiPolygon | ID | Indonesia | Asia | Asia | South-Eastern Asia | Sovereign country | 1.81925e6 | 2.55131e8 | 68.856 | 10003.1 |
| 10 | Geometry: wkbMultiPolygon | AR | Argentina | South America | Americas | South America | Sovereign country | 2.78447e6 | 4.29815e7 | 76.252 | 18797.5 |
| 11 | Geometry: wkbMultiPolygon | CL | Chile | South America | Americas | South America | Sovereign country | 8.14844e5 | 1.76138e7 | 79.117 | 22195.3 |
| 12 | Geometry: wkbMultiPolygon | CD | Democratic Republic of the Congo | Africa | Africa | Middle Africa | Sovereign country | 2.32349e6 | 7.37229e7 | 58.782 | 785.347 |
| 13 | Geometry: wkbMultiPolygon | SO | Somalia | Africa | Africa | Eastern Africa | Sovereign country | 4.84333e5 | 1.35131e7 | 55.467 | missing |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 166 | Geometry: wkbMultiPolygon | ET | Ethiopia | Africa | Africa | Eastern Africa | Sovereign country | 1.13239e6 | 9.73668e7 | 64.535 | 1424.53 |
| 167 | Geometry: wkbMultiPolygon | DJ | Djibouti | Africa | Africa | Eastern Africa | Sovereign country | 21880.3 | 912164.0 | 62.006 | missing |
| 168 | Geometry: wkbMultiPolygon | missing | Somaliland | Africa | Africa | Eastern Africa | Indeterminate | 1.6735e5 | missing | missing | missing |
| 169 | Geometry: wkbMultiPolygon | UG | Uganda | Africa | Africa | Eastern Africa | Sovereign country | 2.45768e5 | 3.88333e7 | 59.224 | 1637.28 |
| 170 | Geometry: wkbMultiPolygon | RW | Rwanda | Africa | Africa | Eastern Africa | Sovereign country | 23365.4 | 1.13454e7 | 66.188 | 1629.87 |
| 171 | Geometry: wkbMultiPolygon | BA | Bosnia and Herzegovina | Europe | Europe | Southern Europe | Sovereign country | 50605.1 | 3.566e6 | 76.561 | 10516.8 |
| 172 | Geometry: wkbMultiPolygon | MK | Macedonia | Europe | Europe | Southern Europe | Sovereign country | 25062.3 | 2.0775e6 | 75.384 | 12298.5 |
| 173 | Geometry: wkbMultiPolygon | RS | Serbia | Europe | Europe | Southern Europe | Sovereign country | 76388.6 | 7.13058e6 | 75.3366 | 13112.9 |
| 174 | Geometry: wkbMultiPolygon | ME | Montenegro | Europe | Europe | Southern Europe | Sovereign country | 13443.7 | 621810.0 | 76.712 | 14796.6 |
| 175 | Geometry: wkbMultiPolygon | XK | Kosovo | Europe | Europe | Southern Europe | Sovereign country | 11230.3 | 1.8218e6 | 71.0976 | 8698.29 |
| 176 | Geometry: wkbMultiPolygon | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 |
| 177 | Geometry: wkbMultiPolygon | SS | South Sudan | Africa | Africa | Eastern Africa | Sovereign country | 6.24909e5 | 1.1531e7 | 55.817 | 1935.88 |
There is also a mutating version of dropmissing, called dropmissing!, which modifies the input in place.
We can also subset by a boolean vector, computed on some predicate.
Earlier on, we saw that we could extract a column as a vector using df.columnname.
We can use this vector of values to create a boolean vector (sometimes called a logical vector in R) that we can use to index into the DataFrame.
Let’s select all countries whose surface area is smaller than 10,000 km^2.
countries_to_select = world_with_area.area_km2 .< 10_000177-element BitVector:
0
0
0
0
0
0
0
0
0
0
⋮
0
0
0
0
0
0
0
1
0
This is a simple vector, with boolean elements and the same length as the number of rows in the DataFrame.
We use it to select all rows in the DataFrame where its value is true.
world_with_area[countries_to_select, :]| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | |
| 1 | Geometry: wkbMultiPolygon | PR | Puerto Rico | North America | Americas | Caribbean | Dependency | 9224.66 | 3.53487e6 | 79.3901 | 35066.0 |
| 2 | Geometry: wkbMultiPolygon | PS | Palestine | Asia | Asia | Western Asia | Disputed | 5037.1 | 4.29468e6 | 73.126 | 4319.53 |
| 3 | Geometry: wkbMultiPolygon | VU | Vanuatu | Oceania | Oceania | Melanesia | Sovereign country | 7490.04 | 258850.0 | 71.709 | 2892.34 |
| 4 | Geometry: wkbMultiPolygon | LU | Luxembourg | Europe | Europe | Western Europe | Sovereign country | 2416.87 | 556319.0 | 82.2293 | 93655.3 |
| 5 | Geometry: wkbMultiPolygon | missing | Northern Cyprus | Asia | Asia | Western Asia | Sovereign country | 3786.36 | missing | missing | missing |
| 6 | Geometry: wkbMultiPolygon | CY | Cyprus | Asia | Asia | Western Asia | Sovereign country | 6207.01 | 1.15231e6 | 80.173 | 29786.4 |
| 7 | Geometry: wkbMultiPolygon | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 |
A more concise way to achieve the same result, without the intermediate array, is world_with_area[world_with_area.area_km2 .< 10_000, :].
This syntax is applicable to columns too!
There are ways to achieve this result using all of the DataFrame manipulation packages mentioned above.
DataFrames.jl also defines a subset function, which is another way to achieve this result:
subset(world_with_area, :area_km2 => ByRow(x -> x < 10_000))| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | |
| 1 | Geometry: wkbMultiPolygon | PR | Puerto Rico | North America | Americas | Caribbean | Dependency | 9224.66 | 3.53487e6 | 79.3901 | 35066.0 |
| 2 | Geometry: wkbMultiPolygon | PS | Palestine | Asia | Asia | Western Asia | Disputed | 5037.1 | 4.29468e6 | 73.126 | 4319.53 |
| 3 | Geometry: wkbMultiPolygon | VU | Vanuatu | Oceania | Oceania | Melanesia | Sovereign country | 7490.04 | 258850.0 | 71.709 | 2892.34 |
| 4 | Geometry: wkbMultiPolygon | LU | Luxembourg | Europe | Europe | Western Europe | Sovereign country | 2416.87 | 556319.0 | 82.2293 | 93655.3 |
| 5 | Geometry: wkbMultiPolygon | missing | Northern Cyprus | Asia | Asia | Western Asia | Sovereign country | 3786.36 | missing | missing | missing |
| 6 | Geometry: wkbMultiPolygon | CY | Cyprus | Asia | Asia | Western Asia | Sovereign country | 6207.01 | 1.15231e6 | 80.173 | 29786.4 |
| 7 | Geometry: wkbMultiPolygon | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 |
DataFramesMeta.jl provides a convenient syntax for subsetting DataFrames using a DSL that closely resembles the tidyverse.
using DataFramesMeta
@chain world_with_area begin
@subset @byrow (:area_km2 < 10_000)
select(:name_long, :area_km2)
endTidierData.jl provides a convenient syntax for subsetting DataFrames using a DSL that closely resembles the tidyverse.
using TidierData
@chain world_with_area begin
@subset @byrow (:area_km2 < 10_000)
select(:name_long, :area_km2)
endQuery.jl provides a convenient syntax for subsetting DataFrames using a DSL that closely resembles the tidyverse.
using Query
@from row in world_with_area |>
@where row.area_km2 < 10_000 |>
@select {name_long = row.name_long, area_km2 = row.area_km2} |>
DataFrame2.2.2.1 Subsetting by predicate
We saw how we could use a boolean vector to index into a DataFrame to select rows where the boolean is true.
However, this means we have to create the boolean vector, and while powerful, it can be clunky.
Instead, DataFrames.jl offers several ways we can do this. First is the subset function, which we just saw in the tabset above:
small_countries = subset(world_with_area, :area_km2 => ByRow(<(10_000)))| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap |
|---|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | |
| 1 | Geometry: wkbMultiPolygon | PR | Puerto Rico | North America | Americas | Caribbean | Dependency | 9224.66 | 3.53487e6 | 79.3901 | 35066.0 |
| 2 | Geometry: wkbMultiPolygon | PS | Palestine | Asia | Asia | Western Asia | Disputed | 5037.1 | 4.29468e6 | 73.126 | 4319.53 |
| 3 | Geometry: wkbMultiPolygon | VU | Vanuatu | Oceania | Oceania | Melanesia | Sovereign country | 7490.04 | 258850.0 | 71.709 | 2892.34 |
| 4 | Geometry: wkbMultiPolygon | LU | Luxembourg | Europe | Europe | Western Europe | Sovereign country | 2416.87 | 556319.0 | 82.2293 | 93655.3 |
| 5 | Geometry: wkbMultiPolygon | missing | Northern Cyprus | Asia | Asia | Western Asia | Sovereign country | 3786.36 | missing | missing | missing |
| 6 | Geometry: wkbMultiPolygon | CY | Cyprus | Asia | Asia | Western Asia | Sovereign country | 6207.01 | 1.15231e6 | 80.173 | 29786.4 |
| 7 | Geometry: wkbMultiPolygon | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 |
2.2.3 Chaining operations
DataFrames.jl functions are mature, stable and widely used, making them a rock solid choice, especially in contexts where reproducibility and reliability are key.
Functions from the DataFrames manipulation packages mentioned earlier (DataFramesMeta.jl, TidierData.jl, and Query.jl) are also available, and quite stable at this point. They offer “tidy” workflows which can sometimes be more intuitive and productive for interactive data analysis, as well as easier to reason about.
The following example demonstrates chaining multiple operations using DataFramesMeta.jl’s @chain macro. We filter Asian countries, select specific columns, and take the first 5 rows:
using DataFramesMeta
asia_sample = @chain world begin
@subset(:continent .== "Asia")
@select(:name_long, :continent, :pop)
first(5)
end| Row | name_long | continent | pop |
|---|---|---|---|
| String | String | Float64? | |
| 1 | Kazakhstan | Asia | 1.72883e7 |
| 2 | Uzbekistan | Asia | 3.07577e7 |
| 3 | Indonesia | Asia | 2.55131e8 |
| 4 | Timor-Leste | Asia | 1.21281e6 |
| 5 | Israel | Asia | 8.2157e6 |
This is equivalent to the following nested DataFrames.jl operations:
asia_sample2 = first(
select(
subset(world, :continent => ByRow(==("Asia"))),
[:name_long, :continent, :pop]
),
5
)| Row | name_long | continent | pop |
|---|---|---|---|
| String | String | Float64? | |
| 1 | Kazakhstan | Asia | 1.72883e7 |
| 2 | Uzbekistan | Asia | 3.07577e7 |
| 3 | Indonesia | Asia | 2.55131e8 |
| 4 | Timor-Leste | Asia | 1.21281e6 |
| 5 | Israel | Asia | 8.2157e6 |
Each approach has advantages: chained operations read top-to-bottom like a pipeline, while nested operations are explicit about function composition. For interactive analysis, chaining is often more intuitive.
2.2.4 Vector attribute aggregation
Aggregation involves summarizing data based on one or more grouping variables, typically values in a column of the data frame to be aggregated. Geographic aggregation is covered in the next chapter; here we focus on attribute-based aggregation.
A classic example is calculating the number of people per continent based on country-level data. The world dataset contains the necessary ingredients: the columns pop and continent, the population and the grouping variable, respectively. The aim is to find the sum() of country populations for each continent.
In DataFrames.jl, attribute-based aggregation is achieved using groupby and combine:
world_agg1 = combine(
groupby(dropmissing(world, :pop), :continent),
:pop => sum => :total_pop
)| Row | continent | total_pop |
|---|---|---|
| String | Float64 | |
| 1 | Oceania | 3.77578e7 |
| 2 | Africa | 1.15495e9 |
| 3 | North America | 5.65029e8 |
| 4 | Asia | 4.31141e9 |
| 5 | South America | 4.12061e8 |
| 6 | Europe | 6.69036e8 |
The result is a (non-spatial) table with six rows, one per continent, and two columns reporting the name and total population of each continent.
We can perform more complex aggregations by passing multiple aggregation expressions. The following calculates population sum, area sum, and country count per continent:
world_agg2 = combine(
groupby(dropmissing(world, [:pop, :area_km2]), :continent),
:pop => sum => :total_pop,
:area_km2 => sum => :total_area,
:name_long => length => :n_countries
)| Row | continent | total_pop | total_area | n_countries |
|---|---|---|---|---|
| String | Float64 | Float64 | Int64 | |
| 1 | Oceania | 3.77578e7 | 8.50449e6 | 7 |
| 2 | Africa | 1.15495e9 | 2.95633e7 | 48 |
| 3 | North America | 5.65029e8 | 2.44843e7 | 18 |
| 4 | Asia | 4.31141e9 | 3.12143e7 | 45 |
| 5 | South America | 4.12061e8 | 1.77462e7 | 12 |
| 6 | Europe | 6.69036e8 | 2.20224e7 | 37 |
Using DataFramesMeta.jl, the same operation can be written more concisely:
world_agg3 = @chain world begin
dropmissing([:pop, :area_km2])
groupby(:continent)
@combine(
:total_pop = sum(:pop),
:total_area = sum(:area_km2),
:n_countries = length(:name_long)
)
end| Row | continent | total_pop | total_area | n_countries |
|---|---|---|---|---|
| String | Float64 | Float64 | Int64 | |
| 1 | Oceania | 3.77578e7 | 8.50449e6 | 7 |
| 2 | Africa | 1.15495e9 | 2.95633e7 | 48 |
| 3 | North America | 5.65029e8 | 2.44843e7 | 18 |
| 4 | Asia | 4.31141e9 | 3.12143e7 | 45 |
| 5 | South America | 4.12061e8 | 1.77462e7 | 12 |
| 6 | Europe | 6.69036e8 | 2.20224e7 | 37 |
Let’s extend this example by calculating population density and selecting the top 3 most populous continents:
top_continents = @chain world begin
dropmissing([:pop, :area_km2])
groupby(:continent)
@combine(
:total_pop = sum(:pop),
:total_area = sum(:area_km2),
:n_countries = length(:name_long)
)
@rtransform(:pop_density = :total_pop / :total_area)
@orderby(-:total_pop)
first(3)
end| Row | continent | total_pop | total_area | n_countries | pop_density |
|---|---|---|---|---|---|
| String | Float64 | Float64 | Int64 | Float64 | |
| 1 | Asia | 4.31141e9 | 3.12143e7 | 45 | 138.123 |
| 2 | Africa | 1.15495e9 | 2.95633e7 | 48 | 39.067 |
| 3 | Europe | 6.69036e8 | 2.20224e7 | 37 | 30.3798 |
2.2.5 Vector attribute joining
Combining data from different sources is a common task in data preparation. Joins do this by combining tables based on a shared ‘key’ variable. DataFrames.jl provides several join functions including leftjoin, innerjoin, rightjoin, and outerjoin.
A common type of attribute join on spatial data is to join DataFrames to GeoDataFrames. In the following example, we combine data on coffee production with the world dataset. The coffee data is in a CSV file containing major coffee-producing nations:
using CSV
coffee_data = CSV.read("data/coffee_data.csv", DataFrame)
first(coffee_data, 5)| Row | name_long | coffee_production_2016 | coffee_production_2017 |
|---|---|---|---|
| String31 | String7 | String7 | |
| 1 | Angola | NA | NA |
| 2 | Bolivia | 3 | 4 |
| 3 | Brazil | 3277 | 2786 |
| 4 | Burundi | 37 | 38 |
| 5 | Cameroon | 8 | 6 |
Its columns are name_long (country name), and coffee_production_2016 and coffee_production_2017 (estimated values for coffee production in units of 60-kg bags per year).
A left join preserves all rows from the first dataset and adds matching columns from the second:
world_coffee = leftjoin(world, coffee_data, on = :name_long)| Row | geometry | iso_a2 | name_long | continent | region_un | subregion | type | area_km2 | pop | lifeExp | gdpPercap | coffee_production_2016 | coffee_production_2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IGeometr… | String? | String | String | String | String | String | Float64 | Float64? | Float64? | Float64? | String7? | String7? | |
| 1 | Geometry: wkbMultiPolygon | TZ | Tanzania | Africa | Africa | Eastern Africa | Sovereign country | 9.32746e5 | 5.22349e7 | 64.163 | 2402.1 | 81 | 66 |
| 2 | Geometry: wkbMultiPolygon | PG | Papua New Guinea | Oceania | Oceania | Melanesia | Sovereign country | 4.6452e5 | 7.75578e6 | 65.23 | 3709.08 | 114 | 74 |
| 3 | Geometry: wkbMultiPolygon | ID | Indonesia | Asia | Asia | South-Eastern Asia | Sovereign country | 1.81925e6 | 2.55131e8 | 68.856 | 10003.1 | 742 | 360 |
| 4 | Geometry: wkbMultiPolygon | KE | Kenya | Africa | Africa | Eastern Africa | Sovereign country | 5.90837e5 | 4.60242e7 | 66.242 | 2753.24 | 60 | 50 |
| 5 | Geometry: wkbMultiPolygon | DO | Dominican Republic | North America | Americas | Caribbean | Sovereign country | 48157.9 | 1.04058e7 | 73.483 | 12663.0 | 1 | NA |
| 6 | Geometry: wkbMultiPolygon | TL | Timor-Leste | Asia | Asia | South-Eastern Asia | Sovereign country | 14714.9 | 1.21281e6 | 68.285 | 6262.91 | 14 | 2 |
| 7 | Geometry: wkbMultiPolygon | MX | Mexico | North America | Americas | Central America | Sovereign country | 1.96948e6 | 1.24222e8 | 76.753 | 16622.6 | 151 | 220 |
| 8 | Geometry: wkbMultiPolygon | BR | Brazil | South America | Americas | South America | Sovereign country | 8.50856e6 | 2.04213e8 | 75.042 | 15374.3 | 3277 | 2786 |
| 9 | Geometry: wkbMultiPolygon | BO | Bolivia | South America | Americas | South America | Sovereign country | 1.08527e6 | 1.05622e7 | 68.357 | 6324.83 | 3 | 4 |
| 10 | Geometry: wkbMultiPolygon | PE | Peru | South America | Americas | South America | Sovereign country | 1.3097e6 | 3.09734e7 | 74.518 | 11547.8 | 585 | 625 |
| 11 | Geometry: wkbMultiPolygon | CO | Colombia | South America | Americas | South America | Sovereign country | 1.15188e6 | 4.77919e7 | 74.022 | 12716.0 | 1330 | 1169 |
| 12 | Geometry: wkbMultiPolygon | PA | Panama | North America | Americas | Central America | Sovereign country | 75265.4 | 3.90399e6 | 77.61 | 20018.0 | 3 | 3 |
| 13 | Geometry: wkbMultiPolygon | CR | Costa Rica | North America | Americas | Central America | Sovereign country | 53832.1 | 4.75758e6 | 79.44 | 14372.4 | 28 | 32 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 166 | Geometry: wkbMultiPolygon | MA | Morocco | Africa | Africa | Northern Africa | Sovereign country | 591719.0 | 3.43181e7 | 75.309 | 7078.88 | missing | missing |
| 167 | Geometry: wkbMultiPolygon | EG | Egypt | Africa | Africa | Northern Africa | Sovereign country | 9.96312e5 | 9.18126e7 | 71.12 | 9879.8 | missing | missing |
| 168 | Geometry: wkbMultiPolygon | LY | Libya | Africa | Africa | Northern Africa | Sovereign country | 1.63372e6 | 6.20411e6 | 71.659 | 16371.9 | missing | missing |
| 169 | Geometry: wkbMultiPolygon | DJ | Djibouti | Africa | Africa | Eastern Africa | Sovereign country | 21880.3 | 912164.0 | 62.006 | missing | missing | missing |
| 170 | Geometry: wkbMultiPolygon | missing | Somaliland | Africa | Africa | Eastern Africa | Indeterminate | 1.6735e5 | missing | missing | missing | missing | missing |
| 171 | Geometry: wkbMultiPolygon | BA | Bosnia and Herzegovina | Europe | Europe | Southern Europe | Sovereign country | 50605.1 | 3.566e6 | 76.561 | 10516.8 | missing | missing |
| 172 | Geometry: wkbMultiPolygon | MK | Macedonia | Europe | Europe | Southern Europe | Sovereign country | 25062.3 | 2.0775e6 | 75.384 | 12298.5 | missing | missing |
| 173 | Geometry: wkbMultiPolygon | RS | Serbia | Europe | Europe | Southern Europe | Sovereign country | 76388.6 | 7.13058e6 | 75.3366 | 13112.9 | missing | missing |
| 174 | Geometry: wkbMultiPolygon | ME | Montenegro | Europe | Europe | Southern Europe | Sovereign country | 13443.7 | 621810.0 | 76.712 | 14796.6 | missing | missing |
| 175 | Geometry: wkbMultiPolygon | XK | Kosovo | Europe | Europe | Southern Europe | Sovereign country | 11230.3 | 1.8218e6 | 71.0976 | 8698.29 | missing | missing |
| 176 | Geometry: wkbMultiPolygon | TT | Trinidad and Tobago | North America | Americas | Caribbean | Sovereign country | 7737.81 | 1.35449e6 | 70.426 | 31181.8 | missing | missing |
| 177 | Geometry: wkbMultiPolygon | SS | South Sudan | Africa | Africa | Eastern Africa | Sovereign country | 6.24909e5 | 1.1531e7 | 55.817 | 1935.88 | missing | missing |
The result is a DataFrame with the same number of rows as world (177), but with two new columns for coffee production. Countries without coffee production data have missing values in these columns.
We can check how many countries have coffee data:
coffee_countries = dropmissing(world_coffee, :coffee_production_2017)
println("Countries with 2017 coffee data: $(nrow(coffee_countries))")Countries with 2017 coffee data: 45
What if we only want to keep countries that have a match in the key variable? An inner join keeps only rows with matches in both datasets:
world_coffee_inner = innerjoin(world, coffee_data, on = :name_long)
println("Rows after inner join: $(nrow(world_coffee_inner))")Rows after inner join: 45
Note that the inner join has fewer rows than coffee_data (47 rows). This is because some country names don’t match exactly. We can identify the mismatches:
coffee_names = Set(coffee_data.name_long)
world_names = Set(world.name_long)
setdiff(coffee_names, world_names)Set{String31} with 2 elements:
String31("Others")
String31("Congo, Dem. Rep. of")
The “Congo, Dem. Rep. of” name doesn’t match the world dataset’s naming convention. We can fix this by updating the coffee data before joining:
# Find the correct name in world
drc_name = filter(n -> occursin("Dem", n) && occursin("Congo", n), world.name_long)
println("DRC name in world: $drc_name")
# Create a corrected copy
coffee_fixed = copy(coffee_data)
coffee_fixed.name_long = replace.(coffee_fixed.name_long, "Congo, Dem. Rep. of" => first(drc_name))
# Now the inner join captures one more country
world_coffee_fixed = innerjoin(world, coffee_fixed, on = :name_long)
println("Rows after fix: $(nrow(world_coffee_fixed))")DRC name in world: ["Democratic Republic of the Congo"]
Rows after fix: 46
When the key columns have different names in each dataset, use the on argument with a Pair:
leftjoin(df1, df2, on = :name_in_df1 => :name_in_df2)2.2.6 Creating attributes and removing spatial information
Often, we want to create new columns based on existing ones. For example, calculating population density for each country requires dividing population by area.
In base Julia, we can add a new column directly:
world2 = copy(world) # don't modify the original
world2.pop_density = world2.pop ./ world2.area_km2
select(world2, :name_long, :pop, :area_km2, :pop_density) |> first| Row | name_long | pop | area_km2 | pop_density |
|---|---|---|---|---|
| String | Float64? | Float64 | Float64? | |
| 1 | Fiji | 885806.0 | 19290.0 | 45.9205 |
Note the broadcasting operator . in ./ — this is essential for element-wise division in Julia.
Using DataFramesMeta.jl’s @rtransform macro (row-wise transform):
world3 = @chain world begin
@rtransform(:pop_density = :pop / :area_km2)
@select(:name_long, :pop, :area_km2, :pop_density)
end
first(world3, 3)| Row | name_long | pop | area_km2 | pop_density |
|---|---|---|---|---|
| String | Float64? | Float64 | Float64? | |
| 1 | Fiji | 885806.0 | 19290.0 | 45.9205 |
| 2 | Tanzania | 5.22349e7 | 9.32746e5 | 56.0012 |
| 3 | Western Sahara | missing | 96270.6 | missing |
To combine existing columns into a new one, we can use string interpolation or concatenation:
world4 = @chain world begin
@rtransform(:con_reg = :continent * ":" * :region_un)
@select(:name_long, :continent, :region_un, :con_reg)
end
first(world4, 3)| Row | name_long | continent | region_un | con_reg |
|---|---|---|---|---|
| String | String | String | String | |
| 1 | Fiji | Oceania | Oceania | Oceania:Oceania |
| 2 | Tanzania | Africa | Africa | Africa:Africa |
| 3 | Western Sahara | Africa | Africa | Africa:Africa |
The opposite operation — splitting one column into multiple — uses the split function:
# Split the combined column back
world5 = @chain world4 begin
@rtransform(
:continent_new = split(:con_reg, ":")[1],
:region_new = split(:con_reg, ":")[2]
)
@select(:name_long, :con_reg, :continent_new, :region_new)
end
first(world5, 3)| Row | name_long | con_reg | continent_new | region_new |
|---|---|---|---|---|
| String | String | SubStrin… | SubStrin… | |
| 1 | Fiji | Oceania:Oceania | Oceania | Oceania |
| 2 | Tanzania | Africa:Africa | Africa | Africa |
| 3 | Western Sahara | Africa:Africa | Africa | Africa |
Renaming columns is done with the rename function:
world_renamed = rename(world, :name_long => :name, :pop => :population)
names(world_renamed)11-element Vector{String}:
"geometry"
"iso_a2"
"name"
"continent"
"region_un"
"subregion"
"type"
"area_km2"
"population"
"lifeExp"
"gdpPercap"
To rename all columns at once, assign to the names! function:
world_short = copy(world)
rename!(world_short, names(world_short) .=> [:geom, :iso, :name, :cont, :reg, :subreg, :type, :area, :pop, :life, :gdp])
names(world_short)11-element Vector{String}:
"geom"
"iso"
"name"
"cont"
"reg"
"subreg"
"type"
"area"
"pop"
"life"
"gdp"
Each of these attribute operations preserves the geometry column. Sometimes, however, it makes sense to remove the geometry — for example, to speed up aggregation or to export just the attribute data.
To drop the geometry column, subset to exclude it:
geom_cols = GI.geometrycolumns(world)
world_df = select(world, Not(geom_cols...))
typeof(world_df)DataFrame
The result is a regular DataFrame without spatial information.
2.3 Manipulating raster objects
In contrast to the vector data model underlying simple features (which represents points, lines and polygons as discrete entities in space), raster data represent continuous surfaces. This section shows how raster objects work by creating them from scratch, building on Section @ref(an-introduction-to-terra). Because of their unique structure, subsetting and other operations on raster datasets work in a different way, as demonstrated in Section @ref(raster-subsetting).
The following code recreates the raster dataset used in Section @ref(raster-classes), the result of which is illustrated in Figure @ref(fig:cont-raster). This demonstrates how the Raster() constructor works to create an example raster named elev (representing elevations).
vals = reshape(1:36, 6, 6)
elev = Raster(vals, (X(LinRange(-1.5, 1.5, 6)), Y(LinRange(-1.5, 1.5, 6))))┌ 6×6 Raster{Int64, 2} ┐ ├──────────────────────┴───────────────────────────────────────────────── dims ┐ ↓ X Sampled{Float64} LinRange{Float64}(-1.5, 1.5, 6) ForwardOrdered Regular Points, → Y Sampled{Float64} LinRange{Float64}(-1.5, 1.5, 6) ForwardOrdered Regular Points ├────────────────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.5, 1.5), Y = (-1.5, 1.5)) └──────────────────────────────────────────────────────────────────────────────┘ ↓ → -1.5 -0.9 -0.3 0.3 0.9 1.5 -1.5 1 7 13 19 25 31 -0.9 2 8 14 20 26 32 -0.3 3 9 15 21 27 33 0.3 4 10 16 22 28 34 0.9 5 11 17 23 29 35 1.5 6 12 18 24 30 36
The result is a raster object with 6 rows and 6 columns, and spatial lookup vectors for the dimensions X (horizontal) and Y (vertical). The vals argument sets the values that each cell contains: numeric data ranging from 1 to 36 in this case.
Raster objects can also contain categorical values, like strings or even values corresponding to categories. The following code creates the raster datasets shown in Figure @ref(fig:cont-raster):
# First, construct a categorical array
using CategoricalArrays
grain_order = ["clay", "silt", "sand"]
grain_char = rand(grain_order, 6, 6)
grain_fact = CategoricalArray(grain_char, levels = grain_order)
using Rasters
# Then, wrap the categorical array in a Raster object
grain = Raster(grain_fact, (X(LinRange(-1.5, 1.5, 6)), Y(LinRange(-1.5, 1.5, 6))))┌ 6×6 Raster{CategoricalArrays.CategoricalValue{String, UInt32}, 2} ┐ ├───────────────────────────────────────────────────────────────────┴──── dims ┐ ↓ X Sampled{Float64} LinRange{Float64}(-1.5, 1.5, 6) ForwardOrdered Regular Points, → Y Sampled{Float64} LinRange{Float64}(-1.5, 1.5, 6) ForwardOrdered Regular Points ├────────────────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.5, 1.5), Y = (-1.5, 1.5)) └──────────────────────────────────────────────────────────────────────────────┘ ↓ → … 1.5 -1.5 "clay" -0.9 "silt" -0.3 "silt" 0.3 "silt" 0.9 … "sand" 1.5 "sand"
This CategoricalArray is stored in two parts: a matrix of integer codes, and a dictionary of levels, that maps the integer codes to the string values. We can retrieve the levels of a CategoricalArray using the levels function, and modify them using the recode function.
levels(grain)3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"clay"
"silt"
"sand"
grain2 = recode(grain, "clay" => "very wet", "silt" => "moist", "sand" => "dry")┌ 6×6 Raster{CategoricalArrays.CategoricalValue{String, UInt32}, 2} ┐ ├───────────────────────────────────────────────────────────────────┴──── dims ┐ ↓ X Sampled{Float64} LinRange{Float64}(-1.5, 1.5, 6) ForwardOrdered Regular Points, → Y Sampled{Float64} LinRange{Float64}(-1.5, 1.5, 6) ForwardOrdered Regular Points ├────────────────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.5, 1.5), Y = (-1.5, 1.5)) └──────────────────────────────────────────────────────────────────────────────┘ ↓ → … 1.5 -1.5 "very wet" -0.9 "moist" -0.3 "moist" 0.3 "moist" 0.9 … "dry" 1.5 "dry"
Rasters.jl does not currently support color tables in rasters. This should come at some point, though. ArchGDAL, the backend, does support these.
2.3.1 Raster subsetting
Raster subsetting is done with the Julia getindex syntax (square brackets), in the same way as we used it to subset DataFrames. Raster selection is, however, far more powerful, since you can use selectors to select various spatial subsets of the raster, like Near, At, Between, and .. (interval).
The Near selector finds the nearest cell to the specified coordinate — this is often the most practical choice:
elev[X(Near(0)), Y(Near(0))]0x10
The At selector returns the value at an exact coordinate. It requires the coordinate to match precisely, which can be tricky with floating-point values:
# Get exact coordinate values from the lookup
x_coords = lookup(elev, X)
y_coords = lookup(elev, Y)
println("X coordinates: ", collect(x_coords))X coordinates: [-1.5, -1.0, -0.5, 0.0, 0.5, 1.0]
# Use the exact value from the lookup
elev[X(At(x_coords[2])), Y(At(y_coords[2]))]0x08
The .. operator (from IntervalSets.jl, re-exported by DimensionalData.jl) selects a range of values:
elev[X(-1..0), Y(0..1)]┌ 3×3 Raster{UInt8, 2} ┐ ├──────────────────────┴──────────────────────────────────────── dims ┐ ↓ X Projected{Float64} -1.0:0.5:0.0 ForwardOrdered Regular Points, → Y Projected{Float64} 1.0:-0.5:0.0 ReverseOrdered Regular Points ├─────────────────────────────────────────────────────────── metadata ┤ Metadata{Rasters.GDALsource} of Dict{String, Any} with 1 entry: "filepath" => "output/elev.tif" ├───────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.0, 0.0), Y = (0.0, 1.0)) crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,... └─────────────────────────────────────────────────────────────────────┘ ↓ → 1.0 0.5 0.0 -1.0 0x02 0x08 0x0e -0.5 0x03 0x09 0x0f 0.0 0x04 0x0a 0x10
This returns a smaller raster containing only the cells within the specified coordinate ranges.
You can also use integer indices directly, just like with arrays:
elev[1, 1] # top-left cell
elev[1:3, 1:3] # 3x3 subset from top-left┌ 3×3 Raster{UInt8, 2} ┐ ├──────────────────────┴───────────────────────────────────────── dims ┐ ↓ X Projected{Float64} -1.5:0.5:-0.5 ForwardOrdered Regular Points, → Y Projected{Float64} 1.0:-0.5:0.0 ReverseOrdered Regular Points ├──────────────────────────────────────────────────────────── metadata ┤ Metadata{Rasters.GDALsource} of Dict{String, Any} with 1 entry: "filepath" => "output/elev.tif" ├────────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.5, -0.5), Y = (0.0, 1.0)) crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,2... └──────────────────────────────────────────────────────────────────────┘ ↓ → 1.0 0.5 0.0 -1.5 0x01 0x07 0x0d -1.0 0x02 0x08 0x0e -0.5 0x03 0x09 0x0f
Cell values can be modified by combining subsetting with assignment. The following sets the top-left cell to 0:
elev_modified = copy(elev)
elev_modified[1, 1] = 0
elev_modified[1:3, 1] # check the first column┌ 3-element Raster{UInt8, 1} ┐ ├────────────────────────────┴─────────────────────────────────── dims ┐ ↓ X Projected{Float64} -1.5:0.5:-0.5 ForwardOrdered Regular Points ├──────────────────────────────────────────────────────────── metadata ┤ Metadata{Rasters.GDALsource} of Dict{String, Any} with 1 entry: "filepath" => "output/elev.tif" ├────────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.5, -0.5),) crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,2... └──────────────────────────────────────────────────────────────────────┘ -1.5 0x00 -1.0 0x02 -0.5 0x03
Multiple cells can be modified at once:
elev_modified[1, 1:3] .= 0 # set first row, columns 1-3 to 0
elev_modified[1:3, 1:3]┌ 3×3 Raster{UInt8, 2} ┐ ├──────────────────────┴───────────────────────────────────────── dims ┐ ↓ X Projected{Float64} -1.5:0.5:-0.5 ForwardOrdered Regular Points, → Y Projected{Float64} 1.0:-0.5:0.0 ReverseOrdered Regular Points ├──────────────────────────────────────────────────────────── metadata ┤ Metadata{Rasters.GDALsource} of Dict{String, Any} with 1 entry: "filepath" => "output/elev.tif" ├────────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.5, -0.5), Y = (0.0, 1.0)) crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,2... └──────────────────────────────────────────────────────────────────────┘ ↓ → 1.0 0.5 0.0 -1.5 0x00 0x00 0x00 -1.0 0x02 0x08 0x0e -0.5 0x03 0x09 0x0f
Unlike numpy arrays (Python) where dimensions are typically (rows, columns) or (y, x), Rasters.jl uses named dimensions. When you write elev[X(...), Y(...)], you’re explicitly specifying which dimension you’re selecting on, making the code more readable and less error-prone.
You can check a raster’s dimensions with dims(elev).
2.3.2 Summarizing raster objects
Julia’s standard library and ecosystem provide many functions for summarizing raster values. Since Rasters.jl rasters behave like arrays, standard functions work directly:
using Statistics
mean(elev)18.5
std(elev)10.535653752852738
minimum(elev), maximum(elev)(0x01, 0x24)
For a quick summary, we can combine these:
println("Elevation statistics:")
println(" Min: $(minimum(elev))")
println(" Max: $(maximum(elev))")
println(" Mean: $(round(mean(elev), digits=2))")
println(" Std: $(round(std(elev), digits=2))")Elevation statistics:
Min: 1
Max: 36
Mean: 18.5
Std: 10.54
When rasters contain missing values (missing in Julia), use the skipmissing function:
# Create a raster with missing values by first converting to a Union type
vals_missing = convert(Array{Union{Missing, Int}, 2}, collect(reshape(1:36, 6, 6)))
elev_with_missing = Raster(vals_missing, (X(LinRange(-1.5, 1.5, 6)), Y(LinRange(-1.5, 1.5, 6))))
elev_with_missing[1, 1] = missing
mean(skipmissing(elev_with_missing))19.0
For categorical rasters, we can calculate frequency tables. First, let’s reload the grain raster from file (as an integer-coded raster):
import ArchGDAL
grain_int = Raster("output/grain.tif")┌ 6×6 Raster{UInt8, 2} ┐ ├──────────────────────┴──────────────────────────────────────── dims ┐ ↓ X Projected{Float64} -1.5:0.5:1.0 ForwardOrdered Regular Points, → Y Projected{Float64} 1.0:-0.5:-1.5 ReverseOrdered Regular Points ├─────────────────────────────────────────────────────────── metadata ┤ Metadata{Rasters.GDALsource} of Dict{String, Any} with 1 entry: "filepath" => "output/grain.tif" ├───────────────────────────────────────────────────────────── raster ┤ extent: Extent(X = (-1.5, 1.0), Y = (-1.5, 1.0)) crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,... └─────────────────────────────────────────────────────────────────────┘ ↓ → 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 0x01 0x00 0x00 0x00 0x01 0x02 -1.0 0x00 0x02 0x02 0x00 0x01 0x01 ⋮ ⋮ 0.5 0x02 0x02 0x00 0x01 0x01 0x00 1.0 0x02 0x01 0x02 0x01 0x01 0x02
To get the frequency of each category:
using StatsBase
counts = countmap(vec(grain_int))Dict{UInt8, Int64} with 3 entries:
0x00 => 10
0x02 => 13
0x01 => 13
Raster value statistics can be visualized in various ways. A histogram shows the distribution of values in a continuous raster:
using CairoMakie
fig = Figure(size=(600, 400))
ax = Axis(fig[1, 1], xlabel="Elevation", ylabel="Frequency", title="Elevation Distribution")
hist!(ax, vec(elev), bins=10)
figelev)
For categorical rasters, a bar plot is more appropriate:
grain_labels = ["clay", "silt", "sand"]
grain_counts = [count(==(i), vec(grain_int)) for i in 1:3]
fig = Figure(size=(600, 400))
ax = Axis(fig[1, 1],
xlabel="Grain type",
ylabel="Count",
title="Grain Type Distribution",
xticks=(1:3, grain_labels)
)
barplot!(ax, 1:3, grain_counts)
figgrain)
The summary statistics shown here are global operations — they summarize the entire raster into a single value or set of values. In contrast, local operations (covered in the next chapter) compute values for each cell based on its neighbors or corresponding cells in other layers.
2.4 Exercises
Create a new column in the
worlddataset calledpop_millionsthat contains the population in millions (dividepopby 1,000,000). Which country has the highest population?Using
groupbyandcombine, calculate the mean life expectancy (lifeExp) by continent. Which continent has the highest average life expectancy?Perform an inner join between
worldandcoffee_data. How many rows are in the result? Why is this different from the number of rows incoffee_data?Create a 10x10 raster with random values between 0 and 100. Calculate its mean, median, and standard deviation.
Subset the
elevraster to only include cells where the elevation is greater than 20. How many cells remain?