Geospatial Data Quality at Scale: A Framework from Detection to Decision

Written by Aurélien Callens, PhD. Data Scientist at Sustaain

Geospatial Data Quality at Scale: A Framework from Detection to Decision

Executive Summary

Most geospatial data quality workflows answer one question: what is wrong? They rarely answer what to do about it, in which order, and why.

We built a framework that does, and applied it to approximately 330,000 geometries from several providers across 20 countries covering several EUDR commodities.

The results are clear: 35.6% of geometries carry quality issues, but less than 1% require field intervention. The majority can be fixed through automation. More importantly, the dominant errors are systemic rather than individual, pointing to pipeline and integration failures that no amount of field retraining will solve.

Data quality, managed this way, becomes a strategic lever rather than one-off cleaning task.

The Reality of Geospatial Data Quality

Each geometry was scanned for quality issues, scored along two dimensions: impact on analysis (severity) and likelihood of correction (fixability), and placed into a priority matrix that maps directly to a remediation strategy.

Our findings:

Most geometries carry 0 to 2 flags (quality issues), with a rapidly decreasing tail, but a non-negligible share accumulates multiple compounding issues.

Severity is skewed toward low to moderate values, meaning most problems do not fully block analysis but still introduce bias. Fixability, however, is heavily concentrated at high values: a large proportion of detected issues can be corrected through automated or semi-automated processes.

The dominant error types are cross-feature and structural, not geometric or topological. It means that inconsistencies at the dataset level (duplicates, overlaps, encoding problems) are more prevalent than individual shape errors. The root cause is systemic data management, not isolated digitization mistakes. Improving data pipelines and integration processes will yield higher returns than focusing solely on field practices.

Near duplicate polygons are geometries from same supplier that overlap with >80% IoU and fake multipolygons are polygons with type MULTIPOLYGON that are in reality POLYGON

Mapped into a priority matrix, the flagged polygons (representing 35.6% of the total dataset) resolve into three actionable segments: a dominant quick wins group (easily fixable at scale), a small critical issues group (concentrating most of the risk), and a high-value automation group (where fixing yields the strongest analytical gains). This reduces a complex distribution of errors to a small number of operational decisions.

Quadrant	Geometries	Total %	Strategy
Clean polygons	213458	64.43%	No action
Quick wins	112397	33.92%	Automation
Critical issues	2723	0.82%	Field recollection / manual investigation
Low priority / tolerable noise	1420	0.43%	Tolerate / monitor
High-value automation	1326	0.4%	Priority automation

Provider-level error profiles and where to focus effort

Aggregating signals by provider shows that data quality issues are not uniform and require tailored solutions rather than a single generic approach.

One provider’s dataset, for example, is dominated by cross-feature inconsistencies, indicating duplication or data integration issues, as well as structural anomalies suggesting problems in data encoding or export pipelines. Targeted interventions for this provider would include deduplication at ingestion, stricter data integration rules, and validation of export formats to enforce consistent geometry structures.

This kind of profiling provides direct guidance for improving data collection and processing: it identifies whether issues originate from field practices or data pipelines, targets training or tooling where error types concentrate, and enables monitoring of improvements over time using consistent metrics.

The Framework that Generated These Results

Error Categories

Geospatial data quality issues can be regrouped into four categories, each capturing a distinct type of failure, from the internal validity of a single geometry to its consistency within a dataset :

Category	Scope	Question answered	Typical failures
Topological	Internal consistency of a geometry	Is the geometry internally valid?	Self-intersections, unclosed rings
Geometric	Single geometry shape	Is the shape physically plausible?	Spikes, slivers, distortions
Structural	Data representation	Is the geometry correctly encoded?	Wrong types, fake multipolygons
Cross-feature	Between geometries of the same dataset	Are the geometries consistent together?	Overlaps, near duplicates, containment

From Categories to Checks

We implemented a set of checks across the four categories to detect signals indicative of quality issues:

Category	Examples of checks	Signal	What it captures
Topology	Is the shape valid?	Invalid geometry	Self-intersections, unclosed rings, invalid topology
Geometry	Is the area too small or too large?	Size anomalies	Polygons too small or too large
Geometry	Are there spikes? Is the shape elongated?	Shape distortion	Spikes, slivers, low compactness, concavity
Geometry	Is the boundary over-digitized or inconsistent?	Boundary noise	Excessive vertices, short segments, duplicate vertices
Geometry	Are segment lengths plausible?	Scale inconsistency	Implausible segment lengths
Structure	Is the geometry type consistent and usable?	Type inconsistency	GeometryCollection, mixed or invalid geometry types
Structure	Is the multipart structure plausible?	Multipart anomaly	Fake or excessive multipolygons

When a check fails, a flag is assigned to the geometry.

A Bivariate Scoring System

Detecting errors is insufficient. The objective is to decide what to fix, how, and in which order. Each detected signal is therefore translated into two complementary dimensions.

Severity quantifies how much a given issue affects downstream analysis, ranging from 0 (no impact) to 5 (analysis not possible). Topological errors are blocking as invalid geometries cannot be used in most operations. Geometric and structural errors introduce varying levels of bias, distorting metrics or affecting interpretation. Severity is computed hierarchically: invalid topology immediately sets the maximum score, other errors are aggregated by category, and only the most severe issue within each category is retained. This avoids over-penalizing geometries with multiple correlated issues.

Fixability measures how likely it is to correct a geometry while preserving its meaning, ranging from 0 (not fixable) to 5 (fully fixable). Some errors are purely technical and can be fixed deterministically. Others require interpretation and may introduce uncertainty. Some require recollection because no reliable correction can be applied. The aggregation follows a bottleneck logic: the least fixable issue dominates, because one blocking problem is enough to invalidate automated correction.

The Priority Matrix

Severity and fixability define a two-dimensional decision space. Each geometry is positioned in a matrix that maps directly to a remediation strategy:

Quadrant	Interpretation	Strategy
Quick wins	Non-critical and fixable	Automation
Critical issues	Critical and hard to fix	Field recollection / manual investigation
Low priority / tolerable noise	Non-critical and hard to fix	Tolerate / monitor
High-value automation	Critical and fixable	Priority automation

From One-Off Diagnosis to Continuous Quality Monitoring

Most data quality workflows stop at detection : issues are identified after collection, but not prevented or systematically prioritized. This framework addresses that gap by integrating decision-making directly into the data pipeline.

The ultimate goal of this framework is not to diagnose problems long after collection, but to control data quality as it is produced. In practice, this means embedding quality checks into ingestion pipelines, evaluating geometries as they are collected, and surfacing critical issues immediately while remediation is still feasible.

This shift enables operational feedback loops. Field teams receive timely signals on critical errors, allowing remapping before they move to new regions. Data quality becomes a constraint of collection rather than a downstream concern.

At scale, this transforms the system: heterogeneous practices converge toward standardized processes, reactive cleaning is replaced by proactive control, and unstructured errors become measurable performance indicators. The transition relies on a simple structure: a limited set of error categories, interpretable signals derived from checks, and a scoring system that prioritizes actions.

The insights and thresholds presented here are derived from the dataset analyzed and therefore reflect its specific characteristics. As new datasets and error patterns emerge, both the checks and the scoring logic will be refined. The objective is not to define static rules, but to build a system that continuously adapts and improves as new data is collected.

Geospatial Data Quality at Scale: A Framework from Detection to Decision

Geospatial Data Quality at Scale: A Framework from Detection to Decision

Executive Summary

The Reality of Geospatial Data Quality

Our findings:

Provider-level error profiles and where to focus effort

The Framework that Generated These Results

Error Categories

From Categories to Checks

A Bivariate Scoring System

The Priority Matrix

From One-Off Diagnosis to Continuous Quality Monitoring

Ready to turn insights into action? Connect with us.