Written by Aurélien Callens, PhD. Data Scientist at Sustaain
Geospatial Data Quality at Scale: A Framework from Detection to Decision
Executive Summary
Most geospatial data quality workflows answer one question: what is wrong? They rarely answer what to do about it, in which order, and why.
We built a framework that does, and applied it to approximately 330,000 geometries from several providers across 20 countries covering several EUDR commodities.
The results are clear: 35.6% of geometries carry quality issues, but less than 1% require field intervention. The majority can be fixed through automation. More importantly, the dominant errors are systemic rather than individual, pointing to pipeline and integration failures that no amount of field retraining will solve.
Data quality, managed this way, becomes a strategic lever rather than one-off cleaning task.
The Reality of Geospatial Data Quality
Each geometry was scanned for quality issues, scored along two dimensions: impact on analysis (severity) and likelihood of correction (fixability), and placed into a priority matrix that maps directly to a remediation strategy.
Our findings:
- Most geometries carry 0 to 2 flags (quality issues), with a rapidly decreasing tail, but a non-negligible share accumulates multiple compounding issues.
- Severity is skewed toward low to moderate values, meaning most problems do not fully block analysis but still introduce bias. Fixability, however, is heavily concentrated at high values: a large proportion of detected issues can be corrected through automated or semi-automated processes.
- The dominant error types are cross-feature and structural, not geometric or topological. It means that inconsistencies at the dataset level (duplicates, overlaps, encoding problems) are more prevalent than individual shape errors. The root cause is systemic data management, not isolated digitization mistakes. Improving data pipelines and integration processes will yield higher returns than focusing solely on field practices.
Near duplicate polygons are geometries from same supplier that overlap with >80% IoU and fake multipolygons are polygons with type MULTIPOLYGON that are in reality POLYGON
- Mapped into a priority matrix, the flagged polygons (representing 35.6% of the total dataset) resolve into three actionable segments: a dominant quick wins group (easily fixable at scale), a small critical issues group (concentrating most of the risk), and a high-value automation group (where fixing yields the strongest analytical gains). This reduces a complex distribution of errors to a small number of operational decisions.
| Quadrant | Geometries | Total % | Strategy |
|---|---|---|---|
| Clean polygons | 213458 | 64.43% | No action |
| Quick wins | 112397 | 33.92% | Automation |
| Critical issues | 2723 | 0.82% | Field recollection / manual investigation |
| Low priority / tolerable noise | 1420 | 0.43% | Tolerate / monitor |
| High-value automation | 1326 | 0.4% | Priority automation |
Provider-level error profiles and where to focus effort
Aggregating signals by provider shows that data quality issues are not uniform and require tailored solutions rather than a single generic approach.
One provider’s dataset, for example, is dominated by cross-feature inconsistencies, indicating duplication or data integration issues, as well as structural anomalies suggesting problems in data encoding or export pipelines. Targeted interventions for this provider would include deduplication at ingestion, stricter data integration rules, and validation of export formats to enforce consistent geometry structures.
This kind of profiling provides direct guidance for improving data collection and processing: it identifies whether issues originate from field practices or data pipelines, targets training or tooling where error types concentrate, and enables monitoring of improvements over time using consistent metrics.
The Framework that Generated These Results
Error Categories
Geospatial data quality issues can be regrouped into four categories, each capturing a distinct type of failure, from the internal validity of a single geometry to its consistency within a dataset :
| Category | Scope | Question answered | Typical failures |
|---|---|---|---|
| Topological | Internal consistency of a geometry | Is the geometry internally valid? | Self-intersections, unclosed rings |
| Geometric | Single geometry shape | Is the shape physically plausible? | Spikes, slivers, distortions |
| Structural | Data representation | Is the geometry correctly encoded? | Wrong types, fake multipolygons |
| Cross-feature | Between geometries of the same dataset | Are the geometries consistent together? | Overlaps, near duplicates, containment |
From Categories to Checks
We implemented a set of checks across the four categories to detect signals indicative of quality issues:
| Category | Examples of checks | Signal | What it captures |
|---|---|---|---|
| Topology | Is the shape valid? | Invalid geometry | Self-intersections, unclosed rings, invalid topology |
| Geometry | Is the area too small or too large? | Size anomalies | Polygons too small or too large |
| Geometry | Are there spikes? Is the shape elongated? | Shape distortion | Spikes, slivers, low compactness, concavity |
| Geometry | Is the boundary over-digitized or inconsistent? | Boundary noise | Excessive vertices, short segments, duplicate vertices |
| Geometry | Are segment lengths plausible? | Scale inconsistency | Implausible segment lengths |
| Structure | Is the geometry type consistent and usable? | Type inconsistency | GeometryCollection, mixed or invalid geometry types |
| Structure | Is the multipart structure plausible? | Multipart anomaly | Fake or excessive multipolygons |
When a check fails, a flag is assigned to the geometry.
A Bivariate Scoring System
Detecting errors is insufficient. The objective is to decide what to fix, how, and in which order. Each detected signal is therefore translated into two complementary dimensions.
Severity quantifies how much a given issue affects downstream analysis, ranging from 0 (no impact) to 5 (analysis not possible). Topological errors are blocking as invalid geometries cannot be used in most operations. Geometric and structural errors introduce varying levels of bias, distorting metrics or affecting interpretation. Severity is computed hierarchically: invalid topology immediately sets the maximum score, other errors are aggregated by category, and only the most severe issue within each category is retained. This avoids over-penalizing geometries with multiple correlated issues.
Fixability measures how likely it is to correct a geometry while preserving its meaning, ranging from 0 (not fixable) to 5 (fully fixable). Some errors are purely technical and can be fixed deterministically. Others require interpretation and may introduce uncertainty. Some require recollection because no reliable correction can be applied. The aggregation follows a bottleneck logic: the least fixable issue dominates, because one blocking problem is enough to invalidate automated correction.
The Priority Matrix
Severity and fixability define a two-dimensional decision space. Each geometry is positioned in a matrix that maps directly to a remediation strategy:
| Quadrant | Interpretation | Strategy |
|---|---|---|
| Quick wins | Non-critical and fixable | Automation |
| Critical issues | Critical and hard to fix | Field recollection / manual investigation |
| Low priority / tolerable noise | Non-critical and hard to fix | Tolerate / monitor |
| High-value automation | Critical and fixable | Priority automation |
From One-Off Diagnosis to Continuous Quality Monitoring
Most data quality workflows stop at detection : issues are identified after collection, but not prevented or systematically prioritized. This framework addresses that gap by integrating decision-making directly into the data pipeline.
The ultimate goal of this framework is not to diagnose problems long after collection, but to control data quality as it is produced. In practice, this means embedding quality checks into ingestion pipelines, evaluating geometries as they are collected, and surfacing critical issues immediately while remediation is still feasible.
This shift enables operational feedback loops. Field teams receive timely signals on critical errors, allowing remapping before they move to new regions. Data quality becomes a constraint of collection rather than a downstream concern.
At scale, this transforms the system: heterogeneous practices converge toward standardized processes, reactive cleaning is replaced by proactive control, and unstructured errors become measurable performance indicators. The transition relies on a simple structure: a limited set of error categories, interpretable signals derived from checks, and a scoring system that prioritizes actions.
The insights and thresholds presented here are derived from the dataset analyzed and therefore reflect its specific characteristics. As new datasets and error patterns emerge, both the checks and the scoring logic will be refined. The objective is not to define static rules, but to build a system that continuously adapts and improves as new data is collected.