Geospatial Data Quality at Scale: A Framework from Detection to Decision

Written by Aurélien Callens, PhD. Data Scientist at Sustaain

Geospatial Data Quality at Scale: A Framework from Detection to Decision

 

Executive Summary

Most geospatial data quality workflows answer one question: what is wrong? They rarely answer what to do about it, in which order, and why.

We built a framework that does, and applied it to approximately 330,000 geometries from several providers across 20 countries covering several EUDR commodities.

The results are clear: 35.6% of geometries carry quality issues, but less than 1% require field intervention. The majority can be fixed through automation. More importantly, the dominant errors are systemic rather than individual, pointing to pipeline and integration failures that no amount of field retraining will solve.

Data quality, managed this way, becomes a strategic lever rather than one-off cleaning task.

 

The Reality of Geospatial Data Quality

Each geometry was scanned for quality issues, scored along two dimensions: impact on analysis (severity) and likelihood of correction (fixability), and placed into a priority matrix that maps directly to a remediation strategy.

 

Our findings:

  • Most geometries carry 0 to 2 flags (quality issues), with a rapidly decreasing tail, but a non-negligible share accumulates multiple compounding issues.

Number flags distribution-scaled

 

  • Severity is skewed toward low to moderate values, meaning most problems do not fully block analysis but still introduce bias. Fixability, however, is heavily concentrated at high values: a large proportion of detected issues can be corrected through automated or semi-automated processes.

Severity fixability-scaled

  • The dominant error types are cross-feature and structural, not geometric or topological. It means that inconsistencies at the dataset level (duplicates, overlaps, encoding problems) are more prevalent than individual shape errors. The root cause is systemic data management, not isolated digitization mistakes. Improving data pipelines and integration processes will yield higher returns than focusing solely on field practices.

Flags distribution scaled

Near duplicate polygons are geometries from same supplier that overlap with >80% IoU and fake multipolygons are polygons with type MULTIPOLYGON that are in reality POLYGON

 

  • Mapped into a priority matrix, the flagged polygons (representing 35.6% of the total dataset) resolve into three actionable segments: a dominant quick wins group (easily fixable at scale), a small critical issues group (concentrating most of the risk), and a high-value automation group (where fixing yields the strongest analytical gains). This reduces a complex distribution of errors to a small number of operational decisions.

 

Quadrant Geometries Total % Strategy
Clean polygons 213458 64.43% No action
Quick wins 112397 33.92% Automation
Critical issues 2723 0.82% Field recollection / manual investigation
Low priority / tolerable noise 1420 0.43% Tolerate / monitor
High-value automation 1326 0.4% Priority automation

 

Matrix flagged polygons scaled

Provider-level error profiles and where to focus effort

Aggregating signals by provider shows that data quality issues are not uniform and require tailored solutions rather than a single generic approach.

Client 4 Error profile

 

One provider’s dataset, for example, is dominated by cross-feature inconsistencies, indicating duplication or data integration issues, as well as structural anomalies suggesting problems in data encoding or export pipelines. Targeted interventions for this provider would include deduplication at ingestion, stricter data integration rules, and validation of export formats to enforce consistent geometry structures.

This kind of profiling provides direct guidance for improving data collection and processing: it identifies whether issues originate from field practices or data pipelines, targets training or tooling where error types concentrate, and enables monitoring of improvements over time using consistent metrics.

 

The Framework that Generated These Results

 

Error Categories

Geospatial data quality issues can be regrouped into four categories, each capturing a distinct type of failure, from the internal validity of a single geometry to its consistency within a dataset :

Category Scope Question answered Typical failures
Topological Internal consistency of a geometry Is the geometry internally valid? Self-intersections, unclosed rings
Geometric Single geometry shape Is the shape physically plausible? Spikes, slivers, distortions
Structural Data representation Is the geometry correctly encoded? Wrong types, fake multipolygons
Cross-feature Between geometries of the same dataset Are the geometries consistent together? Overlaps, near duplicates, containment

 

 

From Categories to Checks

We implemented a set of checks across the four categories to detect signals indicative of quality issues:

Category Examples of checks Signal What it captures
Topology Is the shape valid? Invalid geometry Self-intersections, unclosed rings, invalid topology
Geometry Is the area too small or too large? Size anomalies Polygons too small or too large
Geometry Are there spikes? Is the shape elongated? Shape distortion Spikes, slivers, low compactness, concavity
Geometry Is the boundary over-digitized or inconsistent? Boundary noise Excessive vertices, short segments, duplicate vertices
Geometry Are segment lengths plausible? Scale inconsistency Implausible segment lengths
Structure Is the geometry type consistent and usable? Type inconsistency GeometryCollection, mixed or invalid geometry types
Structure Is the multipart structure plausible? Multipart anomaly Fake or excessive multipolygons

When a check fails, a flag is assigned to the geometry.

 

A Bivariate Scoring System

Detecting errors is insufficient. The objective is to decide what to fix, how, and in which order. Each detected signal is therefore translated into two complementary dimensions.

Severity quantifies how much a given issue affects downstream analysis, ranging from 0 (no impact) to 5 (analysis not possible). Topological errors are blocking as invalid geometries cannot be used in most operations. Geometric and structural errors introduce varying levels of bias, distorting metrics or affecting interpretation. Severity is computed hierarchically: invalid topology immediately sets the maximum score, other errors are aggregated by category, and only the most severe issue within each category is retained. This avoids over-penalizing geometries with multiple correlated issues.

Fixability measures how likely it is to correct a geometry while preserving its meaning, ranging from 0 (not fixable) to 5 (fully fixable). Some errors are purely technical and can be fixed deterministically. Others require interpretation and may introduce uncertainty. Some require recollection because no reliable correction can be applied. The aggregation follows a bottleneck logic: the least fixable issue dominates, because one blocking problem is enough to invalidate automated correction.

 

The Priority Matrix

Severity and fixability define a two-dimensional decision space. Each geometry is positioned in a matrix that maps directly to a remediation strategy:

Quadrant Interpretation Strategy
Quick wins Non-critical and fixable Automation
Critical issues Critical and hard to fix Field recollection / manual investigation
Low priority / tolerable noise Non-critical and hard to fix Tolerate / monitor
High-value automation Critical and fixable Priority automation

 

From One-Off Diagnosis to Continuous Quality Monitoring

Most data quality workflows stop at detection : issues are identified after collection, but not prevented or systematically prioritized. This framework addresses that gap by integrating decision-making directly into the data pipeline.

The ultimate goal of this framework is not to diagnose problems long after collection, but to control data quality as it is produced. In practice, this means embedding quality checks into ingestion pipelines, evaluating geometries as they are collected, and surfacing critical issues immediately while remediation is still feasible.

This shift enables operational feedback loops. Field teams receive timely signals on critical errors, allowing remapping before they move to new regions. Data quality becomes a constraint of collection rather than a downstream concern.

At scale, this transforms the system: heterogeneous practices converge toward standardized processes, reactive cleaning is replaced by proactive control, and unstructured errors become measurable performance indicators. The transition relies on a simple structure: a limited set of error categories, interpretable signals derived from checks, and a scoring system that prioritizes actions.

The insights and thresholds presented here are derived from the dataset analyzed and therefore reflect its specific characteristics. As new datasets and error patterns emerge, both the checks and the scoring logic will be refined. The objective is not to define static rules, but to build a system that continuously adapts and improves as new data is collected.

Ready to turn insights into action? Connect with us.