Domain-Driven Design

What Public Transport Data Teaches You About Architecture

There is a particular kind of mess that you only fully appreciate after you have spent years inside it. Public transport data is that kind of mess. And if you are an architect who has never had to integrate five competing national standards into a single coherent system, I would argue you are missing out on one of the best education programmes in domain-driven design that exists anywhere in industry.

I have spent over twenty years building journey planning systems. The platforms I architect serve operators across multiple UK regions and beyond. They handle millions of journey plans, across web, iOS and Android, with the expectation that the answer comes back in milliseconds. The complexity behind that answer is invisible to the person searching for their morning commute, and that is exactly as it should be. But getting there requires confronting a set of data integration problems that have fundamentally shaped how I think about architecture.

Let me walk through what I mean.

Five standards, five sets of assumptions

In the UK public transport world, there is no single data standard. There are at least five that matter: NaPTAN, ATCO, GTFS, TransXChange and SIRI. Each was designed by a different group, at a different time, to solve a different slice of the same problem. They overlap, they contradict each other, and they all make assumptions about the domain that do not quite hold once you put them side by side.

Take something as fundamental as a bus stop. In NaPTAN, a stop is a precisely defined entity with a specific structure: an access point, a locality, a parent locality, an administrative area, a set of coordinates in OSGB36 (not WGS84, naturally, because why would two standards agree on a coordinate system). In GTFS, a stop is a simpler concept with a different identifier scheme, different assumptions about hierarchy, and coordinates in WGS84. TransXChange describes services and routes with its own model of where vehicles call. SIRI handles real-time information with yet another perspective on what a stop is and how to reference it.

None of them are wrong. They were all designed for a legitimate purpose. But the moment you try to build a system that consumes all of them and produces a single, coherent answer to the question "how do I get from A to B?", you hit a wall that no amount of clever parsing can get you past.

Where domain-driven design stops being theory

This is the point where domain-driven design becomes essential, not optional. I am not talking about reading Eric Evans' book and nodding along. I am talking about sitting down with the actual data, the actual edge cases, the actual operators, and building a domain model that honestly represents the problem space without pretending the contradictions do not exist.

In our systems, a NaPTAN stop and a GTFS stop are not the same thing. They cannot be. Treating them as interchangeable behind a shared interface leads to subtle bugs that only show up at scale: incorrect journey plans, missing connections, operators whose data renders differently depending on which source was ingested first. The architecture has to acknowledge the differences at the boundary, translate cleanly through an anti-corruption layer, and present a unified model on the other side.

We spent a lot of time establishing bounded contexts around each data source. Each standard gets its own context with its own internal model. The translation happens at the boundaries, not inside the core domain. The ubiquitous language matters enormously here. When someone on the team says "stop", everyone needs to know which context they are talking in, because the word means something materially different depending on whether you are looking at NaPTAN data, GTFS data, or the unified model that serves the API.

This is not a theoretical exercise. It is the difference between a system that works reliably across regions and operators, and one that works most of the time but produces silently wrong answers for edge cases that nobody catches until a commuter in Edinburgh misses their connection.

Messy data is the real world

One of the things transport data teaches you early on is that the real world does not conform to schemas. Operators submit data in varying levels of completeness and correctness. Standards have optional fields that some providers populate and others ignore. Updates arrive on different schedules. Some data is weeks old before you see it. Some of it is real-time via SIRI but only for a subset of services.

If you build an architecture that assumes clean, complete, timely data, you will be disappointed before lunchtime. The systems I build assume that data will be partial, late, contradictory and occasionally wrong. The architecture has to be tolerant of that without silently degrading the quality of the answer. That means validation at the ingestion boundary, sensible defaults where data is missing, clear logging when something looks anomalous, and ranking logic in the search layer that can still produce a good answer from imperfect inputs.

This lesson transfers directly to every other domain I have seen. Healthcare data, financial data, logistics data, government data. It is all messy. The architects who build robust systems are the ones who design for the mess from day one rather than assuming someone upstream will clean it up for them.

The gazetteer problem

Search sits at the heart of any journey planning system, and it is a perfect example of how domain knowledge changes architectural decisions. The gazetteer is the component that turns a user's typed query into a usable location. Type "Edinburgh Waverley" and the system needs to resolve that to a specific stop, not a vaguely similar street name or a locality three counties away.

The legacy implementation was a performance bottleneck. When I evaluated the options, the standard answers did not fit. Elasticsearch was powerful but operationally heavy for this specific use case and added infrastructure we would need to manage. Database-native full-text search did not give us enough control over ranking. Off-the-shelf geocoding services were too generic. They do not understand NaPTAN stop points. They do not know that "Central Station" in Glasgow and "Central Station" in Newcastle are fundamentally different entities that need different ranking depending on the operator context.

The solution was a bespoke Apache Lucene implementation with domain-specific analysers and custom ranking. It gets us lookup times of approximately four milliseconds with significantly better result accuracy. The point is not that Lucene is always the right answer. The point is that understanding the domain deeply enough to know why the standard answers fall short is an architectural skill, not a data engineering skill. The best technology choice depends on a thorough understanding of the problem, and in this case the problem was shaped entirely by the peculiarities of UK public transport data.

What this teaches about architecture more broadly

After twenty years in this domain, a few principles have crystallised that I think apply well beyond transport.

First, model the domain before you model the data. The temptation with complex data integration is to start with the schema and work inwards. That leads to systems that can store the data but cannot reason about it. Start with what the business actually needs to do with the data, model that, and let the storage follow.

Second, bounded contexts are not a luxury. When you have multiple sources of truth that overlap but do not agree, you need clear boundaries with explicit translation between them. This is not architectural over-engineering. It is the minimum viable design for correctness.

Third, design for the data you actually receive, not the data the specification promises. Every standard has a gap between what the documentation says and what the real-world feeds contain. Your architecture needs to handle that gap gracefully.

Fourth, performance is a domain concern, not just an infrastructure concern. The four-millisecond gazetteer was not a generic performance optimisation. It was a domain-specific solution that required understanding the data, the use cases, and the failure modes specific to public transport search. You cannot optimise what you do not understand.

And fifth, simplicity wins in the long run. The systems that have aged best in the platforms I have built are the ones where we chose the simplest design that handled the actual requirements, including the messy ones. The clever architectures are the ones that needed replacing first.

The boring domain advantage

Public transport data is not glamorous. It does not make for exciting conference talks. Nobody is writing breathless blog posts about NaPTAN ingestion pipelines. But as an education in building systems that handle real-world complexity at scale, I have not found a better teacher. The data is messy, the standards conflict, the users have zero tolerance for wrong answers, and the system has to be fast, reliable and available every single day.

If your architecture can handle that, it can handle most things.