I need a better data model.

I've worn many hats over the years. I was an all-source analyst. I served on a vulnerability assessment team for critical infrastructure where I worked as both a mission and systems analyst, which is a fancy way of saying I spent my time understanding what things do and how they fit in with everything else. I received an education in data science and afterwards worked for a few government agencies as a data scientist before eventually working as an applied scientist building models for modeling and simulation. Throughout my career, if I'm being honest, I've felt more like a data janitor than an analyst or scientist. A small portion of my time was spent getting spun up on whatever domain I'd been dropped into, simultaneously sprinting to understand the problem and find relevant data. The majority of the time, as I suspect most of you reading this can relate to, was spent wrangling and cleaning. In the end, with what little time remained, I performed some analysis, packaged it into a report or brief, and moved on to the next problem. Even later in my career when project timelines went from being measured in weeks to months, the pain remained the same. More time just meant higher expectations.

I've seen data collected, used, lost, and collected again by people who had no idea it had ever existed. I've seen data stored as if it were socks in a freshman's dorm room. I've dealt with datasets so broken I had to wonder if they weren't sabotaged on purpose or assembled specifically as cautionary tales. I've repeated the same cycle more times than I can count: discover the data, make assumptions, bake those assumptions into the pipeline, learn something that breaks them, rebuild. George Box famously said all models are wrong but some are useful. The same is true of data. Data and models are abstractions of real world things and abstractions are not the things themselves. Most systems and methodologies never reckon with that honestly. They are data-centric. They place emphasis on the data rather than the things the data describes. They reach for gold standard sources and optimize for process, with the assumption baked in that the data model is correct and the pipeline is sound. The reality is that most of them are built on shifting sands and they know it, they just don't have a better way.

Ultimately, the problem wasn't the data itself. It was that every system I worked in was organized around the data as an artifact. The file. The table. The API response. Not the things the data was trying to describe. A bridge. A power plant. A census tract. These things exist independently of any dataset that mentions them. They have histories. They change over time. Different sources describe them differently, sometimes contradicting each other, sometimes filling in gaps the others miss. But the systems I worked in had no concept of the thing itself. They only knew the data. Worse, they didn't care where the data came from or what had been done to it before it arrived. When they did care, it was an afterthought, a column in a metadata table that nobody updated and everyone ignored.

This is the difference between data-centric and entity-centric thinking.

In a data-centric system, a bridge is a row in a table. In an entity-centric system, a bridge is a thing, and the row in the table is just one claim about it, made by one source, at one point in time. That distinction sounds subtle but it changes everything about how you build. It changes what you store, how you query, and what questions you're even able to ask.

If a bridge is a row, conflicting data is a data quality problem. If a bridge is an entity, conflicting data is just multiple sources describing the same thing with varying authority and currency. You don't resolve the conflict by picking a winner and discarding the rest. You preserve all of it, track where it came from, and let your query layer decide what's most trustworthy for the question you're asking right now. Tomorrow the answer might be different, and that's fine, because you still have everything you had yesterday.

I needed a model that thought about data the way I'd learned to think about the world, in terms of things, not tables. So I built one.

I call it the Open Entity State Ontology, or OESO. It is a domain-agnostic ontology for describing real-world entities, and the foundation everything else I'm building rests on. The central concept is simple: rather than describing an entity directly, you describe what a particular source believed about it during a particular window of time. I call these States.

A State is a temporally bounded, immutable description of a thing from a single source. It could be a row in a CSV file describing a power plant, a snapshot from an API call, a field inspection report, or the output of a model. What matters is that every State carries four things: what it describes, where it came from, when it was true of the world, and when we learned about it. Once created, a State is never modified. If a source gets something wrong, you don't correct the State; you create a new one. The old one stays, permanently, because it is part of the record of what you knew and when you knew it. I chose to do this because I want to be able to evaluate sources, processes, and activities over time.

States can be produced through a number of different types of activities including: human-in-the-loop additions, automated ETL processes, model outputs, or any process that takes one or more existing States and derives something new from them. The provenance chain follows automatically. You can always trace a derived State back through the activities that produced it, to the sources those activities consumed, to the organizations that published those sources. Nothing disappears. This matters in practice because it allows multiple people, automated processes, or even AI contributions to coexist without overwriting or getting in the way of each other. It also means you can track imputation logic used to fill missing attributes, or capture the emergence of new entities that were inferred rather than directly observed.

But storing States is only half the problem. If you have ten States from five different sources describing the same bridge attribute, you still need to answer a practical question: which one do you trust right now? That requires assessments. Assessments make claims about how well a source describes reality, how authoritative it is, how current it is, or how well it has held up against ground truth over time. An assessment can be as simple as a user manually flagging a source as authoritative, or as sophisticated as a scoring pipeline that evaluates sources against a defined set of metrics. OESO has no opinion about how you assess. It only strongly suggests that you do, and that you record it alongside everything else. What you do with those assessments is up to you and whatever state resolution logic you bring to the table.

Because OESO preserves States, sources, and assessments as first-class records, you have everything you need to build an estimation layer on top of it. Given everything you know about an entity, all of its States, every source, every assessment, you can ask: what is my best current understanding of what this thing actually looks like right now? The answer is rarely a single State. In practice, no one source describes an entity completely. The authoritative federal database might have the most trusted load rating for a bridge but say nothing about its current inspection status. A supplemental source might fill that gap even though you would never trust it for structural ratings. A good estimation layer walks the available States in order of assessed authority and assembles the most complete picture it can, taking the most trusted value for each attribute and falling back to lesser sources only where the better ones are silent. OESO doesn't answer that question for you. It gives you everything you need to answer it yourself, consistently, reproducibly, and without throwing anything away. How you build that estimation layer is a topic for another post.

Rather than reinventing the wheel, OESO extends standards the semantic web community already uses. PROV-O provides the provenance vocabulary including activities, agents, and derivations. DCAT provides the dataset and distribution model. OESO layers the bitemporal State model, the assessment pattern, and the entity resolution framework on top of those foundations. If you already work with either standard, most of OESO will feel familiar.

OESO isn't magic and it isn't trying to be. Think of it as a contract you can use to develop an entity-centric system. It has no opinion about your domain, your sources, or your resolution logic. It only asks that you store knowledge as States, make assessments on the sources of those States, and use those assessments to make honest estimates about the things you're trying to describe. In return it gives you something most data systems can't offer: a complete, auditable, reproducible record of everything you've believed about the world and why. It allows you to assess sources and processes as you go and provides the means to store those assessments as first-class artifacts. So if your starting point is disparate data and your goal is a best estimate of the portion of the world you need to model, OESO gives you the means to map that journey and to prove you took it.

I've published OESO at github.com/MirrorDrop/ontologies. It's open, it's versioned, and it provides examples.

I need a better data model.

Tags

Author

Stats

Published

You Might Also Like

Part 1 — How Do Unstructured Documents Become a Searchable Knowledge Base? Five Key Engineering Decisions in the Ingestion Pipeline

Unlock Kafka Schemas with Karapace: A Hands-On Guide

Less Database, More Files

Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

I'm opening ContractForge — define data ingestion intent once, run it natively anywhere

How Uber Built Its Big Data System — From a Few TBs to 350 Petabytes with Sub-Hour Latency