Over the last few months I've been refining KMDS, a framework for building repeatable and auditable machine learning systems.
The original motivation behind KMDS was simple:
Many machine learning projects fail long before model selection becomes important.
Teams struggle with questions such as:
- What entities are represented in the data?
- What is the unit of analysis?
- What temporal structure exists?
- Which feature engineering strategies are appropriate?
- Which modeling assumptions were made?
- How are these decisions preserved over time?
Most organizations answer these questions at some point. The problem is that the answers often disappear into notebooks, documents, tickets, or the memories of individual contributors.
KMDS is an attempt to make these decisions explicit, structured, and reusable.
What Changed?
Recent updates have focused on moving beyond workflow automation and toward analytical governance.
1. Metadata-Driven Semantic Data Understanding
The workflow begins with semantic tagging and metadata generation.
Rather than immediately building features or training models, the system first attempts to understand:
- attribute types
- entities
- temporal structure
- data quality characteristics
The goal is to establish a semantic foundation before modeling begins.
2. Feature Advisor
One of the new additions is a Feature Advisor service.
Given metadata and project context, the advisor recommends feature engineering strategies for non-numeric attributes.
Examples include:
- hierarchical categorical encoding
- target encoding strategies
- TF-IDF pipelines
- sentence embedding approaches
- native model handling for modern gradient boosting systems
The objective is not automatic feature engineering.
The objective is to provide design guidance and rationale that helps practitioners make better decisions.
3. Design Governance
A second addition is a Design Governance framework.
Machine learning projects contain many decision points:
- classification vs regression
- handling class imbalance
- interpretability vs predictive performance
- validation strategy
- calibration requirements
- graph-based vs tabular approaches
The Design Governance layer acts as a design-time advisor that captures these considerations and generates implementation guidance.
The output is a structured design blueprint that can be reviewed by humans or supplied to AI coding assistants.
4. Knowledge Preservation
Perhaps the most important change is an increased emphasis on preserving analytical knowledge.
The long-term goal is not simply to create models.
It is to create reusable analytical assets.
Using KMDS tooling, project artifacts can be transformed into a knowledge graph representing:
- data understanding
- feature engineering decisions
- modeling assumptions
- operational considerations
- generated artifacts
This creates a queryable representation of the analytical lifecycle.
Why This Matters
Most organizations already have documentation.
What they often lack is accessible institutional knowledge.
Critical analytical decisions are frequently distributed across:
- repositories
- notebooks
- presentations
- tickets
- email threads
- individual contributors
When people leave, much of that context leaves with them.
My view is that the real asset is not the agent.
The real asset is the structured analytical knowledge that the agent can access.
If the knowledge is preserved independently of any specific model, tool, or LLM, organizations retain ownership of their analytical reasoning and can recreate capabilities as technology evolves.
Current Direction
The broader goal of KMDS is to make machine learning systems:
- more transparent
- more auditable
- more reproducible
- easier to transfer between teams
Recent work has focused on feature governance, design governance, metadata-driven workflows, and knowledge graph generation.
Future work will continue exploring how analytical context can be captured and preserved as a first-class artifact rather than an afterthought.
I would be interested in hearing how others are approaching analytical governance, reproducibility, and knowledge preservation in their own machine learning workflows.













