Design Salesforce Like a Distributed System

Distributed systems engineers don't get to reach across service boundaries whenever it's convenient. The network stops them. A microservice that needs data owned by another service can't just query it. It has to ask, wait, and handle failure. That friction is annoying. It's also the thing that enforces discipline: every team knows exactly where its responsibility ends, because the boundary is physical.

Salesforce gives you no such friction. Every object in your org, every Account, every Order, every custom object owned by another team, is one SOQL query away. Any Apex class can read it. Any trigger can write to it. Any Flow can fire on it. There is no network. There is no boundary. Nothing stops one domain from reaching into another's territory, and nothing makes that reach-in visible until something breaks.

That's the core problem this article addresses: how to design a Salesforce org with the discipline distributed systems engineers are forced to have, in an environment that will never force it on you. Platform Events and a principled trigger architecture are the tools. Bounded contexts, facts vs commands, and choreography or orchestration are the concepts. The prize is the same thing distributed systems get from their hard boundaries: teams that can ship independently without breaking each other.

This article is opinionated, grounded in real patterns from a production org, and the goal is that it shifts how you think about who's allowed to touch what, not just how to get work done async.
It is also not an argument that every Salesforce org should adopt a distributed systems pattern. Rather, this is a thought experiment exploring what changes when we analyze Salesforce from a distributed systems lens.

1. What Salesforce doesn't force on you (and why that's dangerous)

In a distributed system, the boundary between services is enforced by physics. Service B cannot read Service A's database directly. If it needs data, it calls an API, subscribes to an event stream, or maintains its own projection. These patterns (APIs, events, read models) didn't emerge from clever design thinking. They emerged because there was literally no other option.

The result is that distributed systems engineers develop a reflex: before touching anything, ask "do I own this?" If the answer is no, the question becomes "what's the sanctioned channel for getting what I need?"

Salesforce developers rarely develop this reflex, because the platform never punishes them for not having it. The consequence plays out slowly. In year one, one team writes a trigger on Contact and it works fine. In year two, three teams have logic in the same trigger handler. In year five, no one team can change the Contact object without coordinating with four others, running a full regression suite, and hoping nothing breaks in production. The org has become a non-distributed monolith: all the coupling of a distributed system with none of the boundaries, and no network to make the coupling visible.

Distributed systems thinking is the antidote. Not the mechanics (you don't need service meshes or gRPC), but the mindset: treat domain boundaries as if they were network boundaries, even though they aren't. Refuse cross-domain reach-ins the same way a service refuses direct database access from a neighbour. Use events as the only sanctioned channel for cross-domain communication.

The difference from the distributed world is that on Salesforce you have one significant advantage: the data is already local. You don't need to copy data across service boundaries, because there are no service boundaries at the database level. Every subscriber can re-query the source data directly. It's a SOQL call away. This changes the shape of event design considerably, as we'll get to in §7.

2. Domain bleed: what coupling without a boundary looks like

Distributed systems literature warns about the "distributed monolith": services so entangled they must deploy together, with all the pain of distribution and none of the independence. In Salesforce, you get a mirror-image problem: a non-distributed monolith. One schema, one automation surface, every team in everyone else's business, and no network boundary to make the coupling visible or enforce a fix.

Domain bleed is when one domain's logic executes inside another domain's territory. In Salesforce it looks like this:

Concrete symptoms, all of which you have seen:

A trigger nobody owns. AccountTrigger (or the one handler class) has commits from five teams. Touching it for a Billing change risks a Sales regression. Code review needs three squads.
Automation firing across domains. A Billing flow triggered by a field change on Opportunity, which Sales owns. Sales refactors a picklist; Billing breaks in production; nobody connected the two.
SOQL reaching three domains deep. A Support class queries Case → Contact → Account → Opportunity → OpportunityLineItem, encoding assumptions about four domains' models. Any of them refactoring breaks the query.
Validation rules encoding someone else's invariants. A rule on Contact enforcing a Marketing constraint that fires for every team that ever inserts a contact, including integration users that know nothing about it.
Shared fields with implicit contracts. A field added "just for Marketing" that automation, reports, and tests across the org now silently depend on.

The result is the same coupling a distributed system avoids by construction. Things that should be independent become one fragile thing, concentrated in a single org with a single deploy. You can't ship one domain without regression-testing the others. There is no slice any team fully owns. Velocity collapses as the org ages, and everyone blames "Salesforce technical debt" when the real diagnosis is missing boundaries.

Here's the critical distinction: the shared database is not the bleed. Co-located data is your advantage, not your problem. Distributed systems would kill for it. The bleed is control flow and ownership crossing boundaries: one domain reaching in to query, fire on, or write to another's objects. A distributed system can't do that because the network stops it. In Salesforce, you have to stop it yourself. Events are how you do it.

3. There are two kinds of event-driven Salesforce, and triggers are one of them

Before Platform Events: Salesforce already had an event-driven primitive that every Apex developer uses every day. Triggers are synchronous events. A record changes; the platform fires a before/after trigger; your code runs. The pattern is event-driven at its core.

The problem is ownership. And this is where the two models diverge sharply.

The trigger ownership problem

When a trigger is owned by everyone, it's owned by nobody. The classic symptom: a single AccountTrigger.cls (or one handler class dispatching to a dozen unrelated features) with commits from Sales, Fulfillment, Finance, and Support teams. The result:

Every team coupling to one file means nobody can deploy safely alone. The trigger becomes a bottleneck: a coordination tax for every change, regardless of which domain is touching it.

The old pattern, a TriggerFactory loading handlers from custom settings metadata, improved things by registering handlers per-object. But the fundamental problem remained: multiple teams could still register multiple handlers for the same object, and the execution order and interactions between them were implicit.

Trigger Actions: single responsibility, metadata-driven composition

A prime example of a trigger frame work that solves this issue at the architecture level is Mitch Spano's Trigger Actions framework. Each action is a focused, single-responsibility class that implements one interface:

// One class, one concern, one owner
public class TA_Order_Placed implements TriggerAction.AfterInsert {
    public void afterInsert(List<SObject> triggerNew) {
        // only Fulfillment domain logic here
    }
}

Actions are composed via metadata (Trigger_Action__mdt), not code. The trigger file itself becomes trivial, a one-liner that hands off to the framework:

trigger OrderTrigger on Order (after insert, after update, after delete) {
    new MetadataTriggerHandler().run();
}

What this buys you:

	Legacy `TriggerFactory`	Trigger Actions
Ownership	Multiple handlers, shared registration	One class per concern, clear owner
Composition	Code changes to add/remove	Metadata record, no deploy to wire up
Testing	Handler class tests mixed concerns	Each action tested in isolation
Bypass	Shared disable switch	Per-action bypass via custom permission
Order of execution	Implicit / registration order	Explicit order field on metadata

The metadata-driven composition is the key unlock. Sales can add a TA_Order_Placed action for Order without touching any file owned by Fulfillment. Fulfillment can add its own Order action independently. The trigger file never changes; it's infrastructure, not logic.

Async events: Platform Events

Platform Events are the asynchronous counterpart. Where trigger actions fire synchronously in the DML transaction, Platform Events cross transaction boundaries: publish now, process later, survive failures independently.

The right mental model for on-platform work:

Synchronous (trigger actions): "This just happened in my domain. Process it now, in this transaction."

Asynchronous (Platform Events): "This happened in my domain. Other domains should know about it, at their own pace, in their own transaction, on their own failure boundary."

Both are event-driven. The difference is timing, ownership, and what happens when something fails.

The property that makes it all work: the producer doesn't know its consumers

Here is the single most important thing about messaging, and the reason it cuts coupling where a direct call cannot. It's worth stating on its own because everything downstream depends on it.

When you call something (an Apex method, a @future, a queueable, a REST callout), the caller has to name the callee. The dependency points from producer to consumer. If a second consumer needs to react, you go back and edit the producer to call it too. The producer accumulates knowledge of everyone who depends on it, and every new reaction is a change to the thing that fired the event. That's true even when the call is asynchronous. A @future method is still point-to-point; async timing doesn't decouple anything, it just moves the same hard-wired call off the synchronous path.

Publish/subscribe inverts that arrow. The producer names an event, not a consumer. It announces "this happened" into the bus and is done. It does not know who is listening, and must not care. Consumers register their own interest by subscribing. The dependency now points from consumer to event contract, not from producer to consumer:

Two consequences fall out of this, and both are things a direct call (synchronous or asynchronous) simply cannot give you:

N consumers, zero producer changes. One published fact can fan out to one subscriber or fifty. Adding the fifty-first means deploying the new subscriber. The producer never learns it exists, never recompiles, never gets re-tested. In a point-to-point world, every new reaction is a change to the producer. In pub/sub, new reactions are purely additive and live entirely in the consumer's domain.
The dependency direction is reversed, and that's the whole game. "Who depends on whom" is the question that decides whether you can deploy independently. With a direct call, Sales depends on Fulfillment (it has to know Fulfillment exists to call it). With an event, Fulfillment depends on Sales's event contract (a stable, published fact) and Sales depends on nobody. The domain that owns the data is now ignorant of, and immune to, everyone reacting to it.

This is why "fire an async job" is not the same as "publish an event," even though both run later. The async job still couples the producer to a named consumer. The event doesn't. Decoupling comes from the inverted dependency, not from the asynchrony.

4. The cure: notify, don't reach in

Replace every cross-domain reach-in with a published fact. The domain that owns the data publishes "this happened." Domains that care subscribe and react in their own territory, on their own automation, on their own deploy cadence.

Nothing crosses a boundary except a fact. Fulfillment no longer fires on Order; it reacts to OrderPlaced__e and writes to its own Shipment__c. Finance no longer reaches into Sales's object graph; it subscribes to the same fact and manages its own Invoice__c. Sales can refactor everything behind Order as long as the published fact stays stable. You have drawn a bounded-context border inside one org, using events as the only sanctioned channel across it.

This is the reframe the generic posts miss. The bus isn't a data pipe here. It's a membrane: facts pass through, reach-ins don't. And the data stays exactly where it was, one query away, for whoever legitimately owns the context that needs it.

5. A cautionary tale: the order-handoff anti-pattern

Here is what domain bleed looks like when it wears an event-driven costume.

Imagine an order-to-fulfillment flow that uses platform events to chain processing steps: OrderHandoff_STEP1__e → OrderHandoff_STEP2__e → OrderHandoff_SHIPMENT__e → OrderHandoff_STEP4__e → ... Each step publishes the next event from its finally block:

// OrderHandoff_SHIPMENT handler (in the Sales domain)
} finally {
    if (triggerNext) {
        publishEvent('OrderHandoff_STEP4__e', !hasError);
    }
}

And in the shipment step, the Sales domain directly inserts Shipment__c records, which belong to Fulfillment:

// Sales reaching into Fulfillment's territory
orderToShipmentMap.put(
    orderId,
    new Shipment__c(
        Order__c = orderId,
        CarrierCode__c = orderDetail.CarrierCode__c,
        ServiceLevel__c = orderDetail.ServiceLevel__c,
        // ... a dozen more fields from Sales's data model
    )
);
Database.insert(orderToShipmentMap.values(), false);

The platform events look modern. But look at what Sales is actually doing:

It knows Shipment__c exists in the Fulfillment domain.
It knows the exact field-level schema of that object.
It writes directly to it.
It orchestrates the entire sequence. Fulfillment has no say in when or whether its own records get created.

Sales is reaching into Fulfillment's territory. The events just move the coupling off a shared trigger file and onto the data model. This is "domain bleed in a costume." If Fulfillment adds a required validation rule to Shipment__c, Sales's handoff breaks. If Fulfillment renames a field, Sales breaks. The dependency arrow points the wrong way.

The correct design: Sales publishes OrderPlaced__e carrying just the record ID and the timestamp. Fulfillment subscribes, queries the data it needs (same database, one SOQL call away), and creates its own Shipment__c in its own transaction, on its own schedule, under its own validation rules. Fulfillment owns shipment creation. Sales just announces that an order was placed.

This is the "notify, don't reach in" principle from §4. The event should state a fact in the publisher's past tense. OrderHandoff_SHIPMENT__e is not a fact; it's a command telling a later step to create something specific in another domain. If you can name the one subscriber that's supposed to handle it, you've written a command, and the coupling is still there.

6. A well-designed example: the Support domain

By contrast, the Support domain shows what the pattern looks like when it's right.

When a case is escalated, a Trigger Action publishes a CaseEscalated event outbound. Notice what it doesn't do: it doesn't reach into any other domain, it doesn't write to any other object, and it doesn't know who the subscribers are.

public class TA_Case_Escalated implements TriggerAction.AfterUpdate {
    IEventPublisher eventPublisher;

    public void afterUpdate(List<SObject> triggerNew, List<SObject> triggerOld) {
        Map<Id, Account> accountMap = TA_Case_Queries.getInstance().getAccountMap();

        for (Case c : (List<Case>) triggerNew) {
            if (shouldPublishEvent(c, accountMap)) {
                eventPublisher.addMessage('CaseEscalated', c, c.AccountId);
            }
        }
    }

    private Boolean shouldPublishEvent(Case c, Map<Id, Account> accountMap) {
        Account account = accountMap.get(c.AccountId);
        return account != null && account.Tier__c == 'Enterprise';
    }
}

What's clean about this:

Single responsibility. One action class, one concern. The trigger file is infrastructure: it delegates to MetadataTriggerHandler and contains no logic.
Fact, not command. CaseEscalated names something true in the Support domain's past tense. It doesn't tell anyone what to do with it.
Publisher is indifferent to consumers. Support doesn't know whether an account manager, a downstream warehouse, or a notification system will react. It announces; others decide.
Domain stays in its own territory. The action queries its own data (account tier), publishes to an abstract interface (IEventPublisher), and returns. No cross-domain DML.
The event publisher interface keeps the domain agnostic of the transport layer. Whether the platform event gets bridged to Kafka, routed to a middleware, or handled purely on-platform is infrastructure's concern:

public interface IEventPublisher {
    void addMessage(String messageName, SObject record, Id correlationId);
    void publish();
}

The contrast with the order-handoff flow is instructive. Both use events. Both have Trigger Actions wiring the logic. The difference is entirely in what the event means and who controls the downstream writes.

7. The rule that flips at the boundary

Here's the question that immediately follows "publish a fact": how much do you put in the event?

The microservices world has two answers, named by Martin Fowler:

Event notification: a thin signal. "Order 0061x placed." Just an id and what changed. The subscriber goes and looks if it needs more.
Event-carried state transfer: a fat, self-contained event. Everything a consumer could need, baked in, so it never has to call back.

People argue about which is "right." The argument is unresolvable because the right answer depends on whether the subscriber can cheaply query back, and on Salesforce that answer is different inside the org versus outside it. This is the rule:

Inside the org → thin notification events. The subscriber is in the same database. Re-querying is one cheap, current, transactional SOQL call. This is the Salesforce advantage a distributed system never has: the data is already local. Stuffing state into the event would duplicate data that's right there and instantly stale. So on-platform, publish OrderPlaced__e carrying little more than the record id and the change; let the Fulfillment trigger query the source data itself. Communication, not duplication.

At the external boundary → fat, self-contained facts. The moment the consumer is outside Salesforce, re-querying is expensive and coupling: a callback over the API, an auth handshake, latency, rate limits, and a hard dependency on Salesforce being up. So the events you publish outward should carry the full business fact: everything a downstream system needs to act without calling back.

A common pattern is a single outbound platform event (OutboundEvent__e) that acts as a generic envelope: on-platform triggers add messages via an event publisher interface, which serializes the payload and routes it outward (e.g. through a middleware layer to Kafka). The fat-vs-thin distinction maps cleanly onto that: inside the org, thin signals on domain-specific events; at the outbound envelope boundary, fat self-contained payloads.

And this is where your earlier instinct finally becomes correct: the read model, the projection, the materialized copy all belong downstream, outside Salesforce, fed by the fat boundary events. Not because projections are wrong, but because that is the place where data locality genuinely breaks and a local copy genuinely earns its keep. Inside the org you'd never build one. Across the boundary, you must.

So the same system uses both styles, and choosing by location dissolves the whole thin-vs-fat debate:

	Inside the org	At the external boundary
Subscriber can re-query?	Yes, cheap SOQL	No, expensive callback
Event style	Thin notification	Fat, self-contained fact
Carry full state?	No, it's already local	Yes, consumer can't fetch it
Build a projection?	Never, data is co-located	Yes, downstream where locality breaks
Event's job	Communication + ownership	Communication + data transfer

If you take one thing from this article, take this table. It's where Salesforce's data-locality advantage changes the rules, and where the distributed-systems playbook applies unchanged.

8. Facts, not commands (the bleed wears a costume)

Domain bleed has a sneaky form that survives even after you've "gone event-driven." It's the command disguised as an event:

// Bleed in a costume. Not an event.
CreateShipment__e evt = new CreateShipment__e(OrderId__c = orderId);
EventBus.publish(evt);

CreateShipment__e is not a fact. It's the publisher reaching across a boundary to tell one specific subscriber what to do, just routed through the bus so it looks decoupled. The publisher still knows the shipment step exists, still knows what it should do, still breaks when that domain changes. You've moved the bleed onto the event bus, not removed it.

A real fact names something true in the publisher's domain, in the past tense, and is indifferent to who reacts:

// A fact. Past tense. Publisher's domain. Reaches into no one.
OrderPlaced__e evt = new OrderPlaced__e(
    OrderId__c   = order.Id,
    AccountId__c = order.AccountId,
    PlacedAt__c  = System.now()
);
EventBus.publish(evt);

The litmus test:

A fact says "X happened." A command says "service Y, do Z." If you can name the one subscriber that's supposed to handle it, you've written a command, and the coupling you were trying to cut is still there, now harder to see.

It's the name, not the transport

Here's the crucial point, and it's where a lot of advice goes wrong: a Platform Event can legitimately carry either a fact or a command. Nothing about the platform forces the distinction. The same EventBus.publish() call ships both. The difference lives entirely in how you name and shape the message, and in whether you're honest about which one it is.

You don't need a separate piece of infrastructure (a queue product, a second bus) to carry commands. That's the import error again: borrowing "facts go on the event bus, commands go on a queue" from architectures where those are physically different systems. On Salesforce, both are just Platform Events. What keeps the line honest is naming discipline, not a second transport:

	Fact	Command
Name	Past tense: `OrderPlaced__e`	Imperative: `RecalculateDiscount__e`
Answers	"What happened?"	"What should be done?"
Knows the recipient?	No, indifferent to who reacts	Yes, aimed at a specific handler
Consumers	Zero to many	Exactly one (it's a directed instruction)
Coupling	Publisher depends on no one	Publisher depends on the handler existing

Commands aren't evil; they're necessary. Sometimes you genuinely do need to instruct one specific handler to do one specific thing. The mistake is dressing a command up in fact's clothing (a thin event that looks decoupled but names exactly one intended subscriber) and then believing you've cut the coupling. If it's a command, name it like one: imperative, owned by the caller, pointed at a known handler, so the coupling is at least visible and you can decide whether it belongs. The danger isn't commands on the bus; it's commands pretending to be facts.

9. Who owns the cross-boundary flow? Choreography or orchestration

Once you've drawn borders, some business processes still span them, and you have to decide how they're coordinated. The cheap version of this debate is "choreography good, orchestration bad." It's wrong, and on Salesforce you can fall into the failure mode of either one.

Default to choreography. Facts on the bus; whoever cares reacts. Loosely coupled, scales by addition. Right for "a thing happened, others may want to know."
Reach for orchestration deliberately, when a process spans domains and must complete, needs ordering or end-to-end visibility, or has to be undoable (compensation).

The sentence to tattoo on the team:

Someone has to own the multi-step transaction. Pick the owner on purpose.

The order-handoff anti-pattern from §5 illustrates this. Several steps chained via platform events, each step publishing the next, with no single owner tracking whether the whole thing succeeded. It's "Lambda pinball": an emergent chain nobody owns end-to-end, impossible to trace holistically, with no clean answer to "did the handoff finish, and how do we compensate if a middle step failed?" Per-step observability and error logging helps, but that's a bandage on a structural problem.

The fix is not a central god-orchestrator micromanaging anemic domains. It's a local coordinator: a state machine that owns the process, issues commands, and tracks completion, while the business logic stays in the domain handlers. For long-running, undoable processes you want a saga, where each step registers a compensating action so a half-finished process rolls back cleanly.

One honest caveat: the elegant "just use a local in-process engine" advice from the JVM world assumes a free-form runtime. On Salesforce your coordination primitives are Flow, Platform Events, and async Apex, all under governor limits, so you own the process explicitly within that async model.

The two anti-patterns, both reachable on-platform:

Events that are secretly commands → hidden coupling, no owner.
An orchestrator babysitting steps that are really independent reactions → a bottleneck for nothing.

10. Splitting the monorepo into domains

The architecture above only pays off if you can actually enforce the boundaries it describes. Domain isolation in a Salesforce org isn't just about code style - it's a deployment question.

A repo that takes this seriously reflects it in directory structure. Rather than one flat src/ folder where every team drops files, you organize by bounded context:

src/
├── domain-core/              # Shared infrastructure (trigger       # Event bus infrastructure (platform event objects, routing metadata)
├── domain-sales/             # Leads, opportunities, orders
│   ├── lead-management/
│   ├── opportunity/
│   └── order/
├── domain-fulfillment/       # Shipments, inventory, logistics
│   ├── shipment/
│   ├── inventory/
│   └── logistics/
├── domain-finance/           # Invoicing, billing, payments
├── domain-support/           # Case management
└── _common-modules/          # Cross-cutting UI components and utilities

Each domain is a candidate for its own unlocked package (2GP). A domain can refactor its internals freely as long as the events it publishes stay compatible.

Deployment decoupling is the one that changes how fast your org can move - and it's only reachable once cross-domain communication goes through facts instead of reach-ins.

In a bled org, the deployment unit is the whole org. With events as the only sanctioned cross-domain channel, each domain can become its own independently-shippable unit. Fulfillment deploys on Tuesday. Sales deploys on Thursday. Neither regression-tests the other, because neither reaches into the other's objects.

11. The field guide: Platform Event mechanics that actually decide success

Everything above is design. This is the part where the platform's specific behaviour makes or breaks it. These are fact-checked against current Salesforce documentation, because half the lore floating around is stale.

Publish behaviour, and a footgun

When you define an event you choose when it publishes:

Publish Immediately (default): fires when the publish() call executes, regardless of whether the transaction commits. A subscriber can receive the event before - or even when - the publishing transaction rolls back. Great for fire-regardless and logging-style events; a footgun if the subscriber expects committed data, because you can emit OrderPlaced for an order that then never existed. It counts against a separate limit of 150 EventBus.publish() calls per transaction (Limits.getPublishImmediateDML()).
Publish After Commit: publishes only after the transaction commits successfully, not at all if it fails. Use it when the subscriber will go query the data you just wrote (which, given thin events, is most of the time on-platform). It counts as one DML statement each (Limits.getDMLStatements()).

That distinction interacts directly with §7: thin notification events usually want Publish After Commit, because the subscriber's whole plan is to re-query, and re-querying a rolled-back record is a bug. If you're using a generic outbound envelope event to bridge to external systems, PublishAfterCommit is the right default for exactly this reason.

A successful publish is a successful enqueue, not a delivery

High-volume events (the default for new events since API v45.0) publish asynchronously. After publish() returns success, the request is queued; the actual publish happens when resources free up, and in rare cases the async publish can fail after your synchronous call already returned "success." If you need certainty of the final result, use Apex Publish Callbacks. Don't treat a green publish() as proof of delivery.

Subscribers, and who pays the delivery bill

You get a genuinely broad set of subscriber types: Apex triggers, Flows/processes, empApi (an LWC CometD client, for live UI), Pub/Sub API (gRPC over HTTP/2, the modern external path), and CometD (legacy). Worth knowing: Apex triggers, processes, and flows don't count against the event delivery allocation - only CometD-style clients do, and for empApi it's metered per channel, per browser session.

Durability - the bus is not an event store

Retention: high-volume events live in the bus for 72 hours; the older standard-volume events (definable only before Spring '19, now being retired) for 24 hours. Then purged.
Max message size: 1 MB. Fat events still have a ceiling - another reason the fat ones belong at the boundary, carrying a business fact, not a kitchen sink.
No SOQL/SOSL on events, no reports, no list views. If you need to see that an event fired, you persist a trace yourself. Events are shouted, not stored.

This is the deepest reason §7's projections live downstream: the platform deliberately doesn't durably store events, so your long-term, replayable, queryable copy belongs on a real log (Kafka/Redpanda) that is built for retention - outside Salesforce.

Scaling out: it's broadcast, not consumer groups

The intuitive Kafka model: add consumers, partitions distribute, each message processed once - does not apply to external Salesforce subscribers. Per Salesforce's own docs, if you start multiple Pub/Sub API subscriptions for the same org and topic, each stream is independent and the same events are delivered to every subscriber. Scale a gRPC connector from one pod to three and you get every event three times, not a third each. The burden is deduplication (on the EventUuid / id field), not partition assignment. To actually parallelise, consume once and fan the work onto a real log with real consumer groups.
The one place you get competing-consumer semantics is on-platform: Apex parallel subscribers, where one trigger runs as multiple concurrent instances and Salesforce assigns each event to exactly one, partitioned by a key (default: a hash of EventUUID, which randomises, and therefore does not preserve order). Need ordering within an entity? Set the partition key to that entity's id: ordered within a key, parallel across keys, exactly like a Kafka partition key.

Errors, retries, and the replay asymmetry

Inside an Apex subscriber you get a real retry model: throw EventBus.RetryableException and the batch is resent after an increasing delay, in original ReplayId order, up to 10 total runs (initial + 9 retries) before the subscription moves to an error state. Read the attempt count via EventBus.TriggerContext.currentContext().retries so you cap your own retries:

trigger OrderPlacedTrigger on OrderPlaced__e (after insert) {
    try {
        FulfillmentService.handle(Trigger.New);
    } catch (Exception e) {
        if (EventBus.TriggerContext.currentContext().retries < 6) {
            throw new EventBus.RetryableException(e.getMessage());
        }
        Logger.error('OrderPlaced handler exhausted retries', e);
    }
}

For partially-processed batches, checkpoint with EventBus.TriggerContext.currentContext().setResumeCheckpoint(replayId) so a retry resumes after the last success instead of redoing it - important because uncatchable limit exceptions (DML/SOQL governor limits) can't be caught, and an un-checkpointed batch's unprocessed events aren't redelivered.

The asymmetry to remember: there is no Apex API to replay an arbitrary historical event by ReplayId. From Apex you can resume a suspended subscription (from earliest, or "from tip" to skip poison events), checkpoint within a batch, or publish a new event - but arbitrary replay-by-id within the retention window is an external-subscriber capability (Pub/Sub API and CometD let you subscribe from Earliest, Latest, or a custom ReplayId). Once again: durable replay lives downstream, not in Apex.

12. The one thing the integrated platform gives you that Kafka doesn't

For all the constraints, Platform Events have a genuinely lovely property a raw Kafka setup lacks: one fact can drive backend logic and live UI at the same time, with almost no extra plumbing.

Backend: an Apex trigger on the __e reacts and runs domain logic.
Frontend: an LWC subscribes to the same channel via empApi and updates in real time.

import { subscribe, unsubscribe, onError } from 'lightning/empApi';

connectedCallback() {
    subscribe('/event/OrderPlaced__e', -1, (response) => {
        this.handleEvent(response.data.payload);
    }).then((sub) => { this.subscription = sub; });
    onError((error) => console.error('empApi error', JSON.stringify(error)));
}
disconnectedCallback() { unsubscribe(this.subscription, () => {}); }

A single OrderPlaced__e, published once, can drive backend logic (Apex), refresh an ops team's live dashboard (LWC), and hit Kafka for the rest of the company (Pub/Sub API bridge). This is the producer-knows-no-consumers property from §3 made vivid: three subscribers across two completely different runtimes - Apex and JavaScript - and the publisher knows about none of them. Add a fourth and the publisher still doesn't change. One fact, N consumers, zero coupling between them - provided the fact is a fact and not RefreshOrdersDashboard in disguise. Operationally, remember empApi is a CometD client and meters per browser session, so a thousand open dashboards is a thousand subscriptions.

13. The contras, stated plainly

This is not friction-free, and the JVM-world defaults don't all survive contact with the platform:

The bus is not a store. 72h / 24h retention, no SOQL, no native audit. Durability is your job - downstream.
No arbitrary replay from Apex. Suspend/resume and checkpoints only; true replay-by-id is external.
Fan-out, not consumer groups, for external subscribers. Scaling out means deduping, not load balancing.
Governor and async limits constrain orchestration. Local-engine sagas don't port one-to-one.
1 MB cap and limited field types. Fat events have a ceiling.
Publish-immediate footgun. Events can reference data that never committed.
Observability tax. You trade the easy end-to-end call stack for emergent flows. Eventual consistency becomes something you engineer - idempotent handlers, lag monitoring, reconciliation for any fire-and-forget fan-out with an SLA.
Trigger action discipline. Single-responsibility actions only pay off if the team enforces it. Without governance on Trigger_Action__mdt, you end up with the same chaos as the old shared handler class, just in metadata form.

Borders cost discipline. Bleed costs velocity. Pick your bill.

14. Finding the borders: event storming

Everything so far assumes you already know where the boundaries are and which facts matter. In a mature org you usually don't - the knowledge is scattered across the heads of admins, developers, and the business people who actually understand the process. Event storming is the workshop technique for getting it out of those heads and onto a wall before you write a line of Apex.

The mechanics are deliberately low-tech: a long roll of paper (or a digital whiteboard) and sticky notes in a few colours. You get the people who know the business in one room and reconstruct what actually happens, using a fixed vocabulary of colours:

Orange: domain events (facts). Something that happened, past tense: Order Placed, Shipment Dispatched, Invoice Issued. You brainstorm these first and arrange them left-to-right in time. This is the raw material for your Platform Events.
Blue: commands. The action or decision that caused an event: Place Order, Dispatch Shipment. Note the one-to-one feel - a command, when it succeeds, produces an event. This is exactly the fact-vs-command distinction from §8, discovered on the wall before it's encoded in a name.
Yellow: actors / roles. Who issues the command - a sales rep, a warehouse operator, a system integration.
Pink: external systems. Anything outside the org that emits or receives - the payment gateway, the carrier's shipping API. These pink notes are where your fat boundary events from §7 will live.
Purple: policies. "Whenever this event happens, that command should fire." Purple notes are the reactive glue - and crucially, they reveal cross-domain reach-ins. A policy that says "whenever Order Placed, create the Shipment" is the order-handoff anti-pattern from §5, caught on a sticky note.

The borders emerge on their own. As the orange events line up across the timeline, you'll see them cluster - a run of sales facts, then a run of fulfillment facts, then finance. The seams between the clusters are your bounded contexts. You didn't impose them top-down; the language did. Where the business stops using one set of nouns and starts using another is exactly where one domain ends and the next begins. Draw a vertical line there and you've found a border to defend with an event.

What event storming gives you, concretely, before any migration:

A candidate list of Platform Events - every orange note is a fact you might publish.
The bounded contexts - the clusters become your domain-* packages from §10.
The reach-ins to kill - every purple policy that crosses a cluster line is a place a domain is reaching into another's territory. Those are your refactor targets, prioritised by how much pain they cause.
A shared language - the admins and developers now describe the system the same way, which is half the battle in cutting domain bleed.

You don't need a two-day formal workshop to start. Even a 90-minute "big picture" storm of one painful flow - the order handoff, say - will surface more hidden coupling than a week of reading code, because it pulls in the people who know what the system is supposed to do, not just what it currently does.

15. How to get there without a rewrite

You don't refactor a twelve-year-old org in one go. Use branching by abstraction:

Keep internals as they are.
Define a clean set of facts at each context's edge and publish them - thin on-platform, and bridged (e.g. via Redpanda Connect) to a real log as fat boundary facts for the wider company.
Move downstream consumers onto that durable, properly-partitioned stream instead of subscribing to raw Platform Events.
Migrate trigger logic from shared handler classes to Trigger Actions, one object at a time - the metadata-driven composition means you can wire up new actions without touching the trigger file.
Peel domains into their own packages behind those stable contracts, on your own schedule.

The facts are the contract. Everything behind them is yours to modernise; everything in front sees clean borders - even while the engine room is mid-renovation.

The hardest part isn't the platform mechanics - it's the cultural shift. Publishing a fact and trusting other domains to react is slower in the short term than reaching in and doing it yourself. The order-handoff anti-pattern from §5 exists because it works - it's just not owned. Inverting it requires accepting that a domain's records are that domain's responsibility, even when another domain is the trigger for creating them. That discipline is what makes independent deployability real.

16. Takeaways

Treat domain boundaries as if they were network boundaries. Distributed systems engineers get this for free; on Salesforce you have to enforce it yourself. The discipline is identical - only the mechanism differs.
The shared database is your advantage, not your problem. Distributed systems copy data across service boundaries because they have no choice. You can re-query. Use it - keep events thin on-platform, fat only at the external boundary.
There are two kinds of event-driven Salesforce. Triggers are synchronous events; Platform Events are async. Both require the same ownership discipline; the difference is timing and failure boundaries.
Trigger ownership is an architecture decision. A trigger owned by everyone is owned by nobody. Trigger Actions with metadata-driven composition give you single responsibility, independent deploy, and explicit ordering - the same guarantees a service boundary gives a distributed system.
Domain bleed is the core failure mode - one domain reaching into another's objects, queries, and automation. Events are the membrane: facts cross boundaries, reach-ins don't.
Facts, not commands. If you can name the one subscriber meant to handle it, you've written a command. The coupling is still there, just on the bus now.
Thin inside, fat at the external boundary. Inside the org, thin notifications - the data is a query away. Outward, fat self-contained facts. That's the only place a read model projection belongs.
Pick the owner of every cross-domain flow on purpose. Default to choreography; reach for orchestration when a process must complete end-to-end and someone needs to own whether it did.
Split your repo into domains. Directory structure is the first step toward independent packages. Code isolation is the prerequisite for deployment isolation.
Know the mechanics cold - publish behaviour, async publish + callbacks, retention, broadcast vs parallel subscribers, retry/checkpoint, no-arbitrary-replay-from-Apex.
The prize is independent deployability. Distributed systems get it from service boundaries. You get it from event contracts. Same outcome, different path.

The platform is a tool, not an identity. Distributed systems engineers don't get credit for their hard boundaries - the network just forces them. The work on Salesforce is choosing to hold those same boundaries when nothing makes you.

So don't wait for a rewrite, and don't try to redraw the whole org at once. On your next change, find one place where a domain reaches across a boundary it doesn't own - one trigger, one cross-domain write, one query reaching three objects deep - and replace just that reach-in with a published fact. Draw one border. That single event is the first place your org starts behaving like the distributed system it always secretly was.

Glossary

This article borrows vocabulary from distributed systems and domain-driven design. If you came up through the admin/declarative side of Salesforce, here's the plain-English version of each term, in roughly the order it appears.

Domain / subdomain. A cohesive area of the business with its own data and rules - Sales, Fulfillment, Finance. On Salesforce, a domain is the set of objects, automation, and code one team owns. The opposite of "one big shared org where everyone edits everything."
Bounded context. The boundary around a domain inside which a word means exactly one thing. "Account" might mean a prospect in Sales and a billing entity in Finance; a bounded context is where you decide which. The seam between two contexts is exactly where you draw a border and put an event.
Domain bleed. This article's term for one domain's logic running inside another's territory - querying its objects, firing on its records, or writing its fields. The disease the whole article is about.
Coupling. How much one piece of code depends on another. Tight coupling means changing A forces you to change (and re-test) B. The goal is loose coupling - domains that can change independently.
Distributed monolith. A system split into pieces that still have to deploy together because they're too entangled - all the cost of being distributed, none of the independence. The "non-distributed monolith" is the Salesforce mirror image: one org, but the same entanglement.
Fact vs command. A fact (or domain event) states something that already happened, past tense: OrderPlaced. A command instructs a specific handler to do something: CreateShipment. Both can ride on a Platform Event - the difference is naming and intent (see §8).
Producer / consumer (publish / subscribe). The producer publishes a message and doesn't know who's listening. Consumers subscribe to declare their own interest. This is what "the producer doesn't know its consumers" means - the dependency points from consumer to event, not the other way around.
Event notification (thin) vs event-carried state transfer (fat). A thin event carries just an id and what changed; the subscriber re-queries for details. A fat event carries the whole payload so the subscriber never has to call back. Thin inside the org, fat at the external boundary (see §7).
Projection / read model. A pre-built copy of data, shaped for fast reading, kept in sync by listening to events. On-platform you almost never need one (you can just SOQL the source). Downstream of Salesforce - in Kafka, a warehouse - you build them because re-querying across the boundary is expensive.
CQRS (Command Query Responsibility Segregation). The pattern of separating the write side (commands) from the read side (queries/projections). Mentioned mainly to say: it's a downstream concern, not something you build inside the org.