The failure did not come from a selector.
Two workers picked jobs for the same account. Both opened automation tasks against the same persistent browser profile. One task was still waiting on a slow page. The other retried, changed the session state, and wrote a clean-looking failure log.
By the time someone inspected the run, the Playwright error only said:
TimeoutError: page.click failed
That was technically true.
It was also not the root problem.
The root problem was that the queue treated browser profile work like stateless background work.
A worker queue is easy to design when jobs are stateless:
take next job
find available worker
run task
mark done
That model works for screenshots, exports, report generation, image processing, and many API jobs.
It becomes fragile when the job opens a real browser profile.
A Playwright worker may not only be opening a page. It may be operating inside a profile that belongs to one account, one proxy route, one session state, one task history, and one review workflow.
In that case, the fastest available worker is not always the correct worker.
For profile-based browser automation, the queue should route by account context, not worker speed.
The snippets below are pseudo-code. The same model can be implemented with BullMQ, Temporal, Redis, Postgres advisory locks, Sidekiq, Celery, or another queue system.
Why speed-first routing breaks
A simple worker loop often starts like this:
while (true) {
const job = await queue.next();
const worker = await pool.firstAvailable();
await worker.run(job);
}
This is fine if every worker can run every job.
But browser profile automation has hidden state.
A job may require:
{
"accountId": "store-us-018",
"profileId": "profile-store-us-018",
"expectedProxyRegion": "US",
"sessionMode": "persistent",
"taskType": "daily_status_check"
}
That job should not run on any free worker.
It should run only when the correct account environment is available, the profile is not already in use, the expected proxy context still matches, and the previous run does not require human review.
A speed-first queue asks:
Which worker is free?
An account-aware queue asks:
Which account environment is safe to operate now?
Those are different scheduling problems.
Make account context part of the job
A queue cannot route by account context unless the job carries enough context.
At minimum, a browser automation job should include account identity, profile identity, task intent, run mode, retry rules, and an idempotency key.
type BrowserJob = {
jobId: string;
accountId: string;
profileId: string;
taskType: string;
expectedProxyRegion?: string;
expectedLanguage?: string;
expectedTimezone?: string;
runMode: "visible" | "headless";
riskLevel: "low" | "medium" | "review_required";
idempotencyKey: string;
createdAt: string;
};
The important shift is simple.
Instead of saying:
Run this browser task.
The queue should know:
Run this task for this account, inside this profile, under this expected environment, unless the account state says to stop.
That sentence is the real routing contract.
For a broader framing of this problem, this related article explains why account context for browser automation matters beyond the browser window itself.
Route by account or profile
The scheduler needs a routing key.
The safest default is usually the browser profile:
function routingKey(job: BrowserJob) {
return `profile:${job.profileId}`;
}
That prevents two workers from using the same profile at the same time.
In some systems, the account is the better boundary:
function routingKey(job: BrowserJob) {
return `account:${job.accountId}`;
}
That prevents two different profiles from touching the same account in parallel.
For read-only tasks, you may eventually allow narrower routing:
function routingKey(job: BrowserJob) {
return `account:${job.accountId}:task:${job.taskType}`;
}
Start strict.
A slow but safe scheduler is easier to improve than a fast scheduler that corrupts session state.
Add a profile lease
A queue claim and a profile lease are not the same thing.
A queue claim means:
This worker owns this job.
A profile lease means:
This worker is allowed to use this browser profile for a limited time.
A worker can own a job and still fail to get the profile. Another job may already be using it. The account may be paused. A previous run may have left the profile in a review-required state.
A minimal lease can look like this:
async function checkoutProfile(job: BrowserJob) {
const leaseKey = `profile:${job.profileId}`;
const lease = await locks.acquire(leaseKey, {
ttlMs: 15 * 60 * 1000,
metadata: {
jobId: job.jobId,
accountId: job.accountId,
taskType: job.taskType
}
});
if (!lease.ok) {
throw new Error("PROFILE_ALREADY_LEASED");
}
return lease;
}
The lease should expire automatically. It should also be released explicitly.
const lease = await checkoutProfile(job);
try {
await runBrowserTask(job);
} finally {
await lease.release();
}
This gives the queue a clean concurrency boundary.
It also gives the team an audit point: which worker used which profile, for which job, and when.
Handle lease renewal and stale workers
A fixed lease TTL is only safe for short tasks.
Real browser jobs can take longer than expected. A page may load slowly. A file export may take several minutes. A human review step may pause the flow. If the lease expires while the worker is still running, another worker may acquire the same profile.
That brings back the original problem.
For longer jobs, add lease renewal.
async function runWithLeaseRenewal(job: BrowserJob) {
const lease = await checkoutProfile(job);
const heartbeat = setInterval(async () => {
await lease.extend({
ttlMs: 15 * 60 * 1000
});
}, 60 * 1000);
try {
await runBrowserTask(job);
} finally {
clearInterval(heartbeat);
await lease.release();
}
}
This is still not enough.
You also need to protect against stale workers.
A stale worker is a worker that lost the lease, kept running, and later tried to write results. This can happen after a pause, crash recovery, network split, or slow event loop.
One way to reduce the risk is to use a fencing token.
When the worker acquires the lease, the lock service returns a monotonic token.
type ProfileLease = {
id: string;
profileId: string;
fencingToken: number;
release: () => Promise<void>;
extend: (input: { ttlMs: number }) => Promise<void>;
};
Every important write includes that token.
await evidence.write({
jobId: job.jobId,
profileId: job.profileId,
leaseId: lease.id,
fencingToken: lease.fencingToken,
result: "completed"
});
The storage layer rejects writes from an older token.
async function writeEvidence(event: EvidenceEvent) {
const currentLease = await leases.currentForProfile(event.profileId);
if (event.fencingToken < currentLease.fencingToken) {
throw new Error("STALE_WORKER_WRITE_REJECTED");
}
await evidenceStore.insert(event);
}
This does not make distributed systems magical.
It does make one important rule explicit:
A worker that no longer owns the profile should not be allowed to write final state for that profile.
That rule matters when browser profiles represent account state.
Gate account readiness before launching Playwright
Do not wait for Playwright to fail before discovering that the account was not ready.
Before opening the browser, run a readiness check.
async function validateAccountReadiness(job: BrowserJob) {
const account = await accounts.get(job.accountId);
const profile = await profiles.get(job.profileId);
if (account.status === "paused") {
return { ok: false, reason: "ACCOUNT_PAUSED" };
}
if (profile.reviewRequired) {
return { ok: false, reason: "PROFILE_REVIEW_REQUIRED" };
}
if (
job.expectedProxyRegion &&
profile.proxyRegion !== job.expectedProxyRegion
) {
return { ok: false, reason: "PROXY_REGION_MISMATCH" };
}
return { ok: true };
}
This check is boring.
That is why it is useful.
Many browser automation failures do not need a dramatic fix. They need the system to stop before a predictable mismatch becomes a noisy run.
Classify blocked jobs
A blocked job is not always a failed job.
It may be blocked because:
blocked_by_profile_lease
blocked_by_account_pause
blocked_by_proxy_mismatch
blocked_by_review_required
blocked_by_duplicate_idempotency_key
Each reason should lead to a different action.
A profile lease conflict can be retried later.
A proxy mismatch should stop until configuration is fixed.
A review-required state should move to a human queue.
A duplicate idempotency key should be marked as already handled.
A simple scheduling function might look like this:
async function schedule(job: BrowserJob) {
const readiness = await validateAccountReadiness(job);
if (!readiness.ok) {
await queue.moveToBlocked(job, readiness.reason);
return;
}
await queue.enqueueByRoutingKey(job, {
routingKey: `profile:${job.profileId}`
});
}
The goal is not to run every job immediately.
The goal is to run the right job in the right account environment.
Make retries context-aware
Retries are dangerous in browser automation.
If a task fails because of a temporary navigation issue, a retry may help.
If it fails because the account is logged out, the proxy changed, or a review screen appeared, an automatic retry may make the state worse.
Retry rules should inspect failure reasons.
function shouldRetry(event: TaskFailureEvent) {
if (event.reason === "TIMEOUT") return true;
if (event.reason === "TEMPORARY_NAVIGATION_ERROR") return true;
if (event.reason === "LOGIN_REQUIRED") return false;
if (event.reason === "PROFILE_MISMATCH") return false;
if (event.reason === "PROXY_REGION_MISMATCH") return false;
if (event.reason === "HUMAN_REVIEW_REQUIRED") return false;
return false;
}
A retry counter is not enough.
The scheduler should know whether the failure belongs to the page, the browser, the profile, the account, or the workflow.
Log scheduling evidence, not only Playwright errors
Many systems log Playwright errors but forget to log scheduling decisions.
That makes debugging harder.
When a task stops, the team should be able to inspect a record like this:
{
"jobId": "job-2026-06-23-001",
"routingKey": "profile:profile-store-us-018",
"accountId": "store-us-018",
"profileId": "profile-store-us-018",
"workerId": "worker-04",
"leaseId": "lease-8731",
"fencingToken": 42,
"runMode": "headless",
"expectedProxyRegion": "US",
"actualProxyRegion": "US",
"result": "blocked",
"stopReason": "LOGIN_REQUIRED"
}
This tells a better story than:
TimeoutError: page.click failed
A Playwright error tells you what the script saw.
A scheduling record tells you whether the task should have been running in that account environment at all.
That is the missing layer in many worker queue designs.
A minimal account-aware worker flow
Putting the pieces together, the flow looks like this:
async function handleBrowserJob(job: BrowserJob) {
const readiness = await validateAccountReadiness(job);
if (!readiness.ok) {
await evidence.write({
jobId: job.jobId,
accountId: job.accountId,
profileId: job.profileId,
result: "blocked",
reason: readiness.reason
});
return;
}
const lease = await checkoutProfile(job);
try {
await evidence.write({
jobId: job.jobId,
accountId: job.accountId,
profileId: job.profileId,
result: "started",
leaseId: lease.id,
fencingToken: lease.fencingToken
});
await runWithPlaywright(job);
await evidence.write({
jobId: job.jobId,
profileId: job.profileId,
leaseId: lease.id,
fencingToken: lease.fencingToken,
result: "completed"
});
} catch (err) {
const reason = classifyBrowserFailure(err);
await evidence.write({
jobId: job.jobId,
profileId: job.profileId,
leaseId: lease.id,
fencingToken: lease.fencingToken,
result: "failed",
reason
});
if (shouldRetry({ reason })) {
await queue.retryLater(job);
} else {
await queue.moveToReview(job, reason);
}
} finally {
await lease.release();
}
}
The queue library is not the main point.
The important boundary is this:
job
-> readiness gate
-> routing key
-> profile lease
-> lease heartbeat
-> Playwright run
-> evidence log
That boundary keeps profile state from being treated as disposable worker state.
Where the workspace layer fits
This pattern is tool-agnostic.
You can build the scheduler yourself. You can also move more of this logic into a browser automation workspace.
The hard part is keeping account identity, browser profile, proxy context, session state, task history, and review decisions close enough that the scheduler can make safe decisions.
A workspace-style browser automation layer can help here when it keeps profiles, proxy mapping, task logs, and review states close to the scheduler.
This is also the direction Web4 Browser is exploring: keeping profiles, proxy mapping, task logs, review states, and repeatable workflows closer to the automation layer.
More workers make the system faster.
Account-aware scheduling makes it safer to scale.
Final checklist
Before scaling Playwright workers across persistent browser profiles, check these points:
[ ] Every browser job includes accountId and profileId.
[ ] The queue has a routing key based on account or profile.
[ ] Two workers cannot use the same profile at the same time.
[ ] Long-running jobs renew profile leases.
[ ] Stale workers cannot write final state after losing a lease.
[ ] The scheduler checks account readiness before launching Playwright.
[ ] Environment expectations are visible to the job.
[ ] Retries depend on failure reason, not only retry count.
[ ] Each run records worker ID, profile ID, routing key, lease ID, fencing token, and stop reason.
A browser worker queue should not only ask which worker is free.
It should ask which account context is ready.
That is the difference between running jobs faster and running profile automation with control.










