1.2.2 Semantic analysis: name resolution, type checking, catalog lookup

A raw parse tree captures the structure of SQL text, but not what that structure refers to. Whether the token users is a table in some schema, whether the two sides of = have compatible types, whether count is a function or a column. Filling in those answers and turning a raw parse tree into a meaningful query tree is the job of Semantic analysis. PG code commonly shortens it to "Analyze". The shape of the resulting node structure is covered in 1.2.3.

Parser and Analyzer

A note on terminology first. In PG code, "parser" is used in two scopes. Narrowly, it refers to the lexer and grammar that turn SQL text into a raw parse tree. Broadly, it includes the next stage, Analyze, as well. Both stages live under src/backend/parser/.

This chapter splits them. The raw parser takes SQL text and produces a raw parse tree (1.2.1). The Analyzer takes a raw parse tree and produces a Query tree by consulting the catalog (this section). The two stages differ in their output shape (raw parse tree → Query tree) and in whether they touch the catalog.

What the raw parse tree leaves out

The raw parser checks only syntax. The fact that SELECT a FROM t WHERE a = 1 is grammatically well-formed is the limit of what a raw parse tree asserts. Whether t actually exists, whether a is a column of that table, whether 1 can be coerced to the type of a: the raw parse tree knows none of this.

Why split into two stages? A few reasons.

Separation of concerns. Mixing syntax checks with semantic checks puts too much responsibility into one stage. Splitting keeps each stage's code simple.
Raw parse must work in an aborted transaction. When a transaction is in an aborted state, the catalog cannot be touched (covered at the end of 1.2.1). But commands like COMMIT/ROLLBACK still need to be recognized to escape that state. The split lets raw parse identify the command type without catalog access.
Reuse of raw parse output. A single raw parse tree can be Analyzed multiple times in different contexts (different search_path, different parameter types). Prepared statements rely on this reuse pattern.

The result is a clean two-stage split. Raw parse is independent of the catalog and only checks syntax. Analyze then aggressively consults the catalog and fills in the blanks. Analyze does four things. It looks up every name in the raw parse tree against the catalog and replaces them with OIDs (object identifiers, the 32-bit integer IDs assigned to every object in the catalog). It determines the type of every expression. It inserts cast nodes wherever types do not match. It packages the result into a new tree called Query.

Name resolution: what does each name in SQL refer to?

Name resolution is the task of figuring out which catalog object each table, column, or function name in the SQL is pointing at.

Start with table names. As Analyze reads down the FROM clause, it encounters items like FROM users, looks up the users table in the catalog, and obtains its OID. It then registers it in an internal list as "the Nth table appearing in this SQL's FROM". PG calls this list the range table. (Why "range"? Covered in 1.2.3.)

A range table holds more than just plain tables. The FROM clause can contain subqueries (FROM (SELECT ...)), JOIN results (FROM t1 JOIN t2 ON ...), and function calls (FROM generate_series(1, 10)). All of these are equivalent in being "an entry that produces rows", so each gets one slot in the range table. The differences between kinds are covered in 1.2.3.

When the schema is not specified (FROM users versus FROM public.users), PG walks the connection's search_path in order and picks the first schema that contains a matching table. This decision is locked in at Analyze time. So the same SQL text can refer to different tables across connections with different search_path settings.

Column resolution comes next. Column names appearing in WHERE, SELECT, ORDER BY, and similar positions are matched against the range table that was just built, deciding "which column of which table this name refers to". The familiar column reference "a" is ambiguous error from SELECT a FROM t1 JOIN t2 ON ... (when a exists in both t1 and t2) is raised at this stage.

Column references can have between one and four dot-separated parts. a (column only), t.a (table.column), s.t.a (schema.table.column), db.s.t.a (database.schema.table.column). Depending on how many dots are present, different parts are interpreted differently, and the priority rules for that interpretation are baked into this stage.

Function calls are slightly more involved. PG allows function overloading, so multiple versions of the same function name can be registered with different argument types. For example, length() has separate versions accepting text and bytea, and the right one is chosen based on the argument types at the call site. Analyze gathers all candidates with the same name and picks the best match by argument-type matching. If the match is ambiguous, an ambiguous error is raised. If there is no match, function does not exist.

Type checking and cast insertion

In an expression tree, leaf nodes are one of three things. Constants, column references, or parameters. The constant 1 has type integer, the column name has type text, the parameter $1 has whatever type the caller declared. Analyze walks the expression tree from leaf to root, determining the type of every node along the way. The result type of a + b is determined only after the types of a and b are settled, and the type of (a + b) * 2 is determined from that result.

When the two sides of an operation differ in type, an automatic cast may be inserted. In WHERE a = '5' where a is integer and '5' is text, PG inserts a cast node converting '5' to integer so the comparison is well-typed. The authoritative answer for which type-to-type conversions are possible lives in the pg_cast catalog. It records the source type, target type, conversion method, and which context the cast is automatically applied in. A few representative rows look like this.

source → target	Method	Context
`int4` → `int8`	Function call (`int8(int4)`)	implicit
`int8` → `int4`	Function call (`int4(int8)`)	assignment
`int4` → `oid`	Binary passthrough	implicit
`json` → `jsonb`	Text I/O detour	assignment

Two of the columns capture the heart of how PG does casts. The method is how the source value is turned into the target type, and the context is where that cast is automatically applied.

The method falls into one of three kinds.

No conversion needed because the memory representation is the same. The int4 → oid row is this case. Both are 32-bit integers, so the binary representation passes through. The conversion cost is essentially zero.
A cast function is called. The int4 → int8 row is this case, with a registered conversion function int8(int4) that gets called. The cost is one function call.
Output as text and read it back. Every type in PG has an output function that serializes its value to a string and an input function that parses a string back into a value of that type. To bridge two types that have no direct cast function, PG runs the source type's output function to produce a string, then feeds that string through the target type's input function. The json → jsonb row works this way: a json value is first emitted as a textual form like {"a":1}, then that text is parsed by the jsonb input function into the binary JSON representation. This route is inefficient and not used when a direct cast function can be defined, but it serves as a fallback for user-defined types or for bridging two types whose representations differ significantly.

If no method is registered, PG raises cannot cast type X to type Y.

The same cast does not apply automatically in every position. PG splits cast contexts into three groups.

implicit. PG inserts the cast on its own, without a request from the user. Applied in positions like comparisons and arithmetic. The int4 → int8 row is implicit, so WHERE bigint_col = int_col automatically promotes int4 to int8 for the comparison even though the two sides differ in type.
assignment. Applies when a value is being put into a column. Operates in INSERT, the SET of UPDATE, COPY, and similar positions. Some conversions that are not registered as implicit are registered as assignment. The int8 → int4 row is one such example. Narrowing a wide integer to a narrower one risks precision loss, so applying it automatically in a comparison would be dangerous, but in a position where a value is being placed into a column of a particular type, accepting that loss is closer to the user's intent.
explicit. The user directly requests the cast with CAST(col AS integer) or col::integer. PG treats this as an unambiguous statement of intent, so even conversions not registered as implicit or assignment are allowed here.

The three contexts form a relationship in which the set of cast pairs PG applies automatically grows wider. A cast registered as implicit is also automatic in assignment positions, and of course in explicit positions too. Conversely, a cast registered only at the explicit level does not happen automatically in implicit or assignment positions; the user has to spell it out. In set notation, implicit ⊂ assignment ⊂ explicit in terms of which pairs PG converts on your behalf.

This split keeps PG away from a common pitfall in other RDBMSs where implicit conversions happen too eagerly and produce surprising results. PG deliberately keeps implicit casts narrow. Before 8.3 there was an implicit cast between text and integer, but it caused too many unintended comparisons and was removed in later versions.

Catalog lookups and locks

Every decision above involves catalog lookups. An OID lookup per FROM table, an attribute lookup per column, a pg_type lookup per type in the expression, a pg_proc lookup per function call. Analyzing one SQL statement can trigger dozens of catalog lookups.

To keep the cost from blowing up, PG caches lookup results in memory (the syscache, a system catalog cache). A second lookup with the same key in the same connection hits memory. When another connection mutates the catalog with ALTER TABLE or similar, an invalidation message is delivered to the syscache and just the affected entry is dropped.

The bigger user-visible effect is not the lookup itself but its side effect. When Analyze finds a FROM table in the catalog, it also takes a lock on that table. A plain FROM in a SELECT acquires AccessShareLock (the read lock). Tables locked by FOR UPDATE/SHARE get RowShareLock. The target of an INSERT/UPDATE/DELETE gets RowExclusiveLock. By the time Analyze finishes, every table touched by the statement is locked. The exact semantics of each lock mode and the conflict matrix between them are covered in 3.5.1 (Heavyweight locks).

What is interesting is that locks are taken at the Analyze stage. Other designs are possible. Some other RDBMSs delay lock acquisition until plan or execute time. PG chose to take the lock during the catalog visit instead. The effect is clear: once parse finishes, the schema is guaranteed not to change for the rest of plan and execute, on a per-statement basis. The race in which the catalog mutates under ALTER TABLE while planning is in progress, invalidating the plan, is shut out before it can happen.

After Analyze finishes

Once Analyze is done, PG does two more things.

The first is computing a query identifier. When monitoring tools track slow-running queries, calls like WHERE id = 1 and WHERE id = 2 should be grouped together as instances of the same pattern (WHERE id = ?). Otherwise, a statistic like "this query pattern averages X ms" loses its meaning.

The catch is in what key the statistics table groups by. The key here is the data-structure key, the same kind of key as in a hash table where matching keys land in the same slot and accumulate. Using the SQL text directly as a key is no good. WHERE id = 1 and WHERE id = 2 are different texts and would land in separate slots, scattering what should be one row of statistics across many. What we want is one row per pattern, not one row per call.

So at the end of Analyze, PG normalizes the Query tree by stripping out the variable parts (constants and the like, with the same effect as replacing them with ?) and then hashes the remainder into a 64-bit identifier. Same pattern produces the same identifier; same identifier accumulates into the same statistics slot. This process is called query jumble, and the resulting identifier is the key used by monitoring tools to group queries of the same kind.

The second thing is a hook where extensions plug in. PG places a post-analyze hook at the very end of Analyze, so any extension registered there has its code invoked right after each query is analyzed. Statistics-gathering extensions like pg_stat_statements register on this hook to collect each query's identifier and execution time.

What this means in practice

First, the same SQL text can refer to different tables depending on the connection's search_path. This is a direct consequence of table name resolution being locked in at Analyze time and following search_path. In container or deployment environments where the default search_path happens to differ, tests pass while production reports "table not found". This is one of the first things to check when debugging such an error.

Second, throwing a SELECT inside a transaction immediately locks every table the statement touches, and those locks are held until the transaction ends. Lock acquisition being baked into the Analyze stage shows up directly in how applications behave. Patterns like BEGIN; SELECT ...; (some other work), where a transaction is opened early, hold AccessShareLock on the SELECT's tables for the entire transaction's duration, which can block ALTER TABLE in other sessions. This is one of the reasons long-running transactions are operationally risky (other reasons are covered in chapter 3 on MVCC). Conversely, from the application's point of view, the moment a SELECT is issued, the schema becomes protected against change for the remainder of that statement.

Third, the unit of "query" that monitoring tools group on is not the SQL text but the normalized pattern of the analyzed result. When pg_stat_statements collects slow queries, the key it uses to lump same-shaped queries onto one row is the query identifier computed at the end of analysis. Statements that differ only in constant values but share the same structure accumulate into the same row. So calling WHERE id = 1 a thousand times and calling WHERE id = 1, WHERE id = 2, ..., WHERE id = 1000 once each look the same to the monitoring tool: a single row.