GBase 8a Table Design in Practice: Choosing Distribution Keys, Partitions, and Replicated Tables

Many performance issues are baked in the moment a table is created. This guide systematically explains table design decisions in GBase 8a: how to pick distribution keys, when to partition, how to use replicated tables, and how to choose the right data types — with anti‑patterns and a complete example.

1. How Data Is Distributed Across Nodes

GBase 8a uses a Shared‑Nothing architecture. Data is horizontally partitioned and spread across gnodes based on the distribution key:

CREATE TABLE orders (
    order_id    BIGINT NOT NULL,
    customer_id INT    NOT NULL,
    dept_id     INT,
    amount      DECIMAL(18,2),
    order_date  DATE
) DISTRIBUTED BY HASH(customer_id);

A hash function maps every row with the same customer_id to the same gnode. If DISTRIBUTED BY is omitted, the first column is used by default — rarely what you want.

2. Core Principles for Choosing a Distribution Key

High cardinality: The more unique values, the more evenly data is spread. user_id or order_id are ideal; gender or province cause severe skew.
The column used in high‑frequency JOINs: If two tables are often joined on the same key, set that key as the distribution key on both sides. The JOIN then runs locally without cross‑node data shuffle, giving the best performance.
Avoid date or time columns: They have limited unique values and are almost never used in JOIN conditions.

3. Partitioning: How It Differs from Distribution

The distribution key decides which node data goes to; partitioning decides how data is organised inside each node. GBase 8a supports Range partitioning:

CREATE TABLE orders (
    order_id   BIGINT,
    order_date DATE,
    amount     DECIMAL(18,2)
) DISTRIBUTED BY HASH(order_id)
PARTITION BY RANGE(order_date) (
    PARTITION p2023 VALUES LESS THAN ('2024-01-01'),
    PARTITION p2024 VALUES LESS THAN ('2025-01-01'),
    PARTITION p2025 VALUES LESS THAN ('2026-01-01'),
    PARTITION pmax  VALUES LESS THAN MAXVALUE
);

Partition pruning: when the query includes a filter on the partition key, only the relevant partitions are scanned. Use partitioning when a single node holds tens of GBs or more, queries frequently filter by time range, or you need fast historical data cleanup (ALTER TABLE DROP PARTITION is orders of magnitude faster than DELETE). Avoid partitioning for tables under 100 million rows, full‑scan workloads, or when the partition count exceeds 1,000 (metadata overhead becomes significant).

4. Replicated Tables: The Best Strategy for Small Dimension Tables

For lookup tables, dictionary tables, and other small, rarely‑updated tables, use replication:

CREATE TABLE dim_product (
    product_id   INT,
    product_name VARCHAR(128),
    category     VARCHAR(64)
) REPLICATED;

A replicated table stores a full copy on every gnode. JOINs between a fact table and a replicated table require zero network transfer — they run entirely locally. Replication is ideal when row count is under 1 million and updates are rare. Between 1–10 million rows with occasional updates, proceed with caution. Beyond 10 million rows or with frequent writes, use a distribution table with a proper key.

5. Data Type Selection

GBase 8a is a columnar store engine. Data types directly affect compression ratio and query performance.

Strings: Store enumerated values as TINYINT/SMALLINT; use VARCHAR only for truly variable‑length descriptions. Low‑cardinality strings compress extremely well.
Numbers: Use INT/BIGINT for integers — never DECIMAL(20,0). Use DECIMAL(18,2) for monetary amounts; never DOUBLE (floating‑point precision issues).
Temporal: Use DATETIME for full timestamps, DATE for date‑only columns. Never store dates as VARCHAR — it prevents partition pruning and date‑function optimisations.

6. Complete Table Design Example

-- Fact table: large, distributed by high‑cardinality customer_id, partitioned by quarter
CREATE TABLE orders (
    order_id     BIGINT      NOT NULL,
    customer_id  INT         NOT NULL,
    product_id   INT         NOT NULL,
    dept_id      SMALLINT    NOT NULL,
    amount       DECIMAL(18,2),
    status       TINYINT     NOT NULL,
    order_date   DATE        NOT NULL,
    create_time  DATETIME
) DISTRIBUTED BY HASH(customer_id)
PARTITION BY RANGE(order_date) (
    PARTITION p2024q1 VALUES LESS THAN ('2024-04-01'),
    PARTITION p2024q2 VALUES LESS THAN ('2024-07-01'),
    PARTITION p2024q3 VALUES LESS THAN ('2024-10-01'),
    PARTITION p2024q4 VALUES LESS THAN ('2025-01-01'),
    PARTITION p2025   VALUES LESS THAN ('2026-01-01'),
    PARTITION pmax    VALUES LESS THAN MAXVALUE
);

-- Dimension table: small, replicated
CREATE TABLE dim_product (
    product_id   INT          NOT NULL,
    product_name VARCHAR(128) NOT NULL,
    category     VARCHAR(64),
    brand        VARCHAR(64)
) REPLICATED;

7. Common Anti‑Patterns

Anti‑Pattern	Consequence	Correct Approach
No distribution key specified	Defaults to first column, often skewed	Explicitly specify `DISTRIBUTED BY HASH(appropriate_column)`
Distribution on low‑cardinality columns	Severe node imbalance	Use high‑cardinality columns
Dimension table as a distribution table	Hash redistribution on every JOIN	Use `REPLICATED`
`VARCHAR(255)` for enumerated values	Poor compression, higher memory	Use `TINYINT`/`SMALLINT`
Excessive partitions (>1,000)	High metadata overhead, slow planning	Partition by quarter or year instead of day

Good table design is the starting point of performance optimisation in a gbase database. Changing a distribution key later requires rebuilding the table — a very expensive operation. During the design phase, answer three questions: what JOIN conditions are used most? Does the query workload have obvious time‑range filters? How large is the table and how frequently is it written? These answers directly determine your distribution key, partitioning strategy, and whether to use replication.