Many performance issues are baked in the moment a table is created. This guide systematically explains table design decisions in GBase 8a: how to pick distribution keys, when to partition, how to use replicated tables, and how to choose the right data types — with anti‑patterns and a complete example.
1. How Data Is Distributed Across Nodes
GBase 8a uses a Shared‑Nothing architecture. Data is horizontally partitioned and spread across gnodes based on the distribution key:
CREATE TABLE orders (
order_id BIGINT NOT NULL,
customer_id INT NOT NULL,
dept_id INT,
amount DECIMAL(18,2),
order_date DATE
) DISTRIBUTED BY HASH(customer_id);
A hash function maps every row with the same customer_id to the same gnode. If DISTRIBUTED BY is omitted, the first column is used by default — rarely what you want.
2. Core Principles for Choosing a Distribution Key
-
High cardinality: The more unique values, the more evenly data is spread.
user_idororder_idare ideal;genderorprovincecause severe skew. - The column used in high‑frequency JOINs: If two tables are often joined on the same key, set that key as the distribution key on both sides. The JOIN then runs locally without cross‑node data shuffle, giving the best performance.
- Avoid date or time columns: They have limited unique values and are almost never used in JOIN conditions.
3. Partitioning: How It Differs from Distribution
The distribution key decides which node data goes to; partitioning decides how data is organised inside each node. GBase 8a supports Range partitioning:
CREATE TABLE orders (
order_id BIGINT,
order_date DATE,
amount DECIMAL(18,2)
) DISTRIBUTED BY HASH(order_id)
PARTITION BY RANGE(order_date) (
PARTITION p2023 VALUES LESS THAN ('2024-01-01'),
PARTITION p2024 VALUES LESS THAN ('2025-01-01'),
PARTITION p2025 VALUES LESS THAN ('2026-01-01'),
PARTITION pmax VALUES LESS THAN MAXVALUE
);
Partition pruning: when the query includes a filter on the partition key, only the relevant partitions are scanned. Use partitioning when a single node holds tens of GBs or more, queries frequently filter by time range, or you need fast historical data cleanup (ALTER TABLE DROP PARTITION is orders of magnitude faster than DELETE). Avoid partitioning for tables under 100 million rows, full‑scan workloads, or when the partition count exceeds 1,000 (metadata overhead becomes significant).
4. Replicated Tables: The Best Strategy for Small Dimension Tables
For lookup tables, dictionary tables, and other small, rarely‑updated tables, use replication:
CREATE TABLE dim_product (
product_id INT,
product_name VARCHAR(128),
category VARCHAR(64)
) REPLICATED;
A replicated table stores a full copy on every gnode. JOINs between a fact table and a replicated table require zero network transfer — they run entirely locally. Replication is ideal when row count is under 1 million and updates are rare. Between 1–10 million rows with occasional updates, proceed with caution. Beyond 10 million rows or with frequent writes, use a distribution table with a proper key.
5. Data Type Selection
GBase 8a is a columnar store engine. Data types directly affect compression ratio and query performance.
-
Strings: Store enumerated values as
TINYINT/SMALLINT; useVARCHARonly for truly variable‑length descriptions. Low‑cardinality strings compress extremely well. -
Numbers: Use
INT/BIGINTfor integers — neverDECIMAL(20,0). UseDECIMAL(18,2)for monetary amounts; neverDOUBLE(floating‑point precision issues). -
Temporal: Use
DATETIMEfor full timestamps,DATEfor date‑only columns. Never store dates asVARCHAR— it prevents partition pruning and date‑function optimisations.
6. Complete Table Design Example
-- Fact table: large, distributed by high‑cardinality customer_id, partitioned by quarter
CREATE TABLE orders (
order_id BIGINT NOT NULL,
customer_id INT NOT NULL,
product_id INT NOT NULL,
dept_id SMALLINT NOT NULL,
amount DECIMAL(18,2),
status TINYINT NOT NULL,
order_date DATE NOT NULL,
create_time DATETIME
) DISTRIBUTED BY HASH(customer_id)
PARTITION BY RANGE(order_date) (
PARTITION p2024q1 VALUES LESS THAN ('2024-04-01'),
PARTITION p2024q2 VALUES LESS THAN ('2024-07-01'),
PARTITION p2024q3 VALUES LESS THAN ('2024-10-01'),
PARTITION p2024q4 VALUES LESS THAN ('2025-01-01'),
PARTITION p2025 VALUES LESS THAN ('2026-01-01'),
PARTITION pmax VALUES LESS THAN MAXVALUE
);
-- Dimension table: small, replicated
CREATE TABLE dim_product (
product_id INT NOT NULL,
product_name VARCHAR(128) NOT NULL,
category VARCHAR(64),
brand VARCHAR(64)
) REPLICATED;
7. Common Anti‑Patterns
| Anti‑Pattern | Consequence | Correct Approach |
|---|---|---|
| No distribution key specified | Defaults to first column, often skewed | Explicitly specify DISTRIBUTED BY HASH(appropriate_column)
|
| Distribution on low‑cardinality columns | Severe node imbalance | Use high‑cardinality columns |
| Dimension table as a distribution table | Hash redistribution on every JOIN | Use REPLICATED
|
VARCHAR(255) for enumerated values |
Poor compression, higher memory | Use TINYINT/SMALLINT
|
| Excessive partitions (>1,000) | High metadata overhead, slow planning | Partition by quarter or year instead of day |
Good table design is the starting point of performance optimisation in a gbase database. Changing a distribution key later requires rebuilding the table — a very expensive operation. During the design phase, answer three questions: what JOIN conditions are used most? Does the query workload have obvious time‑range filters? How large is the table and how frequently is it written? These answers directly determine your distribution key, partitioning strategy, and whether to use replication.







