EDB Postgres Distributed의 새로운 분석 엔진으로 쿼리 속도가 6배 빨라지다

Phil Eaton
2025년 9월 4일

소개

**EDB Postgres Distributed(PGD)**는 EDB Postgres AI 제품군에 속하며, 다음을 지원합니다:

데이터를 분석용 포맷(Iceberg)으로 복제
**EDB Postgres Analytics Accelerator(PGAA)**를 사용해 Postgres 쿼리를 그대로 실행 (PGAA는 내부적으로 Apache DataFusion 사용)

이번 글에서는 PGD 클러스터를 구성한 뒤, 고객 구매 데이터를 저장하는 비즈니스 테이블을 대상으로 간단한 분석 쿼리를 실행해 보겠습니다.

Postgres 기본 쿼리 엔진 vs PGAA 엔진 성능 비교
t2.xlarge EC2(Ubuntu 24.04) 환경에서 테스트
PGAA가 약 6배 빠른 성능을 보임

※ 이 실험은 엄밀한 벤치마크는 아닙니다. 목표는 환경을 어떻게 구성하고 두 엔진 차이를 관찰할 수 있는지를 보여주는 데 있습니다.

PGD 설정하기

EDB 계정 로그인/등록 후 구독 토큰을 환경 변수로 등록:

$ export EDB_SUBSCRIPTION_TOKEN=whatever-it-is

PGD, PGAA, EDB Postgres Extended 리포지토리 추가:

$ curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/postgres_distributed/setup.deb.sh" | sudo -E bash
$ curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/enterprise/setup.deb.sh" | sudo -E bash

패키지 설치:

$ sudo apt-get update -y
$ sudo apt-get install -y edb-pgd6-expanded-pgextended17 edb-postgresextended-17-pgaa hyperfine

postgres 유저로 전환 후 PGD 단일 노드 클러스터 생성:

$ sudo su postgres
$ cd ~
$ PGPASSWORD=secret /usr/lib/edb-pge/17/bin/pgd node db1 setup \
 --dsn 'host=localhost dbname=pgd' \
 --pgdata /var/lib/postgresql/db \
 --log-file logfile \
 --group-name pgd-group

클러스터 상태 확인:

$ /usr/lib/edb-pge/17/bin/psql -P expanded=auto -h localhost pgd \
 -c 'SELECT * FROM bdr.node_summary'

PGAA 설정하기

이제 Postgres가 실행 중이고 PGD도 정상적으로 설정되었습니다. 클러스터 노드 상태를 확인하면 단일 노드가 활성화된 것을 볼 수 있습니다.

$ /usr/lib/edb-pge/17/bin/psql -P expanded=auto -h localhost pgd \
 -c 'SELECT * FROM bdr.node_summary'

출력 예시는 다음과 같습니다:

node_name              | db1
node_group_name        | pgd-group
interface_connstr      | host=localhost dbname=pgd
peer_state_name        | ACTIVE
peer_target_state_name | ACTIVE
...

postgresql.conf 수정

분석 엔진(PGAA)을 위해 postgresql.conf를 조정합니다.

echo "
pgaa.max_replication_lag_s = 1
pgaa.flush_task_interval_s = 1
# 분석 포맷 데이터가 저장될 위치
pgfs.allowed_local_fs_paths = '/var/lib/postgresql/pgd-analytics'
pgaa.autostart_seafowl_port = 5445
pgaa.seafowl_url = 'http://localhost:5445'" | tee -a \
/var/lib/postgresql/db/postgresql.conf

변경 사항을 반영하기 위해 Postgres를 재시작합니다:

$ /usr/lib/edb-pge/17/bin/pg_ctl -D /var/lib/postgresql/db -l logfile restart

그리고 분석 데이터가 저장될 디렉터리를 생성합니다:

$ mkdir /var/lib/postgresql/pgd-analytics

PGAA 확장 추가 및 데이터 경로 설정

다음 단계는 PGAA 확장을 추가하고 데이터를 어디에 저장할지 지정하는 것입니다.
이번 글에서는 단순화를 위해 데이터를 Iceberg 포맷으로 로컬 파일 시스템에 저장합니다. (실제로는 MinIO 같은 S3 호환 오브젝트 스토리지도 사용할 수 있고, Iceberg REST Catalog 엔드포인트를 지정해 Iceberg 테이블로 직접 복제하는 것도 가능합니다.)

$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "
-- 클러스터의 모든 노드에서 확장을 생성하고 동기화 대기
CREATE EXTENSION pgaa CASCADE;
SELECT bdr.wait_slot_confirm_lsn(NULL, NULL);

-- 스토리지 위치 생성 (postgresql.conf에서 지정한 경로와 일치해야 함)
SELECT bdr.replicate_ddl_command($$SELECT pgfs.create_storage_location('local_fs', 'file:///var/lib/postgresql/pgd-analytics')$$);

-- 그룹 내에서 쓰기 리더를 확보
SELECT bdr.alter_node_group_option('pgd-group', 'enable_proxy_routing', 'true');

-- 현재 PGD 그룹이 우리가 만든 스토리지 위치를 바라보도록 설정
SELECT bdr.alter_node_group_option('pgd-group', 'analytics_storage_location', 'local_fs');
"

주문 데이터 테이블 생성

이제 고객 주문을 나타내는 테이블을 만들고, 1억 건의 테스트 데이터를 삽입해보겠습니다. (제 환경에서는 t2.xlarge 인스턴스에서 약 10분 걸렸습니다.)

$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "
BEGIN;
DROP TABLE IF EXISTS orders CASCADE;
CREATE TABLE orders (
 customer_id INT,
 order_id BIGSERIAL,
 amount_pennies BIGINT,
 PRIMARY KEY (customer_id, order_id)
) WITH (pgd.replicate_to_analytics = TRUE);
INSERT INTO orders (customer_id, amount_pennies)
 SELECT
   random() * 9 + 1,
   random() * 100000 + 1
 FROM generate_series(1, 100_000_000);
COMMIT;"

/var/lib/postgresql/pgd-analytics/public.orders 경로에 Parquet 파일이 생성될 때까지 기다리세요. (약 5분 소요)

PGAA 테이블 생성

Parquet 파일이 준비되면, PGAA의 Table Access Method를 사용하는 Postgres 테이블을 생성합니다. 이 테이블은 Apache DataFusion 기반의 벡터화 쿼리 엔진에서 실행됩니다.

$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "
CREATE TABLE orders_analytics ()
 USING PGAA WITH (pgaa.storage_location = 'local_fs', pgaa.path = 'public.orders', pgaa.format = 'iceberg');
"

⚠️ 참고: 멀티 노드 클러스터에서 실행 시, PGD 복제가 완료되기 전까지는 DDL 락 때문에 실패할 수 있습니다. 약 3분 정도 기다린 뒤 다시 실행하면 됩니다.

마지막으로 Postgres 통계를 최신 상태로 맞추기 위해 VACUUM ANALYZE를 실행합니다. (약 4분 소요)

$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "VACUUM ANALYZE orders;"

분석 쿼리 실행

이제 분석 쿼리를 실행해 보겠습니다. 각 고객이 얼마나 소비했는지를 계산합니다:

SELECT
  customer_id,
  SUM(amount_pennies) total
FROM orders
GROUP BY customer_id
ORDER BY total DESC;

보다 신뢰할 수 있는 결과를 위해 hyperfine으로 여러 번 실행하여 평균 성능을 측정합니다.

$ hyperfine \
 --warmup 10 \
 "/usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders_analytics GROUP BY customer_id ORDER BY total DESC'" \
 "/usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders GROUP by customer_id ORDER BY total DESC'"

Benchmark 1: /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders_analytics GROUP by customer_id ORDER BY total DESC'
 Time (mean ± σ):     863.1 ms ±  11.1 ms    [User: 3.6 ms, System: 4.0 ms]
 Range (min … max):   851.6 ms … 891.6 ms    10 runs
Benchmark 2: /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders GROUP by customer_id ORDER BY total DESC'
 Time (mean ± σ):      5.968 s ±  0.014 s    [User: 0.003 s, System: 0.004 s]
 Range (min … max):    5.939 s …  5.987 s    10 runs
Summary
 /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders_analytics GROUP by customer_id ORDER BY total DESC' ran
   6.91 ± 0.09 times faster than /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders GROUP by customer_id ORDER BY total DESC'