PostgreSQL 如何瀟灑的處理天天上百TB的數據增量

摘要: 本文主要介紹並測試一下PostgreSQL 在中高端x86服務器上的數據插入速度,幫助企業用戶瞭解PostgreSQL在這種純插入場景的性能。(例如運營商網關數據,金融行業數據,產生量大,而且要求快速插入大數據庫中持久化保存。) 測試結果寫在前面:每32K的block存儲89條記錄, 每條記錄約3css

本文主要介紹並測試PostgreSQL 在中高端x86服務器上的數據插入速度(目標表包含一個時間字段的索引),幫助企業用戶瞭解PostgreSQL在這類場景下的性能表現。
這類場景常見於 : 運營商網關數據,金融行業數據,產生量大,而且要求快速插入大數據庫中持久化保存。
另外, 用戶若是須要流式實時處理,能夠參考基於PostgreSQL的流式處理方案,一天處理1萬億的實時流式處理是如何實現的?
https://yq.aliyun.com/articles/166git

TEST CASE

.1. 平均每條記錄長度360字節, (比較常見的長度) 
.2. 時間字段建立索引。
.3. 每輪測試插入12TB數據,插入完12T後清除數據繼續插入。循環。
.4. 測試滿24小時中止測試。
.5. 統計24小時插入的記錄數。github

TEST 結果

24小時一共完成12輪測試,平均每輪測試耗時7071秒。
506萬行/s,1.78 GB/s,全天插入4372億,154TB數據。sql

測試的硬件環境

.1. X86服務器     
.2. 3?核。   
.3. 5??G 內存   
.4. 幾塊SSD,15TB容量

軟件環境

.1. CENTOS 6.x x64   
.2 .xfs   
.3. PostgreSQL 9.5

系統配置參考

https://github.com/digoal/pgsql_admin_script/blob/master/pgsql_perf_tuning.md數據庫

數據庫配置

./configure --prefix=/home/digoal/pgsql9.5.1 --with-blocksize=32 --with-segsize=128 --with-wal-blocksize=32 --with-wal-segsize=64  
make && make install

PostgreSQL支持hugepage的方法請參考:
https://yq.aliyun.com/articles/8482
參數bash

listen_addresses = '0.0.0.0'            # what IP address(es) to listen on;
fsync=on
port = 1921                             # (change requires restart)
max_connections = 600                   # (change requires restart)
superuser_reserved_connections = 13     # (change requires restart)
unix_socket_directories = '.'   # comma-separated list of directories
unix_socket_permissions = 0700          # begin with 0 to use octal notation
tcp_keepalives_idle = 60                # TCP_KEEPIDLE, in seconds;
tcp_keepalives_interval = 10            # TCP_KEEPINTVL, in seconds;
tcp_keepalives_count = 10               # TCP_KEEPCNT;
shared_buffers = 256GB                   # min 128kB
huge_pages = on                 # on, off, or try
work_mem = 512MB                                # min 64kB
maintenance_work_mem = 1GB              # min 1MB
autovacuum_work_mem = 1GB               # min 1MB, or -1 to use maintenance_work_mem
dynamic_shared_memory_type = posix      # the default is the first option
bgwriter_delay = 10ms                   # 10-10000ms between rounds
bgwriter_lru_maxpages = 1000            # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0  
synchronous_commit = off                # synchronization level;
full_page_writes = on                  # recover from partial page writes
wal_buffers = 2047MB                    # min 32kB, -1 sets based on shared_buffers
wal_writer_delay = 10ms         # 1-10000 milliseconds
checkpoint_timeout = 55min              # range 30s-1h
max_wal_size = 512GB
checkpoint_completion_target = 0.9      # checkpoint target duration, 0.0 - 1.0
effective_cache_size = 40GB   
log_destination = 'csvlog'              # Valid values are combinations of
logging_collector = on          # Enable capturing of stderr and csvlog
log_directory = 'pg_log'                # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
log_file_mode = 0600                    # creation mode for log files,
log_truncate_on_rotation = on           # If on, an existing log file with the
log_checkpoints = off
log_connections = off
log_disconnections = off
log_error_verbosity = verbose           # terse, default, or verbose messages
log_timezone = 'PRC'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'                       # locale for system error message
lc_monetary = 'C'                       # locale for monetary formatting
lc_numeric = 'C'                        # locale for number formatting
lc_time = 'C'                           # locale for time formatting
default_text_search_config = 'pg_catalog.english'
autovacuum=off

建立測試表 :
每32K的block存儲89條記錄, 每條記錄360字節。服務器

postgres=# select string_agg(i,'') from (select md5(random()::text) i from generate_series(1,10) t(i)) t(i);
                               string_agg                                                                       
----------------------------------------------------------------------
 53d3ec7adbeacc912a45bdd8557b435be848e4b1050dc0f5e46b75703d4745833541b5dabc177db460b6b1493961fc72c478daaaac74bcc89aec4f946a496028d9cff1cc4144f738e01ea36436455c216aa697d87fe1f87ceb49134a687dc69cba34c9951d0c9ce9ca82bba229d56874af40498dca5f
d8dfb9c877546db76c35a3362d6bdba6472d3919289b6eaeeab58feb4f6e79592fc1dd8253fd4c588a29
(1 row)

postgres=# create unlogged table test(crt_time timestamp, info text default '53d3ec7adbeacc912a45bdd8557b435be848e4b1050dc0f5e46b75703d4745833541b5dabc177db460b6b1493961fc72c478daaaac74bcc89aec4f946a496028d9cff1cc4144f738e01ea36436455c216aa697d87fe1f87ceb49134a687dc69cba34c9951d0c9ce9ca82bba229d56874af40498dca5f
d8dfb9c877546db76c35a3362d6bdba6472d3919289b6eaeeab58feb4f6e79592fc1dd8253fd4c588a29');

postgres=# alter table test alter column info set storage plain;

postgres=# insert into test select now() from generate_series(1,1000);

postgres=# select ctid from test limit 1000;

分別在3個物理塊設備上建立3個表空間目錄,同時在數據庫中建立表空間。 
tbs1, tbs2, tbs3.dom

建立多個分表,用於減小 block extend 衝突。socket

do language plpgsql $$
declare
i int;
sql text;
begin
  for i in 1..42 loop
    sql := 'create unlogged table test'||i||' (like test including all) tablespace tbs1';
    execute sql;
    sql := 'create index idx_test'||i||' on test'||i||' using brin (crt_time) with (pages_per_range=512) tablespace tbs1'; execute sql; end loop; for i in 43..84 loop sql := 'create unlogged table test'||i||' (like test including all) tablespace tbs2'; execute sql; sql := 'create index idx_test'||i||' on test'||i||' using brin (crt_time) with (pages_per_range=512) tablespace tbs2'; execute sql; end loop; for i in 85..128 loop sql := 'create unlogged table test'||i||' (like test including all) tablespace tbs3'; execute sql; sql := 'create index idx_test'||i||' on test'||i||' using brin (crt_time) with (pages_per_range=512) tablespace tbs3'; execute sql; end loop; end; $$; 

又見黑科技 BRIN 索引方法

這裏使用的是brin範圍索引,PostgreSQL 針對物聯網流式數據的黑科技。tcp

postgres=# \di
                 List of relations
 Schema | Name | Type | Owner | Table --------+-------------+-------+----------+--------- public | idx_test1 | index | postgres | test1 public | idx_test10 | index | postgres | test10 public | idx_test100 | index | postgres | test100 public | idx_test101 | index | postgres | test101 public | idx_test102 | index | postgres | test102 public | idx_test103 | index | postgres | test103 public | idx_test104 | index | postgres | test104 public | idx_test105 | index | postgres | test105 public | idx_test106 | index | postgres | test106 ...... ...... public | idx_test90 | index | postgres | test90 public | idx_test91 | index | postgres | test91 public | idx_test92 | index | postgres | test92 public | idx_test93 | index | postgres | test93 public | idx_test94 | index | postgres | test94 public | idx_test95 | index | postgres | test95 public | idx_test96 | index | postgres | test96 public | idx_test97 | index | postgres | test97 public | idx_test98 | index | postgres | test98 public | idx_test99 | index | postgres | test99 (128 rows) 

生成測試腳本, 一個鏈接一次插入178條記錄,佔用2個32KB的block :

vi test.sql 
insert into test(crt_time) values (now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()),(now()); 

for ((i=1;i<=128;i++)) do sed "s/test/test$i/" test.sql > ./test$i.sql; done

開始測試前清除數據:

do language plpgsql $$  
declare
i int;
sql text;
begin
  for i in 1..128 loop
    sql := 'truncate test'||i;
    execute sql;
  end loop;
end; 
$$;

測試方法:
每輪測試插入12TB數據。經過如下方式控制:
.1. 使用128個並行鏈接,每一個鏈接執行1572864個事務。
.2. 一共執行201326592個事務(每一個事務插入178條記錄)。
.3. 一共插入35836133376條記錄(358.36 億記錄)(共計12TB 數據,索引空間另算)。
進行下一輪測試前,輸出日誌,並TRUNCATE全部的數據,而後重複以上測試。
直到測試滿24小時,輸出統計數據。
測試腳本以下 :

vi test.sh
#!/bin/bash 
if [ $# -ne 5 ]; then
  echo "please use: $0 ip port dbname user pwd"
  exit 1
fi

IP=$1
PORT=$2
DBNAME=$3
USER=$4
PASSWORD=$5

export PGPASSWORD=$PASSWORD

DEP_CMD="psql"
which $DEP_CMD 
if [ $? -ne 0 ]; then
  echo -e "dep commands: $DEP_CMD not exist."
  exit 1
fi

truncate() {
psql -h $IP -p $PORT -U $USER $DBNAME <<EOF
do language plpgsql \$\$  
declare
i int;
sql text;
begin
  for i in 1..128 loop
    sql := 'truncate test'||i;
    execute sql;
  end loop;
end; 
\$\$;
checkpoint;
\q
EOF
}

# truncate data first
truncate

START=`date +%s`
echo "`date +%F%T` $START"

for ((x=1;x>0;x++)) 
do 
# ------------------------------------------------------
echo "Round $x test start: `date +%F%T` `date +%s`"

for ((i=1;i<=128;i++)) 
do 
  pgbench -M prepared -n -r -f ./test$i.sql -h $IP -p $PORT -U $USER $DBNAME -c 1 -j 1 -t 1572864 >>./$i.log 2>&1 & 
done 

wait
echo "Round $x test end: `date +%F%T` `date +%s`"
# ------------------------------------------------------

if [ $((`date +%s`-$START)) -gt 86400 ]; then
  echo "end `date +%F%T` `date +%s`"
  echo "duration second: $((`date +%s`-$START))"
  exit 0
fi

echo "Round $x test end, start truncate `date +%F%T` `date +%s`"
truncate
echo "Round $x test end, end truncate `date +%F%T` `date +%s`"

done

測試

nohup ./test.sh xxx.xxx.xxx.xxx 1921 postgres postgres postgres >./test.log 2>&1 &

測試結果

24小時完成12輪測試,平均每輪測試耗時7071秒。 
506萬行/s(每行360字節),1.78GB/s,全天插入4372億,154TB數據。

查詢性能

postgres=# select min(crt_time),max(crt_time) from test1;
            min             |            max             
----------------------------+----------------------------
 2016-04-08 00:32:26.842728 | 2016-04-08 02:29:41.583367
(1 row)

postgres=# explain select count(*) from test1 where crt_time between '2016-04-08 00:32:00' and '2016-04-08 00:33:00';
                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1183919.81..1183919.82 rows=1 width=0)
   ->  Bitmap Heap Scan on test1  (cost=14351.45..1180420.19 rows=1399849 width=0)
         Recheck Cond: ((crt_time >= '2016-04-08 00:32:00'::timestamp without time zone) AND (crt_time <= '2016-04-08 00:33:00'::timestamp without time zone))
         ->  Bitmap Index Scan on idx_test1  (cost=0.00..14001.49 rows=1399849 width=0)
               Index Cond: ((crt_time >= '2016-04-08 00:32:00'::timestamp without time zone) AND (crt_time <= '2016-04-08 00:33:00'::timestamp without time zone))
(5 rows)
Time: 0.382 ms

postgres=# select count(*) from test1 where crt_time between '2016-04-08 00:32:00' and '2016-04-08 00:33:00';
  count  
---------
 2857968
(1 row)
Time: 554.474 ms

小結

.1. 這個CASE主要的應用場景是實時的大數據入庫,例如 物聯網 的應用場景,大量的 傳感器 會產生龐大的數據。
又好比傳統的 運營商網關 ,也會有很是龐大的流量數據或業務數據須要實時的入庫。
索引方面,用到了PostgreSQL黑科技BRIN。
.2. 除了實時入庫,用戶若是須要流式實時處理,能夠參考基於PostgreSQL的流式處理方案,

一天處理1萬億的實時流式處理是如何實現的?

https://yq.aliyun.com/articles/166

.3. 瓶頸, 仍是在IO上面 , 有幾個表現,TOP大量進程處於D(front io)狀態 。

w: S  -- Process Status
          The status of the task which can be one of:
             ’D’ = uninterruptible sleep
             ’R’ = running
             ’S’ = sleeping
             ’T’ = traced or stopped
             ’Z’ = zombie

全部塊設備的使用率均達100% 。
清理數據時 :

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
dfa               0.00     0.00 5807.39 167576.65 1464080.93 1340613.23    16.18   535.69    3.02   0.01 116.77
dfb               0.00     0.00 5975.10 185132.68 1506714.40 1481061.48    15.63   459.46    2.32   0.01 110.62
dfc               0.00     0.00 5715.56 182584.05 1440771.98 1460672.37    15.41   568.02    2.93   0.01 112.37

插入數據時 :

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
dfa               0.00     0.00    0.00 235936.00     0.00 1887488.00     8.00  2676.34   11.17   0.00  99.10
dfb               0.00     0.00    0.00 237621.00     0.00 1900968.00     8.00    66.02    0.10   0.00  99.10
dfc               0.00     0.00    0.00 239830.00     0.00 1918632.00     8.00    10.66    0.04   0.00 101.30

IO層面的性能問題,能夠經過優化代碼(例如 PostgreSQL bgwriter 在寫出數據時,儘可能順序寫出),便於OS層進行IO合併,來緩解IO壓力,從這個信息來看,單次寫IO的大小還能夠再大點。

有幾個工具你可能用得上,perf, systemtap, goprof. 若是要較全面的分析,建議把PostgreSQL --enable-profiling打開用於診斷。