PostgreSQL13逻辑复制实战 1:N的主从数据库表的复制
从MySQL转投到PostgreSQL已经有5年了,去年开始用到逻辑复制功能, 当时从一台主数据库,通过Wireguard的网络,复制数据到从数据库。 一年的使用很平稳,现在需要增加一台从数据库,所以需要重新梳理一下逻辑复制的实现。
当我使用原来的从数据库的备份,作为新的从数据库时,发现无法正常复制数据,因此,进行了一面的实验,发现了一些问题,这里记录一下。
目前的架构
主数据库有replication_slot mysub, 有mypub,mypub2两个发布
1\c mando
2select * from pg_replication_slots;
3 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size
4-----------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------
5 mysub | pgoutput | logical | 16385 | mando | f | t | 862 | | 21060 | 0/E5B50C0 | 0/E5B50F8 | reserved |
6(1 row)
7
8
9select * from pg_publication;
10 oid | pubname | pubowner | puballtables | pubinsert | pubupdate | pubdelete | pubtruncate | pubviaroot
11-------+---------+----------+--------------+-----------+-----------+-----------+-------------+------------
12 16656 | mypub | 10 | f | t | t | t | t | f
13 41305 | mypub2 | 10 | f | t | t | t | t | f
实验1
建立一个已经存在的mysub上,这是第一个目前正在工作的从数据库的代码,看有什么情况出现
1CREATE SUBSCRIPTION mysub CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub WITH (connect = false, slot_name = 'mysub');
1
2//从机报错
32024-03-22 21:10:41.749 CST [22072] ERROR: could not start WAL streaming: ERROR: replication slot "mysub" is active for PID 862
42024-03-22 21:10:41.751 CST [457] LOG: background worker "logical replication worker" (PID 22072) exited with exit code 1
5
6
7//主机日志报错
82024-03-22 13:11:06.926 UTC [14310] traffic@mando ERROR: replication slot "mysub" is active for PID 862
92024-03-22 13:11:06.926 UTC [14310] traffic@mando STATEMENT: START_REPLICATION SLOT "mysub" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
结论: 不能建立同名的slot,因为原来的mysub已经被占用了
实验2
将从数据库的slot_name改为mysub2,看有什么情况出现, 这里mysub是从数据库的subscription, slot_name=mysub2,是主数据库的slot_name
1ALTER SUBSCRIPTION mysub SET (slot_name = "mysub2");
2ALTER SUBSCRIPTION mysub ENABLE;
1//从机报错
22024-03-22 20:55:43.453 CST [21784] ERROR: could not start WAL streaming: ERROR: replication slot "mysub2" does not exist
32024-03-22 20:55:43.455 CST [457] LOG: background worker "logical replication worker" (PID 21784) exited with exit code 1
4
5//主机日志报错
62024-03-22 12:58:52.756 UTC [14105] traffic@mando ERROR: replication slot "mysub2" does not exist
72024-03-22 12:58:52.756 UTC [14105] traffic@mando STATEMENT: START_REPLICATION SLOT "mysub2" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
结论: 主数据库的subscription不能通过alter来建立.
另外,下面的指令,也会提示上面一样的错误.
1CREATE SUBSCRIPTION mysub2 CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub WITH (connect = false, slot_name = 'mysub2');
删除原来的subscription, 再删除从数据库
为了干净的进行第三次实验,先删除原来的subscription,再重新建立数据库.
1\c mando2
2alter subscription mysub disable;
3alter subscription mysub set (slot_name = none);
4drop subscription mysub;
5\c postgres
6drop database mando2;
7create database mando2;
上面的步骤必须一步一步执行,否则会报错,因为有先后顺序的限制.
实验3
在从机上创建subscription
1CREATE SUBSCRIPTION sub_test
2 CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub
3 WITH (create_slot = false);
1//主机日志报错
22024-03-22 23:48:03.200 UTC [45382] traffic@mando ERROR: replication slot "sub_test" does not exist
32024-03-22 23:48:03.200 UTC [45382] traffic@mando STATEMENT: START_REPLICATION SLOT "sub_test" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
4
5//从机上的日志
62024-03-23 07:49:49.285 CST [9079] LOG: logical replication apply worker for subscription "sub_test" has started
72024-03-23 07:49:50.590 CST [9081] LOG: logical replication table synchronization worker for subscription "sub_test", table "login_devices" has started
82024-03-23 07:49:50.602 CST [9082] LOG: logical replication table synchronization worker for subscription "sub_test", table "users" has started
实验4
在从机上运行, 下面的指令会在主机上创建replication_slot sub_testx
1\c mando2
2CREATE SUBSCRIPTION sub_testx
3 CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub
4 WITH (create_slot = true);
在主机上,找到了sub_testx
1mando=# SELECT * FROM pg_replication_slots;
2 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size
3----------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------
4 mysub | pgoutput | logical | 16385 | mando | f | t | 862 | | 21073 | 0/E5D7778 | 0/E5DF068 | reserved |
5 sub_test | pgoutput | logical | 16385 | mando | f | f | | | 21073 | 0/E5D76D0 | 0/E5D7708 | reserved |
6 sub_testx | pgoutput | logical | 16385 | mando | f | t | 46668 | | 21073 | 0/E5D7778 | 0/E5DF068 | reserved |
但是,从机的users表格,没有数据.
1//从机上的日志
22024-03-23 08:09:29.609 CST [9415] LOG: logical replication apply worker for subscription "sub_testx" has started
32024-03-23 08:09:30.912 CST [9416] LOG: logical replication table synchronization worker for subscription "sub_testx", table "login_devices" has started
42024-03-23 08:09:30.923 CST [9417] LOG: logical replication table synchronization worker for subscription "sub_testx", table "users" has started
5
62024-03-23 08:18:56.249 CST [9414] postgres@mando2 ERROR: subscription "sub_testx" already exists
72024-03-23 08:18:56.249 CST [9414] postgres@mando2 STATEMENT: CREATE SUBSCRIPTION sub_testx
8 CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub
9 WITH (create_slot = true);
102024-03-23 08:30:30.764 CST [9416] ERROR: could not start initial contents copy for table "public.login_devices": server closed the connection unexpectedly
11 This probably means the server terminated abnormally
12 before or while processing the request.
13 server closed the connection unexpectedly
1//主机上的日志
22024-03-23 00:09:30.832 UTC [46668] traffic@mando STATEMENT: START_REPLICATION SLOT "sub_testx" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
32024-03-23 00:09:30.832 UTC [46668] traffic@mando LOG: logical decoding found consistent point at 0/E5D7708
42024-03-23 00:09:30.832 UTC [46668] traffic@mando DETAIL: There are no running transactions.
52024-03-23 00:09:30.832 UTC [46668] traffic@mando STATEMENT: START_REPLICATION SLOT "sub_testx" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
62024-03-23 00:09:32.164 UTC [46671] traffic@mando LOG: logical decoding found consistent point at 0/E5D7740
72024-03-23 00:09:32.164 UTC [46671] traffic@mando DETAIL: There are no running transactions.
82024-03-23 00:09:32.164 UTC [46671] traffic@mando STATEMENT: CREATE_REPLICATION_SLOT "sub_testx_16749_sync_16690" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
92024-03-23 00:09:32.189 UTC [46672] traffic@mando LOG: logical decoding found consistent point at 0/E5D7778
102024-03-23 00:09:32.189 UTC [46672] traffic@mando DETAIL: There are no running transactions.
112024-03-23 00:09:32.189 UTC [46672] traffic@mando STATEMENT: CREATE_REPLICATION_SLOT "sub_testx_16749_sync_16710" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
12
132024-03-23 00:26:39.078 UTC [46672] traffic@mando STATEMENT: COPY public.users TO STDOUT
142024-03-23 00:26:45.222 UTC [46671] traffic@mando LOG: could not receive data from client: Connection timed out
152024-03-23 00:26:45.222 UTC [46671] traffic@mando LOG: unexpected EOF on client connection with an open transaction
结论: 实验4创建了一个新的slot,但是,从机上的数据没有同步过来,原因是从机上的连接超时了.经过一段时间分析, 估计是主机的配质问题.
解决问题
从日志可以看出,实验4创建了一个新的slot,经过一段时间的分析,可能是发送的进程资源不够. 修改主数据库 /etc/postgresql/13/main/postgresql.conf 设置如下参数,重启数据库,问题解决.
1max_wal_senders = 20 # max number of walsender processes
2
3max_replication_slots = 20 # max number of replication slots
默认的设置都是10,显然是不够第二个从数据库使用的.
其他有用的设置
Decrease wal_sender_timeout from 30 seconds (default) to 15 seconds to ensure that idle WAL sender processes are terminated more quickly, which frees up resources for active replication.
Decrease wal_receiver_timeout from 30 seconds (default) to 15 seconds to ensure that idle WAL receiver processes are terminated more quickly, which frees up resources for active replication.
Increase max_logical_replication_workers from 4 (default) to 8 to ensure that there are enough logical replication worker processes to keep up with replication demand.
创建发布指令
1
2CREATE PUBLICATION mypub2 WITH (publish = 'insert, update, delete, truncate');
3
4
5ALTER PUBLICATION mypub2 OWNER TO postgres;
6
7--
8-- Name: mypub login_devices; Type: PUBLICATION TABLE; Schema: public; Owner: postgres
9--
10
11ALTER PUBLICATION mypub ADD TABLE ONLY public.login_devices;
12
13
14--
15-- Name: mypub2 login_devices; Type: PUBLICATION TABLE; Schema: public; Owner: postgres
16--
17
18ALTER PUBLICATION mypub2 ADD TABLE ONLY public.login_devices;
用户权限
1GRANT ALL ON TABLE public.login_devices TO traffic;
2
3GRANT ALL ON TABLE public.users TO traffic;