Discussion:
BUG #7914: pg_dump aborts occasionally
(too old to reply)
s***@beingcorp.co.jp
2013-03-04 11:23:17 UTC
Permalink
The following bug has been logged on the website:

Bug reference: 7914
Logged by: Shin-ichi MORITA
Email address: s-***@beingcorp.co.jp
PostgreSQL version: 8.4.16
Operating system: Windows Server 2012
Description:

Hi great team,

I encountered a strange error during pg_dump as below:
----
pg_dump: Dumping the contents of table "t_file_data" failed: PQgetCopyData()
failed.
pg_dump: Error message from server: lost synchronization with server: got
message type "d", length 21861
pg_dump: The command was: COPY public.t_file_data (c_file_info_id,
c_page_no, c_data) TO stdout;
pg_dump: *** aborted because of error
----

It seems that large tables tend to cause this to happen.

In order to figure out what happens, I looked into
src/interfaces/libpq/fe-protocol3.c.
Then I have found that it is possible that PGconn::inBuffer is
unintentionally enlarging during execution of getCopyDataMessage().

Please look at fe-protocol3.c:1432:
1425: avail = conn->inEnd - conn->inCursor;
1426: if (avail < msgLength - 4)
1427: {
:
1432: if (pqCheckInBufferSpace(conn->inCursor + (size_t) msgLength -
4,

When several messages have been already proccessed and avail < msgLength -
4,
inBuffer will be enlarged without left-justification.

I think that pqCheckInBufferSpace() should be called only when inBuffer is
left-justified like this:
1432: if (conn->inStart == 0 && pqCheckInBufferSpace(conn->inCursor +
(size_t) msgLength - 4,

When inBuffer was not left-justified at that moment, the line 1432 will be
executed again after inBuffer is left-justified in pqReadData() called by
pqGetCopyData3().

Please check this out.
Sorry if I have mistaken.

Note that I did not consider the async case.

Thanks and best regards,
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom Lane
2013-03-04 20:37:14 UTC
Permalink
Post by s***@beingcorp.co.jp
----
pg_dump: Dumping the contents of table "t_file_data" failed: PQgetCopyData()
failed.
pg_dump: Error message from server: lost synchronization with server: got
message type "d", length 21861
Hm. Can you create a reproducible test case for this?
Post by s***@beingcorp.co.jp
I think that pqCheckInBufferSpace() should be called only when inBuffer is
1432: if (conn->inStart == 0 && pqCheckInBufferSpace(conn->inCursor +
(size_t) msgLength - 4,
I think changing that code that way would break the considerations
mentioned in pqParseInput3; nor is it clear why this would help, anyway.
It doesn't seem likely that the buffer is so large that it can't be
enlarged by another 20K.

regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Shin-ichi MORITA
2013-03-05 10:56:16 UTC
Permalink
Hi Tom,
Post by Tom Lane
Post by s***@beingcorp.co.jp
----
PQgetCopyData()
failed.
pg_dump: Error message from server: lost synchronization with server: got
message type "d", length 21861
Hm. Can you create a reproducible test case for this?
This issue does not always happen.
It seems to depend on the environment and the load.
My test environment is actually a virual one as follows:
Hypervisor: VMware vSphere Hypervisor 5.1
Allocated vCPUs: 1
Allocated Memory: 4GB
OS: Windows Server 2012

1. Create test data using the script below.
2. Run pg_dump:
pg_dump --host localhost --port 5432 --username postgres
--format custom --verbose --file "..."
--table public.t_data test_db

----
CREATE TABLE t_data
(
c_id integer NOT NULL,
c_data bytea,
CONSTRAINT t_data_pkey PRIMARY KEY (c_id)
)
WITH (
OIDS=FALSE
);

CREATE LANGUAGE plpgsql;

CREATE FUNCTION create_data(n INT) RETURNS VOID AS
$$
BEGIN
FOR i IN 1..n LOOP
INSERT INTO t_data VALUES (i, rpad('', 20000, 'x')::bytea);
END LOOP;
END;
$$
LANGUAGE plpgsql;

SELECT create_data(300000);
----
Post by Tom Lane
Post by s***@beingcorp.co.jp
I think that pqCheckInBufferSpace() should be called only when inBuffer is
1432: if (conn->inStart == 0 && pqCheckInBufferSpace(conn->inCursor +
(size_t) msgLength - 4,
I think changing that code that way would break the considerations
mentioned in pqParseInput3; nor is it clear why this would help, anyway.
It doesn't seem likely that the buffer is so large that it can't be
enlarged by another 20K.
I think this issue happens when pg_dump is slower than the backend
for some reason.
Actually, pg_dump's memory consumption glows over 500MB
when this issue is happening.

If pg_dump is slower than the backend for some reason
and each message is not exactly 8K,
inBuffer will frequently overflow in getCopyDataMessage().

This will cause frequent calls to pqCheckInBufferSpace()
without left-justification.
Each call will double inBuffer and inBuffer will rapidly glow
even if each message is 20K.
Post by Tom Lane
regards, tom lane
Regards,
Shin-ichi MORITA
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom Lane
2013-03-05 19:22:26 UTC
Permalink
Post by Shin-ichi MORITA
Hi Tom,
Post by Tom Lane
Hm. Can you create a reproducible test case for this?
I think this issue happens when pg_dump is slower than the backend
for some reason.
If so, perhaps injecting a sleep() delay into the right place in pg_dump
or libpq would make it reproducible? I wouldn't have any problem
crediting a test with such modified code as being a valid test.

regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Shin-ichi MORITA
2013-03-06 03:59:27 UTC
Permalink
Post by Tom Lane
Post by Shin-ichi MORITA
Post by Tom Lane
Hm. Can you create a reproducible test case for this?
I think this issue happens when pg_dump is slower than the backend
for some reason.
If so, perhaps injecting a sleep() delay into the right place in pg_dump
or libpq would make it reproducible? I wouldn't have any problem
crediting a test with such modified code as being a valid test.
I think it's a good idea.
Although I can not confirm this by actually modifying the code right now,
somewhere in the loop calling PQgetCopyData() in pg_dump.c would be good.

pg_dump.c:
1348: for (;;)
1349: {
1350: ret = PQgetCopyData(conn, &copybuf, 0);
1351:
1352: if (ret < 0)
1353: break; /* done or error */
1354:
1355: if (copybuf)
1356: {
1357: WriteData(fout, copybuf, ret);
1358: PQfreemem(copybuf);
1359: }
sleep(1); /* maybe */

An alternative way would be running pg_dump with a lower priority.
Actually, I can reproduce this issue by setting the priority of
pg_dump to Low using Windows Task Manager
on the "single processor" environment.

Regards,
Shin-ichi MORITA
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom Lane
2013-03-06 18:33:35 UTC
Permalink
Post by Shin-ichi MORITA
Post by Tom Lane
If so, perhaps injecting a sleep() delay into the right place in pg_dump
or libpq would make it reproducible?
An alternative way would be running pg_dump with a lower priority.
Actually, I can reproduce this issue by setting the priority of
pg_dump to Low using Windows Task Manager
on the "single processor" environment.
I tried nice'ing pg_dump without any success in making it bloat memory.
I'm suspicious that there's something Windows-specific in what you're
seeing, because it's fairly hard to credit that nobody's seen this
problem in the ten years or so that libpq has been doing it like that.

Can anyone else manage to reproduce a similar behavior?

regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Shin-ichi MORITA
2013-03-10 11:58:34 UTC
Permalink
Post by Tom Lane
I tried nice'ing pg_dump without any success in making it bloat memory.
I'm suspicious that there's something Windows-specific in what you're
seeing, because it's fairly hard to credit that nobody's seen this
problem in the ten years or so that libpq has been doing it like that.
Can anyone else manage to reproduce a similar behavior?
I tried several other conditions as well including Linux,
32bit OS, non-virtual environments, and so on.

However, so far, I have not been able to reproduce this issue yet
with conditions other than the following combination:

Hypervisor: VMware vSphere Hypervisor 5.1
Allocated vCPUs: 1
Allocated Memory: more than 2GB
OS: Windows Server 2012 or 2008R2 (64bit)
Run pg_dump under high load
or run pb_dump with the Low priority using Windows Task Manager

This might be a rare case.

By the way, we have already worked around this issue by adding --inserts.

Thank you.

Best regards,
Shin-ichi MORITA
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Loading...