Erlang/OTP Forums

Author Message

<  RabbitMQ mailing list  ~  RabbitMQ crashes hard when it runs out of memory

Guest
Posted: Thu Oct 22, 2009 6:04 pm Reply with quote
Guest
From your suggestions, it looks like I was on the right track. Output is inline.

Moved to rabbitmq-discuss.

And again, your guidance is greatly appreciated.

-Stephen

On Thu, Oct 22, 2009 at 1:51 AM, Matthias Radestock <matthias@lshift.net (matthias@lshift.net)> wrote:
Quote:
Stephen,

Stephen Day wrote:
Quote:
After running for few days, limiting queue size on producer side (no memory backpressure), I have noticed that RabbitMQ has restarted several times due to out of memory errors. Currently, I have about 150,000 persistent messages being consumed by 16 clients with qos. Over about 2-6 hours, the RSS and VM size of the erlang process grows unbounded, without the producers adding any messages to the queue (externally limited to 100,000), and the erlang process crashes. I am running version 1.7.0.


Memory consumption can increase due to gc effects, but not without bounds.

How many messages does rabbit think there are in the queues - check with 'rabbitmqctl list_queues' - and how big are the messages?

$ ~/rabbitmq/sbin/rabbitmqctl list_queues name messages messages_unacknowledged consumers memory
Listing queues ...
<redacted>
Guest
Posted: Thu Oct 22, 2009 6:34 pm Reply with quote
Guest
Stephen,

Stephen Day wrote:
> $ ~/rabbitmq/sbin/rabbitmqctl list_queues name messages
> [...]
> <redacted> 149917 16 16 71106232
> du -b mnesia/rabbit/*
> [...]
> 82515919 mnesia/rabbit/rabbit_persister.LOG
> 82526910 mnesia/rabbit/rabbit_persister.LOG.previous

That all looks reasonable.

> Here is the memory() output:
>
> (rabbit@vs-dfw-ctl11)1> memory().
> [{total,1015558480},
> {processes,161498128},
> {processes_used,161494240},
> {system,854060352},
> {atom,516217},
> {atom_used,492146},
> {binary,778446312},
> {code,3860276},
> {ets,70066180}]

Interesting. There is far more binary data allocated than you have
message data in your system.

We've noticed in the past that binaries can take quite a while to get
collected by the gc. One experiment you may want to try is to run
garbage_collect(whereis(rabbit_persister)).
from the Erlang shell and see what difference that makes to the memory
report.

Also, the output of
process_info(whereis(rabbit_persister)).
might be useful.

> Right now, the system is using around 1GB of memory

And you are saying that figure keeps growing, and eventually RabbitMQ
runs out of memory, even though the figures reported by 'rabbitmqctl
list_queues' and 'du' remain roughly constant? That's strange indeed.


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Oct 22, 2009 8:51 pm Reply with quote
Guest
Unfortunately, the system has crashed since the last outputs I provided, but the behavior remains. There definitely seems to be some memory held up in the persister, but I dont think this is the main source. Below, I printed out the memory for the process, gc'd it, then printed it again:

1> process_info(whereis(rabbit_persister)).
[{registered_name,rabbit_persister},
Guest
Posted: Thu Oct 22, 2009 9:23 pm Reply with quote
Guest
As a light experiment, to isolate garbage collection, I ran this:

4> memory().
[{total,367371832},
Guest
Posted: Thu Oct 22, 2009 11:24 pm Reply with quote
Guest
Stephen,

Stephen Day wrote:
> (rabbit@vs-dfw-ctl11)5> [erlang:garbage_collect(P) || P <-
> erlang:processes()].
> [true,true,true,true,true,true,true,true,true,true,true,
> true,true,true,true,true,true,true,true,true,true,true,true,
> true,true,true,true,true,true|...]
>
> (rabbit@vs-dfw-ctl11)6>
> memory().
> [{total,145833144},
> {processes,50900752},
> {processes_used,50896864},
> {system,94932392},
> {atom,514765},
> {atom_used,488348},
> {binary,24622512},
> {code,3880064},
> {ets,64745716}]
>
> This really cut down on usage, so its likely that the binary gc is
> falling behind rabbits requirements.

Agreed.

> How do I track down the uncollected binary heap usage to a process?

Binaries are shared between processes and ref counted, so no single
process owns them. There is a process_info item called 'binary' that
provides information on the binaries referenced by a process, but I've
never looked at that myself, so don't know how useful the contained info is.

One thing you could try is to run the above garbage_collect code
interleaved with the memory reporting code to identify which process
results in the biggest drop in memory memory usage when gc'ed.


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Fri Oct 23, 2009 12:48 am Reply with quote
Guest
I won't bore you with all the output, but I tracked down the binary usage to these two processes:

[{Pid1, _Info, _Bin}, {Pid2, _Info2, _Bin2} | Other ] = [{P, process_info(P), BinInfo} || {P, {binary, BinInfo}} <- [{P, process_info(P, binary)} || P <- processes()], length(BinInfo) > 100000].

<0.157.0>
Guest
Posted: Fri Oct 23, 2009 12:55 am Reply with quote
Guest
I am not quite sure on the function evaluation order, but it might help to know that <0.159.0> is the disk_log process:

<0.159.0>
Guest
Posted: Fri Oct 23, 2009 6:55 am Reply with quote
Guest
Stephen,

Stephen Day wrote:
> I won't bore you with all the output, but I tracked down the binary
> usage to these two processes:
>
> [{Pid1, _Info, _Bin}, {Pid2, _Info2, _Bin2} | Other ] = [{P,
> process_info(P), BinInfo} || {P, {binary, BinInfo}} <- [{P,
> process_info(P, binary)} || P <- processes()], length(BinInfo) > 100000].
>
> <0.157.0> gen:init_it/6 1682835
> 1873131 0
> gen_server2:process_next_msg/8

Do you know what that process is? process_info(P) may have some clues to
figure that out.

> {<0.157.0>,825951032},
> {<0.158.0>,602886872},
> {<0.159.0>,345002144},

Right, so that's some mystery process (probably a queue), the persister
and disk_log. We may have to force a more aggressive gc strategy on the
latter two.


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 9:57 pm Reply with quote
Guest
Its been awhile since I brought this up, but made I a small patch to the memory supervisor that has fixed most of the memory usage problems. Basically, I just force a garbage collect before checking the memory alarm condition:

diff -r 7b0512cdf3bc src/vm_memory_monitor.erl
--- a/src/vm_memory_monitor.erl
Guest
Posted: Fri Nov 06, 2009 10:07 am Reply with quote
Guest
Hi Stephen,

Thanks for the patch, and for digging around enough to come up with a
solution.

On Thu, Nov 05, 2009 at 01:57:31PM -0800, Stephen Day wrote:
> Indeed, this is a bit heinous, but it gets the job done. Unfortunately, I
> don't have the appropriate bug id so I didn't create an hg branch for you to
> pull from.

That's fine. I have to say that it's unlikely this patch will make it
through - the memory management code has gone through a lot of change
recently as we're getting a much better handle on resource management.
Whilst you've obviously been working from the head of our default branch
(many thanks!), there are a couple of issues with garbage collecting
every process like that, for example, it's possible that garbage
collecting vast numbers of processes will take longer than the
memory_check_interval, making messages queue up for the memory manager
process. This would become a problem if the garbage collection is unable
to reclaim any memory at all - eg millions of queues, all of which are
empty.

> As far as overall system effects go, I haven't noticed any (aside from the
> lack of crashes). We have been running this in production for a bit and
> haven't seen any large problems, although the application is low throughput.
> Are there any performance unit tests that I can run to check this?

Yeah, when you garbage collect a process it stops the process. Also, I
*believe* that Erlang uses a generational garbage collector. Normally,
it'll most likely only sweep the young generation, which should be
quick, but I suspect that manually calling garbage_collect will do a
full sweep of all generations, thus potentially taking longer. You may
find that this causes performance to dip.

We tend to measure using the java client. If you get that, and then ant
dist, and then cd build/dist, then start up rabbit and try:

sh runjava.sh com.rabbitmq.examples.MulticastMain -r 20000 -s 0 -a

On my machine, I can bump that 20000 to about 25000 and the sending
rates and receiving rates are about equal (i.e. the queue length doesn't
grow too much). Obviously your hardware may be different, but I suspect
that garbage collection may have a performance impact, obviously
depending on how often it's done. With the default memory_check_interval
of 1 sec, my guess is that it'd be noticeable.

Much better resource management is on its way. However, if your patch
works for you then obviously, please use it.

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Fri Nov 06, 2009 10:44 am Reply with quote
Guest
On Fri, Nov 06, 2009 at 10:06:20AM +0000, Matthew Sackman wrote:
> That's fine. I have to say that it's unlikely this patch will make it
> through - the memory management code has gone through a lot of change
> recently as we're getting a much better handle on resource management.
> Whilst you've obviously been working from the head of our default branch
> (many thanks!), there are a couple of issues with garbage collecting
> every process like that, for example, it's possible that garbage
> collecting vast numbers of processes will take longer than the
> memory_check_interval, making messages queue up for the memory manager
> process. This would become a problem if the garbage collection is unable
> to reclaim any memory at all - eg millions of queues, all of which are
> empty.

Some immediate ideas to improve this a little.
1) Only do the GC when you initially hit the memory alarm. I.e. in the
first case when going from non-alarmed to alarmed, put the gc in there,
then maybe recurse again (though you'll likely want another param on the
function to stop infinite recursion).

2) Only put GC in processes that are known to eat lots of RAM. Eg if
it's the persister, then putting in a manual GC right after it does a
snapshot is probably a good idea.

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Fri Nov 06, 2009 8:58 pm Reply with quote
Guest
On Fri, Nov 6, 2009 at 2:43 AM, Matthew Sackman <matthew@lshift.net (matthew@lshift.net)> wrote:
Quote:
On Fri, Nov 06, 2009 at 10:06:20AM +0000, Matthew Sackman wrote:
> That's fine. I have to say that it's unlikely this patch will make it
> through - the memory management code has gone through a lot of change
> recently as we're getting a much better handle on resource management.

Agreed. This is a definitely a workaround fix for the problem. In the interest of full disclosure, I have gotten rabbitmq to crash with this patch for the same reason, by getting the memory to spike before excess can be collected, so this isn't a full fix by any means. I will try to dig further into the root cause in 1.7.0 release when I have time.
wuji
Posted: Thu Aug 23, 2012 7:27 am Reply with quote
User Joined: 10 Aug 2012 Posts: 654
in the meat.In addition to the Minneapolis flight, a needle needle [h2]replica designer *beep*[/h2] needle was discovered by a teenage passenger aboard a Delta
from Amsterdam to Atlanta. The teen would not surrender the the cheap Ralph Lauren the needle to authorities, who noted he told them that
planned to use it as evidence in a lawsuit.In a a [h4]cheap replica *beep*[/h4] a federal report on the incidents, it was noted that
teen was the son of a passenger aboard the flight flight [h4]replica Christian Louboutin[/h4] flight to Minneapolis who also found a needle in his
needles were reported found on two other flights, one by by Cheap Ralph Lauren Shirts by a crew member and another by a federal air
View user's profile Send private message

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum