Erlang/OTP Forums

Author Message

<  RabbitMQ mailing list  ~  rabbit disk_mode branch eating up all RAM, including swap, d

Guest
Posted: Sun Oct 04, 2009 1:04 pm Reply with quote
Guest
Hi, we're using
Guest
Posted: Mon Oct 05, 2009 10:05 am Reply with quote
Guest
Hi Brian,

On Sun, Oct 04, 2009 at 09:03:28AM -0400, Brian Whitman wrote:
> Hi, we're using 184cb96f7846+ (bug20980) and our host alerted us that rabbit
> was eating up all available swap on a 16GB real + 8GB swap machine.

20980 stopped getting development work some time ago. As per my recent
email to the list, development work is currently focussed on moving away
from using mnesia. I wouldn't really recommend using 20980 any more -
the branches that it grew into which then went through QA did catch some
bugs which will be present in 20980.

>
> """
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP COMMAND
> 18445 rabbitmq 18 0 24.7g 14g 1696 S 1087.1 91.7 2268:18 10g beam.smp
>
> In an effort to prevent kernel panic, we restarted the rabbitmq service,
> freeing up a considerable amount of swap:
>
> However, the rabbitmq server is not starting again as expected, due to the
> following exception:
>
> 2009-10-04 06:26:29.797201500 {"init terminating in
> do_boot",{{nocatch,{error,{cannot_start_application,rabbit,{{timeout_waiting_for_tables,[rabbit_disk_queue]},{rabbit,start,[normal,[]]}}}}},[{init,start_it,1},{init,start_em,1}]}}
> """

Hmm, interesting. Is this in a clustered setup, or unclustered?

> They had to delete the mnesia folder (losing all our disk-backed queues) and
> restart\, now it's fine. I would guess that this breakage coincided with us
> storing quite a large number of unacked messages in the queues (job
> instructions for a very large batch)

How many messages did you have in there, and do you know the average
size?

> a) Would upgrading this branch fix this? We were avoiding doing so because
> things were relatively stable.

The work in 20980 went onto other branches but has not gone onto default
yet because of issues we uncovered in the persister design. Thus the
default branch has the same persister as in v1.6. You would probably be
better off using branch bug21444, which has had the benefit of a lot of
QA attention and bug fixes. That said, all the usual warnings about
using unreleased code do apply.

> b) is there anything else I can look at to debug? The logs don't have
> anything of importance.

Not really - the clustering code was wrong in 20980 for a long time, so
if you were in a clustered setup, I'd blame that - and the error that
you've got would support that too. However, if you just had billions of
messages in there, then even in a non clustered setup, I could believe
mnesia would be taking a long time to start up and that might cause the
above error.

Best wishes,

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Mon Oct 05, 2009 6:30 pm Reply with quote
Guest
Quote:
Hmm, interesting. Is this in a clustered setup, or unclustered?
Guest
Posted: Mon Oct 05, 2009 8:15 pm Reply with quote
Guest
> The work in 20980 went onto other branches but has not gone onto default
> yet because of issues we uncovered in the persister design. Thus the
> default branch has the same persister as in v1.6. You would probably be
> better off using branch bug21444, which has had the benefit of a lot of
> QA attention and bug fixes. That said, all the usual warnings about
> using unreleased code do apply.

In your current workflow, is bug21444 basically the stable
accumulation of changes from 21368? It looks like 21444 is mostly
merges from 21368 and default, but sometimes the commits indicate that
21444 gets its own development as well. 21368 has a ton of
informative commit messages; it looks more active and less stable. Is
that accurate?

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Tue Oct 06, 2009 10:01 am Reply with quote
Guest
On Mon, Oct 05, 2009 at 02:23:14PM -0400, Brian Whitman wrote:
> > How many messages did you have in there, and do you know the average
> > size?
> >
>
> Can't know for sure but my guess is about 20 queues, about 500K messages in
> each queue, message sizes are about 1KB each.

Hmm, 10million rows in mnesia should be fine, but I can definitely
believe that mnesia would take more than 30 seconds to start up, esp if
this is on EC2 which is known to have disk bandwidth issues (correct me
if I'm wrong there). It's very possible that nothing was wrong at all -
it just timed out waiting for mnesia to start up and load in all the
tables, which would in turn stop Rabbit from starting up.

> Does bug21444 have the disk mode auto-pinning stuff?
>
> I understand about the warnings, the stable branch though would crash almost
> immediately with our message load.

Yes. Basically, a lot of the various features of 20980 got split out
into different branches. The main bulk of the work, in terms of the new
persister was in 21368, but that lacked the manual controls for pinning
queues to disk - everything it did was automatic only. 21368 then went
through a lot of QA - the code is in good shape (although flawed due to
issues with mnesia), and 21444 adds to 21368 the manual controls.

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Tue Oct 06, 2009 10:05 am Reply with quote
Guest
On Mon, Oct 05, 2009 at 03:15:00PM -0500, tsuraan wrote:
> In your current workflow, is bug21444 basically the stable
> accumulation of changes from 21368? It looks like 21444 is mostly
> merges from 21368 and default, but sometimes the commits indicate that
> 21444 gets its own development as well. 21368 has a ton of
> informative commit messages; it looks more active and less stable. Is
> that accurate?

No, that's not really accurate. It's difficult to say whether a branch
is "stable" or not. Our basic method is to develop on a branch, then the
branch will go through QA which may well result in further changes to
that branch - that may very well create a cluster of small commits over
the code base as bugs are tracked down or refactorings occur. But at
this point, the code is getting more stable and improving in quality.

21368 is the basic new persister - v2. This is being grown on a
different branch into an altered design (v3) which does not use mnesia
for reasons I've outlined before. 21444 is a set of manual controls that
extend 21368. Thus from time to time I merge 21368 into 21444. All long
standing branches get default merged into them from time to time so that
recent fixes which have been merged into default get propogated out to
branches. This also helps to ensure infrastructure remains working
across all branches, and it makes the eventual merge back into default
easier too.

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Tue Oct 06, 2009 3:32 pm Reply with quote
Guest
> No, that's not really accurate. It's difficult to say whether a branch
> is "stable" or not. Our basic method is to develop on a branch, then the
> branch will go through QA which may well result in further changes to
> that branch - that may very well create a cluster of small commits over
> the code base as bugs are tracked down or refactorings occur. But at
> this point, the code is getting more stable and improving in quality.
>
> 21368 is the basic new persister - v2. This is being grown on a
> different branch into an altered design (v3) which does not use mnesia
> for reasons I've outlined before. 21444 is a set of manual controls that
> extend 21368. Thus from time to time I merge 21368 into 21444. All long
> standing branches get default merged into them from time to time so that
> recent fixes which have been merged into default get propogated out to
> branches. This also helps to ensure infrastructure remains working
> across all branches, and it makes the eventual merge back into default
> easier too.

Ok, thanks for the info. Would it be possible to have a page where
the most active bug numbers are explained, or a blog entry or two on
lshift that explains them? You guys are really responsive on this
list, but sometimes it would be cool to be able to find information
without just asking you Smile

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Tue Oct 06, 2009 4:41 pm Reply with quote
Guest
On Tue, Oct 06, 2009 at 10:31:06AM -0500, tsuraan wrote:
> Ok, thanks for the info. Would it be possible to have a page where
> the most active bug numbers are explained, or a blog entry or two on
> lshift that explains them? You guys are really responsive on this
> list, but sometimes it would be cool to be able to find information
> without just asking you Smile

The short answer is no. The reason is that as a rule, we don't really
like advertising what is on each branch because we would prefer for
people to stick with official releases, or the default branch, unless
they have very good reasons. As soon as we start putting lists up of
what is in what branch, people will use those branches, which makes life
much much harder for us, in terms of offering meaningful support. People
may then be tempted to start merging branches together to get the mix of
features they want... this would rapidly descend into disaster.

Any branch that has been merged into default is considered QA'd and
finished. There are very few long running branches outside of default -
the new persister work, and the AMQP 0.9.1 work are two obvious
examples.

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Oct 08, 2009 11:21 am Reply with quote
Guest
Hi Brian,

On Mon, Oct 05, 2009 at 11:04:21AM +0100, Matthew Sackman wrote:
> However, if you just had billions of
> messages in there, then even in a non clustered setup, I could believe
> mnesia would be taking a long time to start up and that might cause the
> above error.

I've just come across this thread:
http://www.trapexit.org/forum/viewtopic.php?p=44433
which does indeed seem to support the idea that even modest mnesia
databases can take rather more than the 30 seconds we give them to
start up. Given that I would imagine this could happen even in v1.7
with, say, a couple million durable queues, exchanges and bindings,
this is certainly something we will look at as the 30 second timeout
may well be grossly too short.

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Fri Oct 09, 2009 8:08 pm Reply with quote
Guest
Hi Matthew, all, I had a couple people here test out this new branch vs. our old branch on a local VM.

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum