| Author |
Message |
|
| Guest |
Posted: Wed Sep 16, 2009 5:05 pm |
|
|
|
Guest
|
As RabbitMQ is running, I sometimes get messages saying "Erlang has
closed". These don't seem to be associated with any actual failure or
shutdown; a single long-running rabbit will sometimes have a dozen or
more "Erlang has closed" messages on its standard output. Rabbit is
still running fine, and I don't see any problems, but the message is a
bit strange. I also see, in my logs, this triplet of messages:
=INFO REPORT==== 16-Sep-2009::16:45:48 ===
application: rabbit
exited: stopped
type: temporary
=INFO REPORT==== 16-Sep-2009::16:45:48 ===
application: mnesia
exited: stopped
type: temporary
=INFO REPORT==== 16-Sep-2009::16:45:48 ===
application: os_mon
exited: stopped
type: temporary
I don't think those are associated with any interruption either, but
maybe the system is just recovering from that. Anyhow, some sort of
advice on what these mean, and why I don't need to worry about them
would be great
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| 0x6e6562 |
Posted: Thu Sep 17, 2009 10:16 am |
|
|
|
User
Joined: 12 Jul 2007
Posts: 250
|
Tsuraan,
On Wed, Sep 16, 2009 at 6:04 PM, tsuraan <tsuraan@gmail.com> wrote:
> As RabbitMQ is running, I sometimes get messages saying "Erlang has
> closed". |
|
|
| Back to top |
|
| Guest |
Posted: Thu Sep 17, 2009 1:44 pm |
|
|
|
Guest
|
> Is there anything in the sasl log?
Yeah. I've never seen anything in that file before, but there is today:
=CRASH REPORT==== 18-Sep-2009::21:08:20 ===
crasher:
pid: <0.26060.21>
registered_name: []
exception error: bad argument in an arithmetic expression
in function rabbit_load:pick/0
in call from rabbit_reader:compute_redirects/1
in call from rabbit_reader:handle_method0/2
in call from rabbit_reader:handle_method0/3
in call from rabbit_reader:handle_input/3
in call from rabbit_reader:mainloop/3
in call from rabbit_reader:start_connection/3
initial call: rabbit_reader:init(<0.171.0>)
ancestors: [rabbit_tcp_client_sup,rabbit_sup,<0.108.0>]
messages: []
links: [#Port<0.208888>,<0.171.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 23
reductions: 699
neighbours:
=SUPERVISOR REPORT==== 18-Sep-2009::21:08:20 ===
Supervisor: {local,rabbit_tcp_client_sup}
Context: child_terminated
Reason: badarith
Offender: [{pid,<0.26060.21>},
{name,tcp_client},
{mfa,{rabbit_reader,start_link,[]}},
{restart_type,temporary},
{shutdown,brutal_kill},
{child_type,worker}]
This is 1.6.0, running on a badly overloaded machine with a degraded
RAID array Either way, I'm guessing that the badarith is actually
a bug of some sort?
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| 0x6e6562 |
Posted: Thu Sep 17, 2009 3:24 pm |
|
|
|
User
Joined: 12 Jul 2007
Posts: 250
|
Tsuraan,
On Thu, Sep 17, 2009 at 2:38 PM, tsuraan <tsuraan@gmail.com> wrote:
> |
|
|
| Back to top |
|
| Guest |
Posted: Thu Sep 17, 2009 3:44 pm |
|
|
|
Guest
|
> I've seen this symptom before, but the circumstances were a bit
> contrived - the way the output from uptime was being processed on OSX
> - see http://erlang.org/pipermail/erlang-questions/2008-May/034889.html.
>
> However, I'm not convinced that this has anything to do with what you
> are seeing.
>
> Can you provide us with some more details of the environment? Is it
> possible to isolate the code that can reproduce this?
It's Rabbit 1.6.0 on linux, 2.6.27 kernel. The machine has a degraded
(rebuilding) RAID5 array, so its disk IO is stuck at a max of around
6MB/s. At the time of the errors, the system load was between 8 and
15, for an 8-core machine with 4GB RAM. There's a chance that it was
also swapping, but I wasn't watching that. Its swap isn't empty, so
it certainly has swapped at some point.
The system was processing a new chunk of data, which took around 8
hours to complete. From the MQ point of view, there were probably a
few million messages enqueued (well, at least dispatched to consumers;
I realize that rabbit tries really hard not to actually "enqueue"
messages), but no more than 100,000 messages at any one time. I can
try simulating the message flows on a system with a really high load
to see if I can make the problem occur reliably, but we have rabbit on
a few dozen machines, and it doesn't look like it happens often.
I can give more details if you can think of anything more specific
that would be useful. I can't share our source code, but I'll try to
come up with a repeatable test case.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Thu Sep 17, 2009 8:25 pm |
|
|
|
Guest
|
> I've seen this symptom before, but the circumstances were a bit
> contrived - the way the output from uptime was being processed on OSX
> - see http://erlang.org/pipermail/erlang-questions/2008-May/034889.html.
>
> However, I'm not convinced that this has anything to do with what you
> are seeing.
I think what's happening is that the os_mon:call (from cpu_sup:avg1())
is errorring out somehow, so its return value is { error, _ } instead
of a number. In the local_load function of rabbit_load.erl, where you
currently have
_Other -> cpu_sup:avg1()
it should probably be
_Other ->
case cpu_sup:avg1() of
{ error, _ } -> 0.0;
Value -> Value
end
I really wish erlang gave line numbers in its traces, but the only
arithmetic line in pick() is LoadAvg * ?FUDGE_FACTOR, and I don't
think the macro's value is changing
Also, avg1()'s man page says it returns a number, but its spec says an
integer or an { 'error', any() } pair. Its timeout is set at
infinity, so it's not a timeout, but it could be some other error.
I'll see if I can get that machine to bomb out some more with prints
for the return value of avg1().
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 12:09 am |
|
|
|
Guest
|
> I think what's happening is that the os_mon:call (from cpu_sup:avg1())
> is errorring out somehow, so its return value is { error, _ } instead
> of a number. In the local_load function of rabbit_load.erl, where you
> currently have
>
> _Other -> cpu_sup:avg1()
>
> it should probably be
>
> _Other ->
> case cpu_sup:avg1() of
> { error, _ } -> 0.0;
> Value -> Value
> end
>
> I really wish erlang gave line numbers in its traces, but the only
> arithmetic line in pick() is LoadAvg * ?FUDGE_FACTOR, and I don't
> think the macro's value is changing
>
> Also, avg1()'s man page says it returns a number, but its spec says an
> integer or an { 'error', any() } pair. Its timeout is set at
> infinity, so it's not a timeout, but it could be some other error.
> I'll see if I can get that machine to bomb out some more with prints
> for the return value of avg1().
>
I stand corrected. It is indeed a timeout; I used the above code, but
in the case of { error, X }, I have a io:format("avg1() gave the value
~p~n", [ X ]), and it's saying timeout. bizarre. Anyhow, assuming
that 0.0 is an acceptable way to handle that error, it looks like the
fix works for me.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 12:18 am |
|
|
|
Guest
|
> I stand corrected. It is indeed a timeout; I used the above code, but
> in the case of { error, X }, I have a io:format("avg1() gave the value
> ~p~n", [ X ]), and it's saying timeout. bizarre. Anyhow, assuming
> that 0.0 is an acceptable way to handle that error, it looks like the
> fix works for me.
And, a final self-reply before I shut up, it looks like one of my
queues disappeared. I know this was an issue in 1.5.3, but I think it
was supposedly fixed in 1.5.5 and 1.6.0. I don't have any errors in
my rabbit log (other than errors about processes trying to use the
missing queue), my sasl log is empty, and I don't have an erlang crash
dump anywhere. I have no idea what happened to that queue, but it was
present in the past hour or so.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 5:11 am |
|
|
|
Guest
|
Tsuraan,
tsuraan wrote:
> I think what's happening is that the os_mon:call (from cpu_sup:avg1())
> is errorring out somehow, so its return value is { error, _ } instead
> of a number. In the local_load function of rabbit_load.erl, where you
> currently have
>
> _Other -> cpu_sup:avg1()
>
> it should probably be
>
> _Other ->
> case cpu_sup:avg1() of
> { error, _ } -> 0.0;
> Value -> Value
> end
Well spotted. It should be 'infinity' instead of 0.0 though, and we
shouldn't ignore the error if it is something other than timeout.
I have committed a fix which, subject to qa, will make it into the next
release.
(Btw, the reason for the timeout is that even though the cpu_sup
gen_server:call has a timeout of infinity, internally there is a
measurement_server_call with a timeout of 5s, which presumably gets
triggered under heavy load).
Regards,
Matthias.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 12:35 pm |
|
|
|
Guest
|
Tsuraan,
tsuraan wrote:
> As RabbitMQ is running, I sometimes get messages saying "Erlang has
> closed". These don't seem to be associated with any actual failure or
> shutdown; a single long-running rabbit will sometimes have a dozen or
> more "Erlang has closed" messages on its standard output.
AFAIK that message is printed when an external program, such as that
used for memory reporting, terminates. Unless you see any other ill
effects it should be harmless.
> I also see, in my logs, this triplet of messages:
>
> =INFO REPORT==== 16-Sep-2009::16:45:48 ===
> application: rabbit
> exited: stopped
> type: temporary
>
> =INFO REPORT==== 16-Sep-2009::16:45:48 ===
> application: mnesia
> exited: stopped
> type: temporary
>
> =INFO REPORT==== 16-Sep-2009::16:45:48 ===
> application: os_mon
> exited: stopped
> type: temporary
>
> I don't think those are associated with any interruption either
That sequence usually indicates that rabbit was stopped, e.g. with
rabbitmqctl stop/stop_and_halt, rabbitmq-multi stop_all, or the various
scripts in the os-specific packages.
Regards,
Matthias.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 12:43 pm |
|
|
|
Guest
|
Tsuraan,
tsuraan wrote:
> And, a final self-reply before I shut up, it looks like one of my
> queues disappeared. I know this was an issue in 1.5.3, but I think
> it was supposedly fixed in 1.5.5 and 1.6.0.
I am not aware of any problem in that area that got fixed in those
releases. Have you got a reference?
> I don't have any errors in my rabbit log (other than errors about
> processes trying to use the missing queue), my sasl log is empty, and
> I don't have an erlang crash dump anywhere.
In that case the only way for the queue to have disappeared is as part
of auto-deletion due to the 'exclusive' or 'auto-delete' flag being set,
or explicit deletion with 'queue.delete'. Unless you are running in a
cluster, in which case queues can disappear if the node they reside on
disappears.
Regards,
Matthias.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 2:51 pm |
|
|
|
Guest
|
> I am not aware of any problem in that area that got fixed in those
> releases. Have you got a reference?
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2009-July/004033.html
was my queue_disappeared issue. I'm not getting the error message
anymore, but a persistent queue is gone.
> In that case the only way for the queue to have disappeared is as part
> of auto-deletion due to the 'exclusive' or 'auto-delete' flag being set,
> or explicit deletion with 'queue.delete'. Unless you are running in a
> cluster, in which case queues can disappear if the node they reside on
> disappears.
It's a single machine, and all the queues are persistent. There is a
place where queue_delete is called, but the queue is re-created
immediately afterwards (I saw something about queue.purge the other
day, but I haven't looked into it yet). If I do queue_delete and then
a queue_declare in a transaction, will that guarantee that the queue
doesn't disappear?
It looks like right now, if a rabbit_amqqueue_process is terminated,
it removes that queue from mnesia (terminate function of
rabbit_amqqueue_process calls rabbit_amqqueue:internal_delete).
Looking at gen_server2, terminate seems to get called in basically any
case of a process getting an error while handling a call, a supervisor
dying, the process itself dying, I'm not sure what else. Should all
those things cause the queue record to be removed from mnesia?
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 2:56 pm |
|
|
|
Guest
|
> Well spotted. It should be 'infinity' instead of 0.0 though, and we
> shouldn't ignore the error if it is something other than timeout.
For my system it was a timeout; as I guess you probably saw
>
> I have committed a fix which, subject to qa, will make it into the next
> release.
Is that 1.7, or a 1.6.1 release?
> (Btw, the reason for the timeout is that even though the cpu_sup
> gen_server:call has a timeout of infinity, internally there is a
> measurement_server_call with a timeout of 5s, which presumably gets
> triggered under heavy load).
Ok, that makes sense. This is probably the biggest wart I've found
with erlang; timeouts are a great way to avoid deadlock when I get
sloppy, but they also break down pretty badly when a machine is under
absurd load and swapping. I can't think of a better way to handle
things, but in my programs I've gotten pretty tired of things timing
out in weird edge cases where they should just be really slow.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 3:35 pm |
|
|
|
Guest
|
Tsuraan,
tsuraan wrote:
>> I am not aware of any problem in that area that got fixed in those
>> releases. Have you got a reference?
>
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2009-July/004033.html
> was my queue_disappeared issue.
We did fix some timeout related problems, as indicated in the above
thread. Queues weren't actually disappearing.
> It's a single machine, and all the queues are persistent.
Do you mean 'durable'?
> There is a place where queue_delete is called, but the queue is
> re-created immediately afterwards (I saw something about queue.purge
> the other day, but I haven't looked into it yet). If I do
> queue_delete and then a queue_declare in a transaction, will that
> guarantee that the queue doesn't disappear?
Transactions only span publishes and acks.
> It looks like right now, if a rabbit_amqqueue_process is terminated,
> it removes that queue from mnesia (terminate function of
> rabbit_amqqueue_process calls rabbit_amqqueue:internal_delete).
> Looking at gen_server2, terminate seems to get called in basically any
> case of a process getting an error while handling a call, a supervisor
> dying, the process itself dying, I'm not sure what else. Should all
> those things cause the queue record to be removed from mnesia?
Any abnormal termination of a queue process should result in a sasl log
entry. Since you are not seeing such log entries it is unlikely that
abnormal queue termination is a problem.
As to whether we should remove the queue record when the queue process
terminates abnormally ... we couldn't just leave the record in place
since then there would be no way to ever create a queue with the same
name or get rid of the record. We could restart the queue process, but
that presents some interesting race conditions as well as masking a
potentially serious error from clients (even more than is usually the
case). It's on our todo list to look into this area though.
Regards,
Matthias.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Sep 18, 2009 4:40 pm |
|
|
|
Guest
|
>> It's a single machine, and all the queues are persistent.
>
> Do you mean 'durable'?
Yeah, that.
> Transactions only span publishes and acks.
Ok, so it's possible that my program did delete the queue that it
needed and then it died before re-creating it. Those two things are
in adjacent lines, but it's still a possibility, especially on a
really heavily overloaded machine.
> Any abnormal termination of a queue process should result in a sasl log
> entry. Since you are not seeing such log entries it is unlikely that
> abnormal queue termination is a problem.
Is the sasl log in any way related to the simple authentication and
security layer? I had assumed it was logs for secured connections to
rabbit, but now I'm beginning to suspect it's something else...
> As to whether we should remove the queue record when the queue process
> terminates abnormally ... we couldn't just leave the record in place
> since then there would be no way to ever create a queue with the same
> name or get rid of the record. We could restart the queue process, but
> that presents some interesting race conditions as well as masking a
> potentially serious error from clients (even more than is usually the
> case). It's on our todo list to look into this area though.
Yeah, handling process failure would be complicated. It seems like
the information required to recover a queue's contents is available
(at least the persistent contents are on disk). I don't have any idea
how you would handle the transactions a queue is engaged in, unless
AMQP had a transaction rollback error like postgres does for
deadlocks. From my POV, losing the messages in a queue is a pain;
it's not a total catastrophe, but it would be cool if queue contents
could be recovered after a process crash.
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist |
|
|
| Back to top |
|
|
|
All times are GMT
Page 1 of 2
Goto page 1, 2 Next
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum
|
|
|