Erlang/OTP Forums

Author Message

<  Open Telecom Platform (OTP)  ~  Supervisor trees = single point of failure?

oogabooga
Posted: Mon Aug 06, 2007 12:31 am Reply with quote
Joined: 06 Aug 2007 Posts: 2
Hi all,

I'm new to this forum and relatively new to Erlang. The learning curve for the language has actually been fairly reasonable for me -- I've got prior experience with other functional languages and with Prolog, which made the syntactic and semantic pills easier to swallow -- but I'm finding OTP to be a tougher nut to crack. Even though I've been able to get a skeletal OTP application up and running, some of the bigger-picture design issues still elude me.

I understand that supervisor trees, along with the supervisor behavior in OTP, provide a mechanism for building fault-tolerant systems. It's clear to me that, through linking, one process will detect the death of another and restart it. Great stuff.

But the question is this: Doesn't a supervisor tree, with a single root, imply that there's a single point of failure for the whole system? If the system is running on a single node and the root process in the supervisor tree dies, doesn't that bring the whole system down?

In the context of distributed Erlang, what if the root of my supervisor tree is on one node and that node crashes? Doesn't that adversely affect the other nodes? Or is the trick to loosely couple the nodes and not have a single supervisor?

I'm really appreciating Erlang; it's opened my mind in ways that it hasn't been opened in years. But understanding this point would really help me take a big step forward, I think.

Thanks for your time.
View user's profile Send private message
anderst
Posted: Mon Aug 06, 2007 5:36 pm Reply with quote
User Joined: 21 Nov 2006 Posts: 37
Quote:
but I'm finding OTP to be a tougher nut to crack. Even though I've been able to get a skeletal OTP application up and running, some of the bigger-picture design issues still elude me.


Erlang is easy to learn on your own, but the unanimous consensus is that OTP is not. If you are going to use it in industrial applications, you need to attend a course. I attended myself one a few years ago, only to realize that on my own, I would never absorbed that much.. Especially not in 5 days. If you have a generous employer, I would recommend it.

Quote:
But the question is this: Doesn't a supervisor tree, with a single root, imply that there's a single point of failure for the whole system? If the system is running on a single node and the root process in the supervisor tree dies, doesn't that bring the whole system down?


No, that is not how it works. By single point of failure, you would only have one set of machines. Instead, try to picture a pool of two identical nodes (To keep it simple), each running a web server on separate machines. In front of these machines, you have a load balancer (Cisco / Alteon). These front-end nodes are connected to two back-end nodes with a duplicated database. If a webserver node goes down, the alteon will detect it (Heartbeat failing), and will forward the request (and all other requests) to the node which is still up. Something similar would happen if a backend node went down.

Quote:
In the context of distributed Erlang, what if the root of my supervisor tree is on one node and that node crashes? Doesn't that adversely affect the other nodes? Or is the trick to loosely couple the nodes and not have a single supervisor?


In distributed Erlang, you would have one supervisor tree for each node, possibly with duplicated processes. You would pick your strategy from there, either going for a round robin / hashing approach or have a primary and a standby node.

The ultimate trick is to create a pool of loosely coupled nodes where you can add and remove nodes on the fly. Together with backup power supplies, redundant networks, and software upgrade during runtime, you get systems with 99,99999% availability (3ms downtime / year!).

The bottom line is that writing Erlang is easy. Writing distributed, massively concurrent real time systems with demands of high availability and no down time is not. That is where OTP comes in.

Hilsen,
Anders
View user's profile Send private message Visit poster's website
oogabooga
Posted: Mon Aug 06, 2007 6:27 pm Reply with quote
Joined: 06 Aug 2007 Posts: 2
Quote:
If you have a generous employer, I would recommend it.


Unfortunately, my employer, while generous, isn't in the Erlang business at all. Erlang is for my own side project, so I'll need to have a go at this on my own.

Thanks for the response. It made a lot of sense and helped a lot.
View user's profile Send private message
anderst
Posted: Mon Aug 06, 2007 7:15 pm Reply with quote
User Joined: 21 Nov 2006 Posts: 37
You will find that the person who taught me OTP back in 2003 is often on this forum helping out and replying to questions, as is another of his trainers (They run this site). Post your questions and someone will certainly reply.

Hilsen,
Anders
View user's profile Send private message Visit poster's website

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum