Y. Chawathe and E. Brewer
University of California at Berkeley, USA
Presented by Yatin Chawathe
The motivation for the work was given as proliferation of network based
services. As this took place some critical issues emerged: incremental and
linear system scalability, availability and fault tolerance. A reuseable SNS
framework was described. Clusters of w/s are ideal for internet services but
clusters are difficult to manage because services must distribute load
across the cluster. The cluster service must grow the cluster with increasing
load. This was illustrated with a diagram.
In the SNS architecture, workers are grouped into classes. Workers can
recieve tasks from the outside world or from other workers. A task originator
sends a task to the consumer specifying the class and inputs to the tasks. It
is assumed that tasks are short lived, atomic and restartable. Worker driver
provide the interface between the SNS substrate and the worker applicaction.
The SNS manager is a seperate entity which is intentionally centralised
because it makes it easier to reason about and implement. However, it is
necessary to ensure is fault tolerant and not a bottleneck. The SNS manager
has three key functions: resource allocation, fault tolerance and
scalability.
Load balancing is achieved by each worker keeping track of local load.
Workers periodically report their current load to the SNS manager. The
manager piggybacks load reports on its beacons to the rest of the system.
Each worker tries to perform local load balancing decisions using lottery
scheduling with tickets. Auto launch scalability involves worker replication
to handle short bursts so requests are handled in parallel. If the load on
classes of workers gets too high, the SNS manager launches a new one. There
is an overflow pool for long bursts on a non-dedicated set of machines (eg
desktop machines) so that when all dedicated nodes are exhausted this
overflow pool can be used.
Fault Tolerance is achieved through the use of the Starfish model: if an
arm is chopped off, the rest of system grows back to same size. Peer
monitoring is used to monitor performance; no client host monitoring is used.
If the SNS manager dies each worker starts a timer. The first to time out
launches a new SNS manager and rebuilds the list of services. Beaconing
ensures synchronisation and also that multiple SNS managers do not exist.
There are several example applications of the system, including: Transend,
Wingman, Media Pad and MARS. This approach presents a reusable architecture
for building internet service applictions. Application developers program
services to well defined, narrow interfaces. SNS takes care of resource
location, load balancing and fault tolerance and a number of interesting
applications have been developed using it.
One quick question from the floor asked what functionality was lost due
to a crash? Yatin answered that clients experience loss of service for short
periods. For example, a client gets lost if the worker goes away when using
current protocols.