ResiliNets:
Multilevel Resilient and Survivable Networking Initiative
David Hutchison
and
James P.G. Sterbenz
Lancaster University (UK)
and
The University of Kansas (US)
The resilient and survivable networking initiative (ResiliNets)
is investigating the architecture, protocols, and mechanisms to provide
resilient, survivable, and disruption-tolerant networks, services, and
applications.
Scope and Definition
Resilience is the ability of the network to provide and
maintain an acceptable level of service in the face of various
challenges to normal operation:
- unusual but legitimate traffic load (e.g. flash crowds)
- high-mobility of nodes and subnetworks
- weak, asymmetric, and episodic connectivity of wireless channels
- unpredictably long delay paths either due to length (e.g. satellite)
or as a result of episodic connectivity
- attacks against the network hardware, software, or protocol
infrastructure (from recreational crackers, industrial espionage,
terrorism, or warfare)
- large-scale natural disasters (e.g. hurricanes, earthquakes, ice storms,
tsunami, floods)
- failures due to mis-configuration or operational errors
- natural faults of network components
Resilient networks aim to provide acceptable service to
applications:
- ability for users and applications to access information when needed, e.g.:
- Web browsing
- distributed database access
- sensor monitoring
- situational awareness
- maintenance of end-to-end communication association, e.g.:
- computer-supported cooperative work
- video conference
- teleconference (including VoIP calls)
- operation of distributed processing and networked storage, e.g.:
- ability for distributed processes to communicate with one another
- ability for processes to read and write networked storage
Resilient network services must:
- remain accessible whenever possible
- degrade gracefully when necessary
- ensure correctness of operation, even if performance is degraded
- rapidly and automatically recover from degradation
Resilient networks are engineered and have emergent behaviour to:
- resist challenges to normal operation
- recognise when challenges and attacks occur and isolate their effects
- ensure resilience in the face of dependence of other infrastructure such as ythe power grid
- rapidly and autonomically recover to normal operation
- refine future behaviour to better resist, recognise, and recover
Note that while attack detection is an important endeavor, it is in
some sense futile, since a sufficiently sophisticated distributed
denial of service attack is indistinguishable from legitimate
traffic. Thus traffic anomaly detection that attempts to detect
and resist DDOS attacks simply incrementally raise the bar over which
crackers must pass. Since both cases adversely affect servers, cross
traffic, and exhaust network resources, our goal is resilience
regardless of whether or not an attack is occurring.
We are exploiting new architectures, algorithms, and protocols. as well
as techniques in programmable, active, and cognitive networking to
achieve these goals. Three key themes are knobs-and-dials, adaptive
composable protocol mechanisms, and intelligent resource tradeoffs.
- Knobs and dials provide instrumentation upward and
influence downward, respectively, between the layers in the form
of vertical control loops. Thus, we believe in the benefits of
layers as applied to network structure and role (physical/link:
hop-by-hop, network: path, transport: end-to-end, application),
but in softening the boundaries and providing cross-layer
optimisations. Knobs and dials are also necessary between the
data, control, and management planes.
- Context-aware, adaptive and composable protocol
mechanisms understand the current environment, the characteristics
below (via cross-layer dials), and apply the appropriate
mechanisms to achieve resilience and survivability at each
protocol layer. It is essential to keep mechanisms logically
distinct for correct operation, for example discrimination of
congestion (throttle), corruption (retransmit), and delay (wait).
- Resource tradeoffs consist of properly understanding
and trading resources (and constraints) against one-another.
These consist of processing, memory, bandwidth, energy, and latency.
Relationship of resilience to survivability and disruption tolerance
The primary difference between our definition of resilience vs.
survivability and disruption tolerance is that resilient networks are
engineered to tolerate legitimate but unpredictably high-traffic
loads (such as flash crowds), while maximising the service provided to
other users of the network, as well as being resistant to attack.
Survivability is the capability of a system to fulfill its
mission in a timely manner, even in the presence of attacks or
failures [CMU SEI], including large scale natural disasters.
Disruption tolerance is the ability for end-to-end
applications to operate even when network connectivity is not strong
(weak, episodic, or asymmetric) and the network is unable to provide
stable end-to-end paths.
Thus survivability and disruption tolerance are necessary but not
sufficient for resilience.
Relationship of resilience to fault tolerance
Fault tolerance the ability of a system or component
to continue normal operation despite the presence of hardware or software
faults [IEEE].
Fault tolerant systems are generally engineered only to tolerate isolated
random natural failures. Thus, fault tolerance is necessary but not
sufficient for survivability (and therefore resilience). We do believe
that we can learn from past work in fault tolerance, particularly by
extending work in design methodology and metrics.
Multi-Level Resilience and Survivability
We believe that it is essential to solve the problem of resilience on
all levels, both from a network architectural perspective as
well as from a protocol layering and plane viewpoint. Starting from
the bottom-up, each level is made as resilient as practical
(understanding cost and resource tradeoffs). Higher levels are
themselves organised into resilient structures using the resilient
lower-level building blocks.
Network architecture view
>From a network architecture perspective, auto-configured fault
tolerant components are self-organised into resilient network
structures.
- Network components: Individual components must be
fault-tolerant and able to auto-configure their operational
parameters that enable them to be part of a network.
- Network architecture: Auto-configured fault-tolerant
components are the building blocks for the network. These must be
autonomically self-organised into resilient and survivable network
structures that are able to survive the challenges to normal
operation listed above. This is true for individual subnetworks
and ASs, as well as for their composition at all levels of a
hierarchy and ultimately the Global (and Interplanetary) Internet.
Over time, autonomic self-management is responsible for continuing
re-optimisation, self-diagnosis, and self-repair.
Protocol layer view
>From a protocol layer perspective, it is essential in a bottom-up
manner to make each layer as resilient and survivable as practical,
given economic and policy constraint. In every case this is a
necessary, but not sufficient condition for resilience at the layer
above. Traditional research has emphasised the lower layers (physical
and link); we believe that new emphasis must be placed on the network
and transport layers, as well as on services and applications.
- Physical and link layer (hop-by-hop): A significant
body of research has been done on improving the capacity and
robustness of physical communication links. Examples include
reliable physical coding in challenging wireless environments and
robust optical links with automatic protection and restoration.
While this line of pursuit has been a critical first step, the
provision of robust links does not result in resilient and
survivable networks. We are reaching diminishing returns on the
benefits of further improvements on the physical and link layers
(including physical coding, MAC algorithms and protocols, and
SONET and other protection mechanisms). An exception is the need
for further research in the emerging area of dynamic spectrum
allocation and management.
- Network layer (information path): The presence of
robust links is desirable, but not sufficient to obtain a
resilient, survivable network. These robust links must be
organised into a resilient network architecture, supported by
resilient network-layer mechanisms, including forwarding, routing,
signalling, and traffic management. While a first goal is to
maintain network connectivity when possible, survivable forwarding
and routing mechanisms are needed that permit communication even
when a stable path can not be established through the network,
particularly in the presence of episodically-connected links.
Information is forwarded through the network as far as possible,
whenever possible (with store-and-forward when necessary) and may
be physically transferred by mobile nodes (store-and-haul,
also called data muling) Paths must have geographically diverse redundancy to
permit operation after natural disasters or attacks against the
infrastructure.
- Transport layer (end-to-end): The presence of
resilient network structures and the ability to transfer
information through the network even when strongly connected
symmetric paths do not exist is essential for resilience, but does
not ensure resilient end-to-end transport, which requires
resilient and disruption tolerant transport protocols. While we
expect the network to do the best it can, it is the transport
protocol that must provide the appropriate error, flow, and
congestion control mechanisms based on particular
network path characteristics. Existing transport protocols fall
far short of this goal, and make incorrect assumptions about the
network. For example, neither TCP nor SCTP discriminate between
congestion-induced and channel-corruption based loss, and do not
even consider discriminating from delayed packets and
acknowledgements due to store-and-forward in the case of episodic
connectivity.
- Applications: The presence of resilient end-to-end
associations (transactions, flows, or connections) is necessary
but not sufficient for resilient and disruption-tolerant
application behaviour. Applications must be aware of, and adapt
to the characteristics of the end-to-end transport association,
including available bandwidth (symmetric, asymmetric, or
unidirectional), delay, and error rate, as well as the
distributions of these (which may be unpredictably time-varying).
In addition to instrumenting these characteristics to the
application and ultimately to the user via dials, the
user and application should be able to influence the behaviour of
the transport protocol (and ultimately the network below) via
dials that express desired service characteristics and
quality. This allows the application to mask disruptions and delay
to the degree possible, while allowing the user to make
choices (e.g. resolution vs. frame wait, freshness of information
vs. response time).
Protocol plane view
>From a protocol plane perspective, it is necessary that data, control,
and management planes each be resilient, as well as their interactions
and collective behaviour.
- Data plane: The data plane refers to data transfer and
related transfer control (such as error and flow control).
Clearly all of these mechanisms must be resilient and survivable.
- Control plane: The control plane data plane abstracts
signalling and control, for example connection or flow management
at the network layer and routing at the network layer along with
auxiliary mechanisms such as name resolution. These mechanisms
must all be designed to be resilient and secure, in contrast to
those in the current Internet (e.g. DNS, BGP, and OSPF).
- Management plane: The management plane is generally
modelled as cutting across all protocol layers. All network
monitoring and management activities must be resilient and
survivable, with autonomic operation driven by policy. A
critical research question is how to properly include human
network operators in the loop when necessary.
ResiliNets Strategy
Resilient and survivable networking depends on a strategy of layers
of resistance (D2R2):
defence / defense, detection, remediation, and recovery.
Defense
It is first essential that the network architecture, protocols, and
service mechanisms be as resistant as possible from either
attack or from the effects of large-scale natural disasters and
environmental challenges. For example, secure network infrastructure
protocols are less likely to be compromised; spatial diversity
lowers the impact when part of the infrastructure is attacked.
Automatic Detection
Even though a network is resistant to attacks and challenges, we must
assume that they will occur. Therefore, a resilient survivable network
must be context-aware and automatically detect when it is
threatened or under attack.
Adaptive Remediation
Once the network has detected a challenge, compromise, or attack, it
must remediate the effects and adapt its topology and behavior to
mitigate the effects and minimise the impact as much as possible
on the rest of the network and its users.
Autonomic Recovery
As a particular attack ends or new infrastructure is deployed
after a natural disaster, the mitigation can end and the network
must autonomically self-organise and self-repair itself back to
normal operation.
Research Projects and Activities
Details on the activities and research staff in the ResiliNets initiative
are described in the
ResiliNets Wiki.
Last updated 15 August 2006 –
Valid XHTML 1.1 –
Lynx inspected –
W3C A Conformance
©2003–2006 James P.G. Sterbenz
<jpgs@ittc.ku.edu>
<jpgs@comp.lancs.ac.uk>
and David Hutchison
<dh@comp.lancs.ac.uk>