Industry Background
Beyond Best Effort, Building the Real-Time IP Network
The Internet and IP networks have experienced tremendous growth over the past decade – fueled primarily by best-effort applications such as e-mail, web browsing and non-critical data. Despite this impressive growth, IP networks have had mixed success in supporting newer real-time applications and converged legacy services. To achieve the cost savings and revenue opportunities of a single IP/MPLS transport network, carriers must address the design limitations of traditional IP networks that limit their reliability and stability. This article identifies the specific reliability and stability limitations of traditional IP networks and outlines a roadmap for upgrading IP networks to achieve the performance required for real-time services.
In contrast to best-efforts services, Real-time services such as Voice, video, streaming media, and interactive gaming require low latency and jitter and very high network availability.
In order for IP networks to support demanding real-time applications and converged legacy networks, the dynamic and unstable nature of IP networks must be addressed. There are four key causes of IP network instability that must be addressed in order for real-time and converged services to be supported:
1. Router Reliability
2. Link Protection
3. Non-Disruptive Operations
4. Convergence Time
Step 1: 99.999% Router Availability
The first area to focus on improving IP network reliability is the reliability of routers themselves. In contrast to traditional Central Office equipment such as voice and ATM switches, IP routers have not historically been designed for 99.999% availability. Recent data from RHK shows that a typical large IP network achieves between 99.95 and 99.99% availability. This corresponds to 10-50 times higher downtime than the "five nines" availability benchmark of legacy data networks.
To build a router that delivers 99.999% availability requires a comprehensive focus on all dimensions of the routing platform, and a full understanding of the network environment into which the router is placed. Many vendors have claimed to achieve 99.999% availability, however these claims are generally based on models, apply only to hardware and are not validated by field data.
There is currently only one core routing platform, which delivers 99.999% total system availability. Avici Systems achieves this level of availability through full hardware redundancy, extensive software reliability and stress testing, and NSR® Non-Stop Routing protection of the route controller.
The route controller is the most difficult router component to protect, and is responsible for the greatest number of failures in traditional networks. Because it maintains connectivity with peer routers in the network and is responsible for all routing changes, failure in the route controller can cause up to 10 minutes of downtime while the backup route controller re-learns routes and establishes connectivity with peers. Avici's NSR® technology eliminates this downtime by enabling a backup route controller to instantaneously take over as the primary and maintain connectivity with all peers for all protocols including BGP.
Step 2: Local Link Protection
A study by the University of Michigan observed that 32% of outages in a large regional IP network were attributed to link failures. Traditionally, IP networks have been built without locally protected links, relying instead the ability of routers to reconverge and route around failed links.
This approach not only requires significant amounts of costly spare capacity in the network, it also involves service disruption, which is unacceptable for real-time and converged services. The goal of Real-Time network design is to locally protect links to prevent link failures from triggering convergence and impacting services.
Many link protection schemes have been developed with the common objective of locally protecting the link in less than 50 milliseconds to avoid triggering reconvergence. MPLS Fast Reroute offers effective protection and the flexibility to share backup links, but adds complexity due to the need to provision protection at each hop. Link Aggregation provides 1:N local protection of Ethernet links, but does not extend to SONET. SONET APS is effective at protecting SONET links, but is an expensive option as it only offers 1:1 protection.
Powerful new SONET/SDH link aggregation mechanisms such as Avici's Composite Links™ provide more cost-effective link protection by enabling up to 64 physical links to be grouped into a single logical link. In the event any member link fails, traffic is redistributed across the surviving links in less than 45 milliseconds. This not only provides cost-effective 1:N protection, it also provides a non-disruptive mechanism for provisioning additional bandwidth.
Step 3: Eliminate Planned Downtime
Another primary cause of IP network downtime is operational changes such as software and hardware upgrades, link expansion, and configuration changes. These routine tasks cannot be performed on traditional routing platforms without causing service disruption.
Software and Hardware upgrades are another primary cause of disruptions in operational IP networks. Traditional router platforms do not support in-service hardware or software upgrades, forcing carriers to schedule maintenance windows for upgrades. In the past, some carriers have trumpeted high network availability SLAs, but have excluded as much as 25% of the day as maintenance windows. Customers have discovered this practice, and are now demanding that SLAs apply 7x24.
New routing platforms are beginning to emerge which offer in-service hardware scalability and hitless in-service software upgrades. By migrating to such platforms, carriers can virtually eliminate hardware and software upgrades as a source of disruption in the network. For example, Avici's NSR® Non-Stop Routing technology allows carriers to perform in-service software upgrades and revert to previous stable configurations without any service disruption.
Step 4: Rapid Convergence
The three strategies outlined above are designed to dramatically reduce the number and frequency of disruptions to an IP network. Even with such precautions, however, no network is completely immune from disruptions such as human error or topology changes.
When failure events or network changes do occur, the network must minimize the impact by rapidly recovering. The best measure of an IP networks recovery time is the routing convergence time. Routing convergence is a measure of the amount of time it takes for a network to adapt to a change in network conditions. Network changes triggering convergence events could be caused by: changes in the physical network (fiber cuts), failure of a line card module or an entire router, and/or configuration changes.
While the network is adapting to these changes, some data packets are not going to the places they should be. The first order impact is that packets get lost. This is bad for end-users if the outage persists and expensive for the carriers who violate SLA commitments. These failures will drive carrier costs and create performance problems for customers. Packet delays on the order of just a second can be noticed by a user and create sub-par performance in applications such as real time interactive gaming and video conferencing.
The Bottom Line
These four steps – achieving 99.999% router availability, implementing local link protection, enabling in-service upgrades, and improving convergence times – are critical to transforming a best-efforts IP network into a network capable of supporting real-time and converged services. Leading carriers will be driven to deploy Real-Time networks by the revenue opportunities of new real-time services and the tremendous cost savings of reduced outages and convergence efficiencies. Avici Systems is the only vendor to comprehensively address the needs of Real-Time IP Converged networks with the TSR, SSR and QSR routing platforms.
