Application and service delivery over the public Internet is subject to a variety of network performance challenges. This is because the Internet is compiled of different fabrics, connection points, and management entities, all of which are dynamic, creating unpredictable traffic paths, and unreliable conditions. While there is an inherent lack of visibility into end-to-end performance metrics, for the most part, the Internet just works, and packets eventually get to their final destination. In this post we'll discuss key challenges affecting application performance and examine the birth of new technologies, multi-CDN designs and how they affect DNS. Finally, we'll look at Pulsar, our real-time telemetry engine, developed specifically for overcoming many performance challenges by adding intelligence at the DNS lookup stage.
Historically, cloud applications had fixed, well-known locations and ports, which made front end service discovery easy. But this has changed; The new era of microservice-based applications are much more dynamic and are a challenge for DNS, a protocol developed for static routing. Now, cloud applications are distributed across many locations to reduce latency, improve response times and enable site resilience. Containers are lightweight and start in milliseconds, VM’s in seconds. The number of instances and location changes constantly. DNS must keep track of these new endpoints and be flexible enough to support public, private and hybrid cloud environments where all these endpoints live. To stay up to par, a new breed of DNS must be accurate, flexible, fast, fault tolerant and reliable.
A lot of time, effort, and resources are spent fine-tuning cloud environments and optimizing the application stack it serves. Over the last few years, we have witnessed dramatic changes to application architectures which now enable fully distributed platforms spread across multi-data center cloud environments. But new advancements are wasted when connectivity is severed or load balancing is not effective. Building a distributed, elastic application is pointless if users can't get to the correct endpoint in a reasonable amount of time. With growing user expectations, performance is paramount. To top it all off, building fault tolerance into your application is equally important. Servers break, fiber lines are cut, and databases become overloaded. As such, we also need tools to correctly monitor performance and detect real-time network & infrastructure changes to route around trouble spots.
Generic Traffic Management Challenges
The Internet was designed for connectivity, not performance. Routers have limited end-to-end visibility with no way of knowing how long it takes to reach the next hop. Other elements such as bandwidth availability, congestion, underutilization, and fiber cuts are not taken into account. Also, BGP, the protocol that glues the Internet together, is not a performance orientated protocol. Throughout the Internet, packets eventually get to their final destination.
A lot of today's traffic management is done with geolocation to determine the proximity of the user to an end point. However, proximity does not necessarily correlate to performance. Considering all other external factors that might affect application performance such as latency, jitter, throughput, packet loss, and congestion, the closest style metric does not always work for the dynamic nature of networking. GeoIP works “most” of the time, but is 'most of the time' good enough for today's endpoints? In terms of optimization, 'most of the time' in a context of millions of users translates to thousands of suboptimal transactions.
The Internet is complex and geography alone is not the best way to make routing decisions. Application developers need the ability to route users based on user, network, and infrastructure metrics and tune traffic based on specific business goals like reducing latency or jitter. Plain vanilla DNS simply can't do that.
From a business standpoint, overcoming these challenges can have a surprising impact. Consistently sending users on the optimal path improves user experience and customer loyalty, which directly contributes to the top and bottom line.
Network Congestion, Bandwidth, Latency, Jitter, and Packet Loss
A wide range of conditions affect overall network performance. Latency, jitter, throughput, reliability, and packet loss are key factors. High bandwidth applications such as online video gaming, streaming, and large size file transfer applications require high throughput and capacity putting pressure on sending user requests to the most optimal endpoint. Put simply, better endpoint selection means more revenue.
A perfect network would never lose a single packet. However, in reality, networks are never 100% reliable and outages are caused by human error, routing black holes, power and hardware failure. Loss of service will happen, it’s just a matter of when. Packet loss causes retransmissions, affecting the overall network stability; if it's too high, TCP will run out of buffer space and slow down.
Limited bandwidth is bad but latency is the real killer. You can always add capacity to solve bandwidth challenges, but you can't do that with latency. There is only one way to reduce latency and that is to reduce the length of the link. Network latency (combined with throughput) is essentially the speed of the network. It is measured in round-trip time (RTT) and defines the time a packet takes to travel across the network. Generally, latency within a data centre is measured in microseconds but between locations is measured in tens of milliseconds.
Distance kills application performance. Long distances directly result in high RTT and the potential for additional network congestion. It is very difficult to change the distance without moving the actual content closer to the user. WAN acceleration and compression products sometimes help, but sometimes they actually increase latency. This is why CDNs are a popular choice. Caching works great under many circumstances, but only suited to certain content types. The ability to place static content on CDN PoPs will certainly reduce the RTT and increase network performance. Dynamic content is a different story.
Jitter also creates hits in performance for some applications. Jitter is the time difference in packet inter-arrival time to their destination. The amount of acceptable jitter really depends on application type. It's a major problem for real-time communications such as VoIP and video streaming. Network congestion is one of the main causes of jitter. It causes a steady stream of a packet flow to become erratic and results in some packets getting discarded. Some network appliances are equipped with jitter buffers but only help to an extent. The best way to combat jitter is to select the most stable path between two endpoints.
Low bandwidth, high latency, over congestion, and packet loss between a source-destination pair is unacceptable. We need a mechanism to proactively diagnose and detect these problems in real-time, and then course correct traffic to the most optimal endpoint.
Multi-CDN / Multi-Datacenter Strategy Challenges
Life would be easy if you had a single website serving content to a defined group of users in one location. If all users were based in a small geographical region around the server sourcing the content, application performance would be fine. Of course that's not the case. Today's businesses are serving a global audience.
Single server application deployments are no longer the norm. Applications are broken into a number of tiers and spread throughout multiple physical locations. Cloud and other distributed infrastructure acts as an enabler, allowing users to break into new markets by servicing multiple locations. So now we have application endpoints scattered throughout the Internet on multiple hybrid environments. That helps bring an application closer to its users, but still leaves the challenge of routing those users to the most optimal endpoint.
One way to serve content is through a Content Delivery Network (CDN). A single CDN will typically be distributed among many regions, but there is no single CDN that performs best everywhere. Certain CDNs have better performance in specific geographies. For example, in a 2-way multi-CDN design, provider A could be better serving Video content in US-west than provider B. There may also be price differentiators between geographic locations. This is where a multi-CDN strategy comes to play. A multi-CDN approach combines a number of external CDN providers into a single delivery network and uses a mechanism to route based on variables such as performance and cost.
When performance thresholds are exceeded in multi-CDN environments a routing decision must be made. As it turns out, DNS is a great place to make that decision. It offers the ability to select the best CDN network in a given scenario. Solution's like Pular are able to analyse data in real-time and make real-time decisions based on actual user, infrastructure and network conditions.
Intelligent Traffic Management Using Pulsar
Pulsar offers the ability to control Internet routing during DNS lookup. It does this by ingesting custom tailored performance information and beaming it to our edge locations. Pulsar sorts DNS answers and delivers the best answer based on user-defined criteria. This "criteria" is defined by the customer. It can be bandwidth throughput, latency, jitter, or any other metric that can be measured.
Internal load balancers have been capturing various application performance metrics for years. By using Pulsar, this data can be leveraged globally. Pushing performance information to our network gives better visibility and control over endpoint reachability. Because Pulsar can utilize such telemetry on a macro-level, even a small performance differential can have a tremendous impact on thousands or even millions of users, and your bottom line.