3.1 Network Issue Diagnosis

Diagnose network issues, such as packet loss, congestion, routing, and jitter using collected data

Troubleshooting network issues requires a methodical approach and strong analytical skills. This section focuses on diagnosing common network problems like packet loss, congestion, routing issues, and jitter using data collected from various sources like ThousandEyes, Meraki Dashboard, Catalyst Center, and SD-WAN Manager.

Understanding the data collected, the metrics, and their impact on user experience is crucial for effective diagnosis. For example, high latency can lead to slow application responsiveness, while jitter can disrupt voice and video calls.

When facing an issue, start by determining if it's application or network-related. Then, leverage collected data to identify the problem area and its potential causes.

Key Concepts

This section defines key network metrics, explores their common causes, and explains their impact on user experience. Understanding these concepts is crucial for effective network diagnostics.

Packet Loss

Packet loss occurs when one or more data packets traveling across a network fail to reach their destination. This compromises the integrity of the transmitted data and can lead to various issues depending on the application and protocol being used. Packet loss is typically measured as a percentage of packets lost relative to the total number of packets sent.

Common causes of packet loss

Network Congestion: When network traffic exceeds the capacity of network devices, packets may be dropped. This is common in scenarios with high-bandwidth applications like video streaming or large file transfers.
Transmission Errors: Problems with the physical media, such as signal degradation, noise, or interference, can corrupt packets and lead to loss. The impact of these factors varies depending on the type of transmission media used (e.g., fiber optic cables are less susceptible to interference than copper wires).
Device Misconfiguration: Incorrect settings on network devices, including firewalls, routers, and switches, can cause packets to be dropped. This could involve misconfigured access control lists (ACLs), Quality of Service (QoS) policies, or routing rules.
Routing Changes: When network routes change, packets may be lost if they are no longer directed to a valid destination. This can happen during network maintenance, outages, or configuration updates.
Hardware Failures and Software Bugs: Malfunctions in network hardware (e.g., faulty network interface cards) or software bugs in network device operating systems can also contribute to packet loss.

Impact of Packet Loss

Choppy audio and video during streaming or video calls
Frozen video frames
Interrupted streaming services
Failed file downloads
Overall reduction in user productivity

Latency

Network latency refers to the time it takes for a packet to travel from one point in the network to another. It's the delay experienced by data as it traverses the network and is commonly measured in milliseconds.

Factors contributing to latency

Signal Propagation Delay: This inherent delay depends on the distance the signal must travel and the speed of transmission through the chosen medium. Signals travel faster through fiber optic cables compared to copper wires. Geographic distance significantly influences latency as signals take longer to travel over longer distances.
Network Device Processing: Routers and switches introduce a small delay as they process packet headers and determine the appropriate forwarding path. The complexity of the device's configuration and the volume of traffic it handles can impact processing time.
Traffic Load and Queuing: Packets may experience delays while waiting in queues due to high traffic load or Quality of Service (QoS) prioritization. QoS mechanisms can prioritize certain types of traffic (e.g., voice over data), which can delay other packets.
Inefficient Routing: Each hop between network devices adds to the overall latency. Poorly designed routing paths, such as those with unnecessary hops or congested links, contribute to higher latency.

Impact of Latency

Sluggish Application Responsiveness: Applications become slow to respond to user actions, impacting productivity and accuracy.
Decreased Voice and Video Quality: Calls suffer from delays, making conversations difficult. This can lead to users talking over each other or experiencing long silences.
Impact on Business Operations: In time-critical environments, such as financial trading, high latency can cause lost revenue due to delayed transactions and slow reaction to market changes.

Jitter

Jitter is the variability in the time it takes for data packets to be forwarded from their source to their destination. Ideally, packets should arrive at a consistent pace. However, when a network experiences high jitter, packets can arrive out of order.

Factors contributing to jitter

Network Congestion: When too many data streams overwhelm the available bandwidth of a network, packets can be delayed, causing jitter. This is especially noticeable in real-time applications like video conferencing and VoIP calls.
Bursty Traffic: Sudden spikes in network traffic can cause temporary jitter as the network adjusts to the changing load. This can occur during peak usage periods or when large data transfers are initiated.
Wireless Interference: In wireless networks, interference from other signals can cause delays in packet delivery, leading to jitter. Interference from other wireless devices, microwave ovens, or physical obstructions can contribute to this.
Hardware and Software Issues: Faulty or outdated networking hardware can lead to inconsistent packet delivery times, resulting in jitter. This might include issues with network interface cards, drivers, or firmware on network devices.

Impact of Jitter

High jitter primarily impacts voice and video communications, making conversations confusing and difficult to understand. Users may experience:

Distorted audio
Choppy video
Dropped calls

Routing Issues

Border Gateway Protocol (BGP) is the standard routing protocol used to exchange routing information across the internet. It enables the sharing of reachability information between autonomous systems (AS), which are groups of networks operated under a single administrative organization. BGP allows networks to learn about available routes to different parts of the internet and make decisions about how to forward traffic.

When troubleshooting network issues, analyzing the reachability to prefixes of interest is crucial. Reachability must be considered from both sides:

How you can route to the target.
How the target can route back to you.

Common routing issues

Reachability Issues: This occurs when there is no correct route to a prefix. This could happen if the originating AS stops advertising the prefix due to a failure or misconfiguration, or if a malicious AS falsely advertises the prefix.
Convergence and Unstable Conditions: This happens when a route continually toggles between available and unavailable states, or when the route information changes frequently due to underlying infrastructure changes. This can lead to intermittent connectivity issues as routes change and traffic is rerouted.

Resources

Sample Questions

3.1 Question 1

Users at a remote corporate site (identified as s30 in the exhibit) are experiencing issues with a critical Enterprise Application hosted in the Data Center. The site connects to the central campus through an MPLS network.

The following exhibits show the network status before and after the issue began. Based on the information presented, what is the most likely cause of the problem and what actions would you take next as a Network Operations Engineer?

Exhibit 3.1-1: Before Issue

Exhibit 3.1-2: After Issue

View in ThousandEyes

A) Escalate to the transmission media team and have the optic fiber between 10.84.30.1 and 10.87.16.53 checked.
B) Review the bandwidth utilization at this site.
C) Reach out to the team that owns the Enterprise Application and have the server reviewed.
D) Check the routing tables on the MPLS network devices for any recent changes.

3.1 Question 2

Users on remote sites are reporting voice issues, can you identify possible causes and next steps from the following exhibits?

Exhibit 3.1-3: Before Incident #1

Exhibit D — Exhibit 3.1-4: Before Incident #2

Exhibit E — Exhibit 3.1-5: During Incident #1

Exhibit F — Exhibit 3.1-6: During Incident #2

View in ThousandEyes

A) Involve the Voice team as the RTP test does not return any relevant results for the agent located at site 20 (identified as s20 in the exhibit)
B) Verify the routing changes on device 10.87.7.51
C) Verify the docker host 10.84.50.53 and ensure the agent container is running.
D) Analyze the jitter and latency trends on the affected voice paths to identify potential network congestion.