Layer 2 loop troubleshoothing

Check that stupid interface statistics first if things are god damn wrong

One network guy says

I thought it is better to write the summary of the ports first. I have had multiple layer 2 loops in my career and they were mostly disastrous. I think there is nothing more scary than a layer 2 loop in a network. Once it happens, depending on the network gear you have you might either lose all access or things get out of control very quickly.

Normally we should have spanning tree protocol enabled to avoid a loop but in the real world there are always exceptions and sometimes network cabinet isn’t only accessible by you but also field workers. In addition to that you are not the only person who configure network devices.

Let’s say you have a setup like the following. Outputs are from a juniper and cisco meraki box but it can be another vendor too.

All of a sudden protocols flap, ping doesn’t work between even directly connected devices and if there is MC-LAG it also goes down. To isolate the issue:

  • Run “monitor interface traffic” on the core switch to see if you see any excessive “PPS” counter on any interfaces. You might see millions of PPS and usually it is clearly noticeable

when you spot which interface has the highest pps counter

  • Run “show lldp neighbors interface <interface name>” to see which edge switch is generating this traffic.
  • When you spot the edge switch, jump into this switch and run “monitor interface traffic” once again and focus on the input PPS as you might have output counter on multiple interfaces. In various occasions, I have seen that people are crossing two interfaces on the same switch which creates enormous amounts of packets.

This top down approach i.e first checking on the core switch and walking down the way till the edge switch is a method which can be used I believe.

If you have a meraki switch however, you are luckier. I really like the reporting capabilities of these switches. Although I am not happy about their lack of console and direct access, dashboard is giving you so many details. If you experience something which is really unexpected i.e you receive different error each time, go to Network-wide->Event log

and search for the event type “Loop detected”

This will show you on which port loop is detected on a Cisco Meraki switch if you have a loop.

Actually Meraki gives you way user friendly output than a Juniper box. If you go to Switch->Ports under dashboard, you can see the following current traffic sent/receive and you can order them which you can’t in a juniper CLI (at least I don’t know)

If you do this and there is really a loop you will also see that dedicated stack ports almost are forwarding traffic at their max capacity.

Let me know how you approach to find loops in your network and what you do to mitigate.


About: rtoodtoo

Worked for more than 10 years as a Network/Support Engineer and also interested in Python, Linux, Security and SD-WAN // JNCIE-SEC #223 / RHCE / PCNSE

You have a feedback?