Packetization Layer PMTU Discovery

rtoodtoo linux, tcp-ip August 18, 2015

Path MTU discovery that is in place today is relying on ICMP based MTU discovery i.e you send an oversize packet which can’t be forwarded by an intermediate host in the path because the next hop link has a lower MTU size, then the source host is notified by this hop which can’t forward this packet. It is this notification that is sent to the source in an ICMP Destination Unreachable “Fragmentation needed and DF set” message but what happens if this ICMP notifications are blocked? Then we have a big problem and sometimes it may be difficult to identify.
So in this post I would like to show the mitigation technique in case ICMPs are blocked in the network. Let’s first see this ICMP block situation and how we can mitigate this problem by using packetization layer MTU discovery method which is explained in RFC4821 “Packetization Layer Path MTU Discovery”

Following is our topology that we carry out the tests.

Let’s first lower the MTU on segment 2. We do this on Host B(LAB1021-R1)

LAB1021-R1>ip link set vlan2201 mtu 1000
LAB1021-R1>ip link show vlan2201
15: vlan2201@if9:  mtu 1000 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 00:0c:29:42:66:60 brd ff:ff:ff:ff:ff:ff

Yes we have a lower MTU now.

Now check the path via traceoute from LAB1021-PC10 device.

root@LAB1021-PC10:~# traceroute -n 10.11.5.2
traceroute to 10.11.5.2 (10.11.5.2), 30 hops max, 60 byte packets
 1  10.11.6.1  2.765 ms  2.735 ms  3.018 ms
 2  10.11.1.1  4.425 ms  4.415 ms  4.863 ms
 3  10.11.2.2  5.000 ms  4.986 ms  5.047 ms
 4  10.11.5.2  5.454 ms  5.464 ms  5.450 ms

Yes we are following the path as depicted on the topology.

Now let’s download a file (test.dat) size of which is 3000 Bytes from the LAB1021-SRV10

root@LAB1021-PC10:~# wget -S http://10.11.5.2/test.dat
--2015-08-18 18:18:57--  http://10.11.5.2/test.dat
Connecting to 10.11.5.2:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Tue, 18 Aug 2015 18:18:57 GMT
  Server: Apache/2.4.7 (Ubuntu)
  Last-Modified: Tue, 18 Aug 2015 18:05:06 GMT
  ETag: "bb8-51d99c215492c"
  Accept-Ranges: bytes
  Content-Length: 3000
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: application/x-ns-proxy-autoconfig
Length: 3000 (2.9K) [application/x-ns-proxy-autoconfig]
Saving to: â€˜test.dat.4â€™

100%[======================================================================>] 3,000       --.-K/s   in 0s      

2015-08-18 18:18:57 (174 MB/s) - â€˜test.dat.4â€™ saved [3000/3000]

So we have downloaded the file successfully but look what happens on the network to adapt to the MTU change we made.

Let’s check the route cache on the server to see if anything changed for the client host network.

root@LAB1021-SRV10:/var/www/html# ip route get 10.11.6.2
10.11.6.2 via 10.11.5.1 dev vlan2206  src 10.11.5.2 
    cache  expires 592sec mtu 1000    <---          MTU for this client network is reduced to 1000

how this happened? Let's see the packet capture taken on the server side.

This is what we expected. HostB 10.11.2.1 sent the message "Frag needed" message to the server 10.11.5.2 and included the MTU=1000 in the message too. Upon receival of this message server reduced the size of the IP packets and communication didn't halt. So far we might have known these facts but now as a network admin who doesn't know what he is doing, I am blocking all ICMPs on the HostB.

Before this first flush the cache on the server side in order to demonstrate the issue

root@LAB1021-SRV10# ip route flush cache

and block all ICMP packets on the HostB.

LAB1021-R1>iptables -A OUTPUT -p icmp -j DROP

Now we are trying to download the same file again

root@LAB1021-PC10:~# wget http://10.11.5.2/test.dat   
--2015-08-18 18:33:42--  http://10.11.5.2/test.dat
Connecting to 10.11.5.2:80... connected.
HTTP request sent, awaiting response...

Hmm, it is stuck, we don't get anything. Check the packet capture

It is obvious that this time our server doesn't get any notification and keeps sending the same size IP packets again and again and in the end gives up and close the TCP connection. Now the question is that can this be avoided? According to the implemantation explained in RFC4821, yes this can be avoided by moving up MTU discovery from ICMP to TCP.

As I know how this can be done in Linux, I will explain on Linux only. The method called MTU Probing which seems to be activated by the "net.ipv4.tcp_mtu_probing" kernel parameter. Unfortunately it is off by default. Maybe to be cauitous about possible side effects of setting. For instance, undesired MTU lowering inadvertently.

root@LAB1021-SRV10# sysctl net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_mtu_probing = 1

Now our server has the capability to probe MTU at transport layer. Now try to download the same file once again

root@LAB1021-PC10:~# wget http://10.11.5.2/test.dat
--2015-08-18 18:45:21--  http://10.11.5.2/test.dat
Connecting to 10.11.5.2:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3000 (2.9K) [application/x-ns-proxy-autoconfig]
Saving to: â€˜test.dat.5â€™

100%[======================================================================>] 3,000       --.-K/s   in 0.01s   

2015-08-18 18:45:24 (201 KB/s) - â€˜test.dat.5â€™ saved [3000/3000]

Yes, we have downloaded this time but what happened on the network. Look at the packet capture

See what happens. Server first sends normal fully loaded segments but doesn't get any ACK for them. Then try 5 re-transmits but again no response then the magic happens. It lowers the segment size starting from 15th packet. The size you see on the capture is the size of the frame so actual IP packet is 564 Bytes

Now another question. What if we try another download, will the server remember that it should actually lower the segment size? Apparently no. Each transaction requires a new probe. I also recall that this was mentioned somewhere in the RFC too.

One possible issue with this MTU probing is that if you have some temporary link issue which causes several consecutive retransmissions which is sufficient to trigger MTU probing, your throughput might be affected until MTU probe finds the higher MTU value. (This is just my guess, haven't got really any experience with it) As this is MTU probing, according to RFC, it should also detect if the MTU is higher but I haven't really tested this either.

Apparently this PLPMTUD is a nice feature but I wonder when we will see this feature by default on.

Packetization Layer PMTU Discovery

Related

About: rtoodtoo

One thought on “Packetization Layer PMTU Discovery”

Leave a Reply to EricCancel reply

Packetization Layer PMTU Discovery

Related

About: rtoodtoo

One thought on “Packetization Layer PMTU Discovery”

Leave a Reply to EricCancel reply

Discover more from RtoDto.net