I would like to write about Path MTU discovery and IP Fragmentation in this post and the relation between them.
As per the topology above, if the host LINUX1 is sending a packet to LINUX3 device. Packet has to go through a path in which there are various MTU sizes involved. In the past I used to think that Path MTU discovery concept is something which is done before TCP communication starts and detects the the lowest link speed and according to which TCP segments are generated. It isn’t the way how it works. How it works is;
Assume packet, which is leaving LINUX1 has total length of 1450bytes. Because the link between LINUX1-LINUX2 has 1500byte limit, there is no problem. However, once LINUX2 receives the packet, it sees that the link that it must use to forward this packet has a lower maximum packet capacity than the packet it has. Under normal circumstances, LINUX2 sends back an ICMP notification to LINUX1 and says that “Hey dude, I can’t forward this packet as I have a link having 800byte MTU on the way, do something and lower your packet size”
LINUX1 gets this ICMP and lowers its further packets’ maximum sizes to 800 then the packets flow through. Ok, fair enough so far but there is a concept of IP Fragmentation, why doesn’t it occur? This is what documents say if the next link MTU is lower than the packet being forwarded, packets are fragmented.
Now the Path MTU discovery comes in;
net.ipv4.ip_no_pmtu_disc = 0
This setting in LINUX1 device causes every packet to have DF (Don’t Fragment) bit to be set to 1 which means, don’t let any intermediate router to fragment this packet. Below is a screenshot indicating how you will see packet during a capture in Wireshark. Can you see the Don’t fragment bit?
Because fragmentation isn’t allowed by the sender device, first intermediate router that has a lower MTU link sends an ICMP response back. Lets see what kind of ICMP message LINUX1 receives (you can click on the image to see a bigger version)
As you can see TCP three way communication is done properly but once LINUX1 tries to send a segment above 800byte (in the output it isn’t visible but 289 and 290. packets’ total lengths are 1500 for each) it receives the above ICMP response (Destination unreachable Fragmentation needed) from LINUX2 and it lowers further packets’ sizes to fit in 800byte limit.
Now the question, does this communication always work like this? I mean every time a new TCP communication is needed, this process is repeated? Not really. Linux does cache this path MTU. Lets see it;
LINUX1#ip route show cache 192.168.111.2
192.168.111.2 from 172.30.73.219 via 172.30.73.85 dev eth0
cache expires 596sec mtu 800 advmss 1460 hoplimit 64
Can you see it? Now LINUX1 linux knows that it shouldn’t send any packet bigger than 800 byte if it wants to send a packet for this destination again. This cache expires in 596sec as it can be seen in the output. I have noticed that even if you have packets flowing in this direction, expire value continues to count down to zero. So having an active connection doesn’t mean that this value will be reset to its upper limit again and again.
During my troubleshooting, I asked myself what happens if I just block every ICMP packet sent from LINUX2 device. The answer is communication halts!!! because LINUX2 doesn’t provide any feedback about the next link MTU and LINUX1 keeps sending its packets still at 1500byte. Since DF bit is set, fragmentation can’t happen and everything is stuck. This is a very bad thing indeed!
I again asked another question, what can I do from LINUX3 side to prevent this from happening if I can’t inform LINUX1 admin. MSS (Maximum Segment Size) comes in this situation.
MSS isn’t a negotiated value indeed due to which what ever LINUX3 tells the other peer during TCP communication, LINUX1 must obey that. What I did was, I set advertised MSS value for this link to 700 in LINUX3
LINUX3# ip route change 0.0.0.0/0 dev eth0 advmss 700
After this all the subsequent TCP SYN packets will advertise its MMS as 700 and because LINUX1 will obey this and arrange the packet size according to it, packet flow will not be disrupted.
I hope I haven’t made a mistake so far in my post. Let me know if you have any contribution or questions.
Path MTU discovery in JunOS:
If you want to enable/disable Path MTU in SRX, the following output should be enough I think:
[edit system internet-options]
root@host# set ?
+ apply-groups Groups from which to inherit configuration data
+ apply-groups-except Don't inherit configuration data from these groups
gre-path-mtu-discovery Enable path MTU discovery for GRE tunnels
> icmpv4-rate-limit Rate-limiting parameters for ICMPv4 messages
> icmpv6-rate-limit Rate-limiting parameters for ICMPv6 messages
ipip-path-mtu-discovery Enable path MTU discovery for IP-IP tunnels
ipv6-duplicate-addr-detection-transmits IPv6 Duplicate address detection transmits
ipv6-path-mtu-discovery Enable IPv6 Path MTU discovery
ipv6-path-mtu-discovery-timeout IPv6 Path MTU Discovery timeout (minutes)
ipv6-reject-zero-hop-limit Enable dropping IPv6 packets with zero hop-limit
no-gre-path-mtu-discovery Don't enable path MTU discovery for GRE tunnels
no-ipip-path-mtu-discovery Don't enable path MTU discovery for IP-IP tunnels
no-ipv6-path-mtu-discovery Don't enable IPv6 Path MTU discovery
no-ipv6-reject-zero-hop-limit Don't enable dropping IPv6 packets with zero hop-limit
no-path-mtu-discovery Don't enable Path MTU discovery on TCP connections
no-source-quench Don't react to incoming ICMP Source Quench messages
no-tcp-reset Do not send RST TCP packet for packets sent to non-listening ports
no-tcp-rfc1323 Disable RFC 1323 TCP extensions
no-tcp-rfc1323-paws Disable RFC 1323 Protection Against Wrapped Sequence Number extension
<strong>path-mtu-discovery</strong> Enable Path MTU discovery on TCP connections
> source-port Source port selection parameters
source-quench React to incoming ICMP Source Quench messages
tcp-drop-synfin-set Drop TCP packets that have both SYN and FIN flags
Note1: I was wondering how TCP keeps its per connection variables. For example MSS is only announced during connection establishment but no where else but then the entire connection knows that MSS has such a value. I think Transmission Control Block RFC http://www.ietf.org/rfc/rfc2140.txt is the key to this question.
Note2: I have discovered some behavior after 1 year that this post was published. The setting net.ipv4.ip_no_pmtu_disc is only working if you are the TCP sender. At least in my test on Ubuntu. For example if you are the web server (responder), net.ipv4.ip_no_pmtu_disc setting has no effect , no matter what you do every IP packet has DF bit is set to 1 in TCP segments (replies). I don’t know the reason why we can’t manipulate this behavior. There may be a reason or an option to change it but couldn’t find it yet.