Path MTU, IP Fragmentation and MSS

I would like to write about Path MTU discovery and IP Fragmentation in this post and the relation between them.


path_mtu_ip_fragmentation_mss

As per the topology above, if the host LINUX1 is sending a packet to LINUX3 device. Packet has to go through a path in which there are various MTU sizes involved. In the past I used to think that Path MTU discovery concept is something which is done before TCP communication starts and detects the the lowest link speed and according to which TCP segments are generated. It isn’t the way how it works. How it works is;

Assume packet, which is leaving LINUX1 has total length of 1450bytes. Because the link between LINUX1-LINUX2 has 1500byte limit, there is no problem. However, once LINUX2 receives the packet, it sees that the link that it must use to forward this packet has a lower maximum packet capacity than the packet it has. Under normal circumstances, LINUX2 sends back an ICMP notification to LINUX1 and says that “Hey dude, I can’t forward this packet as I have a link having 800byte MTU on the way, do something and lower your packet size”

LINUX1 gets this ICMP and lowers its further packets’ maximum sizes to 800 then the packets flow through. Ok, fair enough so far but there is a concept of IP Fragmentation, why doesn’t it occur? This is what documents say if the next link MTU is lower than the packet being forwarded, packets are fragmented.

Now the Path MTU discovery comes in;

net.ipv4.ip_no_pmtu_disc = 0

 

This setting in LINUX1 device causes every packet to have DF (Don’t Fragment) bit to be set to 1 which means, don’t let any intermediate router to fragment this packet. Below is a screenshot indicating how you will see packet during a capture in Wireshark. Can you see the Don’t fragment bit?

Because fragmentation isn’t allowed by the sender device, first intermediate router that has a lower MTU link sends an ICMP response back. Lets see what kind of ICMP message LINUX1 receives (you can click on the image to see a bigger version)

As you can see TCP three way communication is done properly but once LINUX1 tries to send a segment above 800byte (in the output it isn’t visible but 289 and 290. packets’ total lengths are 1500 for each) it receives the above ICMP response (Destination unreachable Fragmentation needed) from LINUX2 and it lowers further packets’ sizes to fit in 800byte limit.

Now the question, does this communication always work like this? I mean every time a new TCP communication is needed, this process is repeated? Not really. Linux does cache this path MTU. Lets see it;

LINUX1#ip route show cache 192.168.111.2
192.168.111.2 from 172.30.73.219 via 172.30.73.85 dev eth0
    cache  expires 596sec mtu 800 advmss 1460 hoplimit 64

Can you see it? Now LINUX1 linux knows that it shouldn’t send any packet bigger than 800 byte if it wants to send a packet for this destination again. This cache expires in 596sec as it can be seen in the output. I have noticed that even if you have packets flowing in this direction, expire value continues to count down to zero. So having an active connection doesn’t mean that this value will be reset to its upper limit again and again.

During my troubleshooting, I asked myself what happens if I just block every ICMP packet sent from LINUX2 device. The answer is communication halts!!! because LINUX2 doesn’t provide any feedback about the next link MTU and LINUX1 keeps sending its packets still at 1500byte. Since DF bit is set, fragmentation can’t happen and everything is stuck. This is a very bad thing indeed!

I again asked another question, what can I do from LINUX3 side to prevent this from happening if I can’t inform LINUX1 admin. MSS (Maximum Segment Size) comes in this situation.

MSS isn’t a negotiated value indeed due to which what ever LINUX3 tells the other peer during TCP communication, LINUX1 must obey that. What I did was, I set advertised MSS value for this link to 700 in LINUX3

LINUX3# ip route change 0.0.0.0/0 dev eth0 advmss 700

 

After this all the subsequent TCP SYN packets will advertise its MMS as 700 and because LINUX1 will obey this and arrange the packet size according to it, packet flow will not be disrupted.

I hope I haven’t made a mistake so far in my post. Let me know if you have any contribution or questions.

Path MTU discovery in JunOS:

If you want to enable/disable Path MTU in SRX, the following output should be enough I think:

[edit system internet-options]
root@host# set ?
Possible completions:
+ apply-groups         Groups from which to inherit configuration data
+ apply-groups-except  Don't inherit configuration data from these groups
  gre-path-mtu-discovery  Enable path MTU discovery for GRE tunnels
> icmpv4-rate-limit    Rate-limiting parameters for ICMPv4 messages
> icmpv6-rate-limit    Rate-limiting parameters for ICMPv6 messages
  ipip-path-mtu-discovery  Enable path MTU discovery for IP-IP tunnels
  ipv6-duplicate-addr-detection-transmits  IPv6 Duplicate address detection transmits
  ipv6-path-mtu-discovery  Enable IPv6 Path MTU discovery
  ipv6-path-mtu-discovery-timeout  IPv6 Path MTU Discovery timeout (minutes)
  ipv6-reject-zero-hop-limit  Enable dropping IPv6 packets with zero hop-limit
  no-gre-path-mtu-discovery  Don't enable path MTU discovery for GRE tunnels
  no-ipip-path-mtu-discovery  Don't enable path MTU discovery for IP-IP tunnels
  no-ipv6-path-mtu-discovery  Don't enable IPv6 Path MTU discovery
  no-ipv6-reject-zero-hop-limit  Don't enable dropping IPv6 packets with zero hop-limit
  no-path-mtu-discovery  Don't enable Path MTU discovery on TCP connections
  no-source-quench     Don't react to incoming ICMP Source Quench messages
  no-tcp-reset         Do not send RST TCP packet for packets sent to non-listening ports
  no-tcp-rfc1323       Disable RFC 1323 TCP extensions
  no-tcp-rfc1323-paws  Disable RFC 1323 Protection Against Wrapped Sequence Number extension
  path-mtu-discovery   Enable Path MTU discovery on TCP connections
> source-port          Source port selection parameters
  source-quench        React to incoming ICMP Source Quench messages
  tcp-drop-synfin-set  Drop TCP packets that have both SYN and FIN flags

Note1: I was wondering how TCP keeps its per connection variables. For example MSS is only announced during connection establishment but no where else but then the entire connection knows that MSS has such a value. I think Transmission Control Block RFC http://www.ietf.org/rfc/rfc2140.txt is the key to this question.

Note2: I have discovered some behavior after 1 year that this post was published. The setting net.ipv4.ip_no_pmtu_disc is only working if you are the TCP sender. At least in my test on Ubuntu. For example if you are the web server (responder), net.ipv4.ip_no_pmtu_disc setting has no effect , no matter what you do every IP packet has DF bit is set to 1 in TCP segments (replies). I don’t know the reason why we can’t manipulate this behavior. There may be a reason or an option to change it but couldn’t find it yet.

About: rtoodtoo

Worked for more than 10 years as a Network/Support Engineer and also interested in Python, Linux, Security and SD-WAN, currently living in the Netherlands and works as a Network Support Engineer. // JNCIE-SEC #223 / RHCE / PCNSE


5 thoughts on “Path MTU, IP Fragmentation and MSS”

  1. Hi, what about udp ?
    On Netscrren you have set flow path-mtu.
    For Junos you have, path-mtu-discovery Enable Path MTU discovery on TCP connections

    regards Per

  2. Hi Per,
    I didn’t get the question clearly indeed. You mean MSS in UDP or something?
    There is no such thing I suppose. If you can articulate what you mean, I would be pleased.

    Regards

  3. Hi,
    no I did not thinking of MSS for UDP. If we start with Screen OS, we are running traffic through vpn tunnels and default mtu is 1500.
    So if you configure screen with set flow path-mtu, in that case if you ping with DF you get ICMP messages back if you have lager packet then 1418.
    ping -f -l 1418. So it should meens that for UDP packets with DF we get ICMP need to fragment packet.

    But if we goes to Junos, its seems we not have that options, if a UDP packet is lager with DF Junos will fragment this packet over the vpn tunnel.

    regards Per

  4. Hi, I found a practice question that mentions you can reduce the MSS in junos from edit security flow tcp-mss hierarchy. See below output from SRX

    root# set security flow tcp-mss ?
    Possible completions:
    > all-tcp Enable MSS override for all packets
    + apply-groups Groups from which to inherit configuration data
    + apply-groups-except Don’t inherit configuration data from these groups
    > gre-in Enable MSS override for all GRE packets coming out of an IPSec tunnel
    > gre-out Enable MSS override for all GRE packets entering an IPsec tunnel
    > ipsec-vpn Enable MSS override for all packets entering IPSec tunnel

    FYI – you have a lot of good posts on your site,
    Cheers
    Craig

    1. Thanks Craig. This is a very handy command yes and a life saver in many situations in which you have MTU or some performance issues due to fragmentation.

You have a feedback?

This site uses Akismet to reduce spam. Learn how your comment data is processed.