SRX cluster ip-monitoring

In an SRX chassis cluster setup, in addition to interface monitoring you can also use
IP monitoring to monitor the health of your upstream path.

srx-chassis-ip-monitoring

I have a simple topology to explain how ip monitoring works. In this setup node0 and node1
are part of an srx chassis cluster. reth0.0 interface is part of the redundancy group 1 (RG1)
Currently node0 is the primary for RG1 as you can see from the output below;

Now lets configure IP monitoring to detect any failure in network layer.

The config above instructs SRX:

  • Monitor IP address 172.17.11.1 by sending ICMP packets at 3 seconds (retry-interval) interval.
  • If 5 consecutive attempts (retry-count) fail, mark the IP address 172.17.11.1 unreachable.
  • then deduct the weight 200 from the global-threshold value (i.e 200)
  • and if the result of this deduction is 0, then deduct global-weight 100 from the RG1 threshold (255)

Note: Configured secondary-ip-address shouldn’t be the primary IP address on reth0.0
as far as the documentation is concerned but it must be on the same subnet. In a nutshell,
primary node is using reth0.0 interface address (172.17.11.100) as the source and secondary node
is using the IP 172.17.11.99 (with MAC address of the secondary node child interface)

After configuring this ip-monitoring, here is the status. IP is reachable

Now I disable ICMP responses on the gateway, to simulate a failure and
ip address is marked as unreachable.

When we check the cluster information, we can see that threshold is 155 now.
Because our global weight is 100 for this monitored IP address (255-100=155)

You might have noticed that there is no failover yet, since for failover to happen,
RG1 threshold must reach 0 which isn’t the case on this simulation.

Now, I change the config and set the IP weight to 200 and global to 255 to simulate a failover;

After a network failure, node1 becomes primary this time.

You might see the high failure count 46 in the output. This happened after I made simulation mistake:)
I thought that I can just block ICMP from 172.17.11.100 on the Linux GW device as such;

and it should trigger a failover. Yes assumption was correct it should trigger a failover as node1 is able to
ping from 11.99 but once the failover occurs, node1 becomes primary and now it starts pinging from 11.100
which fails again and causes another failover. This caused a sort of dead lock in the cluster but
I have learned something in there:)

IP monitoring is more clear for me after these tests. Let me know if you have any feedback/questions.

UPDATE:
After finishing this post, Steven asked in this post: what happens if we don’t configure the secondary-ip-address, I didn’t know the exact answer. According to my test on a 11.4R7.5 release;

If you try to monitor IP e.g 192.168.60.1, you receive the following warning on every commit onwards.

Also an odd thing for me is that 192.168.60.1 (monitored IP) status is always reachable
on the secondary node. Maybe what is meant by “limited monitoring functionality” is this.

11 thoughts on “SRX cluster ip-monitoring

  1. Steven

    Good article! But is the configuration of a secondary ip address mandatory? If i have a /30 interconnect network with my upstream router i don’t have any more addresses to assign…

    Reply
    1. rtoodtoo Post author

      Hi Steven,
      What will possibly happen if you don’t set this secondary (I am not even sure if you are allowed not to set this, I don’t have the setup to check it righ now), your RG will be ineligible on the secondary node since monitoring will fail. Because of this you won’t have any node to fail over in case an issue occurs on the primary.

      Reply
      1. Steven

        Hello rtoodtoo,

        That makes perfect sense. But I have just the one assigned IP. What I would need is to be able to ping the upstream router from just the active node. When this fails the redundancy group containing just this reth interface would have to failover to the second node and start pinging from that. I don’t care that much which link to the upstream router is being used, just that a failover takes place as soon as connectivity is lost. Switching back and forth could be restricted with the hold-down-interval.
        But I’m affraid that this isn’t going to work.

        Reply
        1. rtoodtoo Post author

          I will try to test this as soon as I have a setup for this but nothing comes into my mind at the moment as a workaround. I am not %100 sure the ineligibility of the second on either but I hope I will test and give you my results.

          Reply
          1. Steven

            Thanks for your help. I’m going to investigate this further, but I also don’t have a lab setup to try different configurations.

  2. rtoodtoo Post author

    I have updated the end of the post Steven. I did a test but result wasn’t as I expected. It must be checked in the X releases in which behavior might be different.

    Reply
  3. salman

    Good Article Rtoodtoo , can you please calrify how did you resolve the issue below

    “Yes assumption was correct it should trigger a failover as node1 is able to
    ping from 11.99 but once the failover occurs, node1 becomes primary and now it starts pinging from 11.100
    which fails again and causes another failover. This caused a sort of dead lock in the cluster but
    I have learned something in there:)”

    What have you learned here 🙂

    Thanks & Regards

    Reply
    1. rtoodtoo Post author

      Salman,
      What I have learned there is that my testing method is incorrect. I have only a single gateway and normally you should have at least two if you are really monitoring your uplinks. My testing cluster couldn’t simply find a healty uplink to failover.

      Cheers.

      Reply
  4. Dan Kline

    I am in a similar situation: I have clustered firewalls and two ISPs, but only a /30 for the connection between my primary firewall and the primary ISP. I am familiar with this configuration in ScreenOS (a much more mature operating system, for sure), where you simply pinged a gateway, or some upstream device and failed to the backup if you lost connectivity.

    Is there simply no way to get this to work with a single public IP address?

    dak

    Reply
  5. Fabian Alfaro

    I haven’t perform a lab recreation for this but I am concerned about few things here.
    My understanding is that the ICMP packets (when the interface and the secondary IP are not specified), are sourced from the reth interface and they use the reth MAC address as the source MAC. Being said this, the downstream or upstream switches will definetely know how to handle those packets. Based on the explanation provided on the SRX series book (page 342), it appears to me that when you specify this interface and secondary-IP, the ICMP packets are going to use the specfied secondary IP as a source address and will come from the backup interface. The book says “this will allow the administrator to check the backup path coming from the secondary node”.

    If this is the case then:

    What happens is the specified interface is different than the one included on the routing table as our outgoing interface to reach the monitored IP?

    How is this possible, if the backup interface is supposed to be not accepting traffic at all?

    How will the switches know how forward this packet to the backup interface if the source MAC should be the reth MAC address?

    If the SRX won’t use the reth MAC address, then how will the SRX select the source MAC for those packets?

    I hope you can help me to clarify this shortly.

    Thank you and BTW, great post (as allways you do) !!!!!!

    Fabian Alfaro

    Reply
    1. rtoodtoo Post author

      Not quite sure if I am getting you right. In the post, what I explained is that secondary will source the packets from the child interface’s MAC not reth0. In a nutshell, you ping the same IP from two different IP addresses and MAC addresses node0 using reth0 and node1 using child (local) interface’s MAC and switch knows where these two different MACs are located.

      Genco.

      Reply

You have a feedback?