SRX cluster ip-monitoring

In an SRX chassis cluster setup, in addition to interface monitoring you can also use
IP monitoring to monitor the health of your upstream path.

srx-chassis-ip-monitoring

I have a simple topology to explain how ip monitoring works. In this setup node0 and node1
are part of an srx chassis cluster. reth0.0 interface is part of the redundancy group 1 (RG1)
Currently node0 is the primary for RG1 as you can see from the output below;

{primary:node0}
root@node0> show chassis cluster status
Cluster ID: 1
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   100         primary        no       no
    node1                   99          secondary      no       no

Redundancy group: 1 , Failover count: 1
    node0                   100         primary        no       no
    node1                   99          secondary      no       no

Now lets configure IP monitoring to detect any failure in network layer.

root@node0# show chassis cluster redundancy-group 1 ip-monitoring
global-weight 100;
global-threshold 200;
retry-interval 3;
retry-count 5;
family {
    inet {
        172.17.11.1 {
            weight 200;
            interface reth0.0 secondary-ip-address 172.17.11.99;
        }
    }
}

The config above instructs SRX:

  • Monitor IP address 172.17.11.1 by sending ICMP packets at 3 seconds (retry-interval) interval.
  • If 5 consecutive attempts (retry-count) fail, mark the IP address 172.17.11.1 unreachable.
  • then deduct the weight 200 from the global-threshold value (i.e 200)
  • and if the result of this deduction is 0, then deduct global-weight 100 from the RG1 threshold (255)

Note: Configured secondary-ip-address shouldn’t be the primary IP address on reth0.0
as far as the documentation is concerned but it must be on the same subnet. In a nutshell,
primary node is using reth0.0 interface address (172.17.11.100) as the source and secondary node
is using the IP 172.17.11.99 (with MAC address of the secondary node child interface)

After configuring this ip-monitoring, here is the status. IP is reachable

{primary:node0}
root@srx0> show chassis cluster ip-monitoring status
node0:
--------------------------------------------------------------------------

Redundancy group: 1

IP address                Status        Failure count  Reason
172.17.11.1               reachable       0            n/a

node1:
--------------------------------------------------------------------------

Redundancy group: 1

IP address                Status        Failure count  Reason
172.17.11.1               reachable       0            n/a

Now I disable ICMP responses on the gateway, to simulate a failure and
ip address is marked as unreachable.

root@srx0> show chassis cluster ip-monitoring status
node0:
--------------------------------------------------------------------------

Redundancy group: 1

IP address                Status        Failure count  Reason
172.17.11.1               unreachable     1            unknown

node1:
--------------------------------------------------------------------------

Redundancy group: 1

IP address                Status        Failure count  Reason
172.17.11.1               reachable     0              n/a

When we check the cluster information, we can see that threshold is 155 now.
Because our global weight is 100 for this monitored IP address (255-100=155)

{primary:node0}
root@srx0> show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: active-active

Redundancy group: 0, Threshold: 255, Monitoring failures: none
    Events:
        Sep 19 14:34:43.366 : hold->secondary, reason: Hold timer expired
        Sep 19 14:34:59.817 : secondary->primary, reason: Better priority (100/99)

Redundancy group: 1, Threshold: 155, Monitoring failures: ip-monitoring
    Events:
        Sep 19 14:34:43.379 : hold->secondary, reason: Hold timer expired
        Sep 19 14:34:59.852 : secondary->primary, reason: Remote yeild (0/0)

You might have noticed that there is no failover yet, since for failover to happen,
RG1 threshold must reach 0 which isn’t the case on this simulation.

Now, I change the config and set the IP weight to 200 and global to 255 to simulate a failover;

global-weight 255;
global-threshold 200;
retry-interval 3;
retry-count 5;
family {
    inet {
        172.17.11.1 {
            weight 200;
            interface reth0.0 secondary-ip-address 172.17.11.99;
        }
    }
}

After a network failure, node1 becomes primary this time.

{primary:node0}
root@srx0> show chassis cluster status
Cluster ID: 1
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   100         primary        no       no
    node1                   99          secondary      no       no

Redundancy group: 1 , Failover count: 46
    node0                   100         secondary      no       no
    node1                   99          primary        no       no

You might see the high failure count 46 in the output. This happened after I made simulation mistake:)
I thought that I can just block ICMP from 172.17.11.100 on the Linux GW device as such;

iptables -A INPUT -p icmp -s 172.17.11.100 -j DROP

and it should trigger a failover. Yes assumption was correct it should trigger a failover as node1 is able to
ping from 11.99 but once the failover occurs, node1 becomes primary and now it starts pinging from 11.100
which fails again and causes another failover. This caused a sort of dead lock in the cluster but
I have learned something in there:)

IP monitoring is more clear for me after these tests. Let me know if you have any feedback/questions.

UPDATE:
After finishing this post, Steven asked in this post: what happens if we don’t configure the secondary-ip-address, I didn’t know the exact answer. According to my test on a 11.4R7.5 release;

If you try to monitor IP e.g 192.168.60.1, you receive the following warning on every commit onwards.

{primary:node0}[edit]
root@srx210-1# commit 
[edit chassis cluster redundancy-group 1 ip-monitoring family inet]
  '192.168.60.1'
    Warning: interface option is not configured. You might get limited monitoring functionality
node0: 
configuration check succeeds
node1: 
[edit chassis cluster redundancy-group 1 ip-monitoring family inet]
  '192.168.60.1'
    Warning: interface option is not configured. You might get limited monitoring functionality
commit complete
node0: 
commit complete

Also an odd thing for me is that 192.168.60.1 (monitored IP) status is always reachable
on the secondary node. Maybe what is meant by “limited monitoring functionality” is this.

secondary:node1}
root@srx210-2> show chassis cluster ip-monitoring status    
node0:
--------------------------------------------------------------------------

Redundancy group: 1

IP address                Status        Failure count  Reason
192.168.60.1              reachable       0            n/a                        

node1:
--------------------------------------------------------------------------

Redundancy group: 1

IP address                Status        Failure count  Reason
192.168.60.1              reachable       0            n/a                        

About: rtoodtoo

Genco has worked for more than 10 years as a Network/Support Engineer. He is also interested in Python, Linux, Security and SD-WAN, currently lives in the Netherlands and works as a Network Support Engineer at Tesla Inc. // JNCIE-SEC #223 / RHCE / PCNSE


11 thoughts on “SRX cluster ip-monitoring”

  1. Good article! But is the configuration of a secondary ip address mandatory? If i have a /30 interconnect network with my upstream router i don’t have any more addresses to assign…

    1. Hi Steven,
      What will possibly happen if you don’t set this secondary (I am not even sure if you are allowed not to set this, I don’t have the setup to check it righ now), your RG will be ineligible on the secondary node since monitoring will fail. Because of this you won’t have any node to fail over in case an issue occurs on the primary.

      1. Hello rtoodtoo,

        That makes perfect sense. But I have just the one assigned IP. What I would need is to be able to ping the upstream router from just the active node. When this fails the redundancy group containing just this reth interface would have to failover to the second node and start pinging from that. I don’t care that much which link to the upstream router is being used, just that a failover takes place as soon as connectivity is lost. Switching back and forth could be restricted with the hold-down-interval.
        But I’m affraid that this isn’t going to work.

        1. I will try to test this as soon as I have a setup for this but nothing comes into my mind at the moment as a workaround. I am not %100 sure the ineligibility of the second on either but I hope I will test and give you my results.

          1. Thanks for your help. I’m going to investigate this further, but I also don’t have a lab setup to try different configurations.

  2. I have updated the end of the post Steven. I did a test but result wasn’t as I expected. It must be checked in the X releases in which behavior might be different.

  3. Good Article Rtoodtoo , can you please calrify how did you resolve the issue below

    “Yes assumption was correct it should trigger a failover as node1 is able to
    ping from 11.99 but once the failover occurs, node1 becomes primary and now it starts pinging from 11.100
    which fails again and causes another failover. This caused a sort of dead lock in the cluster but
    I have learned something in there:)”

    What have you learned here 🙂

    Thanks & Regards

    1. Salman,
      What I have learned there is that my testing method is incorrect. I have only a single gateway and normally you should have at least two if you are really monitoring your uplinks. My testing cluster couldn’t simply find a healty uplink to failover.

      Cheers.

  4. I am in a similar situation: I have clustered firewalls and two ISPs, but only a /30 for the connection between my primary firewall and the primary ISP. I am familiar with this configuration in ScreenOS (a much more mature operating system, for sure), where you simply pinged a gateway, or some upstream device and failed to the backup if you lost connectivity.

    Is there simply no way to get this to work with a single public IP address?

    dak

  5. I haven’t perform a lab recreation for this but I am concerned about few things here.
    My understanding is that the ICMP packets (when the interface and the secondary IP are not specified), are sourced from the reth interface and they use the reth MAC address as the source MAC. Being said this, the downstream or upstream switches will definetely know how to handle those packets. Based on the explanation provided on the SRX series book (page 342), it appears to me that when you specify this interface and secondary-IP, the ICMP packets are going to use the specfied secondary IP as a source address and will come from the backup interface. The book says “this will allow the administrator to check the backup path coming from the secondary node”.

    If this is the case then:

    What happens is the specified interface is different than the one included on the routing table as our outgoing interface to reach the monitored IP?

    How is this possible, if the backup interface is supposed to be not accepting traffic at all?

    How will the switches know how forward this packet to the backup interface if the source MAC should be the reth MAC address?

    If the SRX won’t use the reth MAC address, then how will the SRX select the source MAC for those packets?

    I hope you can help me to clarify this shortly.

    Thank you and BTW, great post (as allways you do) !!!!!!

    Fabian Alfaro

    1. Not quite sure if I am getting you right. In the post, what I explained is that secondary will source the packets from the child interface’s MAC not reth0. In a nutshell, you ping the same IP from two different IP addresses and MAC addresses node0 using reth0 and node1 using child (local) interface’s MAC and switch knows where these two different MACs are located.

      Genco.

You have a feedback?

This site uses Akismet to reduce spam. Learn how your comment data is processed.