Dual ISP failover with RPM ip-monitoring

Internet isn’t perfect and we may have link failures from time to time. How do we react to these failures? Manually or we have an automatic way. I would like to show on this post how Junos can take action upon an upstream gateway reachability issue and how SRX flow behaves in such a scenario. To achieve this task we will use a handful of features currently available on an SRX box. Before getting started, check my test topology below in order to understand this post. It is a simulated Internet environment with some fake public IP addresses. BranchC is our client side SRX device and we have two connected PCs and we will do every config magic on this BranchC device.

dual_isp_rpm_ip_monitoring

Test Plan

  • 1) Create two routing instances for each ISP & cross import the routes between these two instances
  • 2) Forward Debian1 traffic to ISP1 and HostC traffic to ISP2 by using Filter Based Forwarding
  • 3) Monitor each ISP by using RPM (Real Time Performance Monitoring) feature
  • 4) Test the ideal condition traffic flow
  • 5) If any ISP link fails, failover the default route to the other ISP by using ip monitoring feature
  • 6) Analyse the effects of this failover on established TCP/UDP traffic

Now we will go step by step and complete each task.

1) Create two routing instances for each ISP

First we need to create RIB groups so that each ISP routing instance can have the interface routes of the other ISP.

[edit]
root@branchC# show routing-options 
rib-groups {
    ISP1-to-ISP2 {
        import-rib [ ISP1.inet.0 ISP2.inet.0 ];
    }
    ISP2-to-ISP1 {
        import-rib [ ISP2.inet.0 ISP1.inet.0 ];
    }
}

Then create routing instances and activate rib-groups.

[edit]
root@branchC# show routing-instances 
ISP1 {
    instance-type virtual-router;
    interface ge-0/0/0.951;
    routing-options {
        interface-routes {
            rib-group inet ISP1-to-ISP2;
        }
        static {
            route 0.0.0.0/0 next-hop 173.1.1.1;
        }
    }
}
ISP2 {
    instance-type virtual-router;
    interface ge-0/0/0.202;
    routing-options {
        interface-routes {
            rib-group inet ISP2-to-ISP1;
        }
        static {
            route 0.0.0.0/0 next-hop 212.44.1.1;
        }
    }
}

Now routing table should be ready i.e routes from each instances should be cross imported.

root@branchC> show route    

inet.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

173.63.1.0/24      *[Direct/0] 1d 01:58:44
                    > via ge-0/0/0.963
173.63.1.1/32      *[Local/0] 1d 01:58:45
                      Local via ge-0/0/0.963

ISP1.inet.0: 6 destinations, 6 routes (6 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[Static/5] 1d 01:53:34
                    > to 173.1.1.1 via ge-0/0/0.951
173.1.1.0/24       *[Direct/0] 1d 01:54:14
                    > via ge-0/0/0.951
173.1.1.2/32       *[Local/0] 1d 01:54:14
                      Local via ge-0/0/0.951
173.1.1.10/32      *[Static/1] 1d 01:54:14
                      Receive
212.44.1.0/30      *[Direct/0] 1d 01:37:00   <<<< --- This is the route of ISP2 
                    > via ge-0/0/0.202
212.44.1.2/32      *[Local/0] 1d 01:37:00
                      Local via ge-0/0/0.202

ISP2.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[Static/5] 1d 01:54:14
                    > to 212.44.1.1 via ge-0/0/0.202
173.1.1.0/24       *[Direct/0] 1d 01:37:00  <<<< --- This is the route of ISP1
                    > via ge-0/0/0.951
173.1.1.2/32       *[Local/0] 1d 01:37:00
                      Local via ge-0/0/0.951
212.44.1.0/30      *[Direct/0] 1d 01:54:14
                    > via ge-0/0/0.202  
212.44.1.2/32      *[Local/0] 1d 01:54:14
                      Local via ge-0/0/0.202

We have completed the first task. Each routing instance is aware of the brother routing instance. Now we should route traffic from clients to the respective ISPs.

2) Forward Debian1 traffic to ISP1 and HostC traffic to ISP2

Below by using firewall filters, we redirect each traffic to the routing instances.

[edit]
root@branchC# show firewall 
family inet {
    filter redirect-to-isp {
        term to-isp1 {
            from {
                source-address {
                    173.63.1.100/32;
                }
            }
            then {
                routing-instance ISP1;
            }
        }
        term to-isp2 {
            from {
                source-address {
                    173.63.1.200/32;
                }
            }
            then {
                routing-instance ISP2;
            }
        }
        term default-allow {  
            then accept;    
        }
    }
}

but it isn’t activated until we apply it on the incoming interface

[edit]
root@branchC# show interfaces ge-0/0/0.963 
vlan-id 963;
family inet {
    filter {
        input redirect-to-isp;   <<< --- We are redirecting client traffic.
    }
    address 173.63.1.1/24;
}

Redirecting client traffic to routing instances is also completed. Now we will monitor ISP links.

3) Monitor each ISP by using RPM
Junos has a great real time monitoring feature. You can continuously check link quality and probe remote hosts. RPM requires another dedicated post actually but shortly what we do below is that we probe each ISP gateway with 1 seconds interval 5 times by using ICMP and if the total loss of in a single test is 5, then TEST FAILS. What does a test failure mean practially for us? It means we can take an IP monitoring action for this failure.

[edit]
root@branchC# show services rpm 
probe probe-isp1 {
    test test-isp1 {
        probe-type icmp-ping;
        target address 173.1.1.1;
        probe-count 5;
        probe-interval 1;
        test-interval 3;
        source-address 173.1.1.2;
        routing-instance ISP1;
        thresholds {
            total-loss 5;
        }
    }
}
probe probe-isp2 {
    test test-isp2 {
        probe-type icmp-ping;
        target address 212.44.1.1;
        probe-count 5;
        probe-interval 1;
        test-interval 3;
        source-address 212.44.1.2;
        routing-instance ISP2;
        thresholds {
            total-loss 5;
        }
    }
}

If we want to check the probe results

root@branchC> show services rpm probe-results owner probe-isp1 test test-isp1 
    Owner: probe-isp1, Test: test-isp1
    Target address: 173.1.1.1, Source address: 173.1.1.2, Probe type: icmp-ping
    Routing Instance Name: ISP1
    Test size: 5 probes
    Probe results:
      Response received, Mon Nov 10 23:17:00 2014, No hardware timestamps
      Rtt: 5225 usec
    Results over current test:
      Probes sent: 5, Probes received: 5, Loss percentage: 0   <<<--- Probes are sent and received without any problem, no loss.
      Measurement: Round trip time
        Samples: 5, Minimum: 5212 usec, Maximum: 5307 usec, Average: 5264 usec, Peak to peak: 95 usec,
        Stddev: 39 usec, Sum: 26319 usec
    Results over last test:
      Probes sent: 5, Probes received: 5, Loss percentage: 0
      Test completed on Mon Nov 10 23:17:00 2014
      Measurement: Round trip time
        Samples: 5, Minimum: 5212 usec, Maximum: 5307 usec, Average: 5264 usec, Peak to peak: 95 usec,
        Stddev: 39 usec, Sum: 26319 usec
    Results over all tests:
      Probes sent: 64740, Probes received: 63097, Loss percentage: 2
      Measurement: Round trip time
        Samples: 63097, Minimum: 617 usec, Maximum: 15220 usec, Average: 5399 usec,
        Peak to peak: 14603 usec, Stddev: 631 usec, Sum: 340640344 usec

As we can see there isn't any loss at the moment. Only RPM monitoring without an action doesn't really mean anything in our scenario. We need to take an action if a test fails which is IP-MONITORING. Let's do it.

4) Test the ideal condition traffic flow
For this test to be successful, you must have SOURCE NAT configured and security policies should allow the traffic

I am running traceroute from each hosts and traffic follows different ISP for each host. This is what we wanted to do first of all when dual links are functional.

root@debian1:~# traceroute -n 87.1.1.6
traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
 1  173.63.1.1  3.857 ms  3.811 ms  4.635 ms
 2  173.1.1.1  13.120 ms  13.130 ms  13.128 ms
 3  87.1.1.6  12.489 ms  13.112 ms  13.106 ms
root@hostC:~# traceroute -n 87.1.1.6
traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
 1  173.63.1.1  2.876 ms  2.875 ms  3.493 ms
 2  212.44.1.1  12.244 ms  12.249 ms  12.305 ms
 3  87.1.1.6  12.080 ms  12.154 ms  12.188 ms

5) If any ISP link fails, failover!
With the following config, we check RPM probes for failure. If it happens, we set the preferred default route to the other ISP's default gateway by which we will have achieved what we really want. It is done automatically in each event.

[edit]
root@branchC# show services ip-monitoring    
policy track-isp1 {
    match {
        rpm-probe probe-isp1;
    }
    then {
        preferred-route {
            routing-instances ISP1 {
                route 0.0.0.0/0 {
                    next-hop 212.44.1.1;
                }
            }
        }
    }
}
policy track-isp2 {
    match {
        rpm-probe probe-isp2;
    }
    then {
        preferred-route {
            routing-instances ISP2 {
                route 0.0.0.0/0 {
                    next-hop 173.1.1.1;
                }
            }
        }
    }
}

Now we will simulate a failure on the ISP1 after which Debian1 device will also be routed through the ISP2 instead of ISP1. Aha, link failed!

isp1_link_failure

Now check IP monitoring status

root@branchC> show services ip-monitoring status                                 

Policy - track-isp1 (Status: FAIL)
  RPM Probes:
    Probe name             Test Name       Address          Status   
    ---------------------- --------------- ---------------- ---------
    probe-isp1             test-isp1       173.1.1.1        FAIL     <<< --- TEST FAILED 
  Route-Action:
    route-instance    route             next-hop         state
    ----------------- ----------------- ---------------- ------------- 
    ISP1              0.0.0.0/0         212.44.1.1       APPLIED     <<< --- Route action is taken and 0/0 next-hop is set to ISP2.

Policy - track-isp2 (Status: PASS)
  RPM Probes:
    Probe name             Test Name       Address          Status   
    ---------------------- --------------- ---------------- ---------
    probe-isp2             test-isp2       212.44.1.1       PASS     
  Route-Action:
    route-instance    route             next-hop         state
    ----------------- ----------------- ---------------- ------------- 
    ISP2              0.0.0.0/0         173.1.1.1        NOT-APPLIED  
root@branchC> show route table ISP1.inet.0 

ISP1.inet.0: 6 destinations, 7 routes (6 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[Static/1] 00:00:25, metric2 0
                    > to 212.44.1.1 via ge-0/0/0.202 <<< --- New default gateway is ISP2 now. Yupppiii!!!
                    [Static/5] 1d 02:24:50
                    > to 173.1.1.1 via ge-0/0/0.951
173.1.1.0/24       *[Direct/0] 1d 02:25:30
                    > via ge-0/0/0.951
173.1.1.2/32       *[Local/0] 1d 02:25:30
                      Local via ge-0/0/0.951
173.1.1.10/32      *[Static/1] 1d 02:25:30
                      Receive
212.44.1.0/30      *[Direct/0] 1d 02:08:16
                    > via ge-0/0/0.202
212.44.1.2/32      *[Local/0] 1d 02:08:16
                      Local via ge-0/0/0.202

Now let's see if this new condition is working for hosts Debian1 and Hostc. As you can see below, debian1 is now following the ISP2 link instead of the failed ISP1 link.

root@debian1:~# traceroute -n 87.1.1.6
traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
 1  173.63.1.1  1.165 ms  1.154 ms  1.141 ms
 2  212.44.1.1  10.567 ms  10.929 ms  10.923 ms
 3  87.1.1.6  10.501 ms  10.501 ms  10.486 ms
root@hostC:~# traceroute -n 87.1.1.6
traceroute to 87.1.1.6 (87.1.1.6), 30 hops max, 60 byte packets
 1  173.63.1.1  4.354 ms  4.353 ms  4.980 ms
 2  212.44.1.1  14.263 ms  14.261 ms  14.258 ms
 3  87.1.1.6  13.552 ms  14.179 ms  14.172 ms

6) Analyse the effects of this failover on established TCP/UDP traffic
In order to investigate how SRX FLOW behaves upon this route update. I have initiated an SSH connection towards the remote box 87.1.1.6 before the link failure. Below is the session entry of this connection.

root@branchC> show security flow session destination-prefix 87.1.1.6    
Session ID: 6786, Policy name: default-policy-00/2, Timeout: 1788, Valid
  In: 173.63.1.100/33127 --> 87.1.1.6/22;tcp, If: ge-0/0/0.963, Pkts: 42, Bytes: 4959
  Out: 87.1.1.6/22 --> 173.1.1.2/5184;tcp, If: ge-0/0/0.951, Pkts: 41, Bytes: 5317
Total sessions: 1

session is established and working fine. I have also enabled flow trace to see what flow is telling me once I send a packet after the link failure. My comments are inline in the flow trace.

Nov 10 22:36:39 22:36:39.463629:CID-0:RT:<173.63.1.100/33127->87.1.1.6/22;6> matched filter rpm-ff:
Nov 10 22:36:39 22:36:39.463652:CID-0:RT:packet [100] ipid = 23312, @0x4c4123d2
Nov 10 22:36:39 22:36:39.463660:CID-0:RT:---- flow_process_pkt: (thd 1): flow_ctxt type 15, common flag 0x0, mbuf 0x4c412180, rtbl_idx = 7
Nov 10 22:36:39 22:36:39.463665:CID-0:RT: flow process pak fast ifl 76 in_ifp ge-0/0/0.963
Nov 10 22:36:39 22:36:39.463698:CID-0:RT:  ge-0/0/0.963:173.63.1.100/33127->87.1.1.6/22, tcp, flag 18 
Nov 10 22:36:39 22:36:39.463707:CID-0:RT: find flow: table 0x58f397a0, hash 61800(0xffff), sa 173.63.1.100, da 87.1.1.6, sp 33127, dp 22, proto 6, tok 7 
Nov 10 22:36:39 22:36:39.463715:CID-0:RT:Found: session id 0x1a82. sess tok 7 
Nov 10 22:36:39 22:36:39.463720:CID-0:RT:  flow got session.
Nov 10 22:36:39 22:36:39.463721:CID-0:RT:  flow session id 6786  <<< --- Session is matched
Nov 10 22:36:39 22:36:39.463735:CID-0:RT:flow_ipv4_rt_lkup success 87.1.1.6, iifl 0x0, oifl 0x44 
Nov 10 22:36:39 22:36:39.463740:CID-0:RT:  route lookup failed: dest-ip 87.1.1.6 orig ifp ge-0/0/0.951 output_ifp ge-0/0/0.202 fto 0x53bf21a8 orig-zone 6 out-zone 10 vsd 0   <<< --- route lookup fails as currently this destination is pointing to a different interface
Nov 10 22:36:39 22:36:39.463745:CID-0:RT:  readjust timeout to 6 s  <<< --- flow adjusts the timeout to 6 seconds immediately. 
Nov 10 22:36:39 22:36:39.463748:CID-0:RT:  packet dropped,   pak dropped since re-route failed  <<< --- and the packet is dropped.
Nov 10 22:36:39 22:36:39.463751:CID-0:RT: ----- flow_process_pkt rc 0x7 (fp rc -1)

I have learned something new here. Apparently on this new situation flow drops the session timeout to 6 seconds immeadiately. After seeing this I run flow session command once again and saw that session timeout has now 2 seconds.

root@branchC> show security flow session destination-prefix 87.1.1.6    
Session ID: 6786, Policy name: default-policy-00/2, Timeout: 2, Valid
  In: 173.63.1.100/33127 --> 87.1.1.6/22;tcp, If: ge-0/0/0.963, Pkts: 42, Bytes: 4959
  Out: 87.1.1.6/22 --> 173.1.1.2/5184;tcp, If: ge-0/0/0.951, Pkts: 41, Bytes: 5317
Total sessions: 1

And after these two seconds also pass, flow deletes the session from session table.

Nov 10 22:36:45 22:36:45.157575:CID-0:RT:jsf sess close notify
Nov 10 22:36:45 22:36:45.157587:CID-0:RT:flow_ipv4_del_flow: sess 6786, in hash 32
Nov 10 22:36:45 22:36:45.157592:CID-0:RT:flow_ipv4_del_flow: sess 6786, in hash 32

We have seen the effects of failover on TCP but I will leave the effects of this failover on UDP traffic to the reader 🙂 UDP behaviour is a bit different than this one and if required some measures can be taken to mitigate it but I leave it to the reader to discover and share with me.

About: rtoodtoo

Worked for more than 10 years as a Network/Support Engineer and also interested in Python, Linux, Security and SD-WAN // JNCIE-SEC #223 / RHCE / PCNSE


21 thoughts on “Dual ISP failover with RPM ip-monitoring”

    1. Thanks for feedback Leon. To be frank, I don’t have a high end box at home to do my tests:)
      I only have branch SRX and firefly. Hence anything I post is mostly tested on branch or Firefly Perimeter.

      Cheers
      Genco.

  1. hello,

    In the routing table for inet.0 there is no route for 0.0.0.0/0 so the Firewall can’t reach the Internet ?
    Only for devices on the LAN segment(segment where the input filter applies on) has inet connection

    Do you know how to set the route for 0.0.0.0/0 for the Juniper itself ? since he uses the inet.0 routing table..
    I already tried set routing-options import static route 0.0.0.0/0 …… and so on..

    Kind Regards..

      1. you can’t ping internet from the box and you even can’t ping the box itself from Debian1 or HostC, coz ” this just a test system”, right? : )

  2. What if an ISP is flapping and you have ~70% loss ? it will not match your threshold.
    how would you add a second test that force the failover if you reach more than, let’s say 20% loss ?
    I’m not sure 5 ping over 5 sec will be enough for this test. What’s your point of view ?

    1. it all depends your total-loss / probe-count ratio. You can define up to 15 probes per test. You can of course add multiple tests but all should fail for the probe to be matched. If the line is flapping out side of your probe period, then it won’t be detected but if you keep probe count and interval small, then you increase your chances.

      Genco.

  3. OK this has been driving me crazy. I figured out and implemented all of the concepts in this article on our SRX-210 router quite some time ago.. What I CANNOT figure out is how to automatically revert back to the original default gateway once ISP1 comes back online.

    My experience is that the failover works just fine, but the default gateway for PC’s behind the firewall remains set to ISP2 as reflected by the results of a traceroute command. That causes all kinds of problems with asymmetric routing; replies to inbound traffic from ISP1 are routed out ISP2. I have to manually execute the CLI command ‘restart routing’ to fix it.

    What has been your experience and do you have any suggestions?

    Eric

    1. Eric,
      For your first issue, that doesn’t happen in my lab tests. Once the ICMP replies are received on the failed ISP, default route is restored.
      If you are using an older release, you may upgrade to the JTAC recommended release.
      As for your second issue, asymmetric routing occurs if you don’t set your instance type in your routing instances to virtual-router or
      if you have only one logical interface for your both ISPs. I would recommend you to have a look at http://kb.juniper.net/KB27946 to see why might be experiecing asymmetric routing.

      1. What might be happening is that Eric’s probe is relying on the static route that is being inserted by the ip-monitoring config when a failover occurs. In that situation when the probe fails and the route changes then the probe might never go back to succeeding again.

        Your ping probes should include the routing-instance and maybe destination-interface as well and not be using the static/default routes that the ip-monitoring is changing.

        PS. Nice blog Genco, I’m studying for my JNCIP-Sec and finding it to be very useful reading.

  4. Can’t we use ecmp with per flow load balancing and get the same auto failover without need to configure Rpm.

  5. Thank you, this is an excellent post that partially address something I’m working on now. One question with two parts: I need to “overlay” on this scenario two additional pieces. Maybe you have an idea or two. 1, is the question of failing over redundant ipsec vpns. Do they follow the same process where st0.1 in vr1 and st0.2 is in vr2- failing over with ip-monitoring? It seems like it would. 2 we are using gre tunnels (also redundant with gr-0/0/0.0 in vr1 and gr-0/0/0.1 in vr2; we forward all port 80&443 requests to the gre. We need to have failover between the two gre’s. If you have an idea, that’d be great. In neither case to I require specific hosts to go out one specific isps, we have a primary and secondary.

  6. Genco,

    First of all thanks for your wonderful blog. So helpful and insightful. I’ve got a question regarding the RPM probe tests themselves that I haven’t had answered yet: Do you know if the test criteria are read/processed as AND or OR statements? I’ve been trying to figure out how it processes, for example configuring:

    thresholds {
    successive-loss 3;
    total-loss 6;
    }

    If it meets one criteria and not the other, will it still trigger? Thanks in advance! Love the blog.

    1. For this feature make sure you are on the latest stable release. There has been several fixes on this and as far as I recall one of them can trigger the threshold to be successful but not %100 sure as it has been some time that I haven’t touched this.

      Genco.

  7. Hi mate, I need to know how to configure multiple IP in the same interface!

    I have 4 IP in IP ISP1 and 5 in the ISP2, and I need to specify that server need you out for each IP!

  8. i’ve configured the probes as you suggested but it’s failing i’m not sure why?

    root@Vencouver# show services
    rpm {
    probe probe-isp1 {
    test test-1 {
    probe-type icmp-ping;
    target address 8.8.8.8;
    probe-count 5;
    probe-interval 1;
    test-interval 3;
    source-address 202.166.199.34;
    routing-instance routing-table-ISP1;
    thresholds {
    total-loss 5;
    }
    }
    }
    }
    ip-monitoring {
    policy track-isp1 {
    match {
    rpm-probe probe-isp1;
    }
    then {
    preferred-route {
    routing-instances routing-table-ISP1 {
    route 0.0.0.0/0 {
    next-hop 110.34.25.81;
    }
    }
    }
    }
    }
    }

    show routing-instances routing-table-ISP1
    instance-type forwarding;
    routing-options {
    static {
    route 0.0.0.0/0 next-hop 202.166.199.33;
    }
    }

    #run show services rpm probe-results owner probe-isp1 test test-1
    Owner: probe-isp1, Test: test-1
    Target address: 202.79.32.4, Source address: 202.166.199.34, Probe type: icmp-ping
    Routing Instance Name: routing-table-ISP1
    Test size: 5 probes
    Probe results:
    Request timed out, Mon Aug 29 16:52:12 2016
    Results over current test:
    Probes sent: 5, Probes received: 0, Loss percentage: 100
    Results over last test:
    Probes sent: 5, Probes received: 0, Loss percentage: 100
    Results over all tests:
    Probes sent: 355, Probes received: 0, Loss percentage: 100

Leave a Reply to Aleksey Maksimenko (@BravoAvo112)Cancel reply

Discover more from RtoDto.net

Subscribe now to keep reading and get access to the full archive.

Continue reading