Dual ISP failover with RPM ip-monitoring

Internet isn’t perfect and we may have link failures from time to time. How do we react to these failures? Manually or we have an automatic way. I would like to show on this post how Junos can take action upon an upstream gateway reachability issue and how SRX flow behaves in such a scenario. To achieve this task we will use a handful of features currently available on an SRX box. Before getting started, check my test topology below in order to understand this post. It is a simulated Internet environment with some fake public IP addresses. BranchC is our client side SRX device and we have two connected PCs and we will do every config magic on this BranchC device.


Test Plan

  • 1) Create two routing instances for each ISP & cross import the routes between these two instances
  • 2) Forward Debian1 traffic to ISP1 and HostC traffic to ISP2 by using Filter Based Forwarding
  • 3) Monitor each ISP by using RPM (Real Time Performance Monitoring) feature
  • 4) Test the ideal condition traffic flow
  • 5) If any ISP link fails, failover the default route to the other ISP by using ip monitoring feature
  • 6) Analyse the effects of this failover on established TCP/UDP traffic

Now we will go step by step and complete each task.

1) Create two routing instances for each ISP

First we need to create RIB groups so that each ISP routing instance can have the interface routes of the other ISP.

Then create routing instances and activate rib-groups.

Now routing table should be ready i.e routes from each instances should be cross imported.

We have completed the first task. Each routing instance is aware of the brother routing instance. Now we should route traffic from clients to the respective ISPs.

2) Forward Debian1 traffic to ISP1 and HostC traffic to ISP2

Below by using firewall filters, we redirect each traffic to the routing instances.

but it isn’t activated until we apply it on the incoming interface

Redirecting client traffic to routing instances is also completed. Now we will monitor ISP links.

3) Monitor each ISP by using RPM
Junos has a great real time monitoring feature. You can continuously check link quality and probe remote hosts. RPM requires another dedicated post actually but shortly what we do below is that we probe each ISP gateway with 1 seconds interval 5 times by using ICMP and if the total loss of in a single test is 5, then TEST FAILS. What does a test failure mean practially for us? It means we can take an IP monitoring action for this failure.

If we want to check the probe results

As we can see there isn’t any loss at the moment. Only RPM monitoring without an action doesn’t really mean anything in our scenario. We need to take an action if a test fails which is IP-MONITORING. Let’s do it.

4) Test the ideal condition traffic flow
For this test to be successful, you must have SOURCE NAT configured and security policies should allow the traffic

I am running traceroute from each hosts and traffic follows different ISP for each host. This is what we wanted to do first of all when dual links are functional.

5) If any ISP link fails, failover!
With the following config, we check RPM probes for failure. If it happens, we set the preferred default route to the other ISP’s default gateway by which we will have achieved what we really want. It is done automatically in each event.

Now we will simulate a failure on the ISP1 after which Debian1 device will also be routed through the ISP2 instead of ISP1. Aha, link failed!


Now check IP monitoring status

Now let’s see if this new condition is working for hosts Debian1 and Hostc. As you can see below, debian1 is now following the ISP2 link instead of the failed ISP1 link.

6) Analyse the effects of this failover on established TCP/UDP traffic
In order to investigate how SRX FLOW behaves upon this route update. I have initiated an SSH connection towards the remote box before the link failure. Below is the session entry of this connection.

session is established and working fine. I have also enabled flow trace to see what flow is telling me once I send a packet after the link failure. My comments are inline in the flow trace.

I have learned something new here. Apparently on this new situation flow drops the session timeout to 6 seconds immeadiately. After seeing this I run flow session command once again and saw that session timeout has now 2 seconds.

And after these two seconds also pass, flow deletes the session from session table.

We have seen the effects of failover on TCP but I will leave the effects of this failover on UDP traffic to the reader 🙂 UDP behaviour is a bit different than this one and if required some measures can be taken to mitigate it but I leave it to the reader to discover and share with me.

21 thoughts on “Dual ISP failover with RPM ip-monitoring

    1. rtoodtoo Post author

      Thanks for feedback Leon. To be frank, I don’t have a high end box at home to do my tests:)
      I only have branch SRX and firefly. Hence anything I post is mostly tested on branch or Firefly Perimeter.


  1. Nico


    In the routing table for inet.0 there is no route for so the Firewall can’t reach the Internet ?
    Only for devices on the LAN segment(segment where the input filter applies on) has inet connection

    Do you know how to set the route for for the Juniper itself ? since he uses the inet.0 routing table..
    I already tried set routing-options import static route …… and so on..

    Kind Regards..

      1. dmitry

        you can’t ping internet from the box and you even can’t ping the box itself from Debian1 or HostC, coz ” this just a test system”, right? : )

  2. Z1

    What if an ISP is flapping and you have ~70% loss ? it will not match your threshold.
    how would you add a second test that force the failover if you reach more than, let’s say 20% loss ?
    I’m not sure 5 ping over 5 sec will be enough for this test. What’s your point of view ?

    1. rtoodtoo Post author

      it all depends your total-loss / probe-count ratio. You can define up to 15 probes per test. You can of course add multiple tests but all should fail for the probe to be matched. If the line is flapping out side of your probe period, then it won’t be detected but if you keep probe count and interval small, then you increase your chances.


  3. Eric Havemann

    OK this has been driving me crazy. I figured out and implemented all of the concepts in this article on our SRX-210 router quite some time ago.. What I CANNOT figure out is how to automatically revert back to the original default gateway once ISP1 comes back online.

    My experience is that the failover works just fine, but the default gateway for PC’s behind the firewall remains set to ISP2 as reflected by the results of a traceroute command. That causes all kinds of problems with asymmetric routing; replies to inbound traffic from ISP1 are routed out ISP2. I have to manually execute the CLI command ‘restart routing’ to fix it.

    What has been your experience and do you have any suggestions?


    1. rtoodtoo Post author

      For your first issue, that doesn’t happen in my lab tests. Once the ICMP replies are received on the failed ISP, default route is restored.
      If you are using an older release, you may upgrade to the JTAC recommended release.
      As for your second issue, asymmetric routing occurs if you don’t set your instance type in your routing instances to virtual-router or
      if you have only one logical interface for your both ISPs. I would recommend you to have a look at http://kb.juniper.net/KB27946 to see why might be experiecing asymmetric routing.

      1. Kerry Thompson

        What might be happening is that Eric’s probe is relying on the static route that is being inserted by the ip-monitoring config when a failover occurs. In that situation when the probe fails and the route changes then the probe might never go back to succeeding again.

        Your ping probes should include the routing-instance and maybe destination-interface as well and not be using the static/default routes that the ip-monitoring is changing.

        PS. Nice blog Genco, I’m studying for my JNCIP-Sec and finding it to be very useful reading.

  4. Kishan

    Can’t we use ecmp with per flow load balancing and get the same auto failover without need to configure Rpm.

  5. Ryan

    Thank you, this is an excellent post that partially address something I’m working on now. One question with two parts: I need to “overlay” on this scenario two additional pieces. Maybe you have an idea or two. 1, is the question of failing over redundant ipsec vpns. Do they follow the same process where st0.1 in vr1 and st0.2 is in vr2- failing over with ip-monitoring? It seems like it would. 2 we are using gre tunnels (also redundant with gr-0/0/0.0 in vr1 and gr-0/0/0.1 in vr2; we forward all port 80&443 requests to the gre. We need to have failover between the two gre’s. If you have an idea, that’d be great. In neither case to I require specific hosts to go out one specific isps, we have a primary and secondary.

  6. Adharma (@AdharmaMusic)


    First of all thanks for your wonderful blog. So helpful and insightful. I’ve got a question regarding the RPM probe tests themselves that I haven’t had answered yet: Do you know if the test criteria are read/processed as AND or OR statements? I’ve been trying to figure out how it processes, for example configuring:

    thresholds {
    successive-loss 3;
    total-loss 6;

    If it meets one criteria and not the other, will it still trigger? Thanks in advance! Love the blog.

    1. rtoodtoo Post author

      For this feature make sure you are on the latest stable release. There has been several fixes on this and as far as I recall one of them can trigger the threshold to be successful but not %100 sure as it has been some time that I haven’t touched this.


  7. MarceFe

    Hi mate, I need to know how to configure multiple IP in the same interface!

    I have 4 IP in IP ISP1 and 5 in the ISP2, and I need to specify that server need you out for each IP!

  8. Bikram Manandhar

    i’ve configured the probes as you suggested but it’s failing i’m not sure why?

    root@Vencouver# show services
    rpm {
    probe probe-isp1 {
    test test-1 {
    probe-type icmp-ping;
    target address;
    probe-count 5;
    probe-interval 1;
    test-interval 3;
    routing-instance routing-table-ISP1;
    thresholds {
    total-loss 5;
    ip-monitoring {
    policy track-isp1 {
    match {
    rpm-probe probe-isp1;
    then {
    preferred-route {
    routing-instances routing-table-ISP1 {
    route {

    show routing-instances routing-table-ISP1
    instance-type forwarding;
    routing-options {
    static {
    route next-hop;

    #run show services rpm probe-results owner probe-isp1 test test-1
    Owner: probe-isp1, Test: test-1
    Target address:, Source address:, Probe type: icmp-ping
    Routing Instance Name: routing-table-ISP1
    Test size: 5 probes
    Probe results:
    Request timed out, Mon Aug 29 16:52:12 2016
    Results over current test:
    Probes sent: 5, Probes received: 0, Loss percentage: 100
    Results over last test:
    Probes sent: 5, Probes received: 0, Loss percentage: 100
    Results over all tests:
    Probes sent: 355, Probes received: 0, Loss percentage: 100


You have a feedback?