You can find step by step instructions to set up an SRX firewall chassis cluster in different branch models. Before starting your cluster config, please make sure you have installed the JTAC recommended release which you can find at http://kb.juniper.net/KB21476
Please note that these instructions below belong to several branch models each of which has slightly different configuration. Pick the one you have. You can also use HA configuration tool developed by Juniper for easier configuration at here.
1) In branch SRX devices (but only 1XX and 2xx models) ethernet switching must be disabled before enabling cluster.
user@host#delete vlans
user@host#delete interfaces vlan
user@host#delete interfaces ge-0/0/0.0 family ethernet-switching
user@host#delete security
user@host#commit
***ethernet switching must be disabled on other interfaces as well not only ge-0/0/0.0 which is an example.
These changes aren’t sufficient. Delete control link and management ports as well. For example,
in srx210 cluster;
To remove management interface
#delete interfaces fe-0/0/6
To remove control link interface
#delete interfaces fe-0/0/7
in SRX650 cluster
management (fxp0)
#delete interfaces ge-0/0/0
control link
#delete interfaces ge-0/0/1
if you don’t delete these interfaces you will receive the following type of warning during boot or commit.
Interface control process: [edit interfaces]
Interface control process: 'fe-0/0/6'
Interface control process: HA management port cannot be configured
mgd: error: configuration check-out failed
Warning: Commit failed, activating partial configuration.
Warning: Edit the router configuration to fix these errors.
2) Once you issue;
on node 0
host>set chassis cluster cluster-id 1 node 0 reboot
on node 1
host>set chassis cluster cluster-id 1 node 1 reboot
Nodes will be rebooted, cluster may not come up if there is a configuration error.
3)
After the systems are booted, you will have such an output;
{primary:node0}
root@host1> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual failover
Redundancy group: 0 , Failover count: 1
node0 1 primary no no
node1 1 secondary no no
If this is the case, configure the following management interface (fxp0) only on the primary as the config will be pushed to secondary automatically.
Setup host names and management IP addresses as follow.
set groups node0 system host-name berlin
set groups node0 interfaces fxp0 unit 0 family inet address 172.16.20.1/24
set groups node1 system host-name prague
set groups node1 interfaces fxp0 unit 0 family inet address 172.16.20.2/24
set apply-groups "${node}"
fxp0 interface is the new interface name in the cluster environment and one dedicated port is assigned in each branch device. For example in an SRX210 cluster, fe-0/0/6 interface of each node must be used as the management interface. To check for other branch devices look at TABLE1
Configuration will look like below;
groups {
node0 {
system {
host-name berlin;
}
interfaces {
fxp0 {
unit 0 {
family inet {
address 172.16.20.1/24;
}
}
}
}
}
node1 {
system {
host-name prague;
}
interfaces {
fxp0 {
unit 0 {
family inet {
address 172.16.20.2/24;
}
}
}
}
}
}
apply-groups "${node}"
4) Configure fabric links (data-plane): Fabric interface is a dedicated interface in each node and you pick one available in branch SRX devices. It is used to sync RTO’s (Real-Time Object) e.g sessions and can also pass traffic.
One thing to mention is if we take SRX240 as an example, ge-5/0/4 is indeed ge-0/0/4 interface of node1. Don’t think that it is a mistake. Look at TABLE2 to see why it changes so.
SRX 240
First make sure there is no logical unit on fabric interface.
node0#delete interfaces ge-0/0/4.0
Now configure fabric interfaces on node1.
node0#set interfaces fab0 fabric-options member-interfaces ge-0/0/4
node0#set interfaces fab1 fabric-options member-interfaces ge-5/0/4
node0#commit
You have to delete the logical interface otherwise you will get the following error;
[edit interfaces fab0 fabric-options member-interfaces]
'ge-0/0/4'
Logical unit is not allowed on fabric member
error: commit failed: (statements constraint check failed)
Once committed, the fabric link modifications must be propagated to the node1 automatically if the cluster is UP.
SRX210 (only node1’s fabric interface starts with fe-2)
node0#delete interfaces fe-0/0/4.0
node0#set interfaces fab0 fabric-options member-interfaces fe-0/0/4
node0#set interfaces fab1 fabric-options member-interfaces fe-2/0/4
node0#commit
SRX650 (if I choose ge-0/0/2 on both nodes as fabric links)
node0#delete interfaces ge-0/0/2.0
node0#set interfaces fab0 fabric-options member-interfaces ge-0/0/2
node0#set interfaces fab1 fabric-options member-interfaces ge-9/0/2
node0#commit
Here is how the configuration looks like for SRX650;
fab0 {
fabric-options {
member-interfaces {
ge-0/0/2;
}
}
}
fab1 {
fabric-options {
member-interfaces {
ge-9/0/2;
}
}
}
Check status;
root@host1> show chassis cluster data-plane interfaces
fab0:
Name Status
fe-0/0/5 up
fab1:
Name Status
fe-2/0/5 up
{primary:node0}
root@berlin> show interfaces terse fxp0.0
Interface Admin Link Proto Local Remote
fxp0.0 up up inet 172.16.20.1/24
{secondary:node1}
root@prague> show interfaces terse fxp0.0
Interface Admin Link Proto Local Remote
fxp0.0 up up inet 172.16.20.2/24
Cluster Interfaces
{primary:node0}
root@host1> show chassis cluster interfaces
Control link 0 name: fxp1
Control link status: Up
Fabric interfaces:
Name Child-interface Status
fab0 fe-0/0/5 up
fab0
fab1 fe-2/0/5 up
fab1
Fabric link status: Up
[REDUNDANCY GROUPS]
Assume we have two uplinks connected to two SRX 210 devices. Node0 is primary and node1 is secondary.
The above topology is so simplistic as it is to show how redundancy group works.
Below is the configuration according to which there are two redundancy groups. RG0 is for
control plane which no preempt is available. In RG1 node0 has higher priority and primary.
ge-0/0/0 interface is monitored actively and has a weight 255 which means if it fails,
its weight will be subtracted from 255 which results 0 and RG1 will fail over.
Redundancy Group Config
reth-count defines how many reth interfaces we have.
{primary:node0}[edit]
root@host1# show chassis cluster
reth-count 1;
redundancy-group 1 {
node 0 priority 100;
node 1 priority 99;
preempt;
interface-monitor {
ge-0/0/0 weight 255;
}
}
redundancy-group 0 {
node 0 priority 100;
node 1 priority 99;
}
Redundant Ethernet Config
According to this config, ge-0/0/1 and ge-2/0/1 (indeed ge-0/0/1 of node1) interfaces
form reth0 interface. As RG1 also monitors ge-0/0/0 actively , if it fails,
node1 will take over RG1.
{primary:node0}[edit]
root@host1# show interfaces
ge-0/0/0 {
unit 0 {
family inet {
address 212.45.64.1/24;
}
}
}
ge-0/0/1 {
gigether-options {
redundant-parent reth0;
}
}
ge-2/0/1 {
gigether-options {
redundant-parent reth0;
}
}
reth0 {
redundant-ether-options {
redundancy-group 1;
}
unit 0 {
family inet {
address 10.200.2.210/24;
}
}
}
Cluster status Failover
Here we can see that node0 is primary for RG1 and preempt enabled
{secondary:node1}
root@host2> show chassis cluster status redundancy-group 1
Cluster ID: 1
Node Priority Status Preempt Manual failover
Redundancy group: 1 , Failover count: 0
node0 100 primary yes no
node1 99 secondary yes no
Once ge-0/0/0 fails, the following output occurs
{secondary:node1}
root@host2> show chassis cluster status redundancy-group 1
Cluster ID: 1
Node Priority Status Preempt Manual failover
Redundancy group: 1 , Failover count: 1
node0 0 secondary yes no
node1 99 primary yes no
As it can be seen, priority of node0 is set to zero once it fails. Because preempt is ON,
if ge-0/0/0 link is back online, RG1 will fail over to node0 and folllowing output will
be printed (note failover count is incremented)
root@host2>show chassis cluster status redundancy-group 1
Cluster ID: 1
Node Priority Status Preempt Manual failover
Redundancy group: 1 , Failover count: 2
node0 100 secondary yes no
node1 99 secondary-hold yes no
Hi,
Same experience, I followed the documentation and was not able to set up the cluster.
Thanks for your tips, very useful.
For SRX650, we can add the folowwing before setting the cluster :
delete security zones security-zone trust interfaces ge-0/0/1.0
delete security zones security-zone untrust interfaces ge-0/0/0.0
delete interfaces ge-0/0/0
delete interfaces ge-0/0/1
Cheers,
R.
Thank you – just saved me a lot of grief. Been trying to set this up for a while and couldn’t get node 0 to move from hold to primary. Followed your instructions (then also had to delete unit 0 from each interface) and I finally got it working.
I’d also been making the schoolboy error of not using a crossover cable – doh!
Thanks again.
A
You’re welcome. It is good to see that my posts are helpful to someone else!
Hi,
I tried to do the cluster on SRX100 twice , but still not success. have you experienced on SRX100 ? i have only two SRX100 for labbing.
hi,
I haven’t tried on SRX100 indeed but it isn’t much different than SRX210 I think.
Have you cabled the devices according to the table 1 &2 at the end of the post?
You must at least be receiving an error when you restart the nodes. If you can share,
it may improve this post as well so that we can find if there is any mistake or not.
Thanks so much for this guide, being new to Juniper, I never would have figured this out on my own. What I don’t understand is why doesn’t Juniper have accurate documentation on how to do this? It’s completely ridiculous.
Hi,
Juniper has a wide set of documentation and application notes for its products as well. Because blog posts are more tutorial like, it is probably easier to follow. You can also take a look at Junos Security book (http://www.juniper.net/us/en/training/jnbooks/junos_security.html) which is a very good source for those who would like to learn SRX.
Genco.
Thanks, this guide helped me a lot, unlike Juniper’s documentation. Thanks for putting it in a readable format!
I just wanted to add that I just clustered 2x SRX3400’s and the ports on the standby unit were renumbered from 0/0/X to 8/0/X. The tables from Juniper don’t list every SRX model.
Thanks for this write-up! Let me start off by saying Palo Alto > Juniper.
Secondly….their documentation stinks on their website!! I wish that they would say that you need to use crossover cables. I even called support and they said I didn’t need crossover cable. In fact, the support person I troubleshot with ALL DAY insisted that I didn’t. Then finally after about 6 hours of frustration, I plugged crossover cables into ports 0/0/1—->5/0/1 and 0/0/2—->5/0/2 and rebooted the secondary device and BANG VOI-frickin-LA…..everything is right and working.
If you ever get the chance to use Palo Alto firewalls….DO IT!!! And toss this Juniper junk in the dumpster!!!!
Hi,
I don’t know why you haven’t got it working in the first place but I must tell you I have never used crossover cable in any of my setups so what JTAC told you is right I suppose.
Genco.
sorry for the frustrating rant. But, yes after replacing the straight thru cables with crossovers it worked. I even retested with the same exact config with straight thru cables and it going into “Primary/Disabled”.
This is something I can’t explain indeed as I have tested HA in different branch SRX devices (210,240,650)
and as far as I remember I didn’t need any crossover cable. Besides, in recommended version of JUNOS, cluster come up properly as opposed to what is said at the beginning of my post.
Genco.
Hi Genco, Excellent exhibition. Even i struggled a lot while clustering SRX650 devices. It is very handy doc. Keep it up and i hope this will be helpful for others.
Thanks for the article – very useful insights into the world of SRX clusters when troubleshooting one last night! If I have a redundant uplink to the WAN as well as a downlink to my LAN should these both be different reth interfaces but both bound to RG1?
Matt,
As you said you can assign WAN and LAN to separate reth0,1 interfaces and you can use the same RG1 for them.
Then by monitoring these interfaces (e.g ge-0/0/0,1..) under interface-monitor of redundancy-group 1 configuration,
if any link issue occurs with one of these interfaces, junos will fail over the RG1 to the other node i.e both LAN and WAN side which is preferrable I suppose.
i have a error. can you help me ?
root@admin# commit
error: interface-range ‘int’ is not defined
error: interface-ranges expansion failed
please ! thanks !
I don’t know if this is a cluster related config issue but check this out http://forums.juniper.net/t5/SRX-Services-Gateway/interface-range-command/td-p/43257 you may be trying to put different types of interface under one range and/or some unsupported config maybe. Really difficult to comment.
Hi,
I configured in same way. I connect both SRX210H on fe-0/0/7 and fe-0/0/6. The status of cluster was good and there were one primary and one secondary node but when I configured fab0 and fab1, node1 became disabled. Is there anything wrong in configurations or I have to do something else.
Hi,
you connected fe-0/0/7 and fe-0/0/6 or you picked an available port on each SRX for fabric? If you look at the chart, fe-0/0/7 is for control link and you should cross connect fe-0/0/7 port of each node. Then pick up one port from each SRX for fabric link, cross connect and configure it. You can also take a look at the /var/log/jsrpd log file to see the reason and provide your “show chassis cluster information” in here so that other readers can also benefit.
Genco.
hey thx for the nice info there !! :))
i have a problem would be great if u guys can help me.
i cant bind a ge interface i.e ge 0/0/2 to a reth interface i get the error.
[edit interfaces ge-0/0/2 gigether-options]
‘redundant-parent’
Logical unit is not allowed on redundant-ethernet member
Hi,
you probably left “unit 0” configuration on ge-0/0/2 and it isn’t allowed. Remove unit 0 or any unit under this interface then it should be ok.
bah nvm had them pre-defined so deleted em first then it worked 😀 pretty noob here :'(
Hi, sorry, but a i dont know to write english, but i try. I have SRX 240 in chassis cluster, when i configured
ge-0/0/1 {
gigether-options {
redundant-parent reth0;
}
}
ge-2/0/1 {
gigether-options {
redundant-parent reth0;
I meet the next error: Logical unit is not allowed on fabric member
Please, can yo help me with this steep
You probably have a logical unit “unit 0” or another unit number in your fabric interface. If your fabric interface is ge-0/0/4 then delete logical unit via “delete interfaces ge-0/0/4.0” do it for both fabric member interfaces. Then it should work.
Hello,
We’ve configured HA cluster with two srx 240, and some issues occured. Cluster is working well, but there are some severities. Please, is it possible for you to answer our quetions?
First of all, we have configured active-passive cluster, but when we look cluster information we can see active-active mode,
Is it possible to change mode from active-active to active-passive?:
root@SRX240A> show chassis cluster information
node0:
————————————————————————–
Redundancy mode:
Configured mode: active-active
Operational mode: active-active
Redundancy group: 0, Threshold: 255, Monitoring failures: none
Events:
Jun 2 14:44:17.534 : hold->secondary, reason: Hold timer expired
Redundancy group: 1, Threshold: 255, Monitoring failures: none
Events:
Jun 2 14:44:17.538 : hold->secondary, reason: Hold timer expired
Jun 2 14:56:17.295 : secondary->primary, reason: Remote node is in secondary hol
Jun 2 15:02:50.195 : primary->secondary-hold, reason: Monitor failed: IF
Jun 2 15:02:51.204 : secondary-hold->secondary, reason: Back to back failover interval
Jun 2 15:02:57.242 : secondary->primary, reason: Remote node is in secondary hol
node1:
————————————————————————–
Redundancy mode:
Configured mode: active-active
Operational mode: active-active
Redundancy group: 0, Threshold: 255, Monitoring failures: none
Events:
Jun 2 13:15:47.986 : hold->secondary, reason: Hold timer expired
Jun 2 14:38:48.843 : secondary->ineligible, reason: Control link failure
Jun 2 14:40:53.831 : ineligible->primary, reason: Control & Fabric links down
Redundancy group: 1, Threshold: 255, Monitoring failures: none
Events:
Jun 2 13:37:04.546 : hold->secondary, reason: Hold timer expired
Jun 2 14:38:48.845 : secondary->ineligible, reason: Control link failure
Jun 2 14:40:53.966 : ineligible->primary, reason: Control & Fabric links down
Jun 2 14:56:11.343 : primary->secondary-hold, reason: Preempt (1/100)
Jun 2 14:56:12.346 : secondary-hold->secondary, reason: Back to back failover interval
Jun 2 15:02:45.139 : secondary->primary, reason: Remote yeild (1/0)
Jun 2 15:02:51.654 : primary->secondary-hold, reason: Preempt (1/100)
Jun 2 15:02:52.658 : secondary-hold->secondary, reason: Back to back failover interval
The second issue is statuses of nodes, after commiting configuration we saw that HA LEDs on both boxes are amber, we found information that it’s because reth interfaces are down and can’t make monitoring (we haven’t connect them to anything yet), then we deleted chassis cluster redundancy-group 1 interface-monitor (we have configured it before) and HA LEDs became green again but there is a new problem, nodes changed their statuses node 0 became secondary (insted of primary) and node 1 became primary. Why such problem occured and what we should do to change statuses back?
root@SRX240A> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual failover
Redundancy group: 0 , Failover count: 0
node0 100 secondary no no
node1 1 primary no no
Redundancy group: 1 , Failover count: 3
node0 100 primary yes no
node1 1 secondary yes no
Thanks in advance!
Hi,
Please check this link http://forums.juniper.net/t5/SRX-Services-Gateway/How-to-know-that-the-SRX-is-deployed-Active-active-or-Active/td-p/33936
I hadn’t asked this question myself but it seems it is by design though I still don’t understand why when RG1 is in another node, it still shows as active/active
but according to the docs,comments it doesn’t seem to be a problem for you.
To your second question you can failover the RG groups via “request chassis cluster failover” command from one node to another, you can also see more details with “show chassis cluster information detail”
Good stuff. I wasn’t able to get the fabric links up on the SRX550 until I rebooted both nodes.
Can I do cluster between SRX210H & SRX210HE…I tried it using the above steps…but I get this error..ha management port cannot be configured…
HA management port cannot be configured error must be a legitimate error I suppose. You probably have left configuration on fe-0/0/6 or fe-0/0/7. Normally HW should be the same in both nodes but not sure if you can do clustering on those two.
This site is really helpful, i have gone through so many juniper-support kb’s & pdf’s but couldn’t get required real scenario answer’s like here.
I have a question for active-passive(srx650 cluster), 2 reth : —
what will happen(status of nodes(pri & sec)) in following 5 situation’s ?
1>if node 0 goes down(power-down/ failure) then node1 becomes primary, but what will happen when node0 comes up or replaced ? will node 0 becomes primary on all redundancy groups after certain timers ?
2>what will happen when control plane cable got disconnected & later restored between 2 nodes ?
3>what will happen when data plane cable got disconnected & later restored between 2 nodes ?
4>if physical interface(node 0) of reth interface got disconnected RG group will failover to secondary, then what will happen if
a] preempt is on or we manually reboot node 1 after restoration of disconnected physical interface of reth ?
b] preempt is off.
5> if “control plane recovery” is configured, then after restoration of node0 or control link port(node 0), will node 1 give control to node 0 before rebooting itself to become secondary again or temporary network failure occur to make node0 primary again.
Request you to please answer all of my questions..
As far as I know;
1) If you set preempt for the redundancy group node0 will take ownership of it.
2) if control plane communication is disrupted then secondary node will go to disabled state
3) This depends on it is an high end or branch and also the software release. Branch will go to disabled state but in HE in recent releases nothing changes.
4) If preempt is on and you restore the interface, RG should fail back and node1 reboot shouldn’t affect anything in this case I suppose
5) For this question it is better to look in here: http://www.juniper.net/techpubs/en_US/junos12.1/topics/concept/chassis-cluster-control-link-failure-recovery-understanding.html according to which all RGs are supposed to move to node0 in this scenario.
Request you to correct me, if i am wrong anywhere below.
your confirmation will assure that i am proceeding with a right concept.
1] control link failure :-
scenario : both nodes are on
data links are communicating(data plane is active)
Result : RGO/control plane : node 0 RE will remain active, node0 act as stand alone/non-cluster device. All RG of node 0
will be master/active.
no effect on network operation(no network down problem).
node 1 will go to disable state & wait for manual reboot if control-link-recovery config. is not set.
control-link-recovery config:- when heart beats will not receive on control link of node 1, node 1 will
go to hold & eventually disable state, and after start receiving heartbeat on control link
from node 0 side, node1 gets rebooted & join again as secondary.
we can’t set preempt for RG0. Because it is not required as node 0 is not going into secondary/disable
state in case of control link failure. ofcourse if you want node1 to be primary for RG0, you can do so
by manual failover by command when both nodes are working & control plane is working fine.
2] Node0 interface of Reth failure :-
scenario : node 0 side physical interface failed & node 1 physical interface is active
Result : If preempt is on for RG of reth interface, then node0 will become primary after its interface gets restored.
If preempt is off for RG of reth interface, then node0 will become secondary after its interface gets restored.
3] Data link failure :-
scenario : both nodes are on
control links are communicating
Result : If its branch srx then node1 will goes into disable, & in case of hi-end then node 1 will remain active & will be
displayed as secondary-hold in show chassis-cluster status/statistics command. node 0 will remain primary in either cases.
4> Node0 device rebooted :-
Result : Node1 will become primary for all RG’s & node0 will first goes into secondary-hold state & then become secondary node.
while node0 is in secondary-hold state & if node1(current pri) reboots or control-link fails or data-links fails then..
1> Reboot case:— The node in the secondary-hold state transitions to the primary state; the other node goes dead (inactive).
2> Control link failure case:— The node in the secondary-hold state transitions to the ineligible state and then to a
disabled state,the other node transitions to the primary state or act like a stand alone firewall.
3> Fabric link failure case:— The node in the secondary-hold state transitions directly to the disabled state.
5> Node0 replaced with new one as previous node0 became faulty.
scenario : node 1 is active & primary both. we want to make new srx as node0
Questions:
a] Do we have to connect it with existing secondary fw & perform all cluster configuration since beginning.
b] Is there any config-backup file stored on node1(existing primary) which can be rolled back on node0(new-srx)
& node1(existing primary) combinedly.
At Juniper, I have developed an online tool that can assist with this
http://www.juniper.net/support/tools/srxha/
Hi Lembit,
I have updated the post with your link. Thanks for sharing.
I have a question, I have two srx650 with only the 4 built in ports, so I want to now if I need an extra PIM since 1 port is for management, 1 is for fabric and 1 is for control, that leaves only 1 port for data traffic so I’m guessing I will ned an extra port to configure the two reduncancy groups
you may need two reth(2 pair) or two interface from both srx under single redundancy group. get a extra module of ports for creating reth interfaces.
Hi
How do i remove the below, configuration has been removed but i still get the below when doing a commit.
This is on srx210h firewall
blasiuss@custnmsfw1# commit
[edit security nat static]
‘rule-set STATIC-WBSWAKOP-TO-TRUST’
statement does not exist
[edit security nat static rule-set STATIC-WBSWAKOP-TO-TRUST]
‘from’
statement does not exist
[edit security nat static rule-set STATIC-WBSWAKOP-TO-TRUST from]
‘zone WBSWAKOP-NMS’
statement does not exist
[edit security policies]
‘from-zone WBSWAKOP-NMS to-zone trust’
statement does not exist
[edit security policies from-zone WBSWAKOP-NMS to-zone trust]
‘policy 172_28_0_0-TO-TRUST’
statement does not exist
[edit security policies from-zone WBSWAKOP-NMS to-zone trust policy 172_28_0_0-TO-TRUST]
‘match’
statement does not exist
[edit security policies from-zone WBSWAKOP-NMS to-zone trust policy 172_28_0_0-TO-TRUST match]
‘source-address 172.28.0.0/15’
statement does not exist
I have configured two SRX 240 in a cluster and so far everythig seems to be working fine exept the fact that if I reboot node 0 which has higher priority, it does now take over after the reboot.
Can any one help here please.
Thanks
I meant, node 0 does not take over from node 1 after a reboot while it has the highest priority.
please check preempt option if you want the node to take over
What all alerts fall under the category chassis?
What are the configuration needed on the switch for connectivity to the srx reth0?
There is a juniper KB which describes how a switch configured Kenneth. You can easily find it on juniper.net site.
For such setup with reth interface connected to a switch what are the options available? From what I see to use lacp on the reth interface, lacp is needed on the ae interface on the switch as well. From your post I am not sure if addition changes are needed for the srx configuration. Hope you understand my question?
I can’t access the web management through fxp0
I copied the ip address in the IE but I am not able to reach however it is in the same subnet
After my experience with SRX300 series , fab interface will not go up until you also delete all interfaces:
# delete interfaces