NSM Cluster operations & Tuning
I would like to share some of my experience about NSM High Availability service management
and tuning of NSM server. I have gathered a list of items;
1) Interpretation of HA status command
2) Relocating NSM services
3) Troubleshoot NSM DB backup process and syncronization of non-db files
4) Synchronization of NSM db manually between NSM peer servers
5) NSM Maintenance
6) Tuning
7) NSM GUI Client in HA mode
In High Availibility, NSM consists of two main services GuiSvr and DevSvr. Both of these services are managed by the HA service and shouldn’t be stopped/started manually. Only one of the NSM servers (i.e primary) runs these services (i.e GuiSvr and DevSvr) at one time not together.
GuiSvr handles DB related operations and Gui client requests.
DevSvr handles device communication. Both of these services must be running for NSM to function properly.
1) Interpretation of HA status command
“/usr/netscreen/HaSvr/utils/haStatus” command is used to display;
- Local and Peer NSM server addresses
- On which server NSM services are running and in standby
- Network status of each NSM server
- DB replication status. If DB replication isn’t working properly db-repl will display “dirty” instead of “in-sync”
- High availability process statuses which must be ON during normal operation.
2) Relocating services
If you want to relocate NSM services to the peer standby server (192.168.103.22) from primary server (192.168.103.21) simply stop the HA service on the primary.
# /etc/init.d/haSvr stop
You can see that both GuiSvr and DevSvr services have also stopped on the primary automatically;
If you login to the peer NSM server and run the haStatus command once again;
You will see that NSM services are running on the secondary now instead of 192.168.103.21.
“timed-out” status of previous primary is normal as we stopped HA services. Check GuiSvr and DevSvr processes on the new primary to make sure they are running;
As you can see both services are running on the new primary node now. We had stopped
HA service on 192.168.103.21 node. Now we can start it;
This will be the new standby node. After HA starts, in addition to HA services, only GuiSvrManager process must be running on the standby node. The above output is normal in a properly functioning standby node.
When we run haStatus command once again, we will see that the server 192.168.103.21 which was previously primary is now standby.
3) Troubleshoot NSM DB backup process and
syncronization of non-db files
Local Backup
NSM DB and server files are backed up under /var/netscreen/dbbackup folder.
NSM runs backup command on the primary server by default at 02:00 AM and creates the backup folder . if the command is run on Monday, folder name will be “backup1”. If Tuesday it will be “backup2” and so on.
If you want to run the backup manually on the primary, you can simply do it as follows;
[root@nsm2 utils]# su – nsm
-sh-3.00$ cd /usr/netscreen/HaSvr/utils/
-sh-3.00$ ./replicateDb backup
Got arguments: backup. This might take a while to process …
Ha/Backup: SUCCESS
-sh-3.00$ ls /var/netscreen/dbbackup/
backup2 backup3 backup7 completeToken excludeRemote.rsync exclude.rsync startToken
As this command was run on Sunday the new backup folder is backup7.
If you would like to monitor backup process you can tail the backup.log file ;
nsm2# tail -f /usr/netscreen/HaSvr/var/errorLog/backup.log
In the log file you will see that ;
a) GuiSvr non-db files are copied via the command;
/usr/bin/rsync –timeout=3600 –delete -avz –exclude-from=/var/netscreen/dbbackup/exclude.rsync /var/netscreen/GuiSvr /var/netscreen/dbbackup/backup7
b) GuiSvr database is copied via the command;
/usr/netscreen/GuiSvr/utils/dbxml-2.3.10/bin/db_hotbackup -D -h /usr/netscreen/GuiSvr/var/xdb/data -b /var/netscreen/dbbackup/backup7/GuiSvr/xdb/data
c) DevSvr non-db files are copied via the command;
/usr/bin/rsync -avz /var/netscreen/dbbackup/completeToken/completeToken /var/netscreen/dbbackup/backup7/
Sync of GuiSvr and DevSvr non-db files to the secondary node
In addition to local backup on the primary, HA service also copies GuiSvr and DevSvr files to the secondary node except the folders specified in /var/netscreen/dbbackup/exclude.rsync file. For example if you made a change in /usr/netscreen/GuiSvr/var/guiSvr.cfg file, this file is copied to the secondary server by HA automatically in the next rsync run.
If you want to monitor this backup to the secondary node, you can monitor the file /usr/netscreen/HaSvr/var/errorLog/ha.log
4) Synchronization of NSM db manually between NSM peer servers
If you see that db-repl status is dirty and secondary is unable to sync with the primary for some time, a manual intervention may be needed. To restart manual sync simply run the followings on secondary;
1) Stop haSvr on the secondary NSM server
#/etc/init.d/haSvr stop
2) Except the file DB_CONFIG, delete everything under data folder
#mv /var/netscreen/GuiSvr/xdb/data/DB_CONFIG /tmp/
#rm -f /var/netscreen/GuiSvr/xdb/data/*
#mv /tmp/DB_CONFIG /var/netscreen/GuiSvr/xdb/data/
Delete everything under the following directories;
#rm -f /var/netscreen/GuiSvr/xdb/init/*
#rm -f /var/netscreen/GuiSvr/xdb/log/*
3) Start haSvr on the secondary again
#/etc/init.d/haSvr start
4) You will see that after a while (depending on the link capacity between two NSM servers and size of the DB) db-repl status will turn to “in-sync”
If you want to monitor the sync process you can tail the file “/var/netscreen/GuiSvr/errorLog/guiDaemon.0” to see the progress on secondary. Events should be in the following order:
[Notice] [3062081216-dbHa.c:65] get BDB event DB_EVENT_REP_CLIENT
[Notice] [2636499856-dbHa.c:68] get BDB event DB_EVENT_REP_NEWMASTER
[Notice] [2636499856-dbHa.c:71] get BDB event DB_EVENT_REP_STARTUPDONE
[Notice] [2636499856-haStatus.c:367] haDbClearDirtyFlag
Secondary NSM server takes the client role and when the log “DB_EVENT_REP_STARTUPDONE” appears, synchronization is completed which is followed by clearing the dirty flag.
You can run “/usr/netscreen/HaSvr/utils/haStatus” command to confirm the replication status.
5) NSM Maintenance
Maintenance is a must in everything and NSM needs regular maintenance to improve the performance and shrink the database size. Here is a step by step maintenance procedure;
- Stop HA services
- Backup current DB
- Deleting Audit logs and job manager history (and export audit logs to a csv file)
- Exporting DB
- Importing DB
- Start HA services
5.1) Stop HA services (in the following order)
Standby # /etc/init.d/haSvr stop
Don’t stop HA service on primary right after running the stop command on standby. Wait until HA service completely stops on Standby then stop HA service on Primary
Primary # /etc/init.d/haSvr stop
5.2) Backup current DB
Primary# /usr/netscreen/GuiSvr/utils/tech-support.sh db
Provide the /usr/netscreen/GuiSvr/var/GuiSvrDB201212021407.tar.gz to your Support Engineer
Copy the backup file displayed in the output to a remote backup server e.g 10.1.1.1
Primary#scp /usr/netscreen/GuiSvr/var/GuiSvrDB201212021407.tar.gz user@10.1.1.1:/var/tmp/
5.3) Deleting audit logs and job manager history
If you have backup of audit logs and don’t need job manager history, you can delete them.
If you would like to export audit logs to a csv file, do it like below before deleting;
#/usr/netscreen/GuiSvr/utils/xdbAuditLogConverter.sh /var/netscreen/GuiSvr/xdb csv /tmp/auditlogs-nsm-20121203.csv
Then continue with removals;
Audit logs removal
rm -f /var/netscreen/GuiSvr/xdb/init/auditlog.init
rm -f /var/netscreen/GuiSvr/xdb/init/auditlogDetails.init
rm -f /var/netscreen/GuiSvr/xdb/data/auditlog
rm -f /var/netscreen/GuiSvr/xdb/data/auditlogDetails
Job Manager history removal
rm -f /var/netscreen/GuiSvr/xdb/init/directive.init
rm -f /var/netscreen/GuiSvr/xdb/data/directive
5.4) Exporting DB
#/usr/netscreen/GuiSvr/utils/xdbExporter.sh /usr/netscreen/GuiSvr/var/xdb /var/tmp/dbexport.xdif
5.5) Importing DB
#/usr/netscreen/GuiSvr/utils/xdifImporter.sh /var/tmp/dbexport.xdif /usr/netscreen/GuiSvr/var/xdb/init
Note: During import you will see an output similar to “Backup xdb as xdb25961.tar.gz” this is an extra safety measure which takes backup just before the import. You can see the tar.gz file created under /var/netscreen/GuiSvr/xdb folder. If you are sure that everything works fine, you can delete this file later to save space.
5.6) Start HA Services
Primary#/etc/init.d/haSvr start
And check the gui and dev services as follows
Once the services on the primary is ON, then start the HA service on the secondary
Secondary # /etc/init.d/haSvr start
As we have modified the DB on the primary by maintenance, it will take some time for the changes to be synced to the secondary because of which if you run haStatus command right after the maintenance, you will see dirty db-replication status first (see below). After a while, it will get cleared and turn to in-sync.
6) TUNING
Below you can find several configuration changes which may be necessary in some NSM installations.
1) BACKUP TIMEOUTS
/usr/netscreen/HaSvr/var/haSvr.cfg
highAvail.rsyncCommandBackupTimeout 3600
highAvail.rsyncCommandReplicationTimeout 7200
Values of the above settings are 1800 seconds in a default configuration and it may not be sufficient in some setups. First timeout is for the local backup and second is for remote rsync. These values depend on the size of DBs and the capacity of the link NSM server has. There is no harm in increasing this timeout.
2) GUI & DEV SERVER HEAP SIZES (recommended values)
/var/netscreen/GuiSvr/guiSvr.cfg
guiSvrDirectiveHandler.max.heap 1536000000
/var/netscreen/DevSvr/devSvr.cfg
devSvrDirectiveHandler.max.heap 1536000000
3) XDB CONFIGURATION FILE
/var/netscreen/GuiSvr/var/xdb/data/DB_CONFIG
set_data_dir .
set_lg_dir ../log
set_lg_regionmax 600000
set_lk_max_lockers 200000
set_lk_max_locks 200000
set_lk_max_objects 200000
set_cachesize 0 1024000000 4
4) Shared memory
/etc/sysctl.conf
kernel.shmmax = 1073741824
Once this value is changed “sysctl -p” command must be run.
5) NSM client configuration
C:\Program Files\Network and Security Manager\NSM.lax
lax.nl.java.option.java.heap.size.max=1280m
7) NSM GUI Client in HA Mode
One important point to mention about NSM gui client in HA environment is that you may not connect to the server address in the login screen if there was a failover. For example;
We assume that 192.168.103.21 was our primary server but a fail over occurred and NSM service is now on 192.168.103.22 but we don’t know this and still try to connect to the previous primary. If you try to connect, NSM client will detect that 192.168.103.21 server isn’t running NSM service now and try the other server 192.168.103.22. This will be handled transparently by the NSM client. In the end you will be connected to the active node which you can see at the bottom left corner of NSM GUI screen after the login;
Now we have reached the end of the post. I hope you have liked it so far:)
Nice information. Working on NSM already many years your blog is fine to review all the commands! Thank you very much
You’re welcome. Glad to see that you liked it!
It is clearest post about NSM HA I have ever read. My two NSM are all in state of dirty.
I tried the manual sync. did not work for me. When I check NSM database, there is no shadow_server record. Could someone know what to do?
many thanks,
Hi there, my NSM days are a bit in the past now sync issues can be caused by various stuff actually. If you find the reason, please update your comment.
Thanks