NSM Cluster operations & Tuning

I would like to share some of my experience about NSM High Availability service management
and tuning of NSM server.  I have gathered a list of items;

 

1)      Interpretation of HA status command

2)      Relocating NSM services

3)      Troubleshoot NSM DB backup process and syncronization of non-db files

4)      Synchronization of NSM db manually between NSM peer servers

5)      NSM Maintenance

6)      Tuning

7)      NSM GUI Client in HA mode

 In High Availibility, NSM consists of two main services GuiSvr and DevSvr. Both of these services are managed by the HA service and shouldn’t be stopped/started manually. Only one of the NSM servers (i.e primary) runs these services (i.e GuiSvr and DevSvr) at one time not together.

   GuiSvr handles DB related operations and Gui client requests.
   DevSvr handles device communication. Both of these services must be running for NSM to function   properly.


1) Interpretation of HA status command

 

/usr/netscreen/HaSvr/utils/haStatus” command is used to display;

  • Local and Peer NSM server addresses
  • On which server NSM services are running and in standby
  • Network status of each NSM server
  • DB replication status.  If DB replication isn’t working properly db-repl will display “dirty” instead of “in-sync”
  • High availability process statuses which must be ON during normal operation.

 

 

 

 2) Relocating services

If you want to relocate NSM services to the peer standby server (192.168.103.22) from primary server (192.168.103.21) simply stop the HA service on the primary.

# /etc/init.d/haSvr stop

 

 

You can see that both GuiSvr and DevSvr services have also stopped on the primary automatically;

 

 

If you login to the peer NSM server and run the haStatus command once again;

 

You will see that NSM services are running on the secondary now instead of 192.168.103.21.
“timed-out” status of previous primary is normal as we stopped HA services.  Check GuiSvr and DevSvr processes on the new primary to make sure they are running;

 

As you can see both services are running on the new primary node now.  We had stopped
HA service on 192.168.103.21 node. Now we can start it;

 

 

This will be the new standby node. After HA starts, in addition to HA services, only GuiSvrManager process must be running on the standby node. The above output is normal in a properly functioning standby node.

When we run haStatus command once again, we will see that the server 192.168.103.21 which was previously primary is now standby.

 

 

 3) Troubleshoot NSM DB backup process and
syncronization of non-db files

 

Local Backup

NSM DB and server files are backed up under /var/netscreen/dbbackup folder.

 

NSM runs backup command on the primary server by default at 02:00 AM and creates the backup folder . if the command is run on Monday, folder name will be “backup1”. If Tuesday it will be “backup2” and so on.

If you want to run the backup manually on the primary, you can simply do it as follows;

[root@nsm2 utils]# su – nsm
-sh-3.00$ cd /usr/netscreen/HaSvr/utils/
-sh-3.00$ ./replicateDb backup
Got arguments: backup.  This might take a while to process …
Ha/Backup: SUCCESS
-sh-3.00$ ls /var/netscreen/dbbackup/
backup2  backup3  backup7  completeToken  excludeRemote.rsync  exclude.rsync  startToken

As this command was run on Sunday the new backup folder is backup7.

If you would like to monitor backup process you can tail the backup.log file ;

nsm2# tail -f /usr/netscreen/HaSvr/var/errorLog/backup.log

 

In the log file you will see that ;

a)      GuiSvr non-db files are copied via the command;

/usr/bin/rsync –timeout=3600 –delete -avz –exclude-from=/var/netscreen/dbbackup/exclude.rsync /var/netscreen/GuiSvr /var/netscreen/dbbackup/backup7

b)      GuiSvr database is copied via the command;

/usr/netscreen/GuiSvr/utils/dbxml-2.3.10/bin/db_hotbackup -D -h /usr/netscreen/GuiSvr/var/xdb/data -b /var/netscreen/dbbackup/backup7/GuiSvr/xdb/data

 

c)      DevSvr non-db files are copied via the command;

/usr/bin/rsync -avz /var/netscreen/dbbackup/completeToken/completeToken /var/netscreen/dbbackup/backup7/

 

Sync of GuiSvr and DevSvr non-db files to the secondary node

In addition to local backup on the primary, HA service also copies GuiSvr and DevSvr files to the secondary node except the folders specified in /var/netscreen/dbbackup/exclude.rsync file. For example if you made a change in /usr/netscreen/GuiSvr/var/guiSvr.cfg file, this file is copied to the secondary server by HA automatically in the next rsync run.

If you want to monitor this backup to the secondary node, you can monitor the file /usr/netscreen/HaSvr/var/errorLog/ha.log

 

4) Synchronization of NSM db manually between NSM peer servers

 

If you see that db-repl status is dirty and secondary is unable to sync  with the primary for some time, a manual intervention may be needed. To restart manual sync simply run the followings on secondary;

 

1)      Stop haSvr on the secondary NSM server

#/etc/init.d/haSvr stop

2)      Except the file DB_CONFIG, delete everything under data folder

#mv /var/netscreen/GuiSvr/xdb/data/DB_CONFIG /tmp/

#rm -f /var/netscreen/GuiSvr/xdb/data/*

#mv /tmp/DB_CONFIG /var/netscreen/GuiSvr/xdb/data/

 

Delete everything under the following directories;

#rm -f  /var/netscreen/GuiSvr/xdb/init/*

#rm -f  /var/netscreen/GuiSvr/xdb/log/*

 

 

3)      Start haSvr on the secondary again

#/etc/init.d/haSvr start

 

4)      You will see that after a while (depending on the link capacity between two NSM servers  and size of the DB) db-repl status will turn to “in-sync

 

If you want to monitor the sync process you can tail the file “/var/netscreen/GuiSvr/errorLog/guiDaemon.0” to see the progress on secondary.  Events should be in the following order:

 

[Notice] [3062081216-dbHa.c:65] get BDB event DB_EVENT_REP_CLIENT
[Notice] [2636499856-dbHa.c:68] get BDB event DB_EVENT_REP_NEWMASTER
[Notice] [2636499856-dbHa.c:71] get BDB event DB_EVENT_REP_STARTUPDONE
[Notice] [2636499856-haStatus.c:367] haDbClearDirtyFlag

Secondary NSM server takes the client role and when the log “DB_EVENT_REP_STARTUPDONE” appears, synchronization is completed which is followed by clearing the dirty flag.

You can run “/usr/netscreen/HaSvr/utils/haStatus” command to confirm the replication status.

5) NSM Maintenance

Maintenance is a must in everything and NSM needs regular maintenance to improve the performance and shrink the database size.  Here is a step by step maintenance procedure;

 

  • Stop HA services
  • Backup current DB
  • Deleting Audit logs and job manager history (and export audit logs to a csv file)
  • Exporting DB
  • Importing DB
  • Start HA services

 

5.1) Stop HA services (in the following order)

Standby # /etc/init.d/haSvr stop

Don’t stop HA service on primary right after running the stop  command on standby. Wait until HA service completely stops on Standby then stop HA service on Primary

Primary # /etc/init.d/haSvr stop

 

5.2) Backup current DB

 

Primary# /usr/netscreen/GuiSvr/utils/tech-support.sh db

Provide the /usr/netscreen/GuiSvr/var/GuiSvrDB201212021407.tar.gz to your Support Engineer

 

Copy the backup file displayed in the output to a remote backup server e.g 10.1.1.1

Primary#scp /usr/netscreen/GuiSvr/var/GuiSvrDB201212021407.tar.gz user@10.1.1.1:/var/tmp/

 

5.3) Deleting audit logs and job manager history

If you have backup of audit logs and don’t need job manager history, you can delete them.
If you would like to export audit logs to a csv file, do it like below before deleting;

#/usr/netscreen/GuiSvr/utils/xdbAuditLogConverter.sh /var/netscreen/GuiSvr/xdb csv /tmp/auditlogs-nsm-20121203.csv

Then continue with removals;

Audit logs removal

rm -f /var/netscreen/GuiSvr/xdb/init/auditlog.init
rm -f /var/netscreen/GuiSvr/xdb/init/auditlogDetails.init
rm -f /var/netscreen/GuiSvr/xdb/data/auditlog
rm -f /var/netscreen/GuiSvr/xdb/data/auditlogDetails

 

Job Manager history removal

rm -f /var/netscreen/GuiSvr/xdb/init/directive.init
rm -f /var/netscreen/GuiSvr/xdb/data/directive

 

5.4) Exporting DB

#/usr/netscreen/GuiSvr/utils/xdbExporter.sh /usr/netscreen/GuiSvr/var/xdb /var/tmp/dbexport.xdif

 

5.5) Importing DB

#/usr/netscreen/GuiSvr/utils/xdifImporter.sh /var/tmp/dbexport.xdif /usr/netscreen/GuiSvr/var/xdb/init

Note: During import you will see an output similar to “Backup xdb as xdb25961.tar.gz” this is an extra safety measure which takes backup just before the import.  You can see the tar.gz  file created under /var/netscreen/GuiSvr/xdb folder. If you are sure that everything works fine, you can delete this file later to save space.

5.6) Start HA Services

 

Primary#/etc/init.d/haSvr start

And check the gui and dev services as follows

 

Once the services on the primary is ON, then start the HA service on the secondary

Secondary # /etc/init.d/haSvr start

As we have modified the DB on the primary by maintenance, it will take some time for the changes to be synced to the secondary because of which if you run haStatus command  right after the maintenance, you will see dirty db-replication status first (see below).  After a while, it will get cleared and turn to in-sync.

 

 

6) TUNING

 

Below you can find several configuration changes which may be necessary in some NSM installations.

1)      BACKUP TIMEOUTS

/usr/netscreen/HaSvr/var/haSvr.cfg

highAvail.rsyncCommandBackupTimeout           3600
highAvail.rsyncCommandReplicationTimeout      7200

Values of the above settings are 1800 seconds in a default configuration and it may not be sufficient in some setups. First timeout is for the local backup and second is for remote rsync. These values depend on the size of DBs and the capacity of the link NSM server has. There is no harm in increasing this timeout.

2)      GUI & DEV SERVER HEAP SIZES (recommended values)

/var/netscreen/GuiSvr/guiSvr.cfg

guiSvrDirectiveHandler.max.heap 1536000000

/var/netscreen/DevSvr/devSvr.cfg

devSvrDirectiveHandler.max.heap 1536000000

3)      XDB CONFIGURATION FILE

 

/var/netscreen/GuiSvr/var/xdb/data/DB_CONFIG

set_data_dir .

set_lg_dir ../log

set_lg_regionmax 600000

set_lk_max_lockers 200000

set_lk_max_locks 200000

set_lk_max_objects 200000

set_cachesize 0 1024000000 4

 

4)      Shared memory

/etc/sysctl.conf

kernel.shmmax = 1073741824

 

Once this value is changed “sysctl -p” command must be run.

 

5)      NSM client configuration

 

C:\Program Files\Network and Security Manager\NSM.lax

lax.nl.java.option.java.heap.size.max=1280m

 

7) NSM GUI Client in HA Mode

One important point to mention about NSM gui client in HA environment is that you may not connect to the server address in the login screen if there was a failover. For example;

 

We assume that 192.168.103.21 was our primary server but  a fail over occurred and NSM service is now on 192.168.103.22 but we don’t know this and still try to connect to the previous primary. If you try to connect, NSM client will detect that 192.168.103.21 server isn’t running NSM service now and try the other server 192.168.103.22. This will be handled transparently by the NSM client. In the end you will be connected to the active node which you can see at the bottom left corner of NSM GUI screen after the login;

 

 

 

Now we have reached the end of the post. I hope you have liked it so far:)

 

 

 

 

About: rtoodtoo

Worked for more than 10 years as a Network/Support Engineer and also interested in Python, Linux, Security and SD-WAN // JNCIE-SEC #223 / RHCE / PCNSE


4 thoughts on “NSM Cluster operations & Tuning”

  1. Nice information. Working on NSM already many years your blog is fine to review all the commands! Thank you very much

  2. It is clearest post about NSM HA I have ever read. My two NSM are all in state of dirty.
    I tried the manual sync. did not work for me. When I check NSM database, there is no shadow_server record. Could someone know what to do?
    many thanks,

  3. Hi there, my NSM days are a bit in the past now sync issues can be caused by various stuff actually. If you find the reason, please update your comment.

    Thanks

Leave a Reply to rtoodtooCancel reply

Discover more from RtoDto.net

Subscribe now to keep reading and get access to the full archive.

Continue reading