Crashing routed on 1100 Appliances
Probably a few weeks ago I was asked to take a look at a Check Point 1100 cluster with Gaia Embedded R77.20.31 on it which had one member with a Down status. This is not a production cluster so when a reboot solved the problem we did not spend more time to investigate this issue deeper. Until then this only happened once.
But yesterday it happened again…
This time I’ve started investigating /var/log/messages and what I saw was this.
2017 Jan 31 14:16:35 FWL-001 user.err rdcfg: unable to connect to routed No such file or directory 2017 Jan 31 14:16:38 FWL-001 daemon.info routed[9096]: Start routed[9096] version routed-08.09.2016-15:42:48 instance -1 2017 Jan 31 14:16:38 FWL-001 daemon.notice routed[9096]: rt_instance_init: routed manager id -1 initialized itself 2017 Jan 31 14:16:38 FWL-001 daemon.notice routed[9096]: parse_instance_only: my_instance_id -1 parsing instance default 2017 Jan 31 14:16:38 FWL-001 daemon.info routed[9111]: task_cmd_init(152): command subsystem initialized. 2017 Jan 31 14:16:38 FWL-001 daemon.info routed[9111]: Start routed[9111] version routed-08.09.2016-15:42:48 instance 0 2017 Jan 31 14:16:38 FWL-001 user.err rdcfg: unable to connect to routed No such file or directory 2017 Jan 31 14:16:39 FWL-001 daemon.notice routed[9096]: Commence routing updates 2017 Jan 31 14:16:39 FWL-001 daemon.notice routed[9111]: vrrp_set_fw: Command is fw_is_running_vrrp and value is 0 2017 Jan 31 14:16:39 FWL-001 daemon.notice routed[9111]: vrrp_set_fw: Command is fw_is_running_on_cbs and value is 0 2017 Jan 31 14:16:39 FWL-001 daemon.notice routed[9111]: vrrp_set_fw: Command is fwha_cbs_which_member_is_running_gated and value is 0 2017 Jan 31 14:16:40 FWL-001 daemon.notice routed[9096]: rt_cluster_pnote_job: Registered routed_watchdog pnote. 2017 Jan 31 14:16:40 FWL-001 daemon.warn routed[9111]: task_get_proto: getprotobyname("icmp") failed, using proto 1 2017 Jan 31 14:16:40 FWL-001 daemon.notice routed[9111]: Commence routing updates 2017 Jan 31 14:17:24 FWL-001 daemon.info routed[9302]: trace_do_dump: processing dump to /tmp/routed_dump 2017 Jan 31 14:17:24 FWL-001 daemon.info routed[9302]: trace_do_dump: dump completed to /tmp/routed_dump 2017 Jan 31 14:17:24 FWL-001 daemon.info routed[9303]: trace_do_dump: processing dump to /tmp/routed_dump 2017 Jan 31 14:17:24 FWL-001 daemon.info routed[9303]: trace_do_dump: dump completed to /tmp/routed_dump
Actually I saw an admin user logging in first and immediately after that the lines mentioned above. So I needed to check this user. What was he doing? Well, I was able to relate it to our 3rd-party backup-software that logged in with this admin account.
Next step was finding out what commands were being sent by the backup-software what may have caused this problem. So, first I rebooted the cluster again to confirm the cluster was fine. From that point I tried to replicate the issue by manually starting a backup-job on our 3rd-part backup-software and watched the commands being sent to the 1100 Appliance. The backup-software seemed to issue a ‘show route’-command in CLISH and next thing you know…similar lines like above were added to the /var/log/messages file.
But the cluster remained active/standby. So I logged in to the CLI and issued the ‘show route’ command myself. Again /var/log/messages was expanded with similar messages regarding routed. And….the member registered a pnote for routed_watchdog. Now the member went Down. Funny thing is that when you issue that command again the member will enter the cluster again but in just 1 minute it will automatically go down again.
To make sure this wasn’t a bug in the running version of Gaia Embedded we updated the cluster to the latest available version, which is R77.20.40. Unfortunately this did not solve the problem. With R77.20.40 we were able to replicate this issue again.
For now we created an SR at TAC and described every step we took to make routed crash. The cpinfo and routed_dump-files were uploaded to the SR. Within 2 hours we received information TAC was able to replicate this too in their own lab and that the details were forwarded to R&D.
To be continued…
[Heateor-SC style=”background-color:#ffffff;”]