The case of revoked SIC certificates
Last week one of our domains on a MDS server (on R80.20 ) displayed SIC errors for all VSX clusters (running on R77.30). When checking the VSX objects in SmartConsole the Trust State was Uninitialized. When checking the status on all VSX members the SIC was installed. So what happened?
When looking at the specific DMS on the MDS the command cpca_client lscert -kind SIC -stat Valid displayed only valid SIC certificates for the DMS and DLS. When running cpca_client lscert -kind SIC it also showed Revoked SIC certificates. Apparently the SIC certificates for the VSX clusters were on 75% of their lifetime and a renewal was invoked, but somehow the new SIC certificates were not installed and revoked too. This was also the case for the Virtual Switches and Virtual Systems on all of the VSX cluster members.
To be able to install a policy we had to fix this issue, but did not want to have downtime. Normally you would reset SIC with cpconfig and afterwards automatically a cpstop;cpstart which results in a firewall with the InitialPolicy. So this is not the way to go.
For VS0 we could fix it with sk86521 wich explains how to reset SIC without downtime:
cp_conf sic init New_Activation_Key norestart cpwd_admin stop -name CPD -path "$CPDIR/bin/cpd_admin" -command "cpd_admin stop" cpwd_admin start -name CPD -path "$CPDIR/bin/cpd" -command "cpd"
Whenever you perform this SIC reset, make sure you continue with initializing the SIC in the VSX cluster-object. If you do not and you encounter a power failure or unplanned reboot than the VSX Gateway will load the InitialPolicy. That is expected behavior, just like when resetting the SIC with the cpconfig command.
For the Virtual Switch and Virtual Systems we had to use another procedure. Problem is that the InitialPolicy will be loaded when you reset the SIC that way. But since this is a cluster we can fix it without any downtime. We followed some parts of sk34098 to reset the SIC certificates for all VSes other than VS0.
With vsx stat -v you can see the state of all SIC certificates other than VS0. In my case they were all Trusted but I already knew the certificates were revoked on the DMS. To reset the SIC for a particular VS (R75.40VS and above) you need to enter:
vsenv <VS ID> fw vsx sicreset
If the VS is a Virtual Switch of Virtual System in both cases you will end up with a InitialPolicy. Next step is to initialize SIC again in the Virtual Switch object or Virtual System object in SmartConsole. This is very simple. Just open the object, do not change anything and click OK. The SIC certificate is generated and pushed to the VS.
From that point we could install the policy again to the Virtual System. Make sure you deselect the option “For gateway clusters, if installation on a cluster member fails, do not install on that cluster”. When installation fails to the member that has no SIC yet the installation will continue on the member that received a new SIC. There is no additional action for the Virtual Switch.
But at this point: DO NOT REBOOT! If you do reboot the member will start but all Virtual Systems will go Down as it discovers the other member does not have a valid SIC. The Virtual System will stay Down due to a Full Sync pnote. While BGP routes are synced, the connection table is not.
If all Virtual Systems have a new SIC you can safely failover each VS with the clusterXL_admin down command. Now you need to follow the same SIC reset procedures on the other cluster-member. If all went well you did not have any downtime. And when you’re finished and there is a need to reboot, now is the time.
If you want to try to find the rootcause for why the SIC certificates were not renewed you can have a look at sk30579. I tried it but everything seemed to be OK. Consulted TAC too but it seems to be easier to reset the SIC for now instead of investigating it as it not very easy to reproduce.
Please do read the mentioned SK’s and follow the safety procedures like creating backups before you actually follow mine.