Overcoming Sophos UTM HA Cluster Logging and Reporting Issues

Deploying a Sophos UTM HA Cluster (Active/Passive Failover) has some great advantages and is super simple to deploy. The two UTMs monitor each other and learn one is the master and the other the slave. If the slave detects the master is no longer available, it becomes the master and starts handling the traffic.

The two UTM nodes sync certain information, but not everything, and as you will see, this can cause chaos when it comes to logging and reporting.

Let's have a look at some reporting limitations caused by deploying a Sophos UTM HA cluster, and more importantly, how to overcome them.

Only One Active Logging Node

The individual nodes of the clusters actually maintain their own discreet set of logs. If node 1 was active for Monday and Tuesday but node 2 was active for Wednesday and Thursday, their logs will inversely show activity, or lack of activity, for the transition days.

This is very obvious when running a weekly network usage report and looking at the graph. The current active node runs the report from information in its own local log store. Node 1 is therefore missing activity for Wednesday and Thursday.

Looking at the image below you can clearly see something's not right.

Sophos UTM Reporting Limitations

Less obvious is when you are looking at daily reports. If the cluster switches for 15 minutes, such as during an Up2Date reboot, the UTM will not log data during that time. Unless you're looking for it, you may not even know it's missing.

If you are really unlucky, your report runs while the master is offline and your report returns almost no information! If you tend to schedule maintenance events for midnight (when the reports generate), this is not as unlikely as it sounds.

Problems With Missing Log Data

Producing reports with holes in the data totally skews your view of what is happening on your UTM. The reports are simply incomplete and inaccurate.

If you are trying to troubleshoot a Sophos UTM connectivity issue, web access logs can be very useful. If the nodes keep switching because of issues such as a failed or unstable HA link, your logs will jump between nodes, becoming fragmented very quickly (yes, unfortunately I am speaking from experience on this one!).

When using logging for the purposes of audit trails or legal proceedings, they need to be complete and accurate. Having time gaps in your log data is not good, and in many cases could be enough to have the logs inadmissible.

Merging Logs Together

Unfortunately, there is no easy way to merge the log data from the two UTM nodes. The only way to do this is to manually retrieve the files and splice them together.

Since direct file access to the slave node is not possible, retrieving its log files need to be done using a shell session and SSH onto the Master. You can then use ha_utils SSH to connect to the slave node, access the logs, and push the files to a common location.

Retrieving the logs from the master node is much simpler. You can download the archive files from the web interface, FTP, or SCP directly to the node.

Alternatively, you could fail the cluster over again and download the files from the slave, but this would obviously cause another disruption.

Avoiding HA Cluster Logging Problems

To some extent, you can limit missing information by specifying a preferred master, and as long as it comes back quickly and resumes the master role, the window of missing data is kept to a minimum. But in reality, nodes fail and you can't rely on one to always be alive. This is after all, why you have a cluster.

Fortunately, Sophos UTM can be configured to use an external syslog server to receive log data. When using a syslog server, the active node is always pushing out the log data as it comes in. In this scenario, a cluster fail-over will not affect the completeness of the log as the slave node continues to push out the log data as if it was the master. The syslog server sees no difference in the incoming log messages.

If the syslog server is really good, it will also have the ability to recover missing log data should it itself also become unavailable for some reason.

Fastvue Sophos Reporter to the rescue!

Fastvue Sophos Reporter meets all of the requirements. Unlike simple syslog servers, it not only collects all of the log data, but it also analyses it in real-time. If your cluster switches nodes it does not affect the completeness of the log data. The logs from each node are automatically merged, regardless of how often the switches occurs.

And unlike Sophos iView, Fastvue Sophos Reporter also has the ability to import historic log data, and uses it to fill any gaps the log data may have. If the syslog stream is interrupted for any reason, this is an essential feature to maintain the completeness and integrity of your log data. For more information, see our article - Never miss reporting data with Sophos UTMs remote log archive.

To find out more about why Fastvue Sophos Reporter is simply the best reporting tool available for the Sophos UTM, visit the Fastvue Sophos Reporter website. I say that as a hands on user of the software in a large corporate environment where I administer numerous Sophos UTM clusters.