Palo Alto Firewall and ESX Session Time-Outs (Management)

Did you ever wonder how PAN firewalls are calculating session time-outs? I did, and it took me a while to find out about it. As you can see from my last post, it wasn´t the first time I was having problems with session timeouts and I´m pretty sure that there´s a ton of people having similar problems out there – so I want to share our results.

Scenario

pan_esx_sess

There is an ESX-server located in the Inside Zone of our firewall (let´s call it ESX-Management). It is opening SSL management connections to remote ESX-hosts located in the Outside zone (ESX-remote). The session timeout value was set to 4 hours. Unfortunately these sessions were running into timeouts because the PAN firewall was dropping them (we could verify that by checking the monitor tab and seeing the timeout counter running from 14400 to 0). On the other hand we could track a constant flow of heartbeat packets between the hosts. The ESX-management host kept sending heartbeat packages every 5-10 minutes in average.

Result and Issue

The result of that behavior was that the ESX-remote host ran out of sockets (the limit of ESX management connections is limited to a maximum of 321 connections), and therefore wasn´t manageable anymore. (HINT: restart esx management agent to reset all connections ‘/etc/init.d/mgmt-vmware restart’ )

We were able to identify the core problem very easily by entering the ‘netstat –an’ command in linux. The outpot showed us hundreds of established connections. However, when doing the same command on the ESX-management server it only showed us 3-4 active connections to the remote host. After that we checked the session in the PAN traffic monitor and found out that they were marked ‘ended’.

Conclusion: The ESX-management host was recognizing ended sessions, while the ESX-remote host kept the connections in established state-

Why is the PAN dropping sessions because of a timeout which have a constant flow of (heartbeat) packages?

The problem is related to the fact that PAN firewalls are doing session offloading, and data is processed by the data plane (see additional Info in PAN community: https://live.paloaltonetworks.com/docs/DOC-3950 or Wikipedia: http://en.wikipedia.org/wiki/SSL_acceleration). In case of session offloading  the PAN firewall needs a flow of 16 packets (unidirectional, so on one direction of the data-flow) in order to refresh the timeout timer. These values are for TCP sessions – as far as my information goes, UDP sessions need half the amount of packages (so 8 packages in one direction).

Conclusion and Fix

The conclusion you can make from the described the behavior is the following:

  • The application heartbeat value has to be at least 16 times shorter than the session timeout value set on the PAN firewall.

Or the other way around:

  • The application timeout value configured on the PAN firewall has to be changed to a 16 times higher value the heartbeat setting of the application. (You can change timeout values for every application in the objects tab).

For example: In our case the average heartbeat interval was 10 minutes. 16 multiplied by 10 = 160 Minutes. So we would have to set the session timeout value for the application ‘SSL’ to 9600 seconds (160 minutes) – or for a predefined application timeout of 1200 seconds, we would have to configure a heartbeat interval  of at least 75 seconds (1200 divided by 16 = 75).

As mentioned, the problem is related to the session offloading done by PAN firewalls. Taking this into concern, there is a second way to avoid session timeouts: Turn off session offloading. This can only be done via command-line.

To turn off hardware offload temporarily you can use the following commands (in PAN configure mode):
#set session offload no
or permanently with
#set deviceconfig setting session offload no (followed by commit).

As a result every heartbeat package will refresh the TTL timer, since the packages are not directly processed by the data plane. There´s no more need to send 16 packages in order to refresh the timer. In this case you only have to make sure, that the heartbeat interval from the application is smaller than the configured value on the PAN firewall. However there´s a big downside: The utilization of the data plane increases. Somewhere in the forums I read that disabling session offloading can decrease the total throughput by 15%.