This document describes the symptoms of Postfix SMTP server overload, and how to avoid the condition under normal conditions. When the condition is caused by botnets or other malware, the document suggests configuration settings that help to minimize the impact on legitimate mail. Finally, the document introduces stress-adaptive behavior, introduced with Postfix 2.5, and how it can be used to automatically switch configuration settings under overload.
Topics covered in this document:
Under normal conditions, Postfix responds immediately when a remote SMTP client connects. The time needed to deliver mail should be noticeable only with very large messages. Performance degrades more dramatically when the number of remote SMTP clients exceeds the number of Postfix SMTP server processes. When a client connects while all server processes are busy, the client must wait until a server process becomes available.
Overload may be caused by a legitimate mail (example: a DNS registrar opens a new zone for registrations), by mistake (mail explosion caused by a forwarding loop) or by illegitimate mail (worm outbreak, botnet, or other malware activity). Symptoms of Postfix SMTP mail server overload are:
Remote SMTP clients experience a long delay before Postfix sends the "220 hostname.example.com ESMTP Postfix" greeting. If this affects end-user mail clients, enable the "submission" service entry in master.cf (present since Postfix 2.1), and tell users to connect to this instead of the public SMTP service.
The Postfix SMTP server logs an increased number of "lost connection after CONNECT" events. This happens because remote SMTP clients disconnect before Postfix answers the connection.
Postfix 2.3 and later logs a warning that all server ports are busy:
Oct 3 20:39:27 spike postfix/master[28905]: warning: service "smtp" (25) has reached its process limit "30": new clients may experience noticeable delays Oct 3 20:39:27 spike postfix/master[28905]: warning: to avoid this condition, increase the process count in master.cf or reduce the service time per client
NOTE: The first two symptoms may also happen without overload, for example:
Broken DNS also causes lengthy delays before "220 hostname.example.com ..." while the Postfix SMTP server tries to look up the client's hostname.
A portscan for open SMTP ports also results in "lost connection ..." logfile messages.
Legitimate mail that doesn't get through during an episode of overload is not necessarily lost. It should still arrive once the situation returns to normal, as long as the overload condition is temporary.
To service more SMTP clients simultaneously, you need to increase the number of SMTP server processes. This will improve the responsiveness for remote SMTP clients, as long as the server machine has enough hardware and software resources to run the additional processes, and as long as the file system can keep up with the additional load.
You increase the number of SMTP server processes either by increasing the default_process_limit in main.cf (line 3 below), or by increasing the SMTP server's "maxproc" field in master.cf (line 10 below). Either way, you need to issue a "postfix reload" command to make the change effective.
Process limits above 1000 require Postfix version 2.4 or later, and an operating system that supports kernel-based event filters (BSD kqueue(2), Linux epoll(4), or Solaris /dev/poll).
You can reduce the Postfix memory footprint by using cdb: lookup tables instead of Berkeley DB.
1 /etc/postfix/main.cf: 2 # Raise the global process limit, 100 since Postfix 2.0. 3 default_process_limit = 200 4 5 /etc/postfix/master.cf: 6 # ============================================================= 7 # service type private unpriv chroot wakeup maxproc command 8 # ============================================================= 9 # Raise the SMTP service process limit only. 10 smtp inet n - n - 200 smtpd
NOTE: older versions of the SMTPD_POLICY_README document contain a mistake: they configure a fixed number of policy daemon processes. When you raise the SMTP server's "maxproc" field in master.cf, SMTP server processes will report problems when connecting to policy server processes, because there aren't enough of them. Examples of errors are "connection refused" or "operation timed out". To fix, edit master.cf and specify a zero "maxproc" field in all policy server entries; see line 6 in the example below. Issue a "postfix reload" command to make the change effective.
1 /etc/postfix/master.cf: 2 # ============================================================= 3 # service type private unpriv chroot wakeup maxproc command 4 # ============================================================= 5 # Disable the policy service process limit. 6 policy unix - n n - 0 spawn 7 user=nobody argv=/some/where/policy-server
When increasing the number of SMTP server processes is not practical, you can improve Postfix server responsiveness by eliminating unnecessary work. When Postfix spends less time per SMTP session, the same number of SMTP server processes can service more clients in the same amount of time.
Eliminate non-functional RBL lookups (blocklists that are no longer in operation). These lookups can degrade performance. Postfix logs a warning when an RBL server does not respond.
Eliminate redundant RBL lookups (people often use multiple Spamhaus RBLs that include each other). To find out whether RBLs include other RBLs, look up the websites that document the RBL's policies.
Eliminate header_checks and body_checks, and keep just a few emergency patterns to block the latest worm explosion or backscatter mail. See BACKSCATTER_README for examples of the latter.
Group your header_checks and body_checks patterns to avoid unnecessary pattern matching operations.
1 /etc/postfix/header_checks: 2 if /^Subject:/ 3 /^Subject: virus found in mail from you/ reject 4 /^Subject: ..../ .... 5 endif 6 7 if /^Received:/ 8 /^Received: from (postfix\.org) / reject forged client name in received header: $1 9 /^Received: from .../ .... 10 endif
Under conditions of overload you can improve Postfix SMTP server responsiveness by hanging up on suspicious clients, so that other clients get a chance to talk to Postfix.
Use "421" reply codes for botnet-related RBLs or for selected non-RBL restrictions. This causes Postfix 2.3 and later to disconnect immediately without waiting for the remote SMTP client to send a QUIT command.
You can set individual reject codes for RBLs, and for individual responses from a specific RBL. We'll use zen.spamhaus.org as an example; by the time you read this document, details may have changed. Right now, their documents say that a response of 127.0.0.10 or 127.0.0.11 indicates a dynamic client IP address, which means that the machine is probably running a bot of some kind. To give a 421 response instead of the default 554 response, use something like:
1 /etc/postfix/main.cf: 2 smtpd_client_restrictions = 3 permit_mynetworks 4 reject_rbl_client zen.spamhaus.org=127.0.0.10 5 reject_rbl_client zen.spamhaus.org=127.0.0.11 6 reject_rbl_client zen.spamhaus.org 7 8 rbl_reply_maps = hash:/etc/postfix/rbl_reply_maps 9 10 /etc/postfix/rbl_reply_maps: 11 zen.spamhaus.org=127.0.0.10 421 4.7.1 Service unavailable; 12 $rbl_class [$rbl_what] blocked using 13 $rbl_domain${rbl_reason?; $rbl_reason} 14 15 zen.spamhaus.org=127.0.0.11 421 4.7.1 Service unavailable; 16 $rbl_class [$rbl_what] blocked using 17 $rbl_domain${rbl_reason?; $rbl_reason}
Although the above shows three RBL lookups (lines 4-6), Postfix will still only do a single DNS query, so the performance difference is negligible.
The down-side of sending 421 instead of the default 554 is that it works only for zombies and other malware. If the client is running a real MTA, then it may connect again several times until the mail expires in its queue. When this is a problem, stick with the default 554 reply, and use "smtpd_hard_error_limit = 1" as described below.
With Postfix 2.5, or with earlier releases that contain the stress-adaptive behavior patch, you can turn on the above under overload by replacing line 8 with:
8 rbl_reply_maps = ${stress?hash:/etc/postfix/rbl_reply_maps}
More information about automatic stress-adaptive behavior is at the end of this document.
The following measures will still allow most legitimate clients to connect and send mail, but may affect some legitimate clients.
Reduce smtpd_timeout (default: 300s). Experience on the postfix-users list from a variety of sysadmins shows that reducing the "normal" smtpd_timeout to 60s is unlikely to affect legitimate clients. However, it is unlikely to become the Postfix default because it's not RFC compliant. Setting smtpd_timeout to 10s (line 2 below) or even 5s under stress will still allow most legitimate clients to connect and send mail, but may delay mail from some clients. No mail should be lost, as long as this measure is used only temporarily.
Reduce smtpd_hard_error_limit (default: 20). Setting this to 1 under stress (line 3 below) helps by disconnecting clients after a single error, giving other clients a chance to connect. However, this may cause significant delays with legitimate mail, such as a mailing list that contains a few no-longer-active user names that didn't bother to unsubscribe. No mail should be lost, as long as this measure is used only temporarily.
Disable remote SMTP client hostname lookups, so that all SMTP client hostnames become "unknown" (line 5 below). This feature was introduced with Postfix 2.3. Unfortunately, this measure is more problematic than the other ones proposed sofar. First, this will result in loss of mail when you use hostname-based access rules that reject mail from "unknown" SMTP clients (examples: reject_unknown_client_hostname, reject_unknown_reverse_client_hostname). Second, this may result in loss of mail when you subject "unknown" SMTP clients to additional restrictions such as reject_unverified_sender.
1 /etc/postfix/main.cf: 2 smtpd_timeout = 10 3 smtpd_hard_error_limit = 1 4 # Caution: line 5 may trigger REJECTs by hostname-based access rules 5 smtpd_peername_lookup = no
Except with the last measure, no mail should be lost, as long as these measures are used only temporarily. The next section of this document introduces a way to automate this process.
Postfix version 2.5 introduces automatic stress-adaptive behavior. This is also available as an add-on patch for Postfix versions 2.4 and 2.3 from the mirrors listed at http://www.postfix.org/download.html.
It works as follows. When a "public" network service runs into an "all server ports are busy" condition, the master(8) daemon logs a warning, restarts the service (without interrupting existing network sessions), and runs the service with "-o stress=yes" on the command line. Normally, it runs a stress-adaptive service with "-o stress=" on the command line (i.e. with an empty parameter value). Other services never have "-o stress" parameters on the command line, including services that listen on a loopback interface only.
The stress pseudo-parameter value is the key to making main.cf parameter settings stress adaptive:
1 /etc/postfix/main.cf: 2 smtpd_timeout = ${stress?10}${stress:300} 3 smtpd_hard_error_limit = ${stress?1}${stress:20}
Translation:
Line 2: under conditions of stress, use an smtpd_timeout value of 10 seconds instead of the default 300 seconds,
Line 3: under conditions of stress, use an smtpd_hard_error_limit of 1 instead of the default 20.
The syntax of ${name?value} and ${name:value} is explained at the beginning of the postconf(5) manual page.
NOTE: Please keep in mind that the stress-adaptive feature is a fairly desperate measure to keep some legitimate mail flowing under overload conditions. If a site is reaching the SMTP server process limit when there isn't an attack or bot flood occurring, then either the process limit needs to be raised or more hardware needs to be added.
To find out if your Postfix installation supports stress-adaptive behavior, use the "ps" command, and look for the smtpd processes. Postfix has stress-adaptive support when you see "-o stress=" or "-o stress=yes" command-line options. Remember that Postfix never enables stress-adaptive behavior on servers that listen on local addresses only.
The following example is for FreeBSD or Linux. On Solaris, HP-UX and other System-V flavors, use "ps -ef" instead of "ps ax".
$ ps ax|grep smtpd 83326 ?? S 0:00.28 smtpd -n smtp -t inet -u -c -o stress= 84345 ?? Ss 0:00.11 /usr/bin/perl /usr/libexec/postfix/smtpd-policy.pl
You can't use postconf(1) to detect stress-adaptive support. The postconf(1) command ignores the existence of the stress parameter in main.cf, because the parameter has no effect there. Command-line "-o parameter" settings always take precedence over main.cf parameter settings.
If you configure stress-adaptive behavior in main.cf when it isn't supported, nothing bad will happen. The processes will run as if the stress parameter always has an empty value.
You can manually force stress-adaptive behavior on, by adding a "-o stress=yes" command-line option in master.cf. This can be useful for testing overrides on the SMTP service. Issue "postfix reload" to make the change effective.
Note: setting the stress parameter in main.cf has no effect for services that accept remote connections.
1 /etc/postfix/master.cf: 2 # ============================================================= 3 # service type private unpriv chroot wakeup maxproc command 4 # ============================================================= 5 # 6 smtp inet n - n - - smtpd 7 -o stress=yes 8 -o . . .
To permanently force stress-adaptive behavior off with a specific service, specify "-o stress=" on its master.cf command line. This may be desirable for the "submission" service. Issue "postfix reload" to make the change effective.
Note: setting the stress parameter in main.cf has no effect for services that accept remote connections.
1 /etc/postfix/master.cf: 2 # ============================================================= 3 # service type private unpriv chroot wakeup maxproc command 4 # ============================================================= 5 # 6 submission inet n - n - - smtpd 7 -o stress= 8 -o . . .