Performance tuning SSSD for large IPA-AD trust deployments

Written by Alexander Bokovoy and Jakub Hrozek

This blog post describes several sssd.conf options that are available for performance tuning of SSSD, especially focusing on deployment of an IPA server with trust established with an AD server. Some of the options are useful for other scenarios as well, but it should be noted that diverting from the defaults is something that needs understanding of the implication the change has. This post is also written with SSSD version 1.12 and 1.13 in mind – later versions might have different tunables available or might not need these at all as we continuously improve performance.

The anatomy of a trusted identity lookup

The typical use-case this post is trying to address is “a login to my IPA client using AD users and his AD credentials is way too slow”. Later in the article, we’ll be setting several options in the IPA server’s sssd.conf. That might be confusing if you’re not familiar with the architecture of the IPA-AD trust lookups and authentication data flow – so let’s illustrate it briefly.

The two basic requests type the client can perform are identity lookups and authentication. For IPA users, the identity lookups are normal LDAP searches against the IPA server and authentication is performed using an internal kinit-like tool also connecting to the IPA server. Trusted AD users work differently — the identity lookups from the client machines connect to the IPA server using an LDAP extended operation. The IPA server’s extdom plugin queries the SSSD instance running on the server for the AD user information, the SSSD instance runs a search against the Active Directory server using the LDAP protocol and passes the data back to the IPA client. For password-based authentication requests, the IPA clients connect directly to the AD servers using Kerberos.

Keep in mind that because the correct set of groups must be set during authentication time, the authentication request must also perform an identity lookup, typically the initgroups operation, before the authentication. The following diagram illustrates the data flow:

ipa-passwordless-auth

The diagram illustrates the data flow in case a Windows user logs in using his Kerberos credentials (passwordless) to an IPA client:

  1. The windows user opens an SSH client (like putty) and logs in to an IPA client
  2. The SSH daemon initiates processing of the Kerberos ticket
  3. The Kerberos tickets of Active Directory users have a blob attached to them, called MS-PAC (a privilege attribute certificate, see https://msdn.microsoft.com/en-us/library/cc237917.aspx for details of the specification). The MS-PAC (or just PAC going forward) contains the complete and precise list of group memberships at the time user logged in into the Active Directory domain, signed by the domain controller. It is actually the mechanism Windows clients use to set list of the groups the user is a member of. If there is a PAC present, then a special libkrb5 plugin invokes the sssd_pac responder.
  4. The sssd_pac responder parses the group memberships from the PAC blob during login. This processing needs to resolve internal identifiers of groups in Active Directory (SIDs) to their names and POSIX IDs because Linux cannot yet deal with SIDs natively. For each SID identifier that is not stored in cache yet, the sssd_pac process asks the sssd_be process to translate the SID into a name and/or GID.
  5. The sssd_be process calls an LDAP extended operation towards the IPA server, asking for information about a particular SID.
  6. The extended operation is received by the IPA server’s Directory server plugin, that in turn queries the SSSD instance running on the IPA server through the standard NSS interface, which contacts the sssd_nss process on the IPA server
  7. If the information about this object’s SID is not present in the cache on the IPA server itself either, then the sssd_be process is asked to resolve this SID and store the object that corresponds to the SID into the SSSD cache on the server
  8. The sssd_be process on the server searches the Active Directory server’s LDAP store for the corresponding LDAP objects and stores them into the cache. When the object is stored, the flow reverses – the sssd_be process on the server tells the sssd_nss process the cache is up-to-date. The sssd_nss process returns data to the DS plugin on the server, which in turn returns data in the extdom-extop operation reply to the client.
  9. Finally, all the groups from the PAC object are processed. When the SSH daemon on the client opens the session for the user, it would call the initgroups() function to set up group memberships. When this function’s equivalent reaches the sssd_nss responder, all the data would be cached from processing the PAC blob and returned from the SSSD cache.

Please note that for simplicity, this diagram omits additional actions that might happen during login, such as HBAC access control or setting the SELinux label for the user.

If password-based authentication is used, the data flow is quite similar, except the sssd_pac responder is not involved until later in the process when the ticket is acquired on the IPA client itself. With identity lookup (for example id aduser@ad_domain), the PAC responder is not invoked at all, but all lookups are driven completely by the sssd_nss responder. But even here, the general flow between IPA client and IPA server is the same.

Also keep in mind that retrieving members of trusted groups or retrieving user groups prior to login requires RHEL-7.1 or RHEL-6.7 packages on the client side. Additionally, the IPA server must be running RHEL-7.1 or newer at the same time.

Not all IPA masters are capable to resolve AD users this way. In FreeIPA < 4.2, only IPA masters where ipa-adtrust-install command was executed can retrieve information from Active Directory domain controllers. Such IPA masters were also used for establishing actuall cross-forest trust to Active Directory and AD domain controllers were talking back to IPA masters as part of that flow. With FreeIPA 4.2 a concept of ‘AD trust agents’ was introduced that allows other IPA masters to resolve AD users and groups without the need to configure them as a full domain controller.

The data flow also has implications on debugging. If it’s not possible to fetch data about a trusted user on the IPA client (ie getent passwd aduser@ad_domain doesn’t display the user), first make sure it’s possible to run the command on the IPA server. The error messages on the client would normally manifest in functions that contain s2n in the name, such as:

[ipa_s2n_exop_done] (0x0040): ldap_extended_operation result: Operations error(1), Failed to handle the request.

or:

[ipa_s2n_get_user_done] (0x0040): s2n exop request failed.

The most data-intensive operation is usually retrieving the information about user’s groups on the server side. If the group objects are not cached yet, that operation can amount to downloading a large amount of data, storing it in server’s sssd cache, transferring the data to clients and storing them again in client’s sssd cache. Subsequent lookups for the same LDAP objects, even from another client instance would then just read the objects from the server side sssd’s cache, eliminating the server side costs, but the very first search might be costly in a huge environment.

Available tunables

Now that we know how the flow works, we can start optimizing it. Because the heavy lifting of the lookups is done on the server side, the changes in server’s sssd.conf also have the biggest impact on the lookups performance.

Server options

The following options should be added to the /etc/sssd/sssd.conf file on the IPA masters.

ignore_group_members
Normally the most data-intensive operation is downloading the groups including their members. Usually, we are interested in what groups a user is a member of (id aduser@ad_domain) as the initial step rather than what members do specific groups include (getent group adgroup@ad_domain). Setting the ignore_group_members option to True makes all groups appear as empty, thus downloading only information about the group objects themselves and not their members, providing a significant performance boost. Please note that id aduser@ad_domain would still return all the correct groups!

  • Minimal version: For users from IPA domain, this option was introduced in sssd-1.10. However, it’s only possible to use it for trusted domain users via the subdomain_inherit option (see below).
  • Recommended value: ignore_group_members=True¬†unless your environment requires the output of getent group to also contain members
  • Pros: getgrnam/getgrgid calls are significantly faster
  • Cons: getgrnam/getgrgid calls only return the group information, not the members
ldap_purge_cache_timeout
SSSD 1.12.x runs a cache cleanup operation on startup and then periodically. The cache cleanup operation removes cached entries that were not used for a while to make sure the cache doesn’t grow too large. But since SSSD can also operate in offline mode, we must be very conservative about what is removed, otherwise we might lock out user’s access to a file. At the same time, searching the cache and examining the entry could take a lot of CPU time. Therefore, the cache cleanup operation was of limited use and was disabled by default in 1.13. You can disable it in previous versions with a config change as well.

  • Minimal SSSD version: All supported SSSD versions have this option.
  • Recommended value: ldap_purge_cache_timeout = 0
  • Pros: the cache cleanup operation was not particularly useful as a periodic task, but took a long time to execute
  • Cons: the cache can grow in size.
subdomain_inherit
The options above can be passed to the trusted AD domains’ configuration. At the moment, the only supported method is using the subdomain_inherit option in the sssd.conf‘s domain section. Any of the two options’ names from above can be listed as a value of subdomain_inherit and they will apply to both the main (IPA) domain as well as the AD subdomain. In the future, we would prefer to add support for sub-sections in sssd.conf, but we’re not there yet.. The complete list of options that can be inherited by a subdomain is listed in the sssd.conf manual page.

  • Miminal version: Upstream 1.12.5. However, this option has been backported to both RHEL-6.7 and also RHEL-7.1 updates.
  • Pros-Cons: N/A, this is just a control option to extend influence of the options above to domains from a trusted AD forest.

Mount the cache in tmpfs

Quite a bit of the time spent processing the request is writing LDAP objects to the cache. Because the cache maintains full ACID properties, it does disk syncs with every internal SSSD transaction, which causes data to be written to disk. On the plus side, this makes sure the cache is always available in case of a network outage and will be usable after a machine crash, but on the other hand, writing the data takes time. It’s possible to mount the cache into a ramdisk, elliminating the disk IO cost by adding the following to /etc/fstab as a single line:

tmpfs /var/lib/sss/db/ tmpfs size=300M,mode=0700,rootcontext=system_u:object_r:sssd_var_lib_t:s0 0 0

Then mount the directory and restart the sssd afterwards:

# mount /var/lib/sss/db/
# systemctl restart sssd

Please tune the size parameter according to your IPA and AD directory size. As a rule of thumb, you can use 100MBs per 10000 LDAP entries.

Doing this change on the IPA server is a bit safer than IPA clients, because the SSSD instance on the server will never lose connectivity to the IPA server, so the cache can always be rebuilt. But in case the cache was lost after a reboot and the AD server was not reachable due to a network error or a similar condition, the node would not be able to fall back to cached data about AD users.

  • Pros: I/O operations on the cache are much faster.
  • Cons: The cache does not persist across reboots. Practically, this means the cache must be rebuilt after machine reboot, but also that cached password are lost after a reboot.

All in all, we’re looking at adding the following changes to the server side’s sssd.conf:

[domain/domname]
subdomain_inherit = ignore_group_members, ldap_purge_cache_timeout
ignore_group_members = True
ldap_purge_cache_timeout = 0

and optionally an fstab change to remount the database directory in a
ramdisk.

Client options

The following options should be added to /etc/sssd/sssd.conf on the IPA clients.

pam_id_timeout
On a Linux system, user group membership is set for processes during login time. Therefore, during PAM conversation, SSSD has to prefer precision over speed and contact the server for accurate information. However, a single login can span over multiple PAM requests as PAM processing is split into several stages – for example, there might be a request for authentication and a separate request for account check (HBAC). It’s not beneficial to contact the server separately for both requests, therefore we set a very short timeout for the PAM responder during which the requests will be answered from in-memory cache. The default value of 5 seconds might not be enough in cases where complex group memberships are populated on server and client side. The recommended value of this option is as long as a single un-cached login takes.

  • Recommended value: pam_id_timeout = login_time_in_seconds
  • Pros: The remote server would only be contacted once during a login session
  • Cons: If the group memberships change rapidly on the server side, SSSD might still only use the cached values
krb5_auth_timeout
In case the Kerberos ticket has a PAC blob attached to it (see above) and password authentication is used, the krb5_child processes the PAC blob to help establish the group memberships. This processing might take longer than the time the krb5_child process is allowed to run. Therefore, for environments where users are members of a large number of groups, the krb5_auth_timeout value should be increased to allow the groups to be processed. In future SSSD versions, we aim at making the processing much faster, so the default 6 seconds timeout would suffice. Contacting the PAC responder can also be avoided completely by disabling the krb5_validate option, however disabling that option has security implications as we can’t any longer verify the TGT has not been spoofed with a MITM attack.

  • Recommended value: krb5_auth_timeout = login_time_in_seconds
  • Pros: The PAC responder has enough time to process user’s group memberships
  • Cons: Detecting legitimate offline situations might take too long
Mount the cache to tmpfs
Please see the server side section. On the client side, there is an additional caveat – the risk the client would reboot and then lose connectivity to the server is higher. Please do not set this option on roaming clients where you rely on offline logins!

To sum up the client side login changes, we’re looking at these additions to the config file:

[pam]
pam_id_timeout = N

[domain/domname]
krb5_auth_timeout = N
# Disabling the validation is dangerous!!
# krb5_validate = false

Don’t forget to restart the SSSD instance for the new settings to take effect!