Anatomy of SSSD user lookup

This blog post describes how a user lookup request is handled in SSSD. It should help you understand how the SSSD architecture looks like, how the data flows in SSSD and as a result help identify which part might not be functioning correctly on your system. It is aimed mostly at users and administrators – for developers, we have a separate document about SSSD internals on the SSSD wiki written by Yassir Elley. This document re-uses some of the info from the internals one.

We’ll look at the most common operation, looking up user info on a remote server. I won’t go into server-specific details, so most of the info should be equally true for LDAP, Active Directory or FreeIPA servers. There’s also more functionality in SSSD than looking up users, such as sudo or autofs integration, but they are out of scope of this post as well.

Before going into SSSD details, let’s do a really quick intro into what happens on the system in general when you request a user from a remote server. Let’s say the admin configured SSSD and tests the configuration by requesting the admin user:

$ getent passwd admin

When user information is requested about a user (with getent, id or similar), typically one of the functions of the Name Service Switch, such as getpwnam() or initgroups() in glibc is called. There’s lots of information about the Name Service Switch in the libc manual, but for our purposes, it’s enough to know that libc opens and reads the config file /etc/nsswitch.conf to find out which modules should be contacted in which order. The module that all of us have on our Linux machines is files which can read user info from /etc/passwd and user info from /etc/groups. There also exists an ldap module that would read the info directly from an LDAP server and of course an sss module that talks to SSSD. So how does that work?

The first thing to keep in mind is that, unlike nss_ldap or pam_ldap, the SSSD is not just a module that is loaded in the context of the application, but rather a deamon that the modules communicate with. Almost no logic is implemented in the modules, all the functionality happens in the deamon. A user-visible effect during debugging is that using strace is not too helpful as it would only show if the request made it to the SSSD. For debugging the rest, the SSSD debug logs should be used.

Earlier I said that SSSD is a deamon. That’s really not too precise, SSSD is actually a set of deamons that communicate with one another. There are three kinds of SSSD processes. One is the sssd process itself. Its purpose is to read the config file after startup and spawn the other processes according to the config file. Then there are responder or front end processes that listen to queries from the applications, like the query that would come from the getent command. If the responder process needs to contact the remote service for data, it talks to the last SSSD process type, which is the data provider or back end process. This architecture allows for a pluggable setup where there are different back end processes talking to different remote servers, while all these remote servers can be accessed from a range of applications or subsystems by the same lookup code in the responders.

Each process is represented by a section in the sssd.conf config file. The main sssd process is represented by the [sssd] section. The front end processes are defined on the services line in the [sssd] section and each can be configured in a section named after the service. And finally, the back end processes are those configured in the [domain] sections. Each process also logs into its own logfile.

Let’s continue with the getent passwd admin example. To illustrate the flow, there is a diagram that the text follows. The full arrows represent local IO operation (like opening a file), the empty arrows represent local IPC over UNIX sockets and the dotted arrow represents a network IO.

sssd-lookup

The user issued the getent command which calls libc’s getpwnam (diagram step 1), then the libc opens the nss_sss module as per nsswitch.conf and passes in the request. First, the nss_sss memory-mapped cache is consulted, that’s step 2 on the diagram. If the data is present in the cache, it is just returned without even contacting the SSSD, which is extremely fast. Otherwise, the request is passed to the SSSD’s responder process (step 3), in particular sssd_nss. The request first looks into the SSSD on-disk cache (step 4). If the data is present in the cache and valid, the nss responder reads the data from the cache and returns them to the application.

If the data is not present in the cache at all or if it’s expired, the sssd_nss request queries the appropriate back end process (step 5) and waits for reply. The back end process connects to the remote server, runs the search (step 6) and stores the resulting data into the cache (step 7). When the search request is finished, the provider process signals back to the responder process that the cache is updated (step 8). At that point, the front-end responder process checks the cache again. If there’s any data in the cache after the back end has updated it, the data is returned to the application – even in cases when the back end failed to update the cache for some reason, it’s better to return stale data than none. Of course, if no data is found in the cache after the back end has finished, an empty result is returned back. This final cache check is represented by step 9 in the diagram.

When I said the back end “runs a search” against the server, I really simplified the matter a lot. The search can involve many different steps, such as resolving the server to connect to, authenticating to the server, performing the search itself and storing the resulting data into the database. Some of the steps might even require a helper process, for instance authenticating against a remote server using a keytab is done in a heper process called ldap_child that logs into its own logfile called /var/log/sssd/ldap_child.log.

Given most steps happen in the back end itself, then most often, the problem or misconfiguration lies in the back end part. But it is still very important to know the overall architecture and be able to identify if and how the request made it to the back end at all. In the next part, we’ll apply this new information to perform a small case study and we will repair a buggy sssd setup.

Troubleshooting a failing SSSD user lookup.

With the SSSD architecture in mind, we can try a case study. Consider we have an IPA client, but no users, not even the admin show up:

$ getent passwd admin
$ echo $?
2

The admin user was not found! Given our knowledge of the architecture, let’s first see if the system is configured to query sssd for user information at all:

$ grep passwd /etc/nsswitch.conf
passwd: files sss

It is. Then the request was passed on to the nss responder process, since the only other possibility is a successful return from the memory cache. We need to raise the debug_level in the [nss] section like this:

[nss]
debug_level = 7

and restart sssd:

# systemctl restart sssd

Then we’ll request the admin user again and inspect the NSS logs:

[sssd[nss]] [accept_fd_handler] (0x0400): Client connected!
[sssd[nss]] [sss_cmd_get_version] (0x0200): Received client version [1].
[sssd[nss]] [sss_cmd_get_version] (0x0200): Offered version [1].
[sssd[nss]] [nss_cmd_getbynam] (0x0400): Running command [17] with input [admin].
[sssd[nss]] [sss_parse_name_for_domains] (0x0200): name 'admin' matched without domain, user is admin
[sssd[nss]] [nss_cmd_getbynam] (0x0100): Requesting info for [admin] from []
[sssd[nss]] [nss_cmd_getpwnam_search] (0x0100): Requesting info for [admin@ipa.example.com]
[sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x4266f9:1:admin@ipa.example.com]
[sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [ipa.example.com][4097][1][name=admin]
[sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x4266f9:1:admin@ipa.example.com]
[sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 1 errno: 11 error message: Fast reply - offline
[sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040): Unable to get information from Data Provider
Error: 1, 11, Fast reply - offline
Will try to return what we have in cache

Well, apparently the request for the admin user was received and passed on to the back end process, but the back end replied that it switched to offline mode..that means we need to also enable debugging in the domain part and continue investigation there. We need to add debug_level to the [domain] section and restart sssd again. Then run the getent command and inspect the file called /var/log/sssd/sssd_ipa.example.com starting with the time that corresponds to the NSS responder sending the data (as indicated by sss_dp_issue_request in the nss log). In the domain log we see:

[sssd[be[ipa.example.com]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'master.ipa.example.com:389': Domain name not found
[sssd[be[ipa.example.com]]] [set_server_common_status] (0x0100): Marking server 'master.ipa.example.com:389' as 'not working'
[sssd[be[ipa.example.com]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (master.ipa.example.com:389), resolver returned (11)
[sssd[be[ipa.example.com]]] [be_resolve_server_process] (0x1000): Trying with the next one!
[sssd[be[ipa.example.com]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'IPA'
[sssd[be[ipa.example.com]]] [get_server_status] (0x1000): Status of server 'master.ipa.example.com:389' is 'not working'
[sssd[be[ipa.example.com]]] [get_port_status] (0x1000): Port status of port 0 for server '(no name)' is 'not working'
[sssd[be[ipa.example.com]]] [get_server_status] (0x1000): Status of server 'master.ipa.example.com:389' is 'not working'
[sssd[be[ipa.example.com]]] [fo_resolve_service_send] (0x0020): No available servers for service 'IPA'
[sssd[be[ipa.example.com]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5
[sssd[be[ipa.example.com]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
[sssd[be[ipa.example.com]]] [be_ptask_create] (0x0400): Periodic task [Check if online (periodic)] was created
[sssd[be[ipa.example.com]]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 70 seconds from now [1426087775]
[sssd[be[ipa.example.com]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.

OK, that gets us somewhere. Indeed, our /etc/resolv.conf file was ponting to a bad nameserver. And indeed, after fixing the resolver settings and restarting SSSD, everything seems to be working:

$ getent passwd admin
admin:*:1546600000:1546600000:Administrator:/home/admin:/bin/bash

Awesome, we were able to repair a broken SSSD setup!

Advertisements

2 thoughts on “Anatomy of SSSD user lookup

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s