原文:https://ops.tips/blog/how-linux-tcp-introspection/

在本文中,我们将为读者介绍套接字在准备接受连接之前,系统在幕后做了哪些工作,以及“准备好接受连接”倒底意味着什么 。为此,我们将深入介绍bind(2)listen(2)accept(2)等函数的内部运行机制,看看它们为构造套接字数据结构做了哪些方面的工作。

如果您想了解netstat命令幕后的故事,请一定坚持读到最后!

创建TCP套接字


为了建立TCP连接,我们必须创建相应的TCP套接字。所以,我们首先要做的,就是在服务器端创建一个套接字,同时,在客户端创建另一个套接字。

这一步至关重要,因为这实际上就是为通信的双方创建相应的端点。

在这一步中,通信双方调用的函数是相同的,即都是socket(2)函数。

int 
main (int argc, char** argv)
{
        // Create a socket in the AF_INET (ipv4) 
        // communication domain, of type `SOCK_STREAM`
        // (sequenced, reliable, two-way, connection-based
        // stream - yeah, tcp), using the most adequate
        // protocol (that last argument - 0).
        int fd = socket(AF_INET, SOCK_STREAM, 0); 

        if (fd == -1) {
                perror("socket");
                return 1;
        }

        return 0;
}

在上面的代码中,唯一让人感到神秘的地方恐怕就是socket(2)了!

请一定先阅读 How Linux creates Sockets

一旦用适当的协议族和具体协议创建好套接字之后,我们就可以考察如何利用一个与具体协议有关的调用来建立有效连接了。

将套接字绑定到地址


在绑定之前,我们创建的套接字虽然存在于给定的命名空间中,但尚未为其分配相应的地址——底层数据结构已经就绪(已分配),但还没有定义相关的语义。

函数bind(2) 的作用是,将在用户空间指定的地址分配给从socket(2)函数接收的文件描述符所引用的套接字。

/*
 * bind - bind a name to a socket.
 */
int bind(
        // The socket that we created before and which we 
        // want to associate an address with.
        int sockfd, 

        // Address varies depending on the address family.
        //
        // `struct sockaddr` defines a generic socket address,
        // but given that each family carries its own address
        // definition, we need to specialize this struct with
        // a `struct` that suits our protocol.
        const struct sockaddr *addr,

        // Size of the address structure pointed to by addr.
        socklen_t addrlen);

这个函数特别有趣:在接收地址数据时,为了做到尽可能地通用,它预期收到的参数有两个,一个参数是指向某内存块的指针,另一个参数表示内存块的大小。

USERSPACE:

        bind( socket, [ ..... piece of memory ...... ], size of the piece of memory)

                 "hey kernel, here's some chunk of memory that corresponds
                  to something that the socket's family understand; and
                  btw, its size if `N`".


KERNELSPACE:

        oh, thx! I'll let `af_inet` know about it!

                >> grabs the memory by copying it to kernelspace;
                >> forwards it to the af_inet implementation of `bind`.

例如,就这里来说(IPv4),我们可以监听所有接口(0.0.0.0)上的1337端口,为此,只需为结构体sockaddr_in填充相应的协议族、端口和地址信息即可:

/*
 * server_bind - binds a given socket to a specific
 *               port and address.
 * @listen_fd: the socket to bind to an address.
 */
int
server_bind(int listen_fd)
{
        // Structure describing an internet socket
        // address (ipv4).
    struct sockaddr_in server_addr = { 0 };
        int err = 0;

        // The family that the address belongs to.
        // AF_INET: ipv4 "internet" addresses.
    server_addr.sin_family      = AF_INET;

        // The IPV4 address in network byte order (big endian)
    server_addr.sin_addr.s_addr = htonl(INADDR_ANY);

        // The service port that we're willing to
        // bind the socket to (in network byte order).
    server_addr.sin_port        = htons(PORT);

        err = bind(listen_fd, 
                (struct sockaddr*)&server_addr, 
                sizeof(server_addr));
    if (err == -1) {
        perror("bind");
        fprintf(stderr, "Failed to bind socket to address\n");
        return err;
    }
}

虽然我们没有在堆中为结构体sockaddr_in分配内存空间(可以通过malloc或类似函数),但是,我们仍然创建了一块具有相应的大小的内存供其引用(因此,我们可以将其传递给bind函数)。

此外,虽然bind(2)预期接收的数据为一个sockaddr结构体,但实际上,我们几乎可以将任何具有addrlen规定大小的内存块传递给它,即其大小不必与sockaddr相同。

例如,我们不妨考察一下IPv4和IPv6协议族之间的差异。

在这两种情况下,都需调用bind(2)函数来为套接字分配地址,当然,与IPv4相比,IPv6具有更大的地址空间,因此,自然需要更大的结构体。

// Generic `sockaddr`. A pointer to
// a struct like this is expected by the
// `bind` syscall.
//
// size: 16B
struct sockaddr {
    sa_family_t sa_family;   // 2B
    char        sa_data[14]; // 14B
}


// The socket address representation of an
// IPv4 address.
//
// size: 16B
struct sockaddr_in {
    sa_family_t    sin_family; // 2B
    in_port_t      sin_port;   // 2B
    struct in_addr sin_addr;   // 4B

    // Just add the rest that is left
    // (padding it for reasons I don't know).
    unsigned char sin_zero[8]; // 8B
}


// The socket address representation of an
// IPv6 address.
//
// size: 28B
struct sockaddr_in6 {
    sa_family_t     sin6_family;   // 2B
    in_port_t       sin6_port;     // 2B
    uint32_t        sin6_flowinfo; // 4B
    struct in6_addr sin6_addr;     // 16B
    uint32_t        sin6_scope_id; // 4B
};

最重要的是,相关协议系列低层的绑定操作的具体实现是如何处理这样的内存块的——该内存块中应该是地址(无论协议系列认为地址是什么,这里都应该是地址)。

内核如何处理传递给bind函数的地址


跟踪上面程序的绑定操作,我们可以看到AF_INET套接字的bind(2)函数的堆栈跟踪信息:

# Trace the `inet_bind` method, the one that
# gets called whenever a `bind` is called
# on a socket that has been created for
# the `af_inet` family (regardless of the
# type - SOCK_STREAM or SOCK_DATAGRAM).
trace -K inet_bind
PID     TID     COMM            FUNC
28700   28700   bind.out        inet_bind
        inet_bind+0x1 [kernel]
        sys_bind+0xe [kernel]
        do_syscall_64+0x73 [kernel]
        entry_SYSCALL_64_after_hwframe+0x3d [kernel]

仔细考察上面的内容,我们就能理解整个过程是如何进行的。

首先,让我们从提供bind(2)函数的系统调用功能的方法sys_bind开始,我们可以看到:

  1. 查找为进程文件描述符保存的底层套接字;
  2. 将内存从用户空间复制到内核空间;然后
  3. 让低层的套接字协议族来处理绑定操作。

这个系统调用的定义位于net/socket.c文件中:

/*
 * Bind a name to a socket. Nothing much to do here since it's
 * the protocol's responsibility to handle the local address.
 * 
 * We move the socket address to kernel space before we call
 * the protocol layer (having also checked the address is ok).
 */
SYSCALL_DEFINE3(bind, 
        int, fd, 
        struct sockaddr __user*, umyaddr, 
        int, addrlen)
{
        // Reference to the underlying `struct socket`
        // associated with the file descriptor `fd` passed
        // from userspace.
    struct socket*          sock;
    struct sockaddr_storage address;
    int                     err, fput_needed;

    // Retrieve the underlying socket from the
    // file descriptor.
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    if (sock) {
        // Copy the address to kernel space.
        err = move_addr_to_kernel(umyaddr, addrlen, &address);
        if (err >= 0) {
            // Call any security hooks registered for
            // the `bind` operation
            err = security_socket_bind(
              sock, (struct sockaddr*)&address, addrlen);
            if (!err) {
                // Perform the underlying family's
                // bind operation.
                err = sock->ops->bind(
                  sock, (struct sockaddr*)&address, addrlen);
            }
        }
        fput_light(sock->file, fput_needed);
    }
    return err;
}

实际上,这里最能引起我关注的是:所有这些系统调用都提供了audit_或security_方法。

看起来,Linux安全模块的“命”是这些hook给的(参见 Linux Security Modules: General Security Hooks for Linux):

Linux安全模块LSM)框架提供了一种机制,可以通过新的内核扩展来hook各种安全检查

这太有意思了!

有了内核中的套接字地址,以及与找到的文件描述符关联的套接字结构,现在是时候调用sock->ops->bind了(就本例来说,即inet_bind)。

考虑到bind函数基本上就是一个“修改器”函数,从某种意义上说,它的用途就是改变socket结构中的某些内部字段,所以,在考察这些修改是如何进行的之前,先让我们来弄清楚修改的是哪些字段。

/**
 * Higher-level interface for any type of sockets
 * that we end up creating through `sockfs`.
 */
struct socket {
        // The state of the socket (not to confuse with
        // the transport state).
        // 
        // Note.: this is an enumeration of five possible
        // states: 
        // - SS_FREE = 0        not allocated
        // - SS_UNCONNECTED,    unconnected to any socket
        // - SS_CONNECTING, in process of connecting
        // - SS_CONNECTED,  connected to socket
        // - SS_DISCONNECTING   in process of disconnecting
        //
        // Given that we have already called `socket(2)`, at
        // this point, we're clearly not in the `SS_FREE` state.
    socket_state        state;

        // File description (kernelspace) associated with 
        // the file descriptor (userspace).
    struct file     *file;

         // Family-specific implementation of a
         // network socket.
    struct sock     *sk;
        // ...
};


/** 
 * AF_INET specialized representation of network sockets.sockets.
 *
 * struct inet_sock - representation of INET sockets
 *
 * @sk - ancestor class
 * @inet_daddr          - Foreign IPv4 addr
 * @inet_rcv_saddr      - Bound local IPv4 addr
 * @inet_dport          - Destination port
 * @inet_num            - Local port
 * @inet_saddr          - Sending source
 * @inet_sport          - Source port
 * @saddr               - Sending source
 */
 struct inet_sock {
    struct sock     sk;
#define inet_daddr      sk.__sk_common.skc_daddr
#define inet_rcv_saddr      sk.__sk_common.skc_rcv_saddr
#define inet_dport      sk.__sk_common.skc_dport
#define inet_num        sk.__sk_common.skc_num
    __be32          inet_saddr;
    // ...
};

根据这些定义,我们可以推断出哪些字段需要进行修改:与源地址和源端口相关的字段。

一旦开始进行检查并调用各种安全相关的hook,bind函数就会开始修改sock中的相关字段。

通过阅读下面的代码,我们可以明白倒底进行了哪些修改(详见net/ipv4/af_inet.c文件):

// `AF_INET` specific implementation of the
// `bind` operation (called by `sys_bind` after
// retrieving the underlying `struct socket`
// associated with the file descriptor supplied
// by the user from `userspace`).
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
    struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;

        // Retrieves the `struct sock` associated with the non-family
        // specific representation of a socket (`struct socket`).
    struct sock *sk = sock->sk;


        // Cast the socket to the `inet-specific` definition 
        // of a socket.
    struct inet_sock *inet = inet_sk(sk);
    struct net *net = sock_net(sk);
    unsigned short snum;


        // ...
    // Make sure that the address supplied is
    // indeed of the size of a `sockaddr_in`
    err = -EINVAL;
    if (addr_len < sizeof(struct sockaddr_in))
        goto out;


    // Make sure that the address contains the right family
    // specified in its struct.
    if (addr->sin_family != AF_INET) {
        err = -EAFNOSUPPORT;
        if (addr->sin_family != AF_UNSPEC || addr->sin_addr.s_addr != htonl(INADDR_ANY))
            goto out;
    }


        // ...
        // Grab the service port as set in the address 
        // struct supplied from userspace.
    snum = ntohs(addr->sin_port);
    err = -EACCES;

    // Here is where we perform the check to make sure 
        // that the user has the necessary privileges to
    // bind to a privileged port.
    if (snum && snum < inet_prot_sock(net) &&
        !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
        goto out;


    // Can't bind after the socket is already active,
    // of if it's already bound.
    err = -EINVAL;
    if (sk->sk_state != TCP_CLOSE || inet->inet_num)
        goto out_release_sock;

    // Set the source address of the socket to the
    // one that we've supplied.
    inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;
    if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
        inet->inet_saddr = 0;

    /* Make sure we are allowed to bind here. */
    // CC: this is where you can retrieve a "random" port
    //     if you don't specify one.
    if (
                (snum || !inet->bind_address_no_port) && // has a port set?
            sk->sk_prot->get_port(sk, snum)          // was able to grab the port
        ) {
        inet->inet_saddr = inet->inet_rcv_saddr = 0;
        err = -EADDRINUSE;
        goto out_release_sock;
    }

        // ...
        // Set the source port to the one that we've
        // specified in the address supplied.
    inet->inet_sport = htons(inet->inet_num);
    inet->inet_daddr = 0;
    inet->inet_dport = 0;
        // ...

    return err;
}

搞定!

至此,bind(2)就完成了它的使命:为内核中的底层套接字结构提供了一个地址(在用户空间指定)。

注意,目前我们仍然没有在/proc/net/tcp中看到任何套接字。关于这一点,我们稍后再详细介绍!

让套接字进入被动状态


一旦我们为套接字设置了地址,接下来就是让它扮演服务器或客户端的角色。

也就是说,它需要将自己设置为侦听传入的连接,或者启动与正在侦听的其他人的连接。

在这里,我们通过listen(2)来扮演服务器角色,即让自己来监听客户端连接。

// "listen for connections on a socket."
int listen(
        // File descriptor refering to a
        // socket of type SOCK_STREAM or
        // SOCK_SEQPACKET.
        int sockfd, 

        // Maximum length to which the queue
        // of pending connections for `sockfd`
        // can grow.
        int backlog);

至于我们是如何在用户空间完成这一任务的,具体见下列代码:

int
server_listen(int listen_fd)
{
        int err = 0;

    err = listen(listen_fd, BACKLOG);
    if (err == -1) {
        perror("listen");
        return err;
    }
}

根据手册页的说法:

listen()sockfd引用的套接字标记为被动套接字,即使用accept(2)接受传入的连接请求的套接字。

那么,“标记”究竟意味着什么呢?

此外,如果传入“backlog”参数的话,会出现什么情况呢?

解开系统调用listen的神秘面纱


与bind(2)非常相似,系统调用listen(2)的实现代码的作用是,找出与用户空间文件描述符相关联的套接字(与调用该系统调用的进程相关联),进行一些必要的检查,然后让协议族的实现代码来处理相关的语义。

该系统调用的实现代码,详见net/socket.c文件:

SYSCALL_DEFINE2(listen, 
        int, fd, 
        int, backlog)
{
    struct socket *sock;
    int err, fput_needed, somaxconn;

        // Retrieve the underlying socket from
        // the userspace file descriptor associated
        // with the process.
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    if (sock) {
                // Gather the `somaxconn` paremeter globally set
                // (/proc/sys/net/ipv4/somaxconn) and make use of it
                // so limit the size of the backlog that can be
                // specified.
                //
                // See https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
        somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
        if ((unsigned int)backlog > somaxconn)
            backlog = somaxconn;

                // Run the security hook associated with `listen`.
        err = security_socket_listen(sock, backlog);
        if (!err) {
                        // Call the ipv4 implementation of 
                        // `listen` that has been registered
                        // before at `socket(2)` time.
            err = sock->ops->listen(sock, backlog);
                }

        fput_light(sock->file, fput_needed);
    }
    return err;
}

非常有趣的是,我发现backlog只会受到网络命名空间的SOMAXCONN参数集的限制,而不是整个系统的限制。

我们能验证一下吗?当然!

小结

在本文中,我们为读者介绍套接字在准备接受连接之前,系统在幕后做了哪些工作,以及“准备好接受连接”倒底意味着什么 。由于篇幅较长,这里分为两个部分进行翻译,更多精彩内容,将在下篇中继续为读者介绍。

点击收藏 | 0 关注 | 1
  • 动动手指,沙发就是你的了!
登录 后跟帖