Get answers and suggestions for various questions from here

System calls, let the world turn!


I actually do not want to break it to you, the user application is actually a poor urn in the brain (Brain in A VAT) :

Every interaction with the outside world can be done through system calls with the help of the kernel . An application must participate in order to save a file, write to a terminal, or open a TCP connection. The application is highly suspected by the kernel: think it is full of bugs, even a brain full of evil thoughts.

These system calls are function calls from an application to the kernel. For security reasons, they use a specific mechanism, in fact you just call the kernel's API. " System call (Call System) " term means that a particular function call provided by the kernel (for example, system call open() ) or a call route. You can also simply call: syscall .

This article explains system calls, how system calls differ from calling a library, and spying tools on the operating system/application interface. If you thoroughly understand what happens to your application with the operating system, you can turn an impossible problem into a fast and interesting problem.

So, the figure below is a running application, a user process:

It has a private virtual address space - its own memory sandbox. The entire system in its address space (that is, above the metaphor of the "urn"), the program binary file with the library uses all mapped into memory . The kernel itself is also mapped as part of the address space.

Here is our program pid code, which by getpid (2) direct access to its process id:

#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>

int main()
    pid_t p = getpid();
    printf("%d\n", p);

Pid.c download

In Linux, a process does not know its PID at birth. To know its PID, it must ask the kernel, so this query request is also a system call:

Its first step is to start calling the C library's getpid() , which is a wrapper around the system call . When you call some functions, such as , open(2) , read(2) and so on, you are calling these wrappers. In fact, the native method for most programming languages ​​in this block is ultimately done in libc.

Encapsulation provides convenience for these basic operating system APIs, which keeps the kernel simple. All kernel code runs in privileged mode, and buggy kernel lines can have fatal consequences. Anything that can be done in user mode should be done in user mode. By the library to provide friendly approach and the desired processing parameters, like printf(3) this.

We compare a web API, the kernel can be packaged to provide a service that is as simple as possible to provide a HTTP interface, and then provide libraries and helper methods for specific languages. Or it may have some cache, which is libc's getpid() done: the first call, it is true to execute a system call, then, it caches the PID, so you can avoid system call overhead of subsequent calls.

Once the package is completed, it is the first thing to do is to enter the kernel hyperspace (HyperSpace) . This conversion mechanism varies depending on the processor architecture design. In Intel processors, the parameters and the system call number is loaded into the register , then a run command the CPU in privileged mode and immediately transfers control to the calling global system kernel inlet . If you are interested in these details, David Drysdale has two very good articles on LWN ( one , two ).

Kernel system call number and then use this as an entry sys_call_table of one of the index , which is a function pointer to the array of each system call. Here, called sys_getpid :

In Linux, the system calls the C function to implement most of the architecture-independent, and sometimes do so trivial , but good design through the kernel, the system call mechanism is strict isolation. They are ordinary code that works in a general data structure. Well, except for the paranoid parameter verification.

Once their work is done, they will return normally , and then the architecture-specific code will take over and go back to user mode, where the package will continue to do some post-processing work. In our case, getpid(2) now caches the PID returned by the kernel. If the kernel returns an error to the package may further set a global errno variable. These details will let you know how GNU handles it.

If you want a native call, glibc provides a syscall(2) function that can generate a system call without encapsulation. You can also use it to make your own package. This is neither magical nor special for a C library.

The design impact of such system calls is far-reaching. We start with a very useful strace(1) that can be used to monitor system calls for Linux processes (on the Mac, see dtruss(1m) and magical dtrace ; on Windows, see sysinternals ). This is a pid track program:

~/code/x86-os$ strace ./pid

execve("./pid", ["./pid"], [/* 20 vars */]) = 0
brk(0)                                  = 0x9aa0000
access("/etc/", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7767000
access("/etc/", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=18056, ...}) = 0
mmap2(NULL, 18056, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7762000
close(3)                                = 0


getpid()                                = 14678
fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 1), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7766000
write(1, "14678\n", 614678
)                  = 6
exit_group(6)                           = ?

Each line of the output shows a system call, its arguments, and the return value. If you are in a loop in the getpid(2) run 1000 times, you will always find only one getpid() system call, because it's PID has been cached. We can also see that after the output string is formatted, it is printf(3) called write(2) .

strace You can start a new process or attach it to an already running process. You can learn a lot through the system calls of different programs. For example, what is the sshd daemon doing all day?

~/code/x86-os$ ps ax | grep sshd
12218 ?        Ss     0:00 /usr/sbin/sshd -D

~/code/x86-os$ sudo strace -p 12218
Process 12218 attached - interrupt to quit
select(7, [3 4], NULL, NULL, NULL

  ... nothing happens ...
  No fun, it's just waiting for a connection using select(2)
  If we wait long enough, we might see new keys being generated and so on, but
  let's attach again, tell strace to follow forks (-f), and connect via SSH

~/code/x86-os$ sudo strace -p 12218 -f

[lots of calls happen during an SSH login, only a few shown]

[pid 14692] read(3, "-----BEGIN RSA PRIVATE KEY-----\n"..., 1024) = 1024
[pid 14692] open("/usr/share/ssh/blacklist.RSA-2048", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
[pid 14692] open("/etc/ssh/blacklist.RSA-2048", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
[pid 14692] open("/etc/ssh/ssh_host_dsa_key", O_RDONLY|O_LARGEFILE) = 3
[pid 14692] open("/etc/protocols", O_RDONLY|O_CLOEXEC) = 4
[pid 14692] read(4, "# Internet (IP) protocols\n#\n# Up"..., 4096) = 2933
[pid 14692] open("/etc/hosts.allow", O_RDONLY) = 4
[pid 14692] open("/lib/i386-linux-gnu/", O_RDONLY|O_CLOEXEC) = 4
[pid 14692] stat64("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 14692] open("/etc/pam.d/common-password", O_RDONLY|O_LARGEFILE) = 8
[pid 14692] open("/etc/pam.d/other", O_RDONLY|O_LARGEFILE) = 4

Understanding SSH calls is a tough bone, but if you understand it, you learn to track. It is useful to be able to see which file is opened by the application ("Where does this configuration come from?"). If you have a wrong process occurs, you can be strace it, then went to see what it does through the system call? When some applications quit unexpectedly without providing the appropriate error message, you can check if it has a system call failure. You can also use filters to see the number of times each call, and so on:

~/code/x86-os$ strace -T -e trace=recv curl -silent > /dev/null

recv(3, "HTTP/1.1 200 OK\r\nDate: Wed, 05 N"..., 16384, 0) = 4164 <0.000007>
recv(3, "fl a{color:#36c}a:visited{color:"..., 16384, 0) = 2776 <0.000005>
recv(3, "adient(top,#4d90fe,#4787ed);filt"..., 16384, 0) = 4164 <0.000007>
recv(3, "gbar.up.spd(b,d,1,!0);break;case"..., 16384, 0) = 2776 <0.000006>
recv(3, "$),a.i.G(!0)),"..., 16384, 0) = 1388 <0.000004>
recv(3, "margin:0;padding:5px 8px 0 6px;v"..., 16384, 0) = 1388 <0.000007>
recv(3, "){window.setTimeout(function(){v"..., 16384, 0) = 1484 <0.000006>

I encourage you to experiment with these tools in your operating system. Using them well will make you feel superpowered.

However, something that is useful enough often leads us to its design. We can see that the application in those user spaces is strictly limited to its own virtual address space, running in Ring 3 (non-privileged mode). In general, tasks that involve only computational and memory access do not require a system call. For example, C library functions like strlen(3) and memcpy(3) don't require the kernel to do anything. These are things that happen inside the application.

Section (ie parentheses in the man page C library functions where 2 and 3 ) also provide clues. Section 2 is for system call encapsulation, while Section 3 contains other C library functions. But, as we have printf(3) seen, the library functions eventually produce one or more system calls.

If you are curious about this, here is a list of all system calls for Linux (also a list of Filippo ) and Windows . They each have approximately 310 and 460 system calls. It's very interesting to look at these system calls because they represent what software can do on modern computers. In addition, you may also find "treasures" related to interprocess communication and performance. This is a place where "people who don't understand Unix are destined to eventually reinvent a crippled Unix." (LCTT translation: the original "Those who do not understand Unix are condemned to reinvent it, poorly." This sentence is Henry Spencer 's famous saying, reflecting the Unix design philosophy, some of its ideas and culture is a must for technological development The result, seemingly bad, cannot be exceeded.)

Many system calls take a long time to perform tasks compared to CPU cycles , such as reading content from a hard drive. In this case, the calling process is dormant until the underlying work is completed . Because the CPU is running very fast, the general program is dormant for most of its lifecycle because of I/O limitations , waiting for the system call to return. Conversely, if you track a computationally intensive task, you will often see that no system calls are involved. In this case, top(1) will show a lot of CPU usage.

The overhead in a system call can be an issue. For example, solid-state drive is much faster than a regular hard drive, however, the operating system may cost / O expensive operation than I own more expensive . A program that performs a large number of read and write operations may be the bottleneck of operating system overhead. Vectorized I/O has some help with this. So do memory-mapped files , which allows an application to access only the memory can be read or write disk files. Similar mappings exist in places like video cards. Ultimately, the economics of cloud computing may cause the kernel to eliminate or minimize user mode/kernel mode switching.

In the end, system calls are also good for system security. One is that no matter how a binary program is unknown, you can check its behavior by observing its system call. This approach may be used to detect malicious programs. For example, we can record the policy of a system call of an unknown program, alert it to its anomalous behavior, or assign a whitelist to the program call, which makes exploiting more difficult. In this area, we have a lot of research, and many tools, but there is no "killer" solution.

This is the system call. Sorry this article is a bit long and I hope it will be useful to you. The next time, I will write more (short) articles, and I can follow me on RSS and Twitter . This article is dedicated to glorious Clube Atlético Mineiro.


Author: Gustavo Duarte Translator: qhwdw proofread: wxy

This article was compiled by LCTT , Linux China Honor