banner



How To Store Address In A Register X86

TL;DR

This blog post explains how Linux programs call functions in the Linux kernel.

It will outline several dissimilar methods of making systems calls, how to handcraft your own assembly to make system calls (examples included), kernel entry points into organization calls, kernel exit points from arrangement calls, glibc wrappers, bugs, and much, much more.

What is a system phone call?

When you lot run a program which calls open, fork, read, write (and many others) you are making a system call.

Organisation calls are how a programme enters the kernel to perform some job. Programs use organization calls to perform a diversity of operations such as: creating processes, doing network and file IO, and much more.

You can observe a list of system calls by checking thehuman being page for syscalls(ii).

There are several different means for user programs to make organisation calls and the low-level instructions for making a organization telephone call vary among CPU architectures.

As an application developer, you don't typically demand to remember about how exactly a system call is fabricated. You simply include the appropriate header file and make the call as if information technology were a normal function.

glibc provides wrapper lawmaking which abstracts you abroad from the underlying code which arranges the arguments you've passed and enters the kernel.

Before we tin can dive into the details of how organisation calls are made, we'll demand to define some terms and examine some core ideas that will announced later.

Prerequisite information

Hardware and software

This blog post makes the following assumptions that:

  • You are using a 32-bit or 64-chip Intel or AMD CPU. The discussion about the methods may exist useful for people using other systems, but the code samples below contain CPU-specific code.
  • You are interested in the Linux kernel, version iii.thirteen.0. Other kernel versions will be similar, merely the exact line numbers, organization of code, and file paths will vary. Links to the 3.thirteen.0 kernel source tree on GitHub are provided.
  • You are interested in glibc or glibc derived libc implementations (e.g., eglibc).

x86-64 in this blog postal service will refer to 64bit Intel and AMD CPUs that are based on the x86 architecture.

User programs, the kernel, and CPU privilege levels

User programs (like your editor, terminal, ssh daemon, etc) need to collaborate with the Linux kernel so that the kernel can perform a set of operations on behalf of your user programs that they can't perform themselves.

For case, if a user program needs to do some sort of IO (open, read, write, etc) or modify its accost space (mmap, sbrk, etc) it must trigger the kernel to run to complete those actions on its behalf.

What prevents user programs from performing these actions themselves?

It turns out that the x86-64 CPUs have a concept chosenprivilege levels. Privilege levels are a complex topic suitable for their ain blog post. For the purposes of this post, nosotros can (greatly) simplify the concept of privilege levels by saying:

  1. Privilege levels are a means of access control. The current privilege level determines which CPU instructions and IO may be performed.
  2. The kernel runs at the most privileged level, called "Ring 0". User programs run at a lesser level, typically "Ring 3".

In guild for a user program to perform some privileged functioning, it must cause a privilege level modify (from "Ring iii" to "Band 0") so that the kernel tin execute.

There are several means to cause a privilege level change and trigger the kernel to perform some activeness.

Let's first with a common style to cause the kernel to execute: interrupts.

Interrupts

You can call back of an interrupt as an event that is generated (or "raised") past hardware or software.

A hardware interrupt is raised by a hardware device to notify the kernel that a item event has occurred. A common example of this blazon of interrupt is an interrupt generated when a NIC receives a package.

A software interrupt is raised past executing a slice of code. On x86-64 systems, a software interrupt can be raised past executing the int teaching.

Interrupts usually have numbers assigned to them. Some of these interrupt numbers have a special meaning.

You tin can imagine an array that lives in retentivity on the CPU. Each entry in this assortment maps to an interrupt number. Each entry contains the address of a office that the CPU will begin executing when that interrupt is received along with some options, like what privilege level the interrupt handler part should be executed in.

Here's a photograph from the Intel CPU manual showing the layout of an entry in this array:

Screenshot of Interrupt Descriptor Table entry diagram for x86_64 CPUs

If you look closely at the diagram, you can run across a two-scrap field labeled DPL (Descriptor Privilege Level). The value in this field determines the minimum privilege level the CPU will be in when the handler office is executed.

This is how the CPU knows which address it should execute when a detail blazon of event is received and what privilege level the handler for that consequence should execute in.

In practise, there are lots of different ways to deal with interrupts on x86-64 systems. If you are interested in learning more than read near the8259 Programmable Interrupt Controller,Advanced Interrupt Controllers, andIO Advanced Interrupt Controllers.

In that location are other complexities involved with dealing with both hardware and software interrupts, such every bit interrupt number collisions and remapping.

We don't demand to concern ourselves with these details for this discussion nearly system calls.

Model Specific Registers (MSRs)

Model Specific Registers (also known every bit MSRs) are control registers that take a specific purpose to control sure features of the CPU. The CPU documentation lists the addresses of each of the MSRs.

You tin apply the CPU instructions rdmsr to wrmsr to read and write MSRs, respectively.

There are also command line tools which allow you to read and write MSRs, but doing this is non recommended as changing these values (especially while a system is running) is dangerous unless y'all are actually careful.

If yous don't mind potentially destabilizing your system or irreversibly corrupting your data, you can read and write MSRs by installing msr-tools and loading the msr kernel module:

            % sudo apt-get install msr-tools % sudo modprobe msr % sudo rdmsr          

Some of the system phone call methods we'll see subsequently make employ of MSRs, as we'll run into soon.

Calling organization calls with assembly is a bad idea

Information technology'southward not a great idea to call system calls by writing your own assembly code.

One big reason for this is that some arrangement calls accept additional code that runs in glibc before or after the system call runs.

In the examples below, we'll be using the exit system telephone call. It turns out that you can annals functions to run when go out is chosen by a plan by using atexit.

Those functions are chosen from glibc, not the kernel. And so, if you lot write your own assembly to call exit equally we show below, your registered handler functions won't be executed since yous are bypassing glibc.

Nevertheless, manually making system calls with assembly is a proficient learning experience.

Legacy organisation calls

Using our prerequisite noesis we know two things:

  1. Nosotros know that we can trigger the kernel to execute by generating a software interrupt.
  2. Nosotros can generate a software interrupt with the int associates instruction.

Combining these ii concepts leads the states to the legacy organisation call interface on Linux.

The Linux kernel sets aside a specific software interrupt number that can be used by user infinite programs to enter the kernel and execute a system call.

The Linux kernel registers an interrupt handler named ia32_syscall for the interrupt number: 128 (0x80). Let'southward take a look at the lawmaking that actually does this.

From the trap_init function in the kernel iii.13.0 source in arch/x86/kernel/traps.c:

            void __init trap_init(void) {         /* ..... other code ... */          set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);          

Where IA32_SYSCALL_VECTOR is a divers every bit 0x80 in arch/x86/include/asm/irq_vectors.h.

But, if the kernel reserves a single software interrupt that userland programs can raise to trigger the kernel, how does the kernel know which of the many arrangement calls it should execute?

The userland program is expected to put the organisation call number in the eax annals. The arguments for the syscall itself are to be placed in the remaining general purpose registers.

I place this is documented is in a comment in arch/x86/ia32/ia32entry.S:

                          * Emulated IA32 arrangement calls via int 0x80.  *  * Arguments:  * %eax Arrangement call number.  * %ebx Arg1  * %ecx Arg2  * %edx Arg3  * %esi Arg4  * %edi Arg5  * %ebp Arg6    [note: not saved in the stack frame, should not exist touched]  *          

Now that we know how to make a organisation phone call and where the arguments should live, let's attempt to make one by writing some inline assembly.

Using legacy system calls with your own assembly

To brand a legacy organization call, yous tin can write a small bit of inline associates. While this is interesting from a learning perspective, I encourage readers to never make system calls by crafting their own assembly.

In this example, we'll endeavor calling the exit organisation call, which takes a single argument: the exit status.

Outset, nosotros need to find the organization call number for exit. The Linux kernel includes a file which lists each system call in a tabular array. This file is processed past various scripts at build time to generate header files which can be used by user programs.

Let'southward await at the table constitute in arch/x86/syscalls/syscall_32.tbl:

The exit syscall is number one. Co-ordinate to the interface described in a higher place, we just demand to movement the syscall number into the eax annals and the offset argument (the exit condition) into ebx.

Here'southward a piece of C code with some inline associates that does this. Let'due south set the exit status to "42":

(This instance can be simplified, but I thought it would exist interesting to make information technology a bit more wordy than necessary so that anyone who hasn't seen GCC inline associates before tin can utilise this as an example or reference.)

            int master(int argc, char *argv[]) {   unsigned int syscall_nr = 1;   int exit_status = 42;    asm ("movl %0, %%eax\n"              "movl %i, %%ebx\n"        "int $0x80"     : /* output parameters, we aren't outputting anything, no none */       /* (none) */     : /* input parameters mapped to %0 and %1, repsectively */       "m" (syscall_nr), "grand" (exit_status)     : /* registers that we are "clobbering", unneeded since we are calling get out */       "eax", "ebx"); }          

Next, compile, execute, and check the exit status:

            $ gcc -o test test.c $ ./examination $ echo $? 42          

Success! We called the go out system phone call using the legacy system call method by raising a software interrupt.

Kernel-side: int $0x80 entry bespeak

So now that we've seen how to trigger a system telephone call from a userland program, let's see how the kernel uses the organization call number to execute the arrangement call code.

Recall from the previous department that the kernel registered a syscall handler part called ia32_syscall.

This office is implemented in assembly in arch/x86/ia32/ia32entry.S and we can see several things happening in this function, the virtually of import of which is the call to the actual syscall itself:

            ia32_do_call:         IA32_ARG_FIXUP         call *ia32_sys_call_table(,%rax,8) # xxx: rip relative          

IA32_ARG_FIXUP is a macro which rearranges the legacy arguments so that they may be properly understood by the current system call layer.

The ia32_sys_call_table identifier refers to a table which is divers in arch/x86/ia32/syscall_ia32.c. Note the #include line toward the end of the lawmaking:

            const sys_call_ptr_t ia32_sys_call_table[__NR_ia32_syscall_max+1] = {         /*          * Smells like a compiler bug -- it doesn't work          * when the & beneath is removed.          */         [0 ... __NR_ia32_syscall_max] = &compat_ni_syscall, #include <asm/syscalls_32.h> };          

Recall earlier nosotros saw the syscall table defined in arch/x86/syscalls/syscall_32.tbl.

At that place are a few scripts which run at compile time which take this table and generate the syscalls_32.h file from it. The generated header file is comprised of valid C lawmaking, which is simply inserted with the #include shown above to fill up in ia32_sys_call_table with function addresses indexed past organization call number.

And this is how y'all enter the kernel via a legacy organization phone call.

Returning from a legacy system telephone call with iret

We've seen how to enter the kernel with a software interrupt, but how does the kernel render back to the user plan and driblet the privilege level afterwards it has finished running?

If we turn to the (warning: large PDF)Intel Software Developer's Manualwe can find a helpful diagram that illustrates how the program stack volition be arranged when a privilege level change occurs.

Let's take a look:

Screenshot of the Stack Usage on Transfers to Interrupt and Exception-Handling Routines

When execution is transferred to the kernel function ia32_syscall via the execution of a software interrupt from a user program, a privilege level modify occurs. The result is that the stack when ia32_syscall is entered volition look like the diagram higher up.

This ways that the return address and the CPU flags which encode the privilege level (and other stuff), and more are all saved on the program stack before ia32_syscall executes.

And so, in order to resume execution the kernel just needs to re-create these values from the program stack dorsum into the registers where they belong and execution will resume back in userland.

OK, so how do you practise that?

There's a few means to do that, merely one of the easiest means is to the use the iret education.

The Intel instruction set manual explains that the iret teaching pops the return address and saved annals values from the stack in the social club they were prepared:

As with a existent-address mode interrupt render, the IRET instruction pops the return instruction pointer, return lawmaking segment selector, and EFLAGS paradigm from the stack to the EIP, CS, and EFLAGS registers, respectively, and so resumes execution of the interrupted programme or process.

Finding this lawmaking in the Linux kernel is a bit difficult equally it is hidden below several macros and there is extensive care taken to deal with things like signals and ptrace system call exit tracking.

Eventually all the macros in the assembly stubs in the kernel reveal the iret which returns from a arrangement call back to a user program.

From irq_return in arch/x86/kernel/entry_64.Due south:

            irq_return:   INTERRUPT_RETURN          

Where INTERRUPT_RETURN is defined in curvation/x86/include/asm/irqflags.h as iretq.

And now you lot know how legacy organization calls work.

Fast arrangement calls

The legacy method seems pretty reasonable, but in that location are newer means to trigger a system telephone call which don't involve a software interrupt and aremuch fasterthan using a software interrupt.

Each of the ii faster methods is comprised of two instructions. One to enter the kernel and one to leave. Both methods are described in the Intel CPU documentation as "Fast System Phone call".

Unfortunately, Intel and AMD implementations have some disagreement on which method is valid when a CPU is in 32bit or 64bit style.

In order to maximize compatibility across both Intel and AMD CPUs:

  • On 32bit systems utilise: sysenter and sysexit.
  • On 64bit systems use: syscall and sysret.

32-bit fast system calls

sysenter/sysexit

Using sysenter to make a system call is more complicated than using the legacy interrupt method and involves more coordination between the user program (via glibc) and the kernel.

Allow's take it one step at a time and sort out the details. First, let'due south see what the documentation in the Intel Education Gear up Reference (warning very largePDF) says about the sysenter and how to apply information technology.

Let's have a look:

Prior to executing the SYSENTER teaching, software must specify the privilege level 0 code segment and code entry point, and the privilege level 0 stack segment and stack arrow past writing values to the post-obit MSRs:

• IA32_SYSENTER_CS (MSR address 174H) — The lower xvi bits of this MSR are the segment selector for the privilege level 0 lawmaking segment. This value is also used to determine the segment selector of the privilege level 0 stack segment (see the Operation section). This value cannot indicate a null selector.

• IA32_SYSENTER_EIP (MSR accost 176H) — The value of this MSR is loaded into RIP (thus, this value references the commencement instruction of the selected operating procedure or routine). In protected mode, only bits 31:0 are loaded.

• IA32_SYSENTER_ESP (MSR accost 175H) — The value of this MSR is loaded into RSP (thus, this value contains the stack pointer for the privilege level 0 stack). This value cannot stand for a non-canonical address. In protected fashion, only bits 31:0 are loaded.

In other words: in order for the kernel to receive incoming system calls with sysenter, the kernel must set 3 Model Specific Registers (MSRs). The well-nigh interesting MSR in our example is IA32_SYSENTER_EIP (which has the address 0x176). This MSR is where the kernel should specify the address of the function that will execute when a sysenter teaching is executed past a user program.

We can detect the lawmaking in the Linux kernel which writes to the MSR in arch/x86/vdso/vdso32-setup.c:

            void enable_sep_cpu(void) {         /* ... other code ... */          wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) ia32_sysenter_target, 0);          

Where MSR_IA32_SYSENTER_EIP is defined every bit a 0x00000176 arch/x86/include/uapi/asm/msr-alphabetize.h.

Much like the legacy software interrupt syscalls, at that place is a defined convention for making system calls with sysenter.

One place this is documented is in a comment in curvation/x86/ia32/ia32entry.Due south:

                          * 32bit SYSENTER education entry.  *  * Arguments:  * %eax Arrangement telephone call number.  * %ebx Arg1  * %ecx Arg2  * %edx Arg3  * %esi Arg4  * %edi Arg5  * %ebp user stack  * 0(%ebp) Arg6          

Recall that the legacy system phone call method includes a machinery for returning back to the userland program which was interrupted: the iret instruction.

Capturing the logic needed to brand sysenter piece of work properly is complicated because unlike software interrupts, sysenter does non store the return address.

How, exactly, the kernel does this and other accounting prior to executing a sysenter pedagogy tin can change over time (and it has changed, equally you will encounter in the Bugs department below).

In lodge to protect against future changes, user programs are intended to apply a role chosen __kernel_vsyscall which is implemented in the kernel, but mapped into each user procedure when the procedure is started.

This is a fleck odd; it's code that comes with the kernel, but runs in userland.

It turns out that __kernel_vsyscall is role of something chosen a virtual Dynamic Shared Object (vDSO) which exists to let programs to execute kernel code in userland.

We'll examine what the vDSO is, what it does, and how it works in depth later.

For now, let'south examine the __kernel_vsyscall internals.

__kernel_vsyscall internals

The __kernel_vsyscall function that encapulates the sysenter calling convention tin be found in arch/x86/vdso/vdso32/sysenter.South:

            __kernel_vsyscall: .LSTART_vsyscall:         push %ecx .Lpush_ecx:         push button %edx .Lpush_edx:         push button %ebp .Lenter_kernel:         movl %esp,%ebp         sysenter          

__kernel_vsyscall is role of a Dynamic Shared Object (too known equally a shared library) how does a user program locate the address of that role at runtime?

The address of the __kernel_vsyscall function is written into anELF auxilliary vectorwhere a user program or library (typically glibc) can find it and employ it.

There are a few methods for searching ELF auxilliary vectors:

  1. By using getauxval with the AT_SYSINFO argument.
  2. By iterating to the end of the environment variables and parsing them from memory.

Option ane is the simplest choice, but does not exist on glibc prior to ii.16. The example code shown below illustrates option 2.

Every bit we can see in the code above, __kernel_vsyscall does some accounting earlier executing sysenter.

So, all nosotros need to practise to manually enter the kernel with sysenter is:

  • Search the ELF auxilliary vectors for AT_SYSINFO where the accost of __kernel_vsyscall is written.
  • Put the system call number and arguments into the registers as we would ordinarily for legacy system calls
  • Call the __kernel_vsyscall function

You should admittedly never write your own sysenter wrapper function as the convention the kernel uses to enter and leave arrangement calls with sysenter can change and your lawmaking will interruption.

You should ever start a sysenter system call past calling through __kernel_vsyscall.

And so, lets practice that.

Using sysenter system calls with your ain assembly

Keeping with our legacy system call case from before, we'll call exit with an exit condition of 42.

The get out syscall is number 1. According to the interface described above, we merely need to movement the syscall number into the eax register and the outset statement (the go out status) into ebx.

(This instance can be simplified, but I thought information technology would be interesting to make information technology a flake more than wordy than necessary so that anyone who hasn't seen GCC inline assembly before tin can use this equally an example or reference.)

            #include <stdlib.h> #include <elf.h>  int primary(int argc, char* argv[], char* envp[]) {   unsigned int syscall_nr = i;   int exit_status = 42;   Elf32_auxv_t *auxv;    /* auxilliary vectors are located after the end of the environment    * variables    *    * check this helpful diagram: https://static.lwn.internet/images/2012/auxvec.png    */   while(*envp++ != Naught);    /* envp is now pointed at the auxilliary vectors, since nosotros've iterated    * through the environment variables.    */   for (auxv = (Elf32_auxv_t *)envp; auxv->a_type != AT_NULL; auxv++)   {     if( auxv->a_type == AT_SYSINFO) {       break;     }   }    /* Annotation: in glibc 2.16 and higher y'all can replace the above code with    * a call to getauxval(3):  getauxval(AT_SYSINFO)    */    asm(       "movl %0,  %%eax    \due north"       "movl %1, %%ebx    \northward"       "call *%2          \n"       : /* output parameters, we aren't outputting anything, no none */         /* (none) */       : /* input parameters mapped to %0 and %1, repsectively */         "one thousand" (syscall_nr), "m" (exit_status), "k" (auxv->a_un.a_val)       : /* registers that we are "clobbering", unneeded since we are calling leave */         "eax", "ebx"); }          

Next, compile, execute, and bank check the get out status:

            $ gcc -m32 -o test exam.c $ ./test $ echo $? 42          

Success! We called the exit system phone call using the legacy sysenter method without raising a software interrupt.

Kernel-side: sysenter entry point

And so now that we've seen how to trigger a arrangement call from a userland program with sysenter via __kernel_vsyscall, let's see how the kernel uses the organisation call number to execute the system call lawmaking.

Recall from the previous section that the kernel registered a syscall handler function called ia32_sysenter_target.

This function is implemented in assembly in arch/x86/ia32/ia32entry.Southward. Let's take a expect at where the value in the eax register is used to execute the system telephone call:

            sysenter_dispatch:         call    *ia32_sys_call_table(,%rax,8)          

This is identical code as we saw in the legacy organisation telephone call way: a table named ia32_sys_call_table which is indexed into with the system call number.

After all the needed bookkeeping is done both the legacy system call model and the sysenter system telephone call model use the same mechanism and system call tabular array for dispatching system calls.

Refer to the int $0x80 entry betoken sectionto learn where the ia32_sys_call_table is divers and how information technology is constructed.

And this is how you enter the kernel via a sysenter system phone call.

Returning from a sysenter system call with sysexit

The kernel tin can use the sysexit education to resume execution back to the user programme.

Using this instruction is not as straight forwards as using iret. The caller is expected to put the address to return to into the rdx register, and to put the arrow to the programme stack to utilise in the rcx register.

This means that your software must compute the address where execution should be resumed, preserve that value, and restore it prior to calling sysexit.

We can find the code which does this in: arch/x86/ia32/ia32entry.South:

            sysexit_from_sys_call:         andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)         /* clear IF, that popfq doesn't enable interrupts early */         andl  $~0x200,EFLAGS-R11(%rsp)         movl    RIP-R11(%rsp),%edx              /* User %eip */         CFI_REGISTER rip,rdx         RESTORE_ARGS 0,24,0,0,0,0         xorq    %r8,%r8         xorq    %r9,%r9         xorq    %r10,%r10         xorq    %r11,%r11         popfq_cfi         /*CFI_RESTORE rflags*/         popq_cfi %rcx                           /* User %esp */         CFI_REGISTER rsp,rcx         TRACE_IRQS_ON         ENABLE_INTERRUPTS_SYSEXIT32          

ENABLE_INTERRUPTS_SYSEXIT32 is a macro which is defined in arch/x86/include/asm/irqflags.h which contains the sysexit pedagogy.

And at present you know how 32-bit fast organisation calls work.

64-bit fast system calls

Adjacent upward on our journey are 64-scrap fast organization calls. These system calls use the instructions syscall and sysret to enter and render from a system call, respectively.

syscall/sysret

The documentation in the Intel Educational activity Set Reference (very largePDF) explains how the syscall education works:

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (later on saving the address of the instruction post-obit SYSCALL into RCX).

In other words: for the kernel to receive incoming system calls, it must annals the address of the code that will execute when a system call occurs by writing its address to the IA32_LSTAR MSR.

Nosotros can find that code in the kernel in arch/x86/kernel/cpu/common.c:

            void syscall_init(void) {         /* ... other code ... */         wrmsrl(MSR_LSTAR, system_call);          

Where MSR_LSTAR is divers as 0xc0000082 in arch/x86/include/uapi/asm/msr-index.h.

Much like the legacy software interrupt syscalls, there is a defined convention for making system calls with syscall.

The userland program is expected to put the system call number to be in the rax register. The arguments to the syscall are expected to be placed in a subset of the full general purpose registers.

This is documented in thex86-64 ABIin section A.2.i:

  1. User-level applications utilize as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.
  2. A organization-telephone call is done via the syscall teaching. The kernel destroys registers %rcx and %r11.
  3. The number of the syscall has to be passed in register %rax.
  4. Organization-calls are limited to six arguments,no statement is passed direct on the stack.
  5. Returning from the syscall, register %rax contains the outcome of the system-telephone call. A value in the range between -4095 and -1 indicates an error, it is -errno.
  6. Just values of class INTEGER or class Memory are passed to the kernel.

This is besides documented in a comment in arch/x86/kernel/entry_64.S.

Now that we know how to brand a system call and where the arguments should live, let's try to make one past writing some inline associates.

Using syscall system calls with your ain assembly

Building on the previous example, let'southward build a modest C program with inline assembly which executes the get out system call passing the exit condition of 42.

Start, we need to find the organization call number for exit. In this case we need to read the table institute in arch/x86/syscalls/syscall_64.tbl:

The exit syscall is number 60. According to the interface described to a higher place, nosotros simply need to move sixty into the rax register and the first argument (the exit condition) into rdi.

Here'southward a piece of C code with some inline associates that does this. Similar the previous instance, this example is more wordy than necessary in the interest of clarity:

            int main(int argc, char *argv[]) {   unsigned long syscall_nr = lx;   long exit_status = 42;    asm ("movq %0, %%rax\northward"        "movq %ane, %%rdi\northward"        "syscall"     : /* output parameters, we aren't outputting anything, no none */       /* (none) */     : /* input parameters mapped to %0 and %1, repsectively */       "g" (syscall_nr), "thou" (exit_status)     : /* registers that we are "clobbering", unneeded since we are calling leave */       "rax", "rdi"); }          

Side by side, compile, execute, and check the exit status:

            $ gcc -o test exam.c $ ./test $ echo $? 42          

Success! We called the go out system call using the syscall system call method. We avoided raising a software interrupt and (if nosotros were timing a micro-benchmark) information technology executes much faster.

Kernel-side: syscall entry point

Now we've seen how to trigger a organisation phone call from a userland program, let'southward see how the kernel uses the arrangement call number to execute the system call code.

Recall from the previous department we saw the accost of a function named system_call go written to the LSTAR MSR.

Permit'south take a look at the code for this function and see how information technology uses rax to actually hand off execution to the organisation phone call, from arch/x86/kernel/entry_64.Due south:

                          telephone call *sys_call_table(,%rax,8)  # Xxx:    rip relative          

Much like the legacy system call method, sys_call_table is a tabular array defined in a C file that uses #include to pull in C code generated by a script.

From arch/x86/kernel/syscall_64.c, note the #include at the bottom:

            asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {         /*          * Smells like a compiler problems -- it doesn't work          * when the & below is removed.          */         [0 ... __NR_syscall_max] = &sys_ni_syscall, #include <asm/syscalls_64.h> };          

Earlier we saw the syscall table divers in arch/x86/syscalls/syscall_64.tbl. Exactly like the legacy interrupt mode, a script runs at kernel compile time and generates the syscalls_64.h file from the tabular array in syscall_64.tbl.

The code to a higher place simply includes the generated C lawmaking producing an array of function pointers indexed by system call number.

And this is how you enter the kernel via a syscall arrangement call.

Returning from a syscall organisation call with sysret

The kernel can use the sysret instruction to resume execution dorsum to where execution left off when the user program used syscall.

sysret is simpler than sysexit because the address to where execution should be resume is copied into the rcx annals when syscall is used.

As long as you preserve that value somewhere and restore it to rcx earlier calling sysret, execution will resume where it left off before the telephone call to syscall.

This is user-friendly because sysenter requires that you compute this accost yourself in addition to clobbering an boosted register.

We can find the code which does this in curvation/x86/kernel/entry_64.S:

            movq RIP-ARGOFFSET(%rsp),%rcx CFI_REGISTER    rip,rcx RESTORE_ARGS 1,-ARG_SKIP,0 /*CFI_REGISTER  rflags,r11*/ movq    PER_CPU_VAR(old_rsp), %rsp USERGS_SYSRET64          

USERGS_SYSRET64 is a macro which is divers in arch/x86/include/asm/irqflags.h which contains the sysret educational activity.

And now yous know how 64-fleck fast system calls work.

Calling a syscall semi-manually with syscall(2)

Great, nosotros've seen how to call organization calls manually by crafting associates for a few dissimilar arrangement phone call methods.

Usually, y'all don't need to write your own assembly. Wrapper functions are provided past glibc that handle all of the assembly code for y'all.

There are some organisation calls, however, for which no glibc wrapper exists. I example of a system call like this is futex, the fast userspace locking system call.

Just, wait, why doesno system call wrapper exist for futex?

futex is intended only to be called by libraries, not application code, and thus in order to telephone call futex you must do information technology past:

  1. Generating assembly stubs for every platform you lot want to support
  2. Using the syscall wrapper provided by glibc

If you detect yourself in the situation of needing to phone call a organization call for which no wrapper exists, you should definitely choose choice 2: utilize the function syscall from glibc.

Allow'southward utilise syscall from glibc to phone call exit with go out status of 42:

            #include <unistd.h>  int principal(int argc, char *argv[]) {   unsigned long syscall_nr = 60;   long exit_status = 42;    syscall(syscall_nr, exit_status); }          

Side by side, compile, execute, and bank check the exit status:

            $ gcc -o test exam.c $ ./test $ echo $? 42          

Success! We chosen the exit organisation telephone call using the syscall wrapper from glibc.

glibc syscall wrapper internals

Allow's have a look at the syscall wrapper function we used in the previous instance to see how it works in glibc.

From sysdeps/unix/sysv/linux/x86_64/syscall.Southward:

            /* Usage: long syscall (syscall_number, arg1, arg2, arg3, arg4, arg5, arg6)    Nosotros demand to do some arg shifting, the syscall_number will be in    rax.  */           .text ENTRY (syscall)         movq %rdi, %rax         /* Syscall number -> rax.  */         movq %rsi, %rdi         /* shift arg1 - arg5.  */         movq %rdx, %rsi         movq %rcx, %rdx         movq %r8, %r10         movq %r9, %r8         movq 8(%rsp),%r9        /* arg6 is on the stack.  */         syscall                 /* Do the system call.  */         cmpq $-4095, %rax       /* Bank check %rax for error.  */         jae SYSCALL_ERROR_LABEL /* Bound to fault handler if fault.  */ L(pseudo_end):         ret                     /* Return to caller.  */          

Before we showed an excerpt from the x86_64 ABI document that describes both userland and kernel calling conventions.

This assembly stub is cool because it shows both calling conventions. The arguments passed into this role follow the userland calling convention, but are then moved to a different gear up of registers to obey the kernel calling convention prior to inbound the kernel with syscall.

This is how the glibc syscall wrapper works when you utilise it to call system calls that do non come up with a wrapper by default.

Virtual system calls

Nosotros've now covered all the methods of making a system telephone call by entering the kernel and shown how you can make those calls manually (or semi-manually) to transition the organisation from userland to the kernel.

What if programs could call certain organisation calls without entering the kernel at all?

That'southward precisely why the Linux virtual Dynamic Shared Object (vDSO) exists. The Linux vDSO is a set of code that is function of the kernel, simply is mapped into the address space of a user plan to be run in userland.

The thought is that some organisation calls can be used without entering the kernel. One such call is: gettimeofday.

Programs calling the gettimeofday organization call do not really enter the kernel. They instead make a simple part call to a piece of code that was provided past the kernel, only is run in userland.

No software interrupt is raised, no complicated sysenter or syscall bookkeeping is required. gettimeofday is simply a normal office call.

You can meet the vDSO listed as the first entry when you lot use ldd:

            $ ldd `which bash`   linux-vdso.so.1 =>  (0x00007fff667ff000)   libtinfo.then.5 => /lib/x86_64-linux-gnu/libtinfo.and then.five (0x00007f623df7d000)   libdl.and so.2 => /lib/x86_64-linux-gnu/libdl.and then.2 (0x00007f623dd79000)   libc.so.six => /lib/x86_64-linux-gnu/libc.so.vi (0x00007f623d9ba000)   /lib64/ld-linux-x86-64.and so.2 (0x00007f623e1ae000)          

Let's see how the vDSO is setup in the kernel.

vDSO in the kernel

You tin discover the vDSO source in arch/x86/vdso/. In that location are a few associates and C source files along with a linker script.

Thelinker scriptis a absurd matter to take a look at.

From arch/x86/vdso/vdso.lds.S:

            /*  * This controls what userland symbols we export from the vDSO.  */ VERSION {         LINUX_2.half-dozen {         global:                 clock_gettime;                 __vdso_clock_gettime;                 gettimeofday;                 __vdso_gettimeofday;                 getcpu;                 __vdso_getcpu;                 time;                 __vdso_time;         local: *;         }; }          

Linker scripts are pretty useful, but non specially very well known. This linker script arranges the symbols that are going to be exported in the vDSO.

Nosotros can encounter that vDSO exports 4 unlike functions, each with two names. You can discover the source for these functions in the C files in this directory.

For case, the source for gettimeofday found in arch/x86/vdso/vclock_gettime.c:

            int gettimeofday(struct timeval *, struct timezone *)         __attribute__((weak, alias("__vdso_gettimeofday")));          

This is defining gettimeofday to be aweak aliasfor __vdso_gettimeofday.

The __vdso_gettimeofday functionin the aforementioned filecontains the actual source which will be executed in user land when a user programme calls the gettimeofday system call.

Locating the vDSO in memory

Due toaddress space layout randomizationthe vDSO volition be loaded at a random accost when a program is started.

How can user programs discover the vDSO if its loaded at a random address?

If you recall before when examining the sysenter arrangement call method we saw that user programs should call __kernel_vsyscall instead of writing their own sysenter assembly lawmaking themselves.

This function is part of the vDSO, too.

The sample code provided located __kernel_vsyscall by searching theELF auxilliary headersto find a header with type AT_SYSINFO which contained the address of __kernel_vsyscall.

Similarly, to locate the vDSO, a user program tin search for an ELF auxilliary header of type AT_SYSINFO_EHDR. It will incorporate the address of the start of the ELF header for the vDSO that was generated by a linker script.

In both cases, the kernel writes the accost in to the ELF header when the program is loaded. That's how the correct addresses e'er end upwardly in AT_SYSINFO_EHDR and AT_SYSINFO.

Once that header is located, user programs can parse the ELF object (peradventure usinglibelf) and call the functions in the ELF object as needed.

This is nice because this means that the vDSO can take reward of some useful ELF features likesymbol versioning.

An instance of parsing and calling functions in the vDSO is provided in the kernel documentation in Documentation/vDSO/.

vDSO in glibc

Most of the time, people admission the vDSO without knowing it because glibc abstracts this away from them past using the interface described in the previous section.

When a program is loaded, thedynamic linker and loaderloads the DSOs that the plan depends on, including the vDSO.

glibc stores some data virtually the location of the vDSO when it parses the ELF headers of the program that is being loaded. It as well includes short stub functions that will search the vDSO for a symbol proper name prior to making an actual system call.

For example, the gettimeofday function in glibc, from sysdeps/unix/sysv/linux/x86_64/gettimeofday.c:

            void *gettimeofday_ifunc (void) __asm__ ("__gettimeofday");  void * gettimeofday_ifunc (void) {   PREPARE_VERSION (linux26, "LINUX_2.6", 61765110);    /* If the vDSO is not bachelor nosotros fall dorsum on the old vsyscall.  */   return (_dl_vdso_vsym ("gettimeofday", &linux26)           ?: (void *) VSYSCALL_ADDR_vgettimeofday); } __asm (".type __gettimeofday, %gnu_indirect_function");          

This lawmaking in glibc searches the vDSO for the gettimeofday function and returns the address. This is wrapped up nicely with anindirect function.

That's how programs calling gettimeofday pass through glibc and hitting the vDSO all without switching into kernel manner, incurring a privilege level change, or raising a software interrupt.

And, that concludes the showcase of every single arrangement call method available on Linux for 32-chip and 64-chip Intel and AMD CPUs.

glibc arrangement telephone call wrappers

While we're talking about organization calls ;) it makes sense to briefly mention how glibc deals with system calls.

For many arrangement calls, glibc but needs a wrapper function where it moves arguments into the proper registers and and so executes the syscall or int $0x80 instructions, or calls __kernel_vsyscall.

It does this by using a series of tables defined in text files that are processed with scripts and output C code.

For example, the sysdeps/unix/syscalls.list file describes some common system calls:

            access          -       access          i:si    __access        access acct            -       acct            i:S     acct chdir           -       chdir           i:south     __chdir         chdir chmod           -       chmod           i:si    __chmod         chmod          

To learn more about each cavalcade, check the comments in the script which processes this file: sysdeps/unix/make-syscalls.sh.

More complex organization calls, similar exit which invokes handlers accept actual implementations in C or assembly lawmaking and will not be plant in a templated text file like this.

Future blog posts volition explore the implementation in glibc and the linux kernel for interesting system calls.

It would be unfortunate not to have this opportunity to mention two fabled bugs related to system calls in Linux.

So, let's take a wait!

CVE-2010-3301

This security exploitallows local users to gain root access.

The cause is a small bug in the associates lawmaking which allows user programs to make legacy system calls on x86-64 systems.

The exploit code is pretty clever: it generates a region of retentiveness with mmap at a item address and uses an integer overflow to cause this code:

(Think this lawmaking from the legacy interrupts section to a higher place?)

            call *ia32_sys_call_table(,%rax,8)          

to hand execution off to an arbitrary accost which runs every bit kernel code and can escalate the running process to root.

Android sysenter ABI breakage

Call up the part almost not hardcoding the sysenter ABI in your application code?

Unfortunately, the android-x86 folks made this mistake. The kernel ABI changed and suddenly android-x86 stopped working.

The kernel folks concluded upwardly restoring the old sysenter ABI to avoid breaking the Android devices in the wild with dried hardcoded sysenter sequences.

Hither's the gear upthat was added to the Linux kernel. Yous can find a link to the offending commit in the android source in the commit bulletin.

Remember: never write your own sysenter assembly lawmaking. If you have to implement it directly for some reason, use a slice of code similar the instance above and go through __kernel_vsyscall at the very to the lowest degree.

Conclusion

The system telephone call infrastructure in the Linux kernel is incredibly complex. There are many different methods for making system calls each with their own advantages and disadvantages.

Calling organization calls past crafting your own assembly is mostly a bad idea as the ABI may break underneath y'all. Your kernel and libc implementation will (probably) choose the fastest method for making system calls on your system.

If you can't use the glibc provided wrappers (or if ane doesn't exist), you should at the very least use the syscall wrapper part, or endeavor to go through the vDSO provided __kernel_vsyscall.

Stay tuned for future blog posts investigating private organisation calls and their implementations.

Source: https://blog.packagecloud.io/the-definitive-guide-to-linux-system-calls/

Posted by: johnsonrunt1953.blogspot.com

0 Response to "How To Store Address In A Register X86"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel