December | 2016 | My Technical Blog

Archive for December, 2016

17 Dec

Understanding FPU usage in linux kernel

Posted by Peter Teoh in Android. 3 comments

I am interested to learn how Linux kernel handle all the FPU registers (XMM, MMX, SSE, SSE2 etc). This is because it is a security opportunities for memory information leakage, if the values of these registers are not properly initialized. But on the other hand, these registers are so huge, that it will seriously slow down the kernel’s performance, should the context be saved and restored for all FPU registers, whenever there is a context switch.

To understand this, read this comment:

http://lxr.free-electrons.com/source/arch/x86/kernel/fpu/init.c

/*
 * FPU context switching strategies:
 *
 * Against popular belief, we don't do lazy FPU saves, due to the
 * task migration complications it brings on SMP - we only do
 * lazy FPU restores.
 *
 * 'lazy' is the traditional strategy, which is based on setting
 * CR0::TS to 1 during context-switch (instead of doing a full
 * restore of the FPU state), which causes the first FPU instruction
 * after the context switch (whenever it is executed) to fault - at
 * which point we lazily restore the FPU state into FPU registers.
 *
 * Tasks are of course under no obligation to execute FPU instructions,
 * so it can easily happen that another context-switch occurs without
 * a single FPU instruction being executed. If we eventually switch
 * back to the original task (that still owns the FPU) then we have
 * not only saved the restores along the way, but we also have the
 * FPU ready to be used for the original task.
 *
 * 'lazy' is deprecated because it's almost never a performance win
 * and it's much more complicated than 'eager'.
 *
 * 'eager' switching is by default on all CPUs, there we switch the FPU
 * state during every context switch, regardless of whether the task
 * has used FPU instructions in that time slice or not. This is done
 * because modern FPU context saving instructions are able to optimize
 * state saving and restoration in hardware: they can detect both
 * unused and untouched FPU state and optimize accordingly.

http://hypervsir.blogspot.sg/2014/10/an-os-kernel-bug-in-windows-81-32-bit-os.html



In summary:

a. LAZY mode: FPU is not restored/saved all the time, but only when it is used, and the use of FPU will also reset a flag in CR0:TS, thus we don’t have to detect for FPU register usage all the time. But this mode is not the default, as the time save/performance enhanced is not significant, and the algorithm become very complex, thus increasing processing overheads.

b. EAGER mode: This is the default mode. FPU is always saved and restored for each context switch. But again there is hardware feature that can detect whether the long chain of FPU registers are used – and whichever are used, only that register will be saved/restored, and thus it is very hardware efficient.

In the kernel, arch/x86/:

./boot/cpuflags.c:

has_fpu() will check via the following code whether there exists FPU being use:

if (cr0 & (X86_CR0_EM|X86_CR0_TS)) {

which is called by get_cpuflags():

void get_cpuflags(void)

{

if (has_fpu())

set_bit(X86_FEATURE_FPU, cpu.flags);

The following is a 208 patches in 2015 for FPU usage in kernel:

https://lwn.net/Articles/643235/

The instructions to save all FPU – XMM, MMX, SSE, SSE2 etc is called FXSAVE, FNSAVE, FSAVE:

http://x86.renejeschke.de/html/file_module_x86_id_128.html

and the overhead in linux kernel is benchmarked as 87 cycles.

https://lwn.net/Articles/643235/

These optimized way of saving can also be found in comments below:

 * When executing XSAVEOPT (or other optimized XSAVE instructions), if
 * a processor implementation detects that an FPU state component is still
 * (or is again) in its initialized state, it may clear the corresponding
 * bit in the header.xfeatures field, and can skip the writeout of registers
 * to the corresponding memory layout.
 *
 * This means that when the bit is zero, the state component might still contain
 * some previous - non-initialized register state.

To detect that the kernel are triggered on FPU usage, we can set breakpoint on fpstate_sanitize_xstate in KGDB, and the kernel stacktrace are as follows:

Thread 441 hit Breakpoint 1, fpstate_sanitize_xstate (fpu=0xffff8801e7a2ea80) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/fpu/xstate.c:111
111 {
#0  fpstate_sanitize_xstate (fpu=0xffff8801e7a2ea80) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/fpu/xstate.c:111
#1  0xffffffff8103b183 in copy_fpstate_to_sigframe (buf=0xffff8801e7a2ea80, buf_fx=0x7f73ad4fe3c0, size=) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/fpu/signal.c:178
#2  0xffffffff8102e207 in get_sigframe (frame_size=440, fpstate=0xffff880034dcbe10, regs=, ka=) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/signal.c:247
#3  0xffffffff8102e703 in __setup_rt_frame (regs=, set=, ksig=, sig=) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/signal.c:413
#4  setup_rt_frame (regs=, ksig=) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/signal.c:627
#5  handle_signal (regs=, ksig=) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/signal.c:671
#6  do_signal (regs=0xffff880034dcbf58) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/kernel/signal.c:714
#7  0xffffffff8100320c in exit_to_usermode_loop (regs=0xffff880034dcbf58, cached_flags=4) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/entry/common.c:248
#8  0xffffffff81003c6e in prepare_exit_to_usermode (regs=) at /build/linux-FvcHlK/linux-4.4.0/arch/x86/entry/common.c:283

You can use “info thread 441” (see above) to check which process is the stacktrace above corresponding to. Among them is “Xorg”, otherwise, majority of process does not use FPU.

From the stacktrace, “get_sigframe()” is the first function that seemed to analyze on FPU usage:

static void __user *

get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,

void __user **fpstate)

{

if (fpu->fpstate_active) {

        unsigned long fx_aligned, math_size;

        sp = fpu__alloc_mathframe(sp, 1, &fx_aligned, &math_size);
        *fpstate = (struct _fpstate_32 __user *) sp;
        if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
                            math_size) < 0)
                return (void __user *) -1L;
}

So essentially what is happening here is copying the FPU information to userspace stack pointer (which is “sp”).