Internals of ptrace(): from the malware perspective

After reading a few questions on ptrace():

https://stackoverflow.com/questions/24573247/ptrace-how-can-i-stop-getting-traced-in-child-process

https://stackoverflow.com/questions/29804846/where-would-the-cpu-context-interrupted-by-ptrace-be-userspace-stack-or-kernel?rq=1

https://stackoverflow.com/questions/27930589/when-using-getregs-does-ptrace-get-only-userspace-stack-rsp-or-both-kernel-and/48626668#48626668

I realize there is a great lack of understanding of the power and depth of ptrace().

First I shall describe all the power of ptrace():

a. When you ptrace() another program (called debuggee), the debuggee basically will stopped completely (in userspace) and let you do whatever you want with his registers, memory, and essentially everything in userspace.

http://man7.org/linux/man-pages/man2/ptrace.2.html

b. You can legally do anything that is read-only. But if you modify the register, then be aware of the consequences.

c. Malware writer likes to use ptrace(), but why and how?

This is because with ptrace(), you can essentially redirect the debuggee to call malloc(), and given the newly allocated memory, insert a new program into the newly allocated memory (making sure the pages are marked readable and executable), and redirect some existing codes to that memory for execution and then let the debuggee continue execution.

There is a lot more details then that, requiring many ptrace() calls for its complete implementation. And various names are given for it, libraries injection is one of them.

https://github.com/gaffe23/linux-inject

https://shunix.com/shared-library-injection-in-android/   (Android and other Linux shared the same kernel, or almost).

d. A syscall, by definition, is an atomic operation: this means that if it start in userspace, it will end up in userspace. It should not terminate its execution in the kernel, unless it has encountered some blocking kernel functions. An example is “kmalloc()”, which can block, and thus immediately the execution thread will be queued and to be replaced by the next tasks for execution on the CPU.

e. For security reasons, kernel execute at ring 0, and userspace is at ring 3. So by design, userspace will never be able to see any of the values on the registers or memory IN THE KERNEL. Anything to be displayed must be specifically read and transfer over from kernel to userspace.

f. The relationship between the debugger and debuggee, is a highly interlocking one:   when one is executing, you have to ensure the other is blocked and waiting for exception.   Ie, when you ptrace() the debuggee, the debugger will immediately go into a waiting mode (blocking), so that when the debuggee enter a blocking instruction, execution will return back to the debugger, and only when it is in the waiting mode will he be able to catch it.   And vice versa – it can really get very complicated between waiting and continue execution etc.

References:

c – How to detect if the current process is being run by GDB? – Stack Overflow

https://stackoverflow.com/questions/3596781/how-to-detect-if-the-current-process-is-being-run-by-gdb

Accessing data in the RAM of one process from second process

https://stackoverflow.com/questions/27853198/accessing-data-in-the-ram-of-one-process-from-second-process/27853327#27853327

Advertisements

How to extend the filesystem size in CentOS (default is XFS) running as a VMware guest?

sudo fdisk /dev/sdc –> /dev/sdc1

Then issue "pvcreate /dev/sdc1".

This will add the disk to the physical volume group (pvdisplay), noticed the "VG name" is not assigned yet:

vgextend centos /dev/sdc1 —> after this the following 59.5 G is displayed.

And "sudo pvdisplay" will show that /dev/sdc1 is assigned to the VG group "centos".

But lvdisplay still showing old values:

Finally, "lvextend -l +100%FREE /dev/centos/root" and this will extend the VG extents to the 57G limit.

Finally, boot in rescue mode and do the following (this is for ext4 filesystem):

resize2fs /dev/mapper/centos-root

But if your filesystem is XFS, which is default for CentOS 7.2 1511:

The rootfs is running XFS, and XFS does allow online resizing.

So without any rebooting or rescue mode:

So now the rootfs have been properly resize by the command:

xfs_growfs -d /dev/mapper/centos-root

The option "-d" is for maximum possible size.

So in summary, for LVM only the last part is different between XFS and ext4 filesystem: resize2fs or xfs_growfs command.

How does KCOVtrace worked (in the Linux Kernel)?

We all are familiar with how strace worked: it uses the system call “ptrace” to attach to a process and then setting breakpoints on all the system call.

So how does KCOVTRACE worked?

The tool is available here:

https://github.com/google/syzkaller/blob/master/tools/kcovtrace/kcovtrace.c

It requires the kernel to be compiled with CONFIG_KCOV.

When applied on a binary like “/bin/ls” and let it execute, we get a list of addresses as output:

0xffffffff81a109e4
0xffffffff81a109da
0xffffffff81a109e4
0xffffffff81a109da
0xffffffff81a109e4

To know the address 0xffffffff81a109e4, let use the Linux kernel image file “vmlinux”, do a disassembly via “objdump -d vmlinux”:

Selection_065

As we can see the address is the byte offset immediately following calling the function <__sanitizer_cov_trace_pc>.

This function is inserted using GCC plugin mechanism when the linux kernel is compiled with GCC plugin enabled, and CONFIG_KCOV=y.

Inside kernel/kcov.c, you see that that the address of each basic block that got executed in the kernel is updated into “area[]” array, which get floats to the userspace, which is retrieved via /sys/kernel/debug/kcov interface ioctl() read.

Extracted from above URL and displayed as below:

Since only the process that explicitly open the /sys/kernel/debug/kcov interface will get to receive the output the kcov is therefore specific to the process that open the file descriptor.

(unlike FTRACE, where the interface can be open, and another process generate FTRACE in the kernel, only to be retrieve by final process that read the FTRACE).

Having fun with funcgraph, functrace, and funccount

The utility tool "funcgraph" is a script that is based on "FTRACE", and in Ubuntu it is available through the following packages.

apt-file search funcgraph

perf-tools-unstable: /usr/bin/funcgraph
perf-tools-unstable: /usr/share/doc/perf-tools-unstable/examples/funcgraph_example.txt.gz

And looking at all the binaries inside perf-tools-unstable:

First trying to understand kmem_cache_alloc calls (execute via "sudo" as root privilege is needed):

(sudo funcgraph -m 100 "kmem_cache_alloc")

or understanding "scheduler":

(funcgraph -m 100 "schedule*")

Or trying to understand ext4* calls:

(funcgraph -m 100 "ext4*")

And then there is the "functrace" calls:

functrace ‘tcp*’

The command "funccount -t 10 ‘tcp*’" cannot be generated as my kernel lacks the kernel configuration:

And then there is tpoint:

References:

http://www.brendangregg.com/Slides/LISA2014_LinuxPerfAnalysisNewTools.pdf

How to understand the crashdump and identify the source codes in Linux Kernel?

Problem:

Got this in my crash dump, what does it mean? How to identify exactly where in the assembly listing this dump comes from?

[   90.984131] dump_stack+0xe6/0x15b
[   90.984346] ? _atomic_dec_and_lock+0x165/0x165
[   90.984628] ? do_raw_spin_lock+0xf7/0x1e0
[   90.984899] ubsan_epilogue+0x12/0x53
[   90.985131] __ubsan_handle_type_mismatch+0x1f1/0x2c6
[   90.985443] ? ubsan_epilogue+0x53/0x53
[   90.985683] ? debug_lockdep_rcu_enabled+0x26/0x40
[   90.985979] ? ubsan_epilogue+0x53/0x53
[   90.986219] ? __build_skb+0x87/0x320
[   90.986452] dev_gro_receive+0x1b2b/0x1c00
[   90.986805] ? trace_event_raw_event_fib_table_lookup_nh+0x8/0x4b0
[   90.987205] napi_gro_receive+0xa9/0x410
[   90.987446] e1000_clean_rx_irq+0x76f/0x1670
[   90.987709] ? e1000_alloc_jumbo_rx_buffers+0xd80/0xd80
[   90.988052] e1000_clean+0x8c2/0x2b60
[   90.988284] ? trace_hardirqs_on+0xd/0x10
[   90.988527] ? __lock_acquire+0xf4/0x39c0
[   90.988794] ? e1000_clean_tx_ring+0x3c0/0x3c0
[   90.989076] ? __ubsan_handle_type_mismatch+0x29e/0x2c6
[   90.989396] ? time_hardirqs_on+0x2a/0x380
[   90.989712] ? ubsan_epilogue+0x53/0x53
[   90.989947] ? net_rx_action+0x22e/0xe90
[   90.990207] net_rx_action+0x818/0xe90
[   90.990438] ? napi_complete_done+0x440/0x440
[   90.990726] ? debug_lockdep_rcu_enabled+0x26/0x40
[   90.991050] ? __do_softirq+0x18a/0xbc4
[   90.991293] ? time_hardirqs_on+0x2a/0x380
[   90.991560] ? do_raw_spin_trylock+0xd0/0xd0
[   90.991818] ? __do_softirq+0x18a/0xbc4
[   90.992063] __do_softirq+0x1ae/0xbc4
[   90.992336] irq_exit+0x149/0x1d0
[   90.992540] do_IRQ+0x92/0x1b0
[   90.992730] common_interrupt+0x90/0x90
[   90.992963] RIP: 0010:ext4_da_write_begin+0x5fa/0xee0

Firstly, I will assumed you know that the crashdump above are all return addresses of a “call xxxx” command. And so if you mapped it to the assembly listing in the “vmlinux” files when linux kernel is generated, you will see a “call xxx” just before the addresses above.

The cause of the crashdump is somewhere at the top, as it is the cause of triggering the UBSAN compiled kernel, and thus leading to calling dump_stack() function.

Therefore start reading from the top:

First we will need to convert the “vmlinux” output to its assembly listing:

objdump -d vmlinux > vmlinux.S

Then go direct to the “dump_stack” symbol:

So the address is “0xffffffff81e0ae65” for dump_stack():

dump_stack+0xe6/0x15b ==> 0xffffffff81e0ae65 + 0xe6 = 0xffffffff81e0af4b

So now looking at “0xffffffff81e0af4b”:

So we know the dump address is the address immediately after “call show_stack()” function call.

Therefore, using “addr2line -e vmlinux 0xffffffff81e0af4b” we get:

/home/tteikhua/linux-4.12/lib/dump_stack.c:54

Now we need to direct to the source:

So line 54 is exactly after the “__dump_stack()” call, and we can now do the same for all other calls.

cat crash.addr | sed 's/.* //' | sed 's/^M//' |sed 's//.*//' |sed 's/+/ /' > /tmp/tmp$$
 backtrace=20
 forwardtrace=40
 cat /tmp/tmp$$ | while
 read data data1
 do
 addr=`grep "<${data}.:$" vmlinux.S | sed 's/ .*//'`
 output=`addr2line -e ./vmlinux 0x$addr | sed 's/ .*//'`
 addr1=`bash ./addhex.bash 0x$addr $data1`
 output=`addr2line -e ./vmlinux 0x$addr1 | sed 's/ .*//'`
 echo "${output}"
 source=`echo $output | tail -1`
 filename=`echo $source | sed 's/:.*//'`
 line=`echo $source | sed 's/.*://'`
 line1=`expr ${line} - ${backtrace}`

echo "==========================================================="
 cat $filename | nl -ba | sed -n "${line1},+${forwardtrace}p" | sed "${backtrace}s/^/>>/"

echo "==========================================================="
 done

As the automated script above contained some control codes, the attached file is better (output as below):

https://drive.google.com/open?id=0B1hg6_FvnEa8cGpvZEdxTFNCVDA

https://drive.google.com/open?id=0B1hg6_FvnEa8MHEwNU10VDRXQzA

References:

http://helenfornazier.blogspot.sg/2015/07/linux-kernel-memory-corruption-debug.html

https://serverfault.com/questions/605946/kernel-stack-trace-to-source-code-lines

https://stackoverflow.com/questions/6151538/addr2line-on-kernel-module

http://elinux.org/Addr2line_for_kernel_debugging

http://www.linuxdevcenter.com/cmd/cmd.csp?path=a/addr2line

https://plus.google.com/+TheodoreTso/posts/h8H9z4EopkS

How to extend the disk size in my VMware guest running Centos 7

To increase the disk size inside the VMware guest running CentOS7 (7.3, 1611) these are the steps needed:

a. Remove all snapshots from your Vmware image, or if you cannot afford to remove the snapshot, shutdown the guest OS, and "clone from snapshot" (you can select any specific snapshot to clone) to create a new Vmware image. This new one is totally independent, and will not have any of the snapshot.

b. Now go to VMware settings, hardware, and "hard disk" to select "Expand Disk" option.

c. After expanding disk, reboot into your CentOS.

d. Issue "sudo fdisk /dev/sda" to repartition the /dev/sda inside the guest OS:

And here is the partition deletion and recreation part (be careful) – the starting block number must be the same, but the end will default to the largest possible block number.

Now remember to REBOOT, as the partition table will not be updated until after reboot.

And now check using pvdisplay – it is not updated yet, still 75G. Use "pvresize" to extend it.

So now the PV occupies 200G.

Now check with lvdisplay:

So we need to extend the LV to 200G:

Followed by xfs_growfs:

Check now:

Updated.

Linux kernel commandline bootup option

In your /boot/grub/grub.cfg kernel bootup setup, there is a line where you can add a lot of kernel option (see the “linux” line below):

Or if you startup your kernel using QEMU, the place where you can add the kernel options are inside the “-append” line (in bold):

qemu-system-x86_64 -m 3048 -net nic -net user,host=10.0.2.10,hostfwd=tcp::11398-:22 -display none -serial stdio -no-reboot -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 -smp sockets=2,cores=2,threads=1 -enable-kvm -usb -usbdevice mouse -usbdevice tablet -soundhw all -hda /home/user/syz4123/wheezy_4.12.img -snapshot -initrd /home/tteikhua/syz4129/initrd -kernel /home/tteikhua/syz4129/bzImage -append “console=ttyS0 slub_debug=FZ vsyscall=native rodata=n oops=panic nmi_watchdog=panic panic_on_warn=1 panic=86400 ftrace_dump_on_oops=orig_cpu earlyprintk=serial net.ifnames=0 biosdevname=0 kvm-intel.nested=1 kvm-intel.unrestricted_guest=1 kvm-intel.vmm_exclusive=1 kvm-intel.fasteoi=1 kvm-intel.ept=1 kvm-intel.flexpriority=1 kvm-intel.vpid=1 kvm-intel.emulate_invalid_guest_state=1 kvm-intel.eptad=1 kvm-intel.enable_shadow_vmcs=1 kvm-intel.pml=1 kvm-intel.enable_apicv=1 root=/dev/sda”

Or you can add it inside the /etc/default/grub file, and modify the “GRUB_CMDLINE_LINUX_DEFAULT” parameter, and run “update-grub” to modify the /boot/grub/grub.cfg file:

In all these scenario, what are all the available kernel options which you can enable?

In the latest count, there are about 460+ kernel options available (full list is here: https://tthtlc.wordpress.com/a-list-of-kernel-command-line-option-as-collected-from-kernel-source-code/):

Too many, so what are most useful ones?

Cannot answer myself either. But for debugging:

These seemed to be useful.

In particular, one “slub_debug” is used for debugging a recently discoverd vulnerability (CVE-2017-7533):

https://bugzilla.redhat.com/attachment.cgi?id=1296934

image.png

The kernel function for handling “slub_debug” (from above) is “setup_slub_debug” (inside mm/slub.c) from where we can read and understand how the option is handled by the kernel:

Selection_133

%d bloggers like this: