As applications grow in complexity, memory stability is often neglected, causing problems to appear over time. When applications experience consequences of problematic memory implementations, developers may find it difficult to pinpoint the root cause. While there are tools available that automate detecting memory issues, those tools often require re-running the application in special environments, resulting in additional performance costs. Pinpointing the root cause efficiently, in real-time, without re-running the application in special environments requires an understanding of how memory flows in the Linux operating system.

We will explore both types of tracing tools and how they can be used in Linux production and development environments. This will start with Memcheck, a sophisticated tool that analyzes how memory is used, but cannot be used inexpensively in production. Next, we will explore BPF and perf prerequisites, such as: stack unwinding techniques, how memory flows in Linux, what event sources are, and differences between BPF and perf. Lastly, we will tie the principles together with a simulated workload to demonstrate each tools’ power.

At DoorDash, our server applications mainly consist of Java/Kotlin (JVM applications), and dynamically tracing them is a bit trickier than C/C++ applications. This article will serve as the first introductory step by introducing how tracing works, how the system allocates memory, and how modern tracing tools could help pinpoint issues in C/C++. Many of the principles described in this article are fundamental to instrumenting JVM applications, which we will write about in an upcoming article.

The dangers of memory growth and memory leaks

Applications growing in memory usage could strain or overwhelm hardware resources. When applications grow beyond available memory, operating systems like Linux will attempt to evacuate memory pages to free up space. If it fails to do so, the operating system will be forced to terminate offending applications with the help of Linux’s OOM killer. Furthermore, in applications that use garbage collection, continuous memory growth could prove decremental to an application’s performance.

Most common types of memory leaks

Memory leaks generally fall into two categories: reachability leaks and staleness leaks.

  • Reachability leaks are the classic leak scenario when a program allocates memory without ever calling free, an function to release memory. In languages with automatic reference counting, these cases are generally uncommon. However, other cases of reachability leaks still exist, such as reference cycles. Nonetheless, both of these scenarios result in the program losing the pointer to the object and becoming unable to free up the allocated memory.
  • Staleness leaks occur when a program unintentionally holds a pointer to an object that it will never access again. These could be cases of, for example, statically initialized memory that would live throughout the program’s lifetime.

A naive approach to tracing unexpected memory growth 

A naive or simplistic approach to tracing reachability leaks would be to trace where malloc, a memory allocation function, was called without a subsequent free call within a given interval of time. This approach would provide an overview of a possible leak, however, each code path would need to be examined to say for sure. This approach is also likely to include some false positives since long-lived memory, like singletons, are not considered leaks but would be misconstrued as leaks. Another problem with the naive approach is that malloc and free calls can happen very frequently, and tracing each call could slow down performance significantly.

Tracing staleness leaks accurately could be more difficult, as it is generally hard to differentiate between valid or invalid stale memory. Coincidentally, staleness leaks are comparable to tracing memory growths, as the two methodologies require studying code paths to reveal verifiable leaks.

Keeping tracing simple with Memcheck

Valgrind’s Memcheck is an outstanding and thorough tool that requires little to no knowledge of how memory works to use. As a bonus, callgrind_annotate, a post-processing tool, can be combined with Memcheck to indicate which code lines were responsible for leaking memory. Furthermore, Memcheck works brilliantly in automated testing environments, smaller programs, and as a sanity test before release.

The downside to using Memcheck in production applications is that we can’t attach it to an already running program. This is because Memcheck works by interposing memory events before launching the application that would ultimately be very expensive for an application that executes millions of events per second, causing performance to slow down substantially.

Installing Memcheck

Memcheck can be installed on many Linux systems by running

sudo apt-get install valgrind

or by following the instructions here

Simulated workload

Let’s start with a simple example of using Memcheck to detect a memory leak in the following program written in C. Compiling and running the following code will leak memory every second until terminated.

// function that leaks memory every k seconds
void leak(int secs) {
   while (1) {
      int *ptr = (int *)malloc(sizeof(int));
      printf("Leaked: %p\n", ptr);
      sleep(secs);
   }
}

void someFunc() {
   leak(1); // leak every second
}

void main() {
   someFunc();
}

Running Memcheck

Running the simulated workload with Memcheck would look something like this:

valgrind --tool=memcheck ./LoopLeak.out

This command launches the program in Memcheck’s special environment, allowing Memcheck to pinpoint potential leaks.

Analyzing output

After running the program for a brief period of time, the “leak summary” section tells us that we “definitely lost 8 bytes in 2 blocks” overall. However, by default Memcheck does not tell us where in the code we lost those bytes, which we could solve by including debug symbols.

# valgrind --tool=memcheck ./LoopLeak.out 
==1846== Memcheck, a memory error detector
==1846== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1846== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==1846== Command: ./LoopLeak.out
==1846== 
==1846== Process terminating with default action of signal 2 (SIGINT)
==1846==    at 0x4F21A04: nanosleep (nanosleep.c:28)
==1846==    by 0x4F21909: sleep (sleep.c:55)
==1846==    by 0x108730: leak (in /root/Examples/Leaks/LoopLeak.out)
==1846==    by 0x108740: someFunc (in /root/Examples/Leaks/LoopLeak.out)
==1846==    by 0x108751: main (in /root/Examples/Leaks/LoopLeak.out)
==1846== 
==1846== HEAP SUMMARY:
==1846==     in use at exit: 12 bytes in 3 blocks
==1846==   total heap usage: 4 allocs, 1 frees, 1,036 bytes allocated
==1846== 
==1846== LEAK SUMMARY:
==1846==    definitely lost: 8 bytes in 2 blocks
==1846==    indirectly lost: 0 bytes in 0 blocks
==1846==      possibly lost: 0 bytes in 0 blocks
==1846==    still reachable: 4 bytes in 1 blocks
==1846==         suppressed: 0 bytes in 0 blocks
==1846== Rerun with --leak-check=full to see details of leaked memory
==1846== 
==1846== For lists of detected and suppressed errors, rerun with: -s
==1846== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Source-level debugging with Memcheck

Debug symbols enable debuggers to access information from the binary source code, such as the names of identifiers, including variables and procedure calls. Combining Memcheck with callgrind_annotate lets us see where the binary is leaking memory in its respective source code.

Including debug symbols

GCC, and other compilers, have the ability to include debug symbols at compile time. To do so with GCC, we would simply pass in an additional flag:

gcc -g3 leak.c

For more information about other flags available, see the GCC documentation.

Memcheck options

Memcheck’s user manual explains the following options in a bit more detail. However, imagine these options as enabling extra bits of information to feed to the output memory file.

--leak-check: Setting to full enables us to show each leak in detail.

--track-origins: Setting to “yes” helps us track down uninitialized values.

--xtree-memory-file: Defaults to `xtmemory.kcg.%p`.

--xtree-memory: When set to full, the memory execution tree gives 6 different measurements: the current number of allocated bytes and blocks (same values as for allocs), the total number of allocated bytes and blocks, the total number of freed bytes and blocks.

Running Memcheck with options

Next up, we will re-run Memcheck with the aforementioned options to generate an output memory file for further analysis.

valgrind --leak-check=full --track-origins=yes --xtree-memory=full --xtree-memory-file=leaks.kcg ./LoopLeak.out

Memcheck + callgrind_annotate

callgrind_annotate takes in the output memory file along with the source code of the program and maps them together:

callgrind_annotate leaks.kcg leak.c

We can see exactly where we leaked memory, what line, and how much was leaked in this output:

# callgrind_annotate leaks.kcg leak.c 
…
curB curBk totB  totBk totFdB totFdBk 
   .     .     .     .      .       .  
   .     .     .     .      .       .  // function that leaks memory every k seconds
   .     .     .     .      .       .  void leak(size_t secs) {
   .     .     .     .      .       .     while (1) {
  12     3    12     3      0       0        int *ptr = (int *)malloc(sizeof(int));
   .     .     .     .      .       .        if (ptr == NULL) {
   .     .     .     .      .       .          printf("Ran out of memory");
   .     .     .     .      .       .        } else {
   0     0 1,024     1      0       0          printf("Leaked: %p\n", ptr);
   .     .     .     .      .       .        }
   0     0     0     0  1,024       1        sleep(secs);
   .     .     .     .      .       .     }
   .     .     .     .      .       .  }
   .     .     .     .      .       .  
   .     .     .     .      .       .  void someFunc() {
  12     3 1,036     4  1,024       1     leak(1); // leak every second
   .     .     .     .      .       .  }
   .     .     .     .      .       .  
   .     .     .     .      .       .  void main() {
  12     3 1,036     4  1,024       1     someFunc();
   .     .     .     .      .       .  }

--------------------------------------------------------------------------------
curB curBk totB  totBk totFdB totFdBk 
--------------------------------------------------------------------------------
  36     9 3,108    12  3,072       3  events annotated

Introducing the prerequisites to BPF and perf

At this point, we have explained the basics of tracing leaks and growths with Memcheck. This may be a sufficient stopping point for applications with infrequent allocations or automated testing environments. For applications with frequent allocations, we first need to cover fundamental bits of tracing to use BPF and perf efficiently to our advantage. These topics include:

  • Linux’s Memory Model
    • Operating system internals, a high-level look at how the system works 
  • Tracing Event Sources
    • How echoing events in the kernel could hint at the root cause. What event sources are, how to trace them, and the general differences between them

Following these topics, we will dive into a brief introduction of BPF and perf, uncover differences between the two, and explore the capabilities of BPF and perf in simulated workloads.

Unwinding stack traces to gather historical information  

Stack traces help examine and understand code paths that led up to an event and are also commonly used for performance analysis in measuring where an application is spending most of its time. However, stack traces are not accessible out-of-the-box and can only be collected by unwinding the stack.

Several techniques are available for unwinding the stack. However, some tracing tools (like BPF) depend on using a specific unwinding technique to function properly. At the time of writing, this is the current state of the user-land unwinding techniques supported by BPF and perf.

ToolUser-land Unwinding Technique Supported
BPFFrame pointers only
PerfFrame pointers (–call-graph fp)
DWARF (–call-graph dwarf)
LBR (–call-graph lbr)
ORC (possible, not yet supported)

Frame pointers

Frame pointers can be thought of as a linked list of historical callers, and let us review the historical code path that led to a given event, such as a memory allocation.

How frame pointers work

From a high level, when the CPU executes instructions, it stores data in its registers, and the frame pointer is typically stored in the RBP register. For example, if foo() calls bar(), which in turn calls baz(), we could interrupt the execution at baz() and unwind the stack by iterating the linked list (frame pointer), yielding a historical call tree of: foo->bar->baz.

How to deal with compilers that omit frame pointers

Unfortunately, when code is compiled with modern compilers, the frame pointer register is often omitted as a performance improvement, breaking our ability to unwind the stack. From here it’s possible to pursue the other ways of unwinding the stack, as seen below, or re-enable frame pointers on the modern compiler. It should be noted that other techniques such as unwinding the stack with DWARF, a sophisticated debugging format, have their own downsides, which is why BPF insists on re-enabling the frame pointer.

How to decide whether to omit frame pointers or not 

Modern compilers generally omit frame pointers to boost performance. Understanding the reasoning for doing so will help determine whether it is best to use an alternative method of unwinding the stack or accept minor performance degradation in return for re-enabled frame pointers.

The number of general purpose-registers and the size of the program binary are two of the main metrics to consider when deciding this.

Omitting frame pointers for CPUs with fewer registers, such as 32-bit CPUs, is a good optimization, as occupying an entire register for frame pointers can be a significant performance sacrifice. However, dedicating a register for frame pointers on 64-bit CPUs with 16 general-purpose registers is not as big of a sacrifice, especially when it easily enables tools like BPF.

Omitting frame pointers can also yield smaller binaries, as function prologues and epilogues would not need to configure frame pointers.

Enabling frame pointers

At the time of writing, using BPF requires enabling the frame pointers on a modern compiler that will otherwise omit them by default. For example, the GCC compiler can revert the default to omit the frame pointer and keep the frame pointer with the -fno-omit-frame-pointer option. For Java, the JVM provides -XX:+PreserveFramePointer to re-enable the frame pointer.

Resolving frame pointers without a dedicated register

Resolving frame pointers without using a dedicated register can be tricky. The following section describes the three most common stack unwinding techniques that could resolve the frame address, i.e., frame pointer, for a given address, starting with DWARF.

Using DWARF to unwind the stack

DWARF, the predecessor of STABS, was designed as a standardized debugging data format. The DWARF unwinder, which utilizes this format, is a Turing-complete state machine that generates lookup tables on demand to resolve register values, such as the frame address. The unwinder works by parsing instructions from sections in the binary, or a standalone debuginfo file, in the DWARF format and executes those instructions in its custom virtual machine. An abstract example table of what the unwinder generates can be seen below in Figure 1.

Location CFArbxrbpr12r13r14r15ra
0xf000f000rsp+8uuuuuuc-8
0xf000f002rsp+ 16uuuuuc-16c-8
0xf000f004rsp+24uuuuc-24c-16c-8
Figure 1: Example of address locations to equations needed to resolve register values (simplified from the DWARF debugging standard).

The problem with unwinding the stack with DWARF

The abstract example table seen in Figure 1 is, in reality, compressed in size and utilizes the DWARF unwinder to compute register values on demand. For example, the canonical frame address (CFA) could be retrieved with the DW_CFA_def_cfa_expression instruction combined with arbitrary arithmetic and logical operations that act as steps the unwinder needs to perform.

The problem with this is that the aforementioned steps’, a.k.a. interpreting DWARF bytecode on demand, is slow, despite libunwind’s aggressive caching strategies. It is especially problematic in interrupt handlers where tools like BPF and perf strive to log data as fast as possible, in order to not impact performance. Linux’s perf tool works around this performance cost by copying unresolved user-land stack dumps to disk and post-process them at a future point. The same trick would defeat the purpose of BPF because BPF aims to aggregate stacks directly in the kernel and could only do so if the stacks are complete.

Using Last Branch Record to unwind the stack

Last Branch Record (LBR) is an exclusive Intel processor feature that allows the CPU to record function branches in a hardware buffer. The number of hardware registers available differs from generation to generation. However, it typically ranges from 4 to 32, which means that the stack trace’s depth is limited to the number of on-chip LBR registers available. Nonetheless, LBR is a great alternative for applications with shallow stacks. LBR, at the time of writing, is supported by perf and unsupported by BPF.

Using Oops Rewind Capability to unwind the stack

Oops Rewind Capability (ORC), a clever backronym in response to DWARF, stemmed from difficulties of including hardworking dwarves in the Linux kernel in 2012. DWARF was originally built to handle multiple architectures, language features, and more, which is partially why it was made Turing-complete, but it also came with several challenges to including it in the kernel.

ORC was designed to simplify, minimize, and diverge from the standard debugging format to something that kernel developers would own. Josh Poimboeuf initially proposed ORC in 2017, and early results measure 20 times faster than DWARF. Since then, ORC has made its way into the kernel. Although theoretically compatible with perf, perf would need to be taught to read ORC data to be used as a user-land stack unwinding technique.

Introducing Linux’s memory model

By default, Linux follows an optimistic memory allocation strategy based on the assumption that a program may not need to use all the memory it asks for. The strategy allows Linux processes to oversubscribe physical memory, which is administered with a pageout daemon, optional physical swap devices, and, as a last resort, the out-of-memory (OOM) killer. By understanding how memory flows within an operating system, we can target specific areas of concern. Figure 2, below, represents a rough illustration of applications running with the libc allocator in Linux.

Figure 2: How memory events flow within the operating system based on applications running the libc allocator.

We will now go over each section in the illustration above, starting with the libc allocator.

1. Application’s allocator (libc in this example)

libc provides functions for interfacing with hardware memory. The malloc() call allocates a memory block and returns a pointer to the beginning of the block. The free() call takes in the pointer returned and frees the memory block. When memory is freed, libc tracks its location to potentially fulfill a subsequent malloc() call. The realloc() function attempts to change the size of the memory block if the pointer passed in is valid. If not, realloc() is equivalent to a malloc() call. Lastly, calloc() allocates memory similarly to malloc, and  initializes the allocated memory block to zero.

2. brk()

Memory is stored on a dynamic segment of the process’s virtual address space called the heap. When an application is first launched, the system reserves a certain amount of memory for the application and could request to increase or create a new memory segment through, for example, the brk() syscall. brk() events are less frequent than malloc and should make the overhead of tracing them negligible.

3. mmap()

mmap() is commonly used during initialization or at the start of the application to load data files and create working segments. It can be called directly (as a syscall) and by the allocator for larger memory blocks instead of malloc (see MMAP_THRESHOLD). However, unlike brk(), frequent mmap() calls do not necessarily guarantee growth, as they may be returned to the system shortly after using munmap(). This makes tracing mmap() somewhat analogous to malloc()/free() tracing, where the mapping addresses and code paths must be examined in closer detail to pinpoint potential growths.

4. Page faults

The operating system measures the size of memory in memory pages. In Linux, if you were to request 50Mb of memory, the system might only give you five pages of physical memory worth 4KiB each, while the rest is “promised” (a.k.a. virtual) memory. The exact page size varies between systems. The consequences of this promise appear when the system tries to access a page through the memory management unit, MMU, that is not backed by physical memory. At that point, the system triggers a page fault and traps the kernel to map to physical memory before continuing.

5. Paging

Paging comes in pairs: page-ins and page-outs. Page-ins are common, normal, and generally not a cause for concern. They typically occur when the application first starts up by paging-in its executable image and data. However, page-outs are slightly more worrisome, as they occur when the kernel detects that the system is running low on physical memory and will attempt to free up memory by paging it out to disk. If page-outs are frequent, the kernel may spend more time managing paging activity than running applications (also known as thrashing), causing performance to drop considerably.

Tracing Event Sources

Tracing event sources enables us to listen for events in an application or kernel that occur without any necessary changes to the target process or system. Event sources can be thought of as radio channels where each channel reports respective events. For example, tuning in to this metaphorical radio for “malloc” in “kprobes” would instrument kernel malloc probes as opposed to user-land (which would be uprobes). Tools such as BPF and perf could be employed to perform custom actions for when an event occurs, construct histograms and summaries of data, and more. This section will start by exploring some of the event sources available in Linux, starting with kprobes.

kprobes

kprobes, merged in the Linux 2.6.9 kernel released in Oct 2004, allowed us to create instrumentation events for kernel functions dynamically. The kprobes technology also has a counterpart, kretprobes, for instrumenting when a function returns and its return value. Combining both kprobe and kretprobe could give you important metrics, for example, measuring the duration of a function in the kernel.

uprobes

uprobes was merged in the Linux 3.5 kernel, released in July 2012, and gives us the ability to create instrumentation events for user-space functions dynamically. Uprobes’ counterpart to monitor function returns are uretprobes, which could be used to measure the duration of a function in user-space.

USDT

User Statically-Defined Tracing (USDT) probes can be thought of as user-space tracepoints that were manually added by developers in given software. The BPF compiler collection (BCC) includes tplist <pid>, a tool for listing USDTs in a process. What makes USDT different is that it relies on an external system tracer to activate these tracepoints. Many applications do not compile USDT probes by default and require a configuration option such as “–enable-dtrace-probes” or “–with-dtrace”.

PMCs

Performance monitoring counters (PMCs), a.k.a. PICs, CPCs, and PMU events, refer to the same thing, programmable hardware counters on the processor. These hardware counters enable us to trace the true efficiency of CPU instructions, hit ratio of CPU caches, stall cycles, and more. For example, ARMv8-1m, under the section “6. PMU accessibility and restrictions”, demonstrates examples of measuring the number of instructions retired and level 1 data cache misses. As opposed to software counters, PMCs have a fixed number of registers available to measure simultaneously. Measuring more than the number of registers available would require cycling through a set of PMCs (as a way of sampling). Linux’s perf uses this form of cycling automatically.

Hijacking methodology

For those familiar with Frida or other instrumentation toolkits, both uprobes and kprobes use a similar hijacking methodology of replacing the target address with a breakpoint instruction (int3 on x86_64, or jmp, if optimization is possible), then reverting that change when tracing the given function finishes. Similarly, return probes, such as uretprobe and kretprobe, replace the return address with a trampoline function and revert the change when finished.

A brief introduction to perf

perf, a.k.a. perf_events, was brought to life in Linux kernel version 2.6.31 in September of 2009. Its purpose was to use perfcounters (performance counters subsystem) in Linux. Since then, perf has had continuous enhancements over the years to add to its tracing capabilities. Furthermore, perf opened the door to many things outside of observability purposes, like the Trinity fuzzer, that helped expose kernel bugs and exploits throughout Linux’s history.
perf is a remarkable tool for targeting older versions of the kernel, low-frequency activities (such as measuring brk events), and covering use cases that BPF may not be capable of addressing. However, unlike perf, BPF aggregates events directly in the kernel instead of pushing them to user-space for further analysis, yielding better performance, as shown in Figure 3, below:

Figure 3: perf pushes data to user-space, which could be expensive for measuring high-frequency events, such as malloc.

Installing perf

The perf tool is included in the `linux-tools-common` package. Install it by running:

sudo apt-get install linux-tools-common

Once installed, try running perf. It may require `linux-tools-{kernel version}`, which can be installed by running:

sudo apt-get install linux-tools-generic # install for latest kernel

or

sudo apt-get install linux-tools-`uname -r` # install for current kernel

It’s also possible to compile perf from the Linux kernel source. See these instructions for further details.

A brief introduction to BPF

BPF was first developed in 1992 to improve capturing and filtering network packets that matched specific rules by avoiding copying them from kernel to user-space. At that time, BPF was primarily used in the kernel, but that later changed in 2014 by exposing BPF directly to user-space with commit daedfb22. This helped turn BPF into a universal in-kernel virtual machine that was seen as a perfect fit for advanced performance analysis tools.

Nowadays, BPF has been extended to do much more than network packet filtering. It allows us to arbitrarily attach to tracepoints in the kernel or user-space, pull arguments from functions, construct in-kernel summarization of events, and more.

BPF works by compiling the user-defined BPF program (for example, BCC’s disksnoop) and passes the instructions to the in-kernel BPF verifier, as shown in Figure 4, below. This ensures the safe execution of instructions to prevent panicking or corrupting the kernel. The verifier passes the instructions to the interpreter or JIT compiler, which turns the instructions into machine-executable code. Some distros have disabled the interpreter to prevent potential security vulnerabilities (like spectre) and explicitly pass through the JIT compiler.

Figure 4: BPF aggregates events in-kernel instead of pushing out events to user-space.

Companies, such as Cloudflare, have adopted BPF to reject packets for DDoS protection immediately after a packet is received at the kernel level. The eXpress Data Path project is another great example of bare metal packet processing at the system’s lowest level.

Installing BPF

BPF is generally used indirectly, through one of its front-ends, but is also available as libraries for interacting with BPF directly. In this article, we will mainly use BPF programs from BCC but also explore the simplicity of writing powerful one-liners in bpftrace.

BCC 

BCC is a toolkit of over a hundred ready-made programs to use efficiently in production environments. See the installation page for instructions on how to install BCC.

bpftrace

bpftrace is a high-level tracing language for BPF derived from predecessor tracers such as DTrace and SystemTap. It is useful for composing powerful one-liners, executing custom actions, and quickly visualizing data. To install bpftrace, follow the instructions here.

Triaging memory usage with help of the USE method

It can be difficult to pinpoint a specific process causing problems if the system hosts multiple processes at the same time. The Utilization Saturation and Errors (USE) method, created by Brendan Gregg, is a methodology for analyzing the performance of any system. In this case, it’s about characterizing how the system is performing in relationship to memory usage.

The method has three classifications for analyzing performance: Utilization, Saturation, and Errors. Utilization is attributed to the amount of memory used. If memory usage reaches 100%, the system responds by attempting to salvage memory, for example, by paging it out to disk. Errors appear when the system has exhausted all attempts to salvage memory.

To analyze memory usage in Linux, we can look at the USE acronym’s three components: Utilization, Saturation, and Errors. Each section includes a series of questions that we can ask the system by using statistical tools. By analyzing the output of these tools, we can narrow down our search to only target-specific problematic areas.

Simulated workload

It is generally not recommended to learn how tracing tools work at the time of need. For that reason, the included workload is an XOR encryption program written in C++ that contains two memory bugs:

  • One bug has the consequence of scaling proportionally in memory to the size of the input passed in.
  • The other bug retains the input until the application exits, instead of discarding out-of-use memory as soon as possible.

The source code of the workload and instructions on how to get started can be found here.

For the following sections, feel free to use this workload or a suitable alternative. Armed with the knowledge of operating system internals, next, we will use perf and BPF to investigate system memory growth.

Utilization

Quantifying memory utilization will help explain how physical memory is distributed in the system. Oftentimes, as alluded to before, the system may host multiple processes at the same time, which could be slightly misleading in certain memory usage reports.

Answering the following questions about the system will give us an idea of overall memory capacity utilization:

QuestionTool
How much memory is currently free/used?free -m
How was/is memory used?sar -r
How much memory is each process using?top -o %MEM

How much memory is currently free/used?

free reports real-time statistics for quickly checking the amount of available and used memory.

# free -m
               total        used       free      shared  buff/cache   available
Mem:           3934         628        2632         1         674        3075
Swap:          1988          0         1988
How was/is memory used?

sar is a system activity monitoring tool that can run both as a scheduled cron job (cat /etc/cron.d/sysstat) and as a real-time tool (with second intervals). In real-world applications and monitoring systems, sar is often used as a cron job scheduled to run every X minutes. To review sar’s historical data, simply run sar without a time interval (sar -r). With our short-lived workload, we’ll use sar with a time interval of 1 to output reports every second.

# sar -r 1
kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
2124620     3011832    721960    17.92     6636     1071076   1346768     22.20      1212336    509060     36
2020040     2907252    826540    20.51     6636     1071076   1346768     22.20      1317012    509060     36
1915712     2802924    930868    23.10     6636     1071076   1346768     22.20      1420968    509060     36
1811132     2698344   1035448    25.70     6636     1071076   1346768     22.20      1525428    509060     36
1706048     2593260   1140532    28.31     6636     1071076   1346768     22.20      1630188    509060     36

The kbcommit and %commit column show estimates of how much RAM/swap is needed to guarantee that the system won’t run out of memory. The %memused column, in this case, shows continuous growths in memory over the last 5 seconds, while the other columns touch on available/free memory more verbosely than the output in the free tool.

How much memory is each process using?

top provides a real-time list of processes running on the system along with a summary of system information. The main columns of interest are; VIRT, RES, and %MEM. The top man page explains these columns as:

VIRT:
The total amount of virtual memory used by the task. It
includes all code, data and shared libraries plus pages that
have been swapped out and pages that have been mapped but not
used.
RES (Resident Memory Size):
A subset of the virtual address space (VIRT) representing the
non-swapped physical memory a task is currently using. It is also 
the sum of the RSan, RSfd and RSsh fields.
%MEM: 
Simply RES divided by total physical memory.

Running top -o %MEM sorts the highest memory usage processes.

# top -o %MEM
PID   USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+   COMMAND
1179  root      20   0  9287460  3.6g   1804 R  100.0 92.6    0:34.78  XOREncryptor

In our case, we’re only running one application on the system, so the output isn’t surprising, but it’s always a good idea to check.

Saturation

The next step in the USE method is to characterize saturation. The first step, utilization, reveals that we’re using a ton of memory and it’s due to a single process. Saturation, in relationship to memory, is generally caused by high memory utilization that orbits close to the limited bounds of the hardware. The system’s response is to reduce memory use by either swapping out memory, attempting to terminate offending processes, or, if enabled, compressing memory.

Asking the system the following questions will enable us to get a better understanding of potential memory saturation:

QuestionTool
How much of our swap space is used?sar -S
How much memory is swapped in/out?vmstat
How many page faults is the system encountering per second?sar -B

How much of our swap space is used?

When the system runs out of physical memory, and swapping is enabled, it attempts to claim memory back by paging out pages of memory to disk. This is a sign of concern, and checking the swap-space reveals how much pain the system has been or currently is in. The same rules with sar apply here as explained above, namely that running sar without an interval reports historic data captured by when the tool runs as a cron job.

Running sar with option “-S” reports swap space utilization statistics. The main columns to look for is the percentage of the swap space used (%swpused) and the percentage of memory that was swapped out but was swapped back in (%swpcad).

# sar -S 1
kbswpfree kbswpused   %swpused  kbswpcad   %swpcad
1860348    176384        8.66     1852       1.05
1767540    269192       13.22     2484       0.92
1701132    335600       16.48     3604       1.07
1616124    420608       20.65     4324       1.03
1535484    501248       24.61     4164       0.83

Reading this report tells us that the system was forced to swap out memory every second in the last five-second reports.

There are other tools available for simplified real-time statistics such as swapon.

How much memory is swapped in/out?

vmstat reports virtual memory statistics, page-ins/page-outs, block IO, CPU activity, and general memory usage. For memory usage, we are generally interested in the following columns; r, b, swpd, free, buff, cache, si, and so.

The vmstat man page describes these columns as:

r: The number of runnable processes (running or waiting for run time).
b: The number of processes blocked waiting for I/O to complete.
swpd: The amount of virtual memory used.
free: The amount of idle memory.
buff: The amount of memory used as buffers.
cache: The amount of memory used as cache.
si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).

Running vmstat 1 5 will sample data every second for a total of five times, whereas the first report is the average since the last reboot, which is useful for spotting trends.

# vmstat 1 5
procs   ---------------------memory------------------   ----swap---   -------io------    ---system---   ---------cpu--------
 r  b        swpd      free         buff    cache        si      so     bi      bo         in   cs       us   sy   id   wa st
 2  0        55564    104076        1392    41092         0       7    332      73         34   18        0    0   100   0  0
 1  0       172556    147324         640    38776        96  117016    924   117016      26130 421       11    5    84   0  0
 1  0       226828    113852         640    39140         0   54268    480    54268      14209 237       10    3    87   0  0
 1  0       307980    111836         640    39692        80   80988    876    80988      21176 352       12    4    84   0  0
 1  0       410380    140868         620    39032       180  102588    3996  102588      26482 520       12    5    83   0  0

The output tells us that we have one process (r) that is aggressively increasing in memory usage (free) and is forcing the system to swap out memory (so) to disk. The problem with this goes back to what was previously explained under Paging, which is that we might end up spending more time paging out memory to disk than doing work, a.k.a. thrashing.

We can also see that the system is busy doing I/O (bi/bo), which makes sense given that our program is encrypting files by first reading them in. However, none of these columns answers why we are decreasing in available memory, which we’ll solve by gathering stack traces for related memory events.

How many page faults is the system encountering per second?

Page faults, explained under the page fault section, occur when promised memory, a.k.a. virtual memory, traps the kernel to make good on the promise before continuing. Checking how many page faults occur per second could be done using sar -B.

For page faults, we are mainly interested in the two columns: fault/s and majflt/s. The sar man page explains these columns as:

fault/s:
Number of page faults (major + minor) made by the
system per second.  This is not a count of page
faults that generate I/O, because some page faults
can be resolved without I/O.
majflt/s:
Number of major faults the system has made per
second, those which have required loading a memory
page from disk.
# sar -B 1
pgpgin/s      pgpgout/s    fault/s     majflt/s   pgfree/s   pgscank/s   pgscand/s    pgsteal/s    %vmeff
 15260.00      8000.00    11949.00     132.00    2071.00       0.00        0.00         0.00        0.00
192832.00     24000.00    78203.00     355.00    6066.00       0.00        0.00         0.00        0.00
334506.93        0.00     98736.63       0.00       4.95       0.00        0.00         0.00        0.00
     0.00        0.00     25340.00       0.00       5.00       0.00        0.00         0.00        0.00
     0.00        0.00     23987.00       0.00       5.00       0.00        0.00         0.00        0.00

Starting up the workload and sar simultaneously show consistent behavior with what we expect will happen. It starts by paging in memory from disk (pgpgin/s) which is the file that the workload will encrypt, and saves a temporary file to disk (pgpgout/s) which is the output encrypted file.

If we ran sar -B for a longer period of time, allowing the workload to eventually exhaust available memory, we see an increase of pages scanned per second (pgscank/s, pgscand/s), reclaimed from cache (pgsteal/s), and pages put on the free list per second (pgfree/s). These are attempts made by the system to try to keep up with memory demands as quickly as possible.

# sar -B 1
pgpgin/s    pgpgout/s     fault/s     majflt/s  pgfree/s   pgscank/s   pgscand/s    pgsteal/s    %vmeff
421376.00   132.00       121021.00      0.00    52176.00    26414.00      0.00       26057.00     98.65
449408.00   136.00       126368.00      0.00   220041.00   460758.00    254.00      216361.00     46.93
406416.00   144.00       114882.00      4.00   322224.00   247949.00      0.00      178409.00     71.95
407308.00   236.00       114173.00      8.00    85091.00   221639.00      0.00       84854.00     38.28
 97024.00     8.00        44830.00      3.00    70633.00   103415.00      0.00       70592.00     68.26

Errors

Gathering errors, although the last word in the USE acronym, is actually a good step to take first. We will primarily be looking at processes that were killed by Linux’s OOM killer. However, other failures, such as failed mallocs (although rare in default Linux kernel configurations) and physical RAM failures, could still occur.
If the system was forced to terminate a process with the out-of-memory manager’s help, we could be lucky enough to find clues of those events in dmesg. As an example of a report:

# dmesg
[ 1593.811912] Out of memory: Killed process 1179 (XOREncryptor) total-vm:11335464kB, anon-rss:3764656kB, file-rss:1808kB, shmem-rss:0kB, UID:0 pgtables:19348kB oom_score_adj:0

In this output, we can see that we had a process (PID 1179) that grew out of bounds and was terminated by Linux’s OOM killer. When systems host multiple processes at a time, this information could narrow down the search for memory hogs.

Targeted tracing with BPF and perf

The benefit of BPF and perf is their ability to compose queries on-the-fly without recompiling or restarting an application in any special or sandbox environments. Given those capabilities, we could simply ask the system to show us stack traces that led up to an event, without specifying the exact process we are interested in looking at. 

While the USE method revealed fairly straightforward information, if targeting a different workload, the results would be different and likely not as apparent. That is why it is still a good idea to go through the USE-method in its full form before diving into specific areas to trace. In our case, the answers to each question are summed up in Figure 5, below:

QuestionTool Answer
How much memory is currently free/used?free -mAt the time of execution, there was some memory left. However, sar -r reveals continuous drops of available memory – soon to cross into swap space.
How was/is memory used?sar -rContinuous decrease of available memory.
How much memory is each process using?top -o %MEMXOREncryptor hogs the majority of available memory.
How much of our swap space is used?sar -Ssar -S shows a continuous increase of swap space capacity. This is an indication that the system has run out of available memory and is forced to page out memory to swap space.
How much memory is swapped in/out?vmstatvmstat shows increases of virtual memory, and that a single process is forcing the system to swap pages to disk. The report also shows signs of thrashing.
How many page faults is the system encountering per second?sar -BRunning sar -B for an extended period of time shows the increase of memory reclaims, page outs to disk, and general hardship in keeping up to the memory demand.
Were any processes killed by the out-of-memory manager?dmesgXOREncryptor was eventually killed by the OOM killer.
Figure 5: The color code represents each part of the USE-method in action. Blue is utilization, yellow is saturation, and red is errors.

The summary of questions and answers tells a story. A process, XOREncryptor, was scaling beyond available resources which forced the system to keep up with high memory demands. Eventually, when the process scaled beyond the system’s capabilities to reclaim memory, the process was killed by the OOM killer.

This gives us the impetus to trace the following areas shown in Figure 6, below, derived from the illustration of memory flow in Figure 2.

WhatWhyMemory Event
Heap expansionCapturing a stack trace that led up to heap
expansion would reveal why the heap was expanded.
brk()
Large allocationsLarger allocations are often contracted to
mmap as opposed to malloc() for efficiency reasons. Tracing mmap() would reveal events
that led up to large allocations.
mmap()
Major & minor page faultsOften cheaper to trace than malloc(),
and would give us an idea as to why page faults are
triggered so frequently in the workload application.
page-faults, major-page-faults
Figure 6: Above is a list of memory events paired with reasons as to why they might be beneficial to developers. Each row includes a memory event to help the investigation for problematic memory.

With this list in mind, in the next section, we’ll start by examining the syscalls made to brk(), page-faults, and mmap(), and gather conclusive evidence throughout the process to pinpoint potential memory issues. This is where ready-made tools from the BCC collection, powerful one-liners from BPFTrace, and the classic Linux perf shine. Each tool enables us to transform raw output data into interactive flame graphs for an interactive debugging experience.

Tracing the brk() syscall

The dynamic memory segment, the heap, can expand in size for many reasons. brk() is one of the ways memory will expand, and tracing brk() may indicate clues as to why. Also, brk() is generally called less frequently than malloc, or other high-frequency events, making it less computationally intensive and a perfect place to start.

Picking an appropriate interval

brk() events are usually infrequent. For the workload included, we’ve chosen 30 seconds as enough time to capture events, since that interval should provide a sufficient amount of samples. However, we recommend measuring an appropriate interval when targeting other workloads. To measure what an appropriate interval is in other workloads is easily accomplishable with perf stat, for example:

# perf stat -e syscalls:sys_enter_brk -I 1000 -a
1.000122133                  3      syscalls:sys_enter_brk                                      
2.000984485                  0      syscalls:sys_enter_brk

Measure brk() with perf

To record the brk() syscalls:

# perf record -e syscalls:sys_enter_brk -a -g -- sleep 30

Copy the recorded buffer to an output file:

# perf script > out.stacks

Transform to a flame graph:

# ./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --color=mem  --title="Heap Expansion Flame Graph" --countname="calls" > out.svg

Measure brk() with BCC

 # /usr/share/bcc/tools/stackcount -f PU t:syscalls:sys_enter_brk > out.stacks
 # ./flamegraph.pl --hash --color=mem --title="Heap Expansion Flame Graph" --countname="calls" < out.stacks > out.svg 

Measure brk() with BPFTrace

 # bpftrace -e 'tracepoint:syscalls:sys_enter_brk { @[ustack, comm] = count(); }' > out.stacks
 # ./stackcollapse-bpftrace.pl < out.stacks | ./flamegraph.pl --color=mem --title="Heap Expansion Flame Graph" --countname="calls" > out.svg 

Heap expansion flame graph analysis

The flame graph reveals that we have six calls to brk() in total (ignoring the events from `sleep`). It shows that we are expanding the heap’s size by brk() in two functions, read_in_chunks and process_chunk. Those code paths could be examined further to reveal a potential memory issue.

Figure 7: The heap expanded in size using the brk() syscall from the process_chunk and read_in_chunks functions. Analyzing these code paths could help us diagnose a problem.

Tracing the mmap() syscall

The mmap() syscall is often used during initialization for creating working segments and loading data files. However, we previously explained that mmap() can also be contracted by the applications allocator for large allocations. Tracing mmap() will help reveal code paths that exceed the MMAP_THRESHOLD, showing us code paths that led up to delegations to mmap().

Tracing mmap() with perf

 # perf record -e syscalls:sys_enter_mmap -a -g -- sleep 30
 # perf script > out.stacks
 # ./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --color=mem \ --title="mmap() Flame Graph" --countname="calls" > out.svg 

Tracing mmap() with BPF

 # /usr/share/bcc/tools/stackcount -f -PU t:syscalls:sys_enter_mmap > out.stacks
 # ./flamegraph.pl --hash --color=mem \ --title="mmap() Flame Graph" --countname="calls" < out.stacks > out.svg 

Tracing mmap() with BPFTrace

 # bpftrace -e 'tracepoint:syscalls:sys_enter_mmap { @[ustack, comm] = count(); }'
 # ./stackcollapse-bpftrace.pl < out.stacks | ./flamegraph.pl --color=mem --title="mmap() Flame Graph" --countname="calls" > out.svg 

mmap() flame graph analysis

Analyzing the flame graph shown in Figure 8, we could see that large allocations were contracted from malloc() to mmap() in the read_in_chunks and process_chunk functions.

Where both functions, contrary to their function name (chunk), seem to allocate memory above the mmap() threshold. This, in our case, points directly at the problem, because we meant for the chunk suffixed functions to allocate and process memory in smaller chunks, which would certainly not exceed the MMAP_THRESHOLD.

Figure 8: The process_chunk and read_in_chunk functions allocate memory above the mmap() threshold.

Back to the code, we can see that on line 27 we allocate a buffer that is equal to the size of the input. Replacing that line with a fixed buffer of 4096 corrects the problem. Figure 11, below, now only calls mmap() for the input/output streams in the boost library. This fixes the bug where we would scale proportionally in memory to the size of the input passed in.

Figure 9: Replacing line 27 with a fixed buffer corrected our problem of scaling proportionally to the size of the input that was passed in.

Tracing page faults

Briefly summarized, page faults occur when the system has backed memory with virtual memory and is forced to map to physical memory before continuing. Examining call paths that led up to a page fault is another effective way of explaining memory growth.

Tracing page faults with perf

# perf record -e page-faults -a -g -- sleep 30
# perf script > out.stacks
# ./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --color=mem \ --title="Page Faults" --countname="pages" > out.svg

Tracing page faults with BCC

# /usr/share/bcc/tools/stackcount -f -PU t:exceptions:page_fault_user > out.stacks
# ./flamegraph.pl --hash --color=mem \ --title="Page Faults" --countname="pages" < out.stacks > out.svg

Tracing page faults with BPFTrace

# bpftrace -e 'tracepoint:exceptions:page_fault_user { @[ustack, comm] = count(); }' > out.stacks
# ./stackcollapse-bpftrace.pl < out.stacks | ./flamegraph.pl --color=mem --title="Page Faults" --countname="pages" > out.svg

Page faults flame graph analysis

Investigating the flame graph below, we see a large number of pages (~80k) in a 30-second interval. The code paths that led up to each page fault are not surprising given that we expect to allocate new data in our processing functions. However, what is surprising is the contradiction that this flame graph manifests against the expected behavior discussed under malloc(). Long story short, the allocator is capable of recycling pages by fulfilling a subsequent malloc with a previously free, but that doesn’t seem to be the case in this output. Instead, we seem to always allocate new pages, which could be valid in some programs, but a definite problem in our case.

Figure 10: There are a large number of page faults originating from the process_chunk function.

After walking through the code path that led up to page faults and cross-examining this with the code, we can see that on line 35 we never seem to deallocate the chunk that was processed. Correcting this will fix the bug that retains the input until the application exits.

Now, when we trace page faults, the flame graph in Figure 11, below, shows a significant decrease (~40k) in page fault events and seems to take full advantage of the allocator’s capabilities, shown by the disappearance of “operator new”. That is a huge win in both performance and memory!

Figure 11: Correcting the bug led to ~40k fewer page fault events and less memory used.

Summary

The power of dynamic tracing is incredible as it allows us to instrument any part of a computer such as functions, syscalls, I/O, network, and virtual memory. Most importantly, it allows us to efficiently pinpoint abnormalities in programs in real-time. Hopefully this article goes the extra mile by also describing frameworks to debug and better understand the potential problems and tools to examine and correct them. The included example should also help anyone looking to better master this material by providing a sandbox example to try all the tools.

Overall, this article should serve as an introduction to memory problems and how to detect them with traditional and newer, more sophisticated tools. Stay tuned for our upcoming article on using BPF/perf on JVM applications.

Additional Reading

Brendan Gregg

Brendan Gregg’s books, BPF Performance Tools and Systems Performance: Enterprise and the Cloud, 2nd Edition, served as great inspiration for research topics, material, and an unmatched level of detail. I can highly recommend them both, starting with BPF Performance Tools for beginners, and the second edition for more advanced usage. Aside from books, much of Brendan’s material published online has been incredibly helpful as well.

Memleak and argdist

Sasha Goldshtein wrote a fantastic article on two tools that Sasha wrote for BPF that could be incredibly useful in pinpointing leaks, collecting argument values, and more. I highly recommend reading it here.

Vector, Netflix

Frequent measurements provide historical insights into how a system has behaved in the past, which is invaluable information needed in figuring out why it behaves a certain way in the future. Netflix has created Vector for one of these purposes, and I highly recommend checking it out.