In this blog post we give an overview of Intel’s new processor feature called Processor Trace and describe how it can be applied to enhance our 3rd generation threat detection technology. We outline the integration of Processor Tracing into VMRay Analyzer for identifying and triggering hidden functionality of malware. VMRay can also leverage it for effective detection of code-reuse attacks and for gaplessly reconstructing the relevant control flow. Since Processor Trace allows looking back into the past, this leads us directly to the exploited vulnerability.
Intel’s new 6th Generation Processor Family (formerly Skylake) microarchitecture has a new hardware feature called Processor Trace that can be used to efficiently collect run-time branch information. At first glance it’s similar to existing technologies such as Last Branch Recording (LBR) and Branch Trace Messages (BTM). However, Processor Trace is much faster and more flexible in terms of what type and amount of trace information can be recorded. While LBR is also fast, it only allows storing information about the latest (up to) 32 branches. Therefore, its applicability is limited when it comes to the reconstruction of extensive control flows that span more than just a few functions. BTM on the other hand has no real limitation on the number of branches that can be recorded, but this comes at the cost of a big performance impact on the whole system.
Processor Trace eliminates these drawbacks by having no size limitation on branch storage and being fast at the same time. Instead of writing all data to virtual memory, it records trace information directly in physical memory. This keeps the caches clean and avoids translation lookaside buffer (TLB) pollution. Additionally, a packet-based output scheme is used that greatly reduces the required memory bandwidth. Instead of storing full source and destination addresses for every taken branch, only the minimal required information is captured that is necessary for the complete reconstruction of the control flow. While this requires an additional decoder and disassembling capabilities to interpret the recorded data, it makes the method very effective with respect to performance and memory consumption. Intel provides the library libipt that contains a reference decoder implementation for utilizing this technology.
Once tracing is enabled, the processor records the current instruction pointer and processor mode (16, 32, 64-bit mode) to inform the decoder about the starting point. After that, additional tracing information is only recorded when it is needed by the decoder to reconstruct the actual control flow:
Besides this data, Processor Trace can be configured in different ways to record much more data such as performance information about code execution. However, for our purposes this is not relevant yet. We can refer to the Intel documentation for further information.
The already small performance impact can even be reduced by further reducing the amount of collected data by filtering based on the current value of CR3, the privilege level or current memory region. One also has to specify how and where the CPU should store the recorded branch data by setting up a Table of Physical Addresses. Each table entry defines the location and size of a corresponding output memory region. The last entry in each table specifies the location of the next table. As soon as all regions defined in a table are filled, the processor automatically switches to the next table. Since a table can point to itself, circular output regions are possible as well.
VMRay Analyzer is based on unique hypervisor-based 3rd generation threat detection technology. In short, an arbitrary executable or document is detonated within a virtualized environment and its execution monitored completely externally, i.e., from the hypervisor. Without the need to modify any single bit within the analysis environment, it is nearly impossible for the analyzed software to detect and evade the analysis process.
During the analysis, the hypervisor intercepts every single interaction between the monitored application and any other part of the system. For example, all API or direct system calls are intercepted, and their parameter and return values are recorded. This analysis data can be greatly enhanced with help of the trace information obtained by Processor Trace. And since Processor Trace can be completely configured from the hypervisor, neither the analyzed malware sample nor the guest operating system itself is aware of this tracing.
However, malware may rely on some environmental conditions and execute relevant code paths only under certain circumstances that may not be given during analysis. For example, it may execute only on Wednesdays or on machines in Eastern Europe. Processor Tracing can be used to identify this kind of dormant functionality in a very efficient way: by combining the recorded tracing information with the static disassembly of (the unpacked and deobfuscated) binary in memory, we are able to reconstruct the exact control flow. With that data, we locate every non-executed portion of the code and identify the corresponding condition checks. By utilizing our symbolic execution engine, we are able to determine the relevant output values of the corresponding APIs called before that are needed to navigate the control flow to the code in question. With that information at hand, we re-execute the sample and from the hypervisor patch the relevant API results to enforce the execution of the dormant functionality.
Another powerful use case of integrating Processor Tracing into VMRay Analyzer is an effective identification of ROP (and similar code-reuse) attacks. While it not only provides identification, it also allows reconstruction of the exact control flow leading backwards to the exploited vulnerability. To that end, we leverage a circular buffer that contains branch trace information of the latest execution steps before each API call. Every time a function is executed that typically is involved in code reuse attacks, we verify the recorded trace information and check where the call originates from. As an effective heuristic, we look for unusual high amount of short code sequences ending with free control flow instruction, e.g., return or jump register. In such events, there is a high probable chance for an ongoing exploiting attempt. Further evidence for an attack are return instructions that “violate” the internal call stack that is maintained by the CPUs branch prediction unit. Fortunately, these prediction misses are also stored in trace information generated by the processor. An interesting read on a related technique can be found in Taming ROP on Sandy Bridge by Georg Wicherski. The described method uses Performance Counters for branch misprediction to detect ROP attacks. However, it can only be used to detect an ongoing attack. It can’t deliver any information about the particular control flow and the vulnerability that has been exploited.
Branch tracing is a powerful method for control flow reconstruction and therefore can be utilized for enhancing malware and exploit analysis. Unfortunately, all runtime monitoring impacts system performance and thus influences the effectiveness of dynamic analysis. In the past, there already existed special processor support for tracing in terms of LBR and BTM. But their limitations and performance drawback made them impractical for malware analysis. Intel’s new Processor Tracing feature can be utilized to enhance 3rd generation threat detection technology significantly.