Monday, May 5, 2008

A Survey of Anti-Tamper Technologies

This article surveys the various anti–tamper (AT) technologies used to protect software. The primary objective of AT techniques is to protect critical program information by preventing unauthorized modification and use of software. This protection goal applies to any program that requires protection from unauthorized disclosure or inadvertent transfer of leadingedge technologies and sensitive data or systems. In this article, we review the various approaches to AT techniques, their strengths and weaknesses, their advantages and disadvantages, and briefly discuss a process for developing program protection plans. We also survey the tools that are typically used to circumvent AT protections, and techniques that are commonly used to make these protections more resilient against such attack.

The unauthorized modification and subsequent misuse of software is often referred to as software cracking. Usually, cracking requires disabling one or more software features that enforce policies (of access, usage, dissemination, etc.) related to the software. Because there is value and/or notoriety to be gained by accessing valuable software capabilities, cracking continues to be common and is a growing problem.

To combat cracking, anti–tamper (AT) technologies have been developed to protect valuable software. Both hardware and software AT technologies aim to make software more resistant against attack and protect critical program elements. However, before discussing the various AT technologies, we need to know the adversary's goals. What do software crackers hope to achieve. Their purposes vary, and typically include one or more of the following:

  • Gaining unauthorized access.
    The attacker's goal is to disable the software access control mechanisms built into the software. After doing so, the attacker can make and distribute illegal copies whose copy protection or usage control mechanisms have been disabled — this is the familiar software piracy problem. If the cracked software provides access to classified data, then the attacker's real goal is not the software itself, but the data that is accessible through the software. The attacker sometimes aims at modifying or unlocking specific functionality in the program, e.g., a demo or export version of software is often a deliberately degraded version of what is otherwise fully functional software. The attacker then seeks to make it fully functional by re–enabling the missing features.
  • Reverse engineering.
    The attacker aims to understand enough about the software to steal key routines, to gain access to proprietary intellectual property, or to carry out code–lifting, which consists of reusing a crucial part of the code (without necessarily understanding the internals of how it works) in some other software. Good programming practices, while they facilitate software engineering, also tend to simultaneously make it easier to carry out reverse engineering attacks. These attacks are potentially very costly to the original software developer as they allow a competitor (or an enemy) to nullify the developer's competitive advantage by rapidly closing a technology gap through insights gleaned from examining the software.
  • Violating code integrity.
    This familiar attack consists of either injecting malicious code (malware) into a program, injecting code that is not malevolent but illegally enhances a program's functionality, or otherwise subverting a program so it performs new and unadvertised functions (functions that the owner or user would not approve of). While AT technology is related to anti–virus protection, it has some crucial differences. AT technology is similar to virus protection in that it impedes malware infection of an AT–protected executable. However, AT technology differs from virus protection in that the AT technology's goal is not only to protect the client's software from unauthorized modification by malevolent outsiders (infection by malware written by others), but also to protect the software from modification by an authorized client. In many situations, it is important that only authorized applications execute (e.g., in a taximeter, odometer, or any situation where tampering is feared), using only authorized functionality, and that only valid data is used.

It should be clear by now that AT technology is not only about anti–piracy, it has an equal and broader aim of policy enforcement. That aim is to enforce the policies of the software publisher about the proper use of the software, even as the software is running in a potentially hostile environment where the user owns the processor and is intent on violating those policies.

There is a plethora of AT protection mechanisms. These include encryption wrappers, code obfuscation, guarding, and watermarking/fingerprinting in addition to various hardware techniques. While these techniques are discussed separately for pedagogical purposes, the reader should bear in mind that software is best protected when several protection techniques are used together in a mutually supportive manner. No technique is invulnerable or even clearly superior to the others in all circumstances; therefore, a mix of protection techniques allows the defense to capitalize on the strengths of each technique while also masking the shortfalls of other techniques. In the following paragraphs we present a brief overview of these techniques.

Hardware–Based Protections

The most common hardware approach uses a trusted processor. The trusted, tamper–resistant hardware checks and verifies every piece of hardware and software that exists — or that requests to be run on a computer — starting at the boot–up process [1]. This hardware could guarantee integrity by checking every entity when the machine boots up, and every entity that will be run or used on that machine after it boots up. The hardware could, for example, store all of the keys necessary to verify digital signatures, decrypt licenses, decrypt software before running it, and encrypt messages during any online protocols it may need to run (e.g., for updates) with another trusted remote entity (such as the software publisher).

Software downloaded onto a machine would be stored in encrypted form on the hard drive and would be decrypted and executed by the hardware, which would also encrypt and decrypt information it sends and receives from its random access memory. The same software or media could be encrypted in a different way for each trusted processor that would execute it because each processor would have a distinctive decryption key. This would put quite a dent in the piracy problem, as disseminating your software or media files to others would not do them much good (because their own hardware would have different keys).

A less drastic protection than using a separate, trusted, hardware computational device also involves hardware, but is more lightweight such as a smart card or physically secure token. These lightweight hardware protection techniques usually require that the hardware be present for the software to run, to have certain functionality, to access a media file, etc. Defeating this kind of protection usually requires working around the need for the hardware rather than duplicating the hardware. The difficulty of this work–around depends on the role that the tamper–resistant hardware plays in the protection. A device that just outputs a serial number is trivially vulnerable to a replay attack (e.g., an attacker replays a valid serial number to the software, without the presence of the hardware device), whereas a smart card that engages in a challenge–response protocol (different data each time) prevents the simple replay attack but is still vulnerable (e.g., to modification of the software interacting with the smart card). A device that decrypts content or that provides some essential feature of a program or media file is even harder to defeat.

Advantages and Drawbacks

The chief advantage of hardware–based protection techniques is that they run on a trusted CPU and can be made arbitrarily complex — hence, difficult to defeat while inflicting minimal computational cost on the protected software once it has been decrypted within the hardware and is running. However, there is a cost to decrypt it in the first place, and also to encrypt everything that goes out to the non–protected part of the system, and then decrypt it when it comes back into the trusted hardware.

In addition, it is generally more difficult to successfully attack tamper–resistant hardware and make the exploit directly available to others than a software–only protection scheme. This point holds only for a properly designed system. A compromise of hardware that imprudently contains the same secret keys as all other hardware of the same type would lead to widely reproducible exploits.

The advantages of hardware protection also include its capability to enforce such rules as «only approved peripherals can be a part of this computer system,» or «only approved (through digital signatures) software and contents are allowed», etc.

Nevertheless, hardware–based protection also has its drawbacks. There is the usual problem of inflexibility: hardware–based protections are more awkward to modify, port, and update than software–based ones. They are also less secure than commonly assumed and can be broken; see, e.g., [2]. To date, it has not been demonstrated that hardware protections can scale to grid computing or to smallscale computing. In addition, there is no guarantee that all avenues of attack are closed by hardware protection, and there is a significant cost attached to using hardware protection; the cost is driven mainly by the time needed to assemble, integrate, and test the hardware protection technique.

Additional drawbacks to the hardware protection approach include its expense and general fragility to accidents (an electric power surge that fries the processor also renders the hard drive contents unusable because the key that decrypts them is destroyed). The potential implications for censorship are also chilling. Another disadvantage of hardware protection is the boot–up time and the time spent encrypting and decrypting, which makes the approach problematic for low–end machines and embedded systems (unless the whole system lies within tamper–resistant hardware).

Using trusted hardware also incurs many indirect costs as a result of the earlier–mentioned limitations it imposes (e.g., the restriction to only certain approved hardware, software, and media creates a barrier to competition that leads to higher prices). Due to the imperfect protection offered by hardware protection, a more robust approach to software security interweaves hardware protection with other protection techniques such as those discussed in the following sections.

The rest of this article discusses the various software–based protection mechanisms. The reader should keep in mind that hardware and software protection techniques are not mutually exclusive. A judicious combination can serve to increase the security of the system more than any of its individual component techniques.

Encryption Wrappers

With encryption wrapper software security, critical portions of the software (or possibly all of it) are encrypted and decrypted dynamically at run–time. The encryption wrapper approach works well against a static attack, and forces the attacker to run the program in order to get an unencrypted image of it. To make the attacker's task harder, at no time during execution is the whole software in the clear; code decrypts just before it executes, leaving other parts of the program still encrypted. Therefore, no single snapshot of memory can expose the whole decrypted program. Of course, the attacker can take many such snapshots, compare them, and piece together the unencrypted program.

Another avenue of attack is to figure out the various decryption keys that are present in the software. One defensive technique that can be used to delay the attacker is to include defensive mechanisms in the program that deprive the attacker of using run–time attack tools, e.g., anti–debugger, anti–memory dump, and other defensive mechanisms, which make it more difficult for the attacker to run and analyze the program in a synthetic (virtual machine) environment. Yet, a determined attacker can usually defeat these protections (e.g., through the use of virtual machines that faithfully emulate a PC, including the most rarely used instructions, cache behavior, etc).

Encryption wrappers often use lightweight encryption to minimize the computational cost of executing the protected program. The encryption can be advantageously combined with compression: Not only does this result in a smaller amount of storage usage, but it also makes the encryption harder to defeat by cryptanalysis (of course one compresses before encryption, not the other way around).

An encryption wrapper's chief advantage is that it effectively hinders an attacker's ability to statically analyze a program. The attacker is then forced to perform more sophisticated types of dynamic attacks, which can significantly increase the amount of time needed to defeat the protection. The main disadvantage of encryption wrappers is the performance penalty caused by the decryption overhead, and its weakness to memory dumps: before it can run, encryption–protected software must be decrypted, at which point it becomes exposed.

Code Obfuscation

Code obfuscation consists of transforming code so it becomes less intelligible to a human, thus making it not only harder to reverse engineer, but also harder to tamper with. In software that has specific areas where policy checks are made, these areas will be harder to identify and disable after the software has been obfuscated. Obfuscation is usually carried out by inserting or performing obfuscating transformations. It is a requirement that these transformations do not damage a program's functionality, and it must have only a moderate impact on code performance, and on the storage space used on the disk and at run–time (of the two, speed is more important).

The obfuscation must also be resilient to attack, and for this reason it is desirable to maximize the obscurity of the obfuscated software. The obfuscating transformations need to be resilient against tools designed to automatically undo them, and to not be easily detectable by statistical analysis of the resulting code (resilience to statistical analysis makes it harder for automatic tools to find the locations where these transformations were applied).

The different types of obfuscation transformations that have been proposed [3] include the following:

  • Layout obfuscation.
    This modifies the physical appearance of the code, e.g., replacing important variables with random strings, removing all formatting (making nested conditional statements harder to read), etc. Such transformations are easy to make but are effective only when combined with other transformation techniques.
  • Data obfuscation.
    This obscures the data structures used within a program, e.g., the representation and the methods of using that data, independent data merging (and vice–versa — splitting up data that is dependent), etc. Data obfuscation serves to delay the attacker because data structures contain important information that any attacker needs to comprehend before launching an attack.
  • Control obfuscation.
    This manipulates the control flow of a program to make it difficult to discern its original structure, e.g., through merging (or splitting) various fragments of code, reordering expressions, loops, or blocks, etc. It is similar to creating a spurious program that is entangled with the original program so as to obscure the important control features of that program.
  • Preventive transformations.
    These aim at making it difficult for a deobfuscation tool to extract the true program from the obfuscated version of it. Preventive transformations can be implemented by using what Collberg [4] calls opaque predicates, an example of which is a conditional statement that always evaluates as true, but in a manner that is hard to recognize.

Obfuscation can be done at the source–code level (source–to–source translation) or at the assembly level. Although most obfuscators are of the former kind (source–to–source), assembly level obfuscation is better because it effectively hides the operation of the binary. If the source–code level transformations hide information by adding crude and inefficient ways of doing simple tasks, then the code optimizer in the compiler may undo them. If, on the other hand, the transformations are clever enough to fool the optimizer, then it can fail to properly do its job, and the performance of the resulting code suffers. Low– level obfuscation does not prevent the code optimizer from doing its job, but if done carelessly it runs the risk of producing code that looks so different from the kind produced by the compiler that it inadvertently flags the areas where the transformations were applied.

Obfuscation transformations are classified according to several criteria: how much obscurity they add to the program (potency), how difficult they are to break for a de–obfuscator (resilience), and how much computational overhead they add to the obfuscated application (cost). In [4], software complexity metrics are used to formalize the notion of transformation potency and resilience.

The potency of a transformation measures how much more difficult the obfuscated code is to understand for a human than the original code. On the other hand, the resilience of a transformation measures how well it stands up to attack by an automatic de–obfuscator. The resilience measurement takes two factors into account: the programmer effort required to construct the de–obfuscator and the execution time and space required by the de–obfuscator to reduce the potency of the transformation. The best obfuscation is usually achieved by a combination of the above three mentioned transformations. The combination of the three approaches provides a well–balanced mix of highly potent and resilient transformations.

Like all software–only protections, obfuscation can delay — but not prevent — a determined attacker intent on reverse engineering the software. Barak [5] presents a family of functions that are provably impossible to completely and successfully obfuscate. For more information and a discussion of code obfuscation, refer to [3,4,6,7].

Software Watermarking and Fingerprinting

The goal of watermarking is to embed information into software in a manner that makes it hard to remove by an adversary without damaging the software's functionality. The information inserted could be purchaser information, or it could be an integrity check to detect modification, the placing of caption–type information, etc. A watermark need not be stealthy; visible watermarks act as a deterrent (against piracy, for example), but most of the literature has focused on stealthy watermarks. In steganography (the art of concealing the existence of information within seemingly innocuous carriers), the mark is required to be stealthy: its very existence must not be detectable [8].

A specific type of watermarking is fingerprinting, which embeds a unique message in each instance of the software for traitor tracing. This has consequences for the adversary's ability to attack the watermark: two differently marked copies often make possible a diff attack that compares the two differently marked copies and can enable the adversary to create a usable copy that has neither one of the two marks. Thus, in any fingerprinting scheme, it is critical to use techniques that are resilient against such comparison attacks.

A watermark is generally required to be robust (hard to remove). In some situations, however, a fragile watermark is desirable; it is destroyed if even a small alteration is made to the software (e.g., this is useful for making the software tamper–evident).

Software watermarks can be static, i.e., readable without running the software, or could appear only at run–time (preferably in an evanescent form). In either case, reading the watermark usually requires knowing a secret key, without which the watermark remains invisible.

Watermarks may be used for proof of software authorship or ownership, fingerprinting for identifying the source of illegal information/software dissemination, proof of authenticity, tamper–resistant copyright protection, and captioning to provide information about the software. When software watermarks are used for proof of authorship or ownership (culprit–tracing), it is important to use a very resilient scheme. Recall that this is when the watermark contains information about the copyright owner as well as the entity that is licensed to use the software, thus allowing trace–back to the culprit if the item were to be illegally disseminated to others. Breaking the security of such a scheme can enable the attacker to frame an innocent victim.

As you can see, while watermarks can demonstrate authorized possession and the fact that software has been pirated, they do not address the reverse engineering or authorized execution issues of the other schemes discussed; therefore, we advocate the development and use of a spectrum of software protection techniques.

Guarding

A guard is code that is injected into the software for the sake of AT protection. A guard must not interfere with the program's basic functionality unless that program is tampered with — it is the tampering that triggers a guard to take action that deviates from normal program behavior. Examples of guard functionality range from tasks as simple as comparing a checksum of a code fragment to its expected value, to repairing code (in case it was maliciously damaged), to complex and indirect forms of protection through subtle side effects.

The preferred use of the guarding approach consists of injecting into the code to be protected a large number of guards that mutually protect each other as well as the software program in which they now reside. Guards can also be used to good effect in conjunction with hardware–based protection techniques to further ensure that the protected software is only executed in an authorized environment.

The number, types, and stealthiness of guards; the protection topology (who protects who); and where the guards are injected in the original code and how they are entangled with it are some of the parameters in the strength of the resulting protection: They are all tunable in a manner that depends on the type of code being protected, the desired level of protection, etc.

Manually installing such a tangled web of protection is impractical (as it must be done every time the software is updated), so it is important that this protection be done in a highly automated fashion using high–level scripts that specify the protection guidelines and parameters. It should be thought of as a part of the compilation process where an anti–tamper option results in code that is guarded and tamper–resistant.

The rationale for this approach is that a single (even if elaborate) AT protection scheme for a software application is insufficient because a single defense results in a single point of attack that can be located and compromised. To make self–protection robust, the defense must not rely on a single complex protection technique no matter how effective it might be. Instead, there needs to be a multitude of (possibly simple) protection techniques installed in the program that cooperatively enforce the code's integrity as well as protect the other against tampering.

The guard's response when it detects tampering is flexible and can range from a mild response to the disruption of normal program execution through injection of run–time errors (crashes or even subtle errors in the answers computed); the reaction chosen depends on the software publisher's business model and the expected adversary. Generally, it is better for a guard's reaction to be delayed rather than to occur immediately upon detection so that tracing the reaction back to its true cause is as difficult as possible and consumes a great deal of the attacker's time. More on guarding can be found in [9].

AT Process

This section explores the various AT guidelines expressed in the «Defense Acquisition Guidebook» [10], and the recommended process for developing a program protection plan. The «Defense Acquisition Guidebook» specifies the AT measures that should be considered for use on any system with critical program information (CPI), developed with allied partners, likely to be sold or provided to United States allies and friendly foreign governments, or likely to fall into enemy hands. The first step in the recommended AT methodology is to develop a program protection plan. The process of developing this plan includes the following:

  • Develop a list of critical technologies.
  • Develop a threat analysis.
  • Develop a list of identified vulnerabilities.
  • Develop a preliminary AT requirement.
  • Perform an analysis of AT methods that applies to the system, including cost/benefit assessments.
  • Provide an explanation of which AT method(s) will be implemented; develop a plan for validating the AT implementation.

The standard approach of validating AT protections is done via red–teaming. A red team consists of individuals who are well versed in security methods and their corresponding weaknesses. Their primary objectives are to attempt to defeat the protection, to assess the protection's strengths and weaknesses, and to make recommendations for improvement. While this is an effective method of evaluation, a major problem with red teams is that the validation is done by humans, and may not be totally reliable or repeatable. Furthermore, as the need for AT technologies grows, red teams are becoming increasingly called upon to perform evaluations. The teams are overwhelmed with assignments, significant delays in product evaluations, and release results. To improve this process, there is a clear and present need for automated testing and validation tools. Such tools could be used to define a standard set of metrics and guidelines to evaluate software protections.

Conclusion

This article has surveyed the motivation for using AT technology, the hardware and software AT techniques in use today, and the strengths and weaknesses of AT technologies. We also briefly introduced the process and documentation used to develop a program protection plan. The motivation for and primary objective of AT technology is to protect CPI by preventing unauthorized modification and use of software. The main software AT techniques include encryption wrappers, code obfuscation, watermarking/fingerprinting, and guarding.

A fundamental challenge faced by software AT technology is that the protected application is running on a host that is not trusted, and thus cannot be assured to be secure. Guards address this shortfall to a degree and in a flexible and extensible manner. However, in light of the need for robust, seamless, comprehensive software defense, using both software and hardware AT solutions in combination often offers an appealing alternative to using them individually (especially if economic considerations are factored in).

At this time, indications are that if strong software AT technology (e.g., in the form of judiciously constructed guards) is added to an application so that it requires the presence of a lightweight tamper–resistant hardware device in order to execute properly, the result is a strong yet economical software protection capability.

References

  1. Arbaugh, W., D. Farber, and J. Smith. A Secure and Reliable Bootstrap Architecture. Proc. of the IEEE Symposium on Security and Privacy, Oakland, CA, 1997.
  2. Anderson, R., and M. Kuhn. Tamper Resistance — A Cautionary Note. Proc. of Second Usenix Workshop on Electronic Commerce, Oakland, CA, Nov. 1996: 1–11.
  3. Collberg C., and C. Thomborson. «Watermarking, Tamper–Proofing, and Obfuscation Tools for Software Protection». IEEE Transactions on Software Engineering 28.8 (2002): 735–746, 2002.
  4. Collberg, C., C. Thomborson, and D. Low. «A Taxonomy of Obfuscating Transformations». Department of Computer Science, University of Auckland, New Zealand, 1997.
  5. Barak, B., et al. «On the (Im)possibility of Obfuscating Programs». Electronic Colloquium on Computational Complexity. Report No. 57, 2001.
  6. Wang, C., et al. «Software Tamper Resistance: Obstructing Static Analysis of Programs». University of Virginia, Computer Science Technical Report CS–2000–12, Dec. 2000.
  7. Wroblewski, G. «General Method of Program Code Obfuscation». Diss. Wroclaw University of Technology, Institute of Engineering Cybernetics, 2002.
  8. Johnson, N. «Introduction to Steganography and Steganalysis». Workshop on Statistical and Machine Learning Techniques in Computer Intrusion Detection, Johns Hopkins University, 11–13 June 2002.
  9. Chang, H., and M. Atallah. Protecting Software Code By Guards. Proc. of ACM Workshop on Security and Privacy in Digital Rights Management, Philadelphia, PA, Nov. 2001: 160–175.
  10. Office of the Secretary of Defense. Interim Defense Acquisition Guidebook. Washington, D.C.: OSD, 30 Oct. 2002 http://dod5000.dau.mil/DoD5000Interactive/InterimGuidebook.asp

Additional Reading

  1. Lipton, R.J., S. Rajagopalan, and D.N. Serpanos. «Spy: A Method to Secure Clients for Network Services». IEEE Distributed Computing Systems Workshops 2002: 23–28.
  2. Anderson, R., and M. Kuhn. «Low Cost Attacks on Tamper Resistant Devices». 5th International Workshop on Security Protocols, Apr. 1997: 125–136.
Dr. Mikhail J. Atallah, Eric D. Bryant, and Dr. Martin R. Stytz Arxan Technologies, Inc.

Monday, April 28, 2008

Static Disassembly of Obfuscated Binaries

Disassembly is the process of recovering a symbolic representation of a program's machine code instructions from its binary representation. Recently, a number of techniques have been proposed that attempt to foil the disassembly process. These techniques are very effective against state–of–the–art disassemblers, preventing a substantial fraction of a binary program from being disassembled correctly. This could allow an attacker to hide malicious code from static analysis tools that depend on correct disassembler output (such as virus scanners).

The paper presents novel binary analysis techniques that substantially improve the success of the disassembly process when confronted with obfuscated binaries. Based on control flow graph information and statistical methods, a large fraction of the program's instructions can be correctly identified. An evaluation of the accuracy and the performance of our tool is provided, along with a comparison to several state–of–the–art disassemblers.

1. Introduction

Software applications are often distributed in binary form to prevent access to proprietary algorithms or to make tampering with licensing verification procedures more difficult. The general assumption is that understanding the structure of a program by looking at its binary representation is a hard problem that requires substantial resources and expertise.

Software reverse–engineering techniques provide automated support for the analysis of binary programs. The goal of these techniques is to produce a higher–level representation of a program that allows for comprehension and possibly modification of the program's structure.

The software reverse–engineering process can be divided into two parts: disassembly and decompilation. The task of the disassembly phase is the extraction of the symbolic representation of the instructions (assembly code) from the program's binary image [12]. Decompilation [5,6] is the process of reconstructing higher–level semantic structures (and even source code) from the program's assembly–level representation.

A number of approaches have been proposed to make the reverse–engineering process harder [8,9,17]. These techniques are based on transformations that preserve the program's semantics and functionality and, at the same time, make it more difficult for a reverse–engineer to extract and comprehend the program's higher–level structures. The process of applying one or more of these techniques to an existing program is called obfuscation.

Most previouswork on program obfuscation has focused on the decompilation phase. To this end, researchers have proposed to use constructs such as indirect jumps or indirect memory references via pointers that are difficult to analyze [14]. In [13], Linn and Debray introduce novel obfuscation techniques that focus on the disassembly phase instead. Their techniques are independent of and complementary to previous approaches to make decompilation harder. The main idea is to transform the binary such that the parsing of instructions becomes difficult. The approach exploits the fact that the Intel x86 instruction set architecture contains variable length instructions that can start at arbitrary memory address. By inserting padding bytes at locations that cannot be reached during run–time, disassemblers can be confused to misinterpret large parts of the binary. Although their approach is limited to Intel x86 binaries, the obfuscation results against current state–of–the–art disassemblers are remarkable.

Linn and Debray state that their obfuscation techniques can enhance software security by making it harder for an attacker to steal intellectual property, to make unauthorized modifications to proprietary software or to discover vulnerabilities. On the other hand, program obfuscation could also be used by attackers to hide malicious code such as viruses or Trojan Horses from virus scanners [3,16]. Obfuscation also presents a serious threat to tools that statically analyze binaries to isolate or to identify malicious behavior [2,11]. The reason is that if relevant program structures were incorrectly extracted, malicious code could be classified as benign.

This paper presents static analysis techniques to correctly disassemble Intel x86 binaries that are obfuscated to resist static disassembly. The main contribution are general control–flow–based and statistical techniques to deal with hard–to–disassemble binaries. Also, a mechanism is presented that is specifically tailored against the tool implemented by Linn and Debray [13]. An implementation based on our approach has been developed, and the results show that our tool is able to substantially improve the disassembly of obfuscated binaries.

The paper is structured as follows. In Section 2, the principal techniques used in binary disassembly are reviewed, together with a discussion of Linn and Debray's recently proposed obfuscation techniques. In Section 3, we outline the disassembly approach and present our assumptions. Section 4 and Section 5 provide an indepth description of our disassembly techniques. In Section 6, a quantitative evaluation of the accuracy and performance of our disassembler is presented. Finally, in Section 7, we briefly conclude and outline future work.

2. Related Work and Background

Disassembly techniques can be categorized into two main classes: dynamic techniques and static techniques.

Approaches that belong to the first category rely on monitored execution traces of an application to identify the executed instructions and recover a (partial) disassembled version of the binary. Approaches that belong to the second category analyze the binary structure statically, parsing the instruction opcodes as they are found in the binary image.

Both static and dynamic approaches have advantages and disadvantages. Static analysis takes into account the complete program, while dynamic analysis can only operate on the instructions that were executed in a particular set of runs. Therefore, it is impossible to guarantee that the whole executable was covered when using dynamic analysis. On the other hand, dynamic analysis assures that only actual program instructions are part of the disassembly output. In this paper, we focus on static analysis techniques only.

In general, static analysis techniques follow one of two approaches. The first approach, called linear sweep, starts at the first byte of the binary's text segment and proceeds from there, decoding one instruction after another. It is used, for example, by GNU's objdump [10]. The drawback of linear sweep disassemblers is that they are prone to errors that result from data embedded in the instruction stream. The second approach, called recursive traversal, fixes this problem by following the control flow of the program [6,15]. This allows recursive disassemblers to circumvent data that is intertwined with the program instructions. The problem with the second approach is that the control flow cannot always be reconstructed precisely. When the target of a control transfer instruction such as a jump or a call cannot be determined statically (e.g., in case of an indirect jump), the recursive disassembler fails to analyze parts of the program's code. This problem is usually solved with a technique called speculative disassembly [4], which uses a linear sweep algorithm to analyze unreachable code regions.

Linn and Debray's approach [13] to confuse disassemblers are based on two main techniques. First, junk bytes are inserted at locations that are not reachable at run–time. These locations can be found after control transfer instructions such as jumps where control flow does not continue. Consider the example in Figure 1, where a function is presented in source form and in its corresponding assembly representation. At address 0x8048012, two junk bytes are added after the jump instruction at address 0x8048010. Inserting junk bytes at unreachable locations should not effect recursive disassemblers, but has a profound impact on linear sweep implementations.

Example disassembly obfuscated function Figure 1: Example function

The second technique relies on a branch function to change the way regular procedure calls work. This creates more opportunities to insert junk bytes and misleads both types of disassemblers. A normal call to a subroutine is replaced with a call to the branch function. This branch function uses an indirect jump to transfer control to the original subroutine. In addition, an offset value is added to the return address of the subroutine. When the subroutine is done, control is not transfered to the address directly after the call instruction. Instead, the instruction that is offset number of bytes after the call instruction is executed. In the example in Figure 1, two junk bytes are inserted after the call to the branch function at address 0x8048003. During run–time, the branch function modifies the return address such that the next instruction that is executed after the call is at address 0x804800a.

Figure 2 shows the disassembly results for the example function when using a linear sweep and a recursive traversal disassembler. The linear sweep disassembler is successfully confused in both cases where junk bytes are inserted. The two junk bytes at 0x8048008 are interpreted as or instruction, causing the the following four bytes (which are actually a cmp and a jne instruction) as being parsed as a 32–bit argument value. A similar problem occurs at address 0x8048012, resulting in only 5 out of 12 correctly identified instructions.

Traditional disassemblers Figure 2: Traditional disassemblers

This recursive disassembler is not vulnerable to the junk bytes inserted at address 0x8048012 because it recognizes instruction 0x8048010 as an unconditional jump. Therefore, the analysis can continue at the jump target, which is at address 0x8048019. However, the junk bytes after the call instruction at 0x8048003 lead to incorrect disassembly and the subsequent failure to decode the jump at 0x804800c with its corresponding target at 0x8048014. In this example, the recursive traversal disassembler succeeds to correctly identify 9 out of 12 instructions. However, the situation becomes worse when dealing with real binaries. Because calls are redirected to the branch function, large parts of the binary become unreachable for the recursive traversal algorithm. The results in Section 6 demonstrate that recursive traversal disassemblers, such as IDA Pro, perform worse on obfuscated binaries than linear sweep disassemblers, such as objdump.

3. Disassembling Obfuscated Binaries

Our disassembler performs static analysis on Intel x86 binaries. When analyzing an obfuscated binary, one cannot assume that the code was generated by a wellbehaved compiler. In fact, the obfuscation techniques introduced by Linn and Debray [13] precisely exploit the fact that standard disassemblers assume certain properties of compiler–generated code that can be violated without changing the program's functionality. By transforming the binary into functionally equivalent code that does not possess all the assumed properties, standard disassemblers are confused and fail to correctly translate binary code into its corresponding assembly representation. In general, certain properties are easier to change than others and it is not straightforward to transform (i.e., obfuscate) a binary into a functionally equivalent representation in which all the compiler–related properties of the original code are lost. When disassembling obfuscated binaries, we require that certain assumptions are valid. These assumptions (some of which constitute limiting factors for our ability to disassemble obfuscated binaries) are described in the following subsections.

  1. Valid instructions must not overlap. An instruction is denoted as valid if it belongs to the program, that is, it is reached (and executed) at run–time as part of some legal program execution trace. Two instructions overlap if one or more bytes in the executable are shared by both instruction. In other words, the start of one instruction is located at an address that is already used by another instruction. Overlapping instructions have been suggested to complicate disassembly in [7]. However, suitable candidate instructions for this type of transformation are difficult to find in real executables and the reported obfuscation effects were minimal [13].
  2. Conditional jumps can be either taken or not taken. This means that control flow can continue at the branch target or at the instruction after the conditional branch. In particular, it is not possible to insert junk bytes at the branch target or at the address following the branch instruction. Linn and Debray [13] discuss the possibility to transform unconditional jumps into conditional branches using opaque predicates. Opaque predicates are predicates that always evaluate to either true or false, independent of the input. This would allow the obfuscator to insert junk bytes either at the jump target or in place of the fall–through instruction. However, it is not obvious how to generate opaque predicates that are not easily recognizable for the disassembler. Also, the obfuscator presented in [13] does not implement this transformation.
  3. An arbitrary amount of junk bytes can be inserted at unreachable locations. Unreachable locations denotes locations that are not reachable at run–time. These locations can be found after instructions that change the normal control flow. For example, most compilers arrange code such that the address following an unconditional jump contains a valid instruction. However, we assume that an arbitrary number of junk bytes can be inserted there.
  4. The control flow does not have to continue immediately after a call instruction. Thus, an arbitrary number of padding bytes can be added after each call. This is different from the standard behavior where it is expected that the callee returns to the instruction following a call using the corresponding return instruction. More specifically, in the x86 instruction set architecture, the call operation performs a jump to the call target and, in addition, pushes the address following the call instruction on the stack. This address is then used by the corresponding ret instruction, which performs a jump to the address currently on top of the stack. However, by redirecting calls to a branch function, it is trivial to change the return address.

Our disassembly techniques can be divided into two classes: general techniques and tool–specific techniques.

General techniques are techniques that do not rely upon any knowledge on how a particular obfuscator transforms the binary. It is only required that the transformations respect our assumptions. Our general techniques are based on the program's control flow, similar to a recursive traversal disassembler. However, we use a different approach to construct the control flow graph, which is more resilient to obfuscation attempts. Program regions that are not covered by the control flow graph are analyzed using statistical techniques. The general techniques are described in more detail in Section 4.

An instance of an obfuscator that respects our assumptions is presented by Linn and Debray in [13]. By tailoring the static analysis process against a particular tool, it is often possible to reverse some of the performed transformations and improve the analysis results. Section 5 discusses potential modifications to our general techniques to take advantage of tool–specific knowledge when disassembling binaries transformed with Linn and Debray's obfuscator.

In Section 6, we show that the general techniques presented in the next section offer a significant improvement over previous approaches. When combined with tool–specific knowledge, the obfuscated binary is almost completely disassembled.

4. General Techniques

This section discusses the general techniques to reconstruct the program's control flow. Regions in the binary that are not covered by the control flow graph are analyzed using statistical methods.

4.1 Function Identification

The first step when disassembling obfuscated programs is to divide the binary into functions that can then be analyzed independently. The main reason for doing so is run–time performance; it is necessary that the disassembler scales well enough such that the analysis of large real–world binaries is possible.

An important part of our analysis is the reconstruction of the program's control flow. When operating on the complete binary, the analysis does not scale well for large programs. Therefore, the binary is broken into smaller regions (i.e., functions) that can be analyzed consecutively. This results in a run–time overhead of the disassembly process that is linear in the number of instructions (roughly, the size of the code segment).

A straightforward approach to obtain a function's start addresses is to extract the targets of call instructions. When a linker generates an ordinary executable, the targets of calls to functions located in the binary's text segment are bound to the actual addresses of these functions. Given the call targets and assuming that most functions are actually referenced from others within the binary, one can obtain a fairly complete set of function start addresses. Unfortunately, this approach has two drawbacks. One problem is that this method requires that the call instructions are already identified. As the objective of our disassembler is precisely to provide that kind of information, the call instructions are not available at this point. Another problem is that an obfuscator can redirect all calls to a single branching function that transfers control to the appropriate targets. This technique changes all call targets to a single address, thus removing information necessary to identify functions.

We use a heuristic to locate function start addresses. This is done by searching the binary for byte sequences that implement typical function prologs. When a function is called, the first few instructions usually set up a new stack frame. This frame is required to make room for local variables and to be able restore the stack to its initial state when the function returns. In the current implementation, we scan the binary for byte sequences that represent instructions that push the frame pointer onto the stack and instructions that increase the size of the stack by decreasing the value of the stack pointer. The technique works very well for regular binaries and also for the obfuscated binaries used in our experiments. The reason is that the used obfuscation tool [13] does not attempt to hide function prologs. It is certainly possible to extend the obfuscator to conceal the function prolog. In this case, our function identification technique might require changes, possible using tool–specific knowledge.

Note that the partitioning of the binary into functions is mainly done for performance reasons, and it is not crucial for the quality of the results that all functions are correctly identified. When the start point of a function is missed, later analysis simply has to deal with one larger region of code instead of two separate smaller parts. When a sequence of instructions within a function is misinterpreted as a function prolog, two parts of a single function are analyzed individually. This could lead to less accurate results when some intra–procedural jumps are interpreted as inter–procedural, making it harder to reconstruct the intra–procedural control flow graph as discussed in the following section.

4.2 Intra–Procedural Control Flow Graph

To find the valid instructions of a function (i.e., the instructions that belong to the program), we attempt to reconstruct the function's intra–procedural control flow graph. A control flow graph (CFG) is defined as a directed graph G = (V,E) in which vertices u, v ∈ V represent basic blocks and an edge e ∈ E : u → v represents a possible flow of control from u to v. A basic block describes a sequence of instructions without any jumps or jump targets in the middle. More formally, a basic block is defined as a sequence of instructions where the instruction in each position dominates, or always executes before, all those in later positions, and no other instruction executes between two instructions in the sequence. Directed edges between blocks represent jumps in the control flow, which are caused by control transfer instructions (CTIs) such as calls, conditional and unconditional jumps, or return instructions.

The traditional approach to reconstruct the control flow graph of a function works similar to a recursive disassembler. The analysis commences at the function's start address and instructions are disassembled until a control transfer instruction is encountered. The process is then continued recursively at all jump targets that are local to the procedure and, in case of a call instruction or a conditional jump, at the address following the instruction. In case of an obfuscated binary, however, the disassembler cannot continue directly after a call instruction. In addition, many local jumps are converted into non–local jumps to addresses outside the function to blur local control flow. In most cases, the traditional approach leads to a control flow graph that covers only a small fraction of the valid instructions of the function under analysis. This claim is supported by the experimental data shown in Section 6 that includes the results for a state–of–the–art recursive disassembler.

We developed an alternative technique to extract a more complete control flow graph. The technique is composed of two phases: in the first phase, an initial control flow graph is determined. In the following phase, conflicts and ambiguities in the initial CFG are resolved. The two phases are presented in detail in the next two sections.

4.2.1 Initial Control Flow Graph

To determine the initial control flow graph for a function, we first decode all possible instructions between the function's start and end addresses. This is done by treating each address in this address range as the begin of a new instruction. Thus, one potential instruction is decoded and assigned to each address of the function. The reason for considering every address as a possible instruction start stems from the fact that x86 instructions have a variable length from one to fifteen bytes and do not have to be aligned in memory (i.e., an instruction can start at an arbitrary address). Note that most instructions take up multiple bytes and such instructions overlap with other instructions that start at subsequent bytes. Therefore, only a subset of the instructions decoded in this first step can be valid. Figure 3 provides a partial listing of all instructions in the address range of the sample function that is shown in Figure 1. For the reader's reference, valid instructions are marked by an x in the «Valid» column. Of course, this information is not available to our disassembler. An example for the overlap between valid and invalid instructions can be seen between the second and the third instruction. The valid instruction at address 0x8048001 requires two bytes and thus interferes with the next (invalid) instruction at 0x8048002.

Partial instruction listing Figure 3: Partial instruction listing

The next step is to identify all intra–procedural control transfer instructions. For our purposes, an intraprocedural control transfer instruction is defined as a CTI with at least one known successor basic block in the same function. Remember that we assume that control flow only continues after conditional branches but not necessarily after call or unconditional branch instructions. Therefore, an instruction is an intra–procedural control transfer instruction if either (i) its target address can be determined and this address is in the range between the function's start and end addresses or (ii) it is a conditional jump.

Note that we assume that a function is represented by a contiguous sequence of instructions, with possible junk instructions added in between. However, it is not possible that the basic blocks of two different functions are intertwined. Therefore, each function has one start address and one end address (i.e., the last instruction of the last basic block that belongs to this function). However, it is possible that a function has multiple exit points.

In case of a conditional jump, the address that immediately follows the jump instruction is the start of a successor block, and thus, every conditional jump is also an intra–procedural control transfer operation. This is intuitively plausible, as conditional branches are often used to implement local branch (e.g., if–else) and loop (e.g., while, for) statements of higher–level languages, such as C.

To find all intra–procedural CTIs, the instructions decoded in the previous step are scanned for any control transfer instructions. For each CTI found in this way, we attempt to extract its target address. In the current implementation, only direct address modes are supported and no data flow analysis is performed to compute address values used by indirect jumps. However, such analysis could be later added to further improve the performance of our static analyzer. When the instruction is determined to be an intra–procedural control transfer operation, it is included in the set of jump candidates. The jump candidates of the sample function are marked in Figure 3 by an x in the «Candidate» column. In this example, the call at address 0x8048003 is not included into the set of jump candidates because the target address is located outside the function.

Given the set of jump candidates, an initial control flow graph is constructed. This is done with the help of a recursive disassembler. Starting with an initial empty CFG, the disassembler is successively invoked for all the elements in the set of jump candidates. In addition, it is also invoked for the instruction at the start address of the function.

The key idea for taking into account all possible control transfer instructions is the fact that the valid CTIs determine the skeleton of the analyzed function. By using all control flow instructions to create the initial CFG, we make sure that the real CFG is a subgraph of this initial graph. Because the set of jump candidates can contain both valid and invalid instructions, it is possible (and also frequent) that the initial CFG contains a superset of the nodes of the real CFG. These nodes are introduced as a result of argument bytes of valid instructions being misinterpreted as control transfer instructions. The Intel x86 instruction set contains 26 singlebyte opcodes that map to control transfer instructions (out of 219 single–byte instruction opcodes). Therefore, the probability that a random argument byte is decoded as CTI is not negligible. In our experiments (for details, see Section 6), we found that about one tenth of all decoded instructions are CTIs. Of those instructions, only two thirds were part of the real control flow graph. As a result, the initial CFG contains nodes and edges that represent invalid instructions. Most of the time, these nodes contain instructions that overlap with valid instructions of nodes that belong to the real CFG. The following section discusses mechanisms to remove these spurious nodes from the initial control flow graph. It is possible to distinguish spurious from valid nodes because invalid CTIs represent random jumps within the function while valid CTIs constitute a well–structured CFG with nodes that have no overlapping instructions.

Creating an initial CFG that includes nodes that are not part of the real control flow graph can been seen as the opposite to the operation of a recursive disassembler. A standard recursive disassembler starts from a known valid block and builds up the CFG by adding nodes as it follows the targets of control transfer instructions that are encountered. This technique seems favorable at first glance, as it makes sure that no invalid instructions are incorporated into the CFG. However, most control flow graphs are partitioned into several unconnected subgraphs. This happens because there are control flow instructions such as indirect branches whose targets often cannot be determined statically. This leads to missing edges in the CFG and to the problem that only a fraction of the real control flow graph is reachable from a certain node. The situation is exacerbated when dealing with obfuscated binaries, as inter–procedural calls and jumps are redirected to a branching function that uses indirect jumps. This significantly reduces the parts of the control flow graph that are directly accessible to a recursive disassembler, leading to unsatisfactory results.

Although the standard recursive disassembler produces suboptimal results, we use a similar algorithm to extract the basic blocks to create the initial CFG. As mentioned before, however, the recursive disassembler is not only invoked for the start address of the function alone, but also for all jump candidates that have been identified. An initial control flow graph is then constructed according to the code listing shown in Algorithm 1.

There are two differences between a standard recursive disassembler and our implementation. First, we assume that the address after a call or an unconditional jump instruction does not have to contain a valid instruction. Therefore, our recursive disassembler cannot continue at the address following a call or an unconditional jump. Note, however, that we do continue to disassemble after a conditional jump (i.e., branch). This can be seen at Label 5 of Algorithm 1 where the disassembler recursively continues after conditional branch instructions.

The second difference is due to the fact that it is possible to have instructions in the initial call graph that overlap. In this case, two different basic blocks in the call graph can contain overlapping instructions starting at slightly different addresses. When following a sequence of instructions, the disassembler can arrive at an instruction that is already part of a previously found basic block. In the regular case, this instruction is the first instruction of the existing block. The disassembler can complete the instruction sequence of the current block and create a link to the existing basic block in the control flow graph.

Algorithm disassemble

When instructions can overlap, it is possible that the current instruction sequence starts to overlap with another sequence in an existing basic block for some instructions before the two sequences eventually merge. At the point where the two sequences merge, the disassembler finds an instruction that is in the middle (or at the end) of a sequence associated with an existing basic block. In this case, the existing basic block is split into two new blocks. One block refers to the overlapping sequence up to the instruction where the two sequences merge, the other refers to the instruction sequence that both have in common. All edges in the control flow graph that point to the original basic block are changed to point to the first block, while all outgoing edges of the original block are assigned to the second. In addition, the first block is connected to the second one. The reason for splitting the existing block is the fact that a basic block is defined as a continuous sequence of instructions without a jump or jump target in the middle. When two different overlapping sequences merge at a certain instruction, this instruction has two predecessor instructions (one in each of the two overlapping sequences). Therefore, it becomes the first instruction of a new basic block. As an additional desirable side effect, each instruction appears at most once in a basic block of the call graph.

The functionality of splitting an existing basic block is implemented by the split procedure referenced at Label 3 of Algorithm 1. Whenever an instruction is found that is already associated with a basic block (check performed at Label 1), the instruction sequence of the current basic block is completed. When the instruction is in the middle of the existing block (check performed at Label 2), it is necessary to split the block. The current block is then connected either to the existing basic block or, after a split, to the newly created block that contains the common instruction sequence. The check performed at Label 4 takes care of the special case where the recursive disassembler starts with an instruction that is part of an existing basic block. In this case, the current block contains no instructions and a reference to the old block is returned instead.

The situation of two merging instruction sequences is a common phenomenon when disassembling x86 binaries. The reason is called self–repairing disassembly and relates to the fact that two instruction sequences that start at slightly different addresses (that is, shifted by a few bytes) synchronize quickly, often after a few instructions. Therefore, when the disassembler starts at an address that does not correspond to a valid instruction, it can be expected to re–synchronize with the sequence of valid instructions after a few steps [13].

Initial control flow graph Figure 4: Initial control flow graph

The initial control flow graph that is created by Algorithm 1 for our example function is shown in Figure 4. In this example, the algorithm is invoked for the function start at address 0x8048000 and the four jump candidates (0x8048006, 0x804800c, 0x8048010, and 0x8048017). The nodes in this figure represent basic blocks and are labeled with the start address of the first instruction and the end address of the last instruction in the corresponding instruction sequence. Note that the end address denotes the first byte after the last instruction and is not part of the basic block itself. Solid, directed edges between nodes represent the targets of control transfer instructions. A dashed line between two nodes signifies a conflict between the two corresponding blocks. Two basic blocks are in conflict when they contain at least one pair of instructions that overlap. As discussed previously, our algorithm guarantees that a certain instruction is assigned to at most one basic block (otherwise, blocks are split appropriately). Therefore, whenever the address ranges of two blocks overlap, they must also contain different, overlapping instructions. Otherwise, both blockswould contain the same instruction, which is not possible. This is apparent in Figure 4, where the address ranges of all pairs of conflicting basic blocks overlap. To simplify the following discussion of the techniques used to resolve conflicts, nodes that belong to the real control flow graph are shaded. In addition, each node is denoted with an uppercase letter.

4.2.2 Block Conflict Resolution

The task of the block conflict resolution phase is to remove basic blocks from the initial CFG until no conflicts are present anymore. Conflict resolution proceeds in five steps. The first two steps remove blocks that are definitely invalid, given our assumptions. The last three steps are heuristics that choose likely invalid blocks. The conflict resolution phase terminates immediately after the last conflicting block is removed; it is not necessary to carry out all steps. The final step brings about a decision for any basic block conflict and the control flow graph is guaranteed to be free of any conflicts when the conflict resolution phase completes.

The five steps are detailed in the following paragraphs.

Step 1: We assume that the start address of the analyzed function contains a valid instruction. Therefore, the basic block that contains this instruction is valid. In addition, whenever a basic block is known to be valid, all blocks that are reachable from this block are also valid.

A basic block v is reachable from basic block u if there exists a path p from u to v. A path p from u to v is defined as a sequence of edges that begins at u and terminates at v. An edge is inserted into the control flow graph only when its target can be statically determined and a possible program execution trace exists that transfers control over this edge. Therefore, whenever a control transfer instruction is valid, its targets have to be valid as well.

We tag the node that contains the instruction at the function's start address and all nodes that are reachable from this node as valid. Note that this set of valid nodes contains exactly the nodes that a traditional recursive disassembler would identify when invoked with the function's start address. When the valid nodes are identified, any node that is in conflict with at least one of the valid nodes can be removed.

In the initial control flow graph for the example function in Figure 4, only node A (0x8048000) is marked as valid. That node is drawn with a stronger border in Figure 4. The reason is that the corresponding basic block ends with a call instruction at 0x8048003 whose target is not local. In addition, we do not assume that control flow resumes at the address after a call and thus the analysis cannot directly continue after the call instruction. In Figure 4, node B (the basic block at 0x8048006) is in conflict with the valid node and can be removed.

Step 2: Because of the assumption that valid instructions do not overlap, it is not possible to start from a valid block and reach two different nodes in the control flow graph that are in conflict. That is, whenever two conflicting nodes are both reachable from a third node, this third node cannot be valid and is removed from the CFG. The situation can be restated using the notion of a common ancestor node. A common ancestor node of two nodes u and v is defined as a node n such that both u and v are reachable from n.

In Step 2, all common ancestor nodes of conflicting nodes are removed from the control flow graph. In our example in Figure 4, it can be seen that the conflicting node F and node K share a common ancestor, namely node J. This node is removed from the CFG, resolving a conflict with node I. The resulting control flow graph after the first two steps is shown in Figure 5.

CFG after two steps of conflict resolution Figure 5: CFG after two steps of conflict resolution

The situation of having a common ancestor node of two conflicting blocks is frequent when dealing with invalid conditional branches. In such cases, the branch target and the continuation after the branch instruction are often directly in conflict, allowing one to remove the invalid basic block from the control flow graph.

Step 3: When two basic blocks are in conflict, it is reasonable to expect that a valid block is more tightly integrated into the control flow graph than a block that was created because of a misinterpreted argument value of a program instruction. That means that a valid block is often reachable from a substantial number of other blocks throughout the function, while an invalid block usually has only a few ancestors.

The degree of integration of a certain basic block into the control flow graph is approximated by the number of its predecessor nodes. A node u is defined as a predecessor node of v when v is reachable by u. In Step 3, the predecessor nodes for pairs of conflicting nodes are determined and the node with the smaller number is removed from the CFG.

In Figure 5, node K has no predecessor nodes while node F has five. Note that the algorithm cannot distinguish between real and spurious nodes and thus includes node C in the set of predecessor nodes for node F. As a result, node K is removed. The number of predecessor nodes for node C and node H are both zero and no decision is made in the current step.

Step 4: In this step, the number of direct successor nodes of two conflicting nodes are compared. A node v is a direct successor node of node u when v can be directly reached through an outgoing edge from u. The node with less direct successor nodes is then removed. The rationale behind preferring the node with more outgoing edges is the fact that each edge represents a jump target within the function and it is more likely that a valid control transfer instruction has a target within the function than any random CTI.

In Figure 5, node C has only one direct successor node while node H has two. Therefore, node C is removed from the control flow graph. In our example, all conflicts are resolved at this point.

Step 5: In this step, all conflicts between basic blocks must be resolved. For each pair of conflicting blocks, one is chosen at random and then removed from the graph. No human intervention is required at this step, but it would be possible to create different alternative disassembly outputs (one output for each block that needs to be removed) that can be all presented to a human analyst.

It might also be possible to use statistical methods during Step 5 to improve the chances that the «correct» block is selected. However, this technique is not implemented and is left for future work.

The result of the conflict resolution step is a control flow graph that contains no overlapping basic blocks. The instructions in these blocks are considered valid and could serve as the output of the static analysis process. However, most control flow graphs do not cover the function's complete address range and gaps exist between some basic blocks.

4.3 Gap Completion

The task of the gap completion phase is to improve the results of our analysis by filling the gaps between basic blocks in the control flow graph with instructions that are likely to be valid. A gap from basic block b1 to basic block b2 is the sequence of addresses that starts at the first address after the end of basic block b1 and ends at the last address before the start of block b2, given that there is no other basic block in the control flow graph that covers any of these addresses. In other words, a gap contains bytes that are not used by any instruction in the control flow graph.

Gaps are often the result of junk bytes that are inserted by the obfuscator. Because junk bytes are not reachable at run–time, the control flow graph does not cover such bytes. It is apparent that the attempt to disassemble gaps filled with junk bytes does not improve the results of the analysis. However, there are also gaps that do contain valid instructions. These gaps can be the result of an incomplete control flow graph, for example, stemming from a region of code that is only reachable through an indirect jump whose target cannot be determined statically. Another frequent cause for gaps that contain valid instructions are call instructions. Because the disassembler cannot continue after a call instruction, the following valid instructions are not immediately reachable. Some of these instructions might be included into the control flow graph because they are the target of other control transfer instructions. Those regions that are not reachable, however, cause gaps that must be analyzed in the gap completion phase.

The algorithm to identify the most probable instruction sequence in a gap from basic block b1 to basic block b2 works as follows. First, all possibly valid sequences in the gap are identified. A necessary condition for a valid instruction sequence is that its last instruction either (i) ends with the last byte of the gap or (ii) its last instruction is a non intra–procedural control transfer instruction. The first condition states that the last instruction of a valid sequence has to be directly adjacent to the first instruction of the second basic block b2. This becomes evident when considering a valid instruction sequence in the gap that is executed at run–time. After the last instruction of the sequence is executed, the control flow has to continue at the first instruction of basic block b2. The second condition states that a sequence does not need to end directly adjacent to block b2 if the last instruction is a non intra–procedural control transfer. The restriction to non intra–procedural CTIs is necessary because all intra–procedural CTIs are included into the initial control flow graph. When an intra–procedural instruction appears in a gap, it must have been removed during the conflict resolution phase and should not be included again.

Instruction sequences are found by considering each byte between the start and the end of the gap as a potential start of a valid instruction sequence. Subsequent instructions are then decoded until the instruction sequence either meets or violates one of the necessary conditions defined above. When an instruction sequence meets a necessary condition, it is considered possibly valid and a sequence score is calculated for it. The sequence score is a measure of the likelihood that this instruction sequence appears in an executable. It is calculated as the sum of the instruction scores of all instructions in the sequence. The instruction score is similar to the sequence score and reflects the likelihood of an individual instruction. Instruction scores are always greater or equal than zero. Therefore, the score of a sequence cannot decrease when more instructions are added. We calculate instruction scores using statistical techniques and heuristics to identify improbable instructions.

The statistical techniques are based on instruction probabilities and digraphs. Our approach utilizes tables that denote both the likelihood of individual instructions appearing in a binary as well as the likelihood of two instructions occurring as a consecutive pair. The tables were built by disassembling a large set of common executables and tabulating counts for the occurrence of each individual instruction as well as counts for each occurrence of a pair of instructions. These counts were subsequently stored for later use during the disassembly of an obfuscated binary. It is important to note that only instruction opcodes are taken into account with this technique; operands are not considered. The basic score for a particular instruction is calculated as the sum of the probability of occurrence of this instruction and the probability of occurrence of this instruction followed by the next instruction in the sequence.

In addition to the statistical technique, a set of heuristics are used to identify improbable instructions. This analysis focuses on instruction arguments and observed notions of the validity of certain combinations of operations, registers, and accessing modes. Each heuristic is applied to an individual instruction and can modify the basic score calculated by the statistical technique. In our current implementation, the score of the corresponding instruction is set to zero whenever a rule matches. Examples of these rules include the following:

  • operand size mismatches;
  • certain arithmetic on special–purpose registers;
  • unexpected register–to–registermoves (e.g., moving from a register other than %ebp into %esp);
  • moves of a register value into memory referenced by the same register.

When all possible instruction sequences are determined, the one with the highest sequence score is selected as the valid instruction sequence between b1 and b2.

Gap completion and disassembler output Figure 6: Gap completion and disassembler output

The instructions that make up the control flow graph of our example function and the intermediate gaps are shown in the left part of Figure 6. It can be seen that only a single instruction sequence is valid in the first gap, while there is none in the second gap. The right part of Figure 6 shows the output of our disassembler. All valid instructions of the example function have been correctly identified.

5. Tool–Specific Techniques

The techniques discussed in the previous section can disassemble any binary that satisfies our assumptions with reasonable accuracy (see Section 6 for detailed results). As mentioned previously, however, the results can be improved when taking advantage of available tool–specific knowledge. This section introduces a modification to our general techniques that can be applied when disassembling binaries transformed with Linn and Debray's obfuscator.

A significant problem for the disassembler is the fact that it cannot continue disassembling at the address following a call instruction. As discussed in Section 2, Linn and Debray's obfuscator replaces regular calls with calls to a branch function. The branch function is responsible for determining the real call target, that is, the function that is invoked in the original program. This is done using a perfect hash function, using the location of the call instruction as input. During run–time, the location of the call instruction can be conveniently determined from the top of the stack. The reason is that the address following the call instruction is pushed on the stack by the processor as part of the x86 call operation.

Besides finding the real target of the call and jumping to the appropriate address, the branch function is also responsible for adjusting the return address such that control flow does not return directly to the address after the call instruction. This is achieved by having the branch function add a certain offset to the return address on the stack. This offset is constant (but possibly different) for each call instruction and obtained in a way similar to the target address by performing a table lookup based on the location of the caller. When the target function eventually returns using the modified address on the stack, the control flow is transfered to an instruction located at offset bytes after the original return address. This allows the obfuscator to fill these bytes with junk.

By reverse engineering the branch function, the offset can be statically determined for each call instruction. This allows the disassembler to skip the junk bytes and continue at the correct instruction. One possibility is to manually reverse engineer the branch function for each obfuscated binary. However, the process is cumbersome and error prone. A preferred alternative is to automatically extract the desired information.

We observe that the branch function is essentially a procedure that takes one input parameter, which is the address after the call instruction that is passed on the top of the stack. The procedure then returns an output value by adjusting this address on the stack. The difference between the initial value on the stack and the modified value is the offset that we are interested in. It is easy to simulate the branch function because its output only depends on the single input parameter and several static lookup tables that are all present in the binary's initialized data segment. As the output does not depend on any input the program receives during run–time, it can be calculated statically.

To this end, we have implemented a simple virtual processor as part of the disassembler that simulates the instructions of the branch function. Because the branch function does not depend on dynamic input, all memory accesses refer to addresses in the initialized data segment and can be satisfied statically. The execution environment is set up such that the stack pointer of the virtual processor points to an address value for which we want to determine the offset. Then, the simulator executes instructions until the input address value on the stack is changed. At this point, the offset for a call is calculated by subtracting the old address value from the new one.

Whenever the disassembler encounters a call instruction, the value of the address following the call is used to invoke our branch function simulator. The simulator calculates the corresponding offset, and the disassembler can then skip the appropriate number of junk bytes to continue at the next valid instruction.

6. Evaluation

Linn and Debray evaluated their obfuscation tool using the SPECint 95 benchmark suite, a set of eight benchmark applications written in C. These programs were compiled with gcc version egcs–2.91.66 at optimization level –O3 and then obfuscated.

To measure the efficacy of the obfuscation process, the confusion factor for instructions was introduced. This metric measures how many program instructions were incorrectly disassembled. More formally, let V be the set of valid program instructions and O the set of instructions that a disassembler outputs. Then, the confusion factor CF is defined as CF = V-OV . Because our work focuses on the efficacy of the disassembler in identifying valid instructions, we define the disassembler accuracy DA as DA = 1-CF.

Linn and Debray used three different disassemblers to evaluate the quality of their obfuscator. The first one was the GNU objdump utility, which implements a standard linear sweep algorithm. The second disassembler was implemented by Linn and Debray themselves. It is a recursive disassembler that uses speculative linear disassembly (comparable to our gap completion) for regions that are not reachable by the recursive part. This disassembler was also provided with additional information about the start and end addresses of all program functions. The purpose of this disassembler was to serve as an upper bound estimator for the disassembler accuracy and to avoid reporting «unduly optimistic results» [13]. The third disassembler was IDA Pro 4.3x, a commercial disassembler that is often considered to be among the best commercially available disassemblers. This belief is also reflected in the fact that IDA Pro is used to provide disassembly as input for static analysis tools such as [3].

We developed a disassembler that implements the general techniques and the tool–specific modification presented in the two previous sections. Our tool was then run on the eight obfuscated SPECint 95 applications. The results for our tool and a comparison to the three disassemblers used by Linn and Debray are shown in Table 1. Note that we report two results for our disassembler. One shows the disassembler accuracy when only general techniques are utilized. The second result shows the disassembler accuracy when the tool–specific modification is also enabled.

ProgramObjdumpLinn⁄DebrayIDA ProOur tool
generaltool‑specific
compress9556.0769.9624.1991.0498.07
gcc65.5482.1845.0988.4595.17
go66.0878.1243.0191.8196.80
ijpeg60.8274.2331.4691.6097.53
li56.6572.7829.0789.8697.35
m88ksim58.4275.6629.5690.3997.49
perl57.6672.0131.3686.9396.28
vortex66.0276.9742.6590.7196.65
Mean60.9175.2434.5590.1096.92
Table 1: Disassembler accuracy

These results demonstrate that our disassembler provides a significant improvement over the best disassembler used in the evaluation by Linn and Debray. Even without using tool–specific knowledge, the disassembler accuracy is higher than their recursive disassembler used to estimate the upper bound for the disassembler accuracy. When the tool–specific modification is enabled, the binary is disassembled almost completely. The poor results for IDA Pro can be explained with the fact that the program only disassembles addresses that can be guaranteed (according to the tool) to be instructions. As a result, many functions that are invoked through the branch function are not disassembled at all. In addition, IDA Pro continues directly after call instructions and is frequently mislead by junk bytes there.

Given the satisfying results of our disassembler, the disassembly process was analyzed in more detail. It is interesting to find the ratio between the number of valid instructions identified by the control flow graph and the number of valid instructions identified by the gap completion phase. Although the gap completion phase is important in filling regions not covered by the CFG, our key observation is the fact that the control transfer instructions and the resulting control flow graph constitute the skeleton of an analyzed function. Therefore, one would expect that most valid instructions can be derived from the control flow graph, and only small gaps (e.g., caused by indirect calls or unconditional jumps) need to be completed later. Table 2 shows the fraction (in percent) of correctly identified, valid instructions that were obtained using the control flow graph and the fraction obtained in the gap completion phase. Because the numbers refer to correctly identified instructions only, the two fractions sum up to unity. Both the results with toolspecific support and the results with the general techniques alone are provided. When tool specific support is available, the control flow graph contributes noticeable more to the output. In this case, the disassembler can include all regions following call instructions into the CFG. However, in both experiments, a clear majority of the output was derived from the control flow graph, confirming our key observation.

Programgeneraltool‑specific
CFGGapCFGGap
compress9587.0912.9196.363.64
gcc85.1214.8893.106.90
go89.1310.8795.114.89
ijpeg87.0212.9895.034.97
li85.6314.3795.114.89
m88ksim87.1812.8296.004.00
perl86.2213.7895.574.43
vortex88.0411.9694.675.33
Mean86.9313.0795.124.88
Table 2: CFG vs. gap completion

Because most of the output is derived from the control flow graph, it is important that the conflict resolution phase is effective. One third of the control transfer instructions that are used to create the initial control flow graphs are invalid. To achieve a good disassembler accuracy, it is important to remove the invalid nodes from the CFG. The first two steps of the conflict resolution phase remove nodes that are guaranteed to be invalid, given our assumptions. The third and forth step implement two heuristics and the fifth step randomly selects one of two conflicting nodes. It is evident that it is desirable to have as many conflicts as possible resolved by the first and second step, while the fifth step should never be required.

Table 3 shows for each program the number of basic blocks in the initial control flow graphs (column Initial Blocks) and the number of basic blocks in the control flow graphs after the conflict resolution phase (column Final Blocks). In addition, the number of basic blocks that were removed in each of the five steps of the conflict resolution phase are shown. The numbers given in Table 3 were collected when the tool–specific modification was enabled. The results were very similar when only general techniques were used.

ProgramInitial BlocksConflict ResolutionFinal Blocks
Step 1Step 2Step 3Step 4Step 5
compress9554674702146934242934838577
gcc245586217622568029801900565166878
go91140106678934940523115461749
ijpeg702559414606952991409549238
li634598350529749521257844657
m88ksim77344100616933693817710153134
perl10484110940114421175029115270266
vortex1187031500492211342440737380274
Table 3: Conflict resolution

It can be seen that most conflicts were resolved after the first three steps. About two thirds of the removed basic blocks were guaranteed to be invalid. This supports our claim that invalid control flow instructions, caused by the misinterpretation of instruction arguments, often result in impossible control flows that can be easily detected. Most of the remaining blocks are removed by the first heuristic that checks how tight a block is connected with the rest of the CFG. Invalid blocks are often loosely coupled and can taken out during this step. The last two steps were only responsible for a small fraction of the total removed blocks. The heuristic in step four was sometimes able to provide an indication of which block was valid. Otherwise, a random node had to be selected.

Static analysis tools are traditionally associated with poor scalability and the inability to deal with real–world input. Therefore, it is important to ascertain that our disassembler can process even large real–world binaries in an acceptable amount of time. In Section 4, we claimed that the processing overhead of the program is linear in the number of instructions of the binary. The intuitive reason is the fact that the binary is partitioned into functions that are analyzed independently. Assuming that the average size of an individual function is relatively independent of the size of the binary, the amount of work per function is also independent of the size of the binary. As a result, more functions have to be analyzed as the size of the binary increases. Because the number of functions increases linearly with the number of instructions and the work per function is constant (again, assuming a constant average function size), the overhead of the static analysis process is linear in the number of instructions.

ProgramSize (Bytes)InstructionsTime (s)
openssh263,68446,3434
compress951,768,42092,1379
li1,820,768109,6527
ijpeg1,871,776127,0129
m88ksim2,001,616127,3588
go2,073,728145,95311
perl2,176,268169,05415
vortex2,340,196204,23016
gcc2,964,740387,28928
emacs4.765,512405,53538
Table 4: Disassembler processing times

To support this claim with experimental data, the time for a complete disassembly of each evaluation binary was taken. The size of obfuscated programs of the SPECint 95 benchmark are in the range of 1.77 MB to 2.96 MB. To obtain more diversified results, we also disassembled one smaller (openssh 3.7) and one larger binary (emacs 21.3). The processing times were taken as the average of ten runs on a 1.8 GHz Pentium IV system with 512 MB of RAM, running Gentoo Linux 2.6. The results (in seconds) for the disassembler are listed in Table 4. There was no noticeable difference when using tool–specific modification.

Processing times and linear regression Figure 7: Processing times and linear regression

Figure 7 shows a plot of the processing times and the corresponding number of instructions for each binary. The straight line represents the linear regression line. The close proximity of all points to this line demonstrates that the processing time increases proportional to the number of instructions, allowing our disassembler to operate on large binaries with acceptable cost.

7. Conclusions

Correct disassembler output is crucial for many security tools such as virus scanners [3] and intrusion detection systems [11]. Recently, Linn and Debray [13] presented obfuscation techniques that successfully confuse current state–of–the–art disassemblers. We developed and implemented a disassembler that can analyze obfuscated binaries. Using the program's control flow graph and statistical techniques, we are able to correctly identify a large fraction of the program's instructions.

Obfuscation and de–obfuscation is an arms race. It is possible to devise obfuscation techniques that will make the disassembly algorithms describe in this paper less effective. However, this arms race is usually in favor of the de–obfuscator. The obfuscator has to devise techniques that transform the program without seriously impacting the run–time performance or increasing the binary's size or memory footprint while there are no such constraints for the de–obfuscator. Also, the de–obfuscator has the advantage of going second. That is, the obfuscator must resist all attacks, while the de–obfuscator can tailor the attack to a specific obfuscation technique. In this direction, a recent theoretical paper [1] also proved that obfuscation is impossible in the general case, at least for certain properties.

Acknowledgments

This research was supported by the Army Research Office under agreement DAAD19–01–1–0484 and by the National Science Foundation under grants CCR–0209065 and CCR–0238492.

References

  1. B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich, A. Sahai, S. Vadhan, and K. Yang. On the (Im)possibility of Software Obfuscation. In Crypto, 2001.
  2. J. Bergeron, M. Debbabi, M.M. Erhioui, and B. Ktari. Static Analysis of Binary Code to Isolate Malicious Behaviors. In 8thWorkshop on Enabling Technologies: Infrastructure for Collaborative Enterprises, 1999.
  3. M. Christodorescu and Somesh Jha. Static Analysis of Executables to Detect Malicious Patterns. In 12th USENIX Security Symposium, 2003.
  4. C. Cifuentes and M. Van Emmerik. UQBT: Adaptable binary translation at low cost. IEEE Computer, 40(2–3), 2000.
  5. C. Cifuentes and A. Fraboulet. Intraprocedural Static Slicing of Binary Executables. In International Conference on Software Maintenance (ICSM'97), Bari, Italy, October 1997.
  6. C. Cifuentes and K. Gough. Decompilation of Binary Programs. Software Practice & Experience, 25(7):811–829, July 1995.
  7. F. B. Cohen. Operating System Protection through Program Evolution. http://all.net/books/IP/evolve.html.
  8. C. Collberg and C. Thomborson. Watermarking, Tamper–Proofing, and Obfuscation — Tools for Software Protection. IEEE Transactions on Software Engineering, 28(8):735–746, August 2002.
  9. C. Collberg, C. Thomborson, and D. Low. A Taxonomy of Obfuscating Transformations. Technical Report 148, Department of Computer Science, University of Auckland, July 1997.
  10. Free Software Foundation. GNU Binary Utilities, Mar 2002. http://www.gnu.org/software/binutils/manual/.
  11. J.T. Giffin, S. Jha, and B.P. Miller. Detecting manipulated remote call streams. In 11th USENIX Security Symposium, 2002.
  12. W.C. Hsieh, D. Engler, and G. Back. Reverse–Engineering Instruction Encodings. In USENIX Annual Technical Conference, pages 133–146, Boston, Mass., June 2001.
  13. C. Linn and S. Debray. Obfuscation of executable code to improve resistance to static disassembly. In 10th ACM Conference on Computer and Communications Security (CCS), pages 290–299, October 2003.
  14. T. Ogiso, Y. Sakabe, M. Soshi, and A. Miyaji. Software obfuscation on a theoretical basis and its implementation. IEICE Transactions on Fundamentals, E86–A(1), 2003.
  15. R. Sites, A. Chernoff, M. Kirk, M. Marks, and S. Robinson. Binary Translation. Digital Technical Journal, 4(4), 1992.
  16. Symantec. Understanding and Managing Polymorphic Viruses. http://www.symantec.com/avcenter/whitepapers.html.
  17. G. Wroblewski. General Method of Program Code Obfuscation. In Proceedings of the International Conference on Software Engineering Research and Practice (SERP), Las Vegas, NV, June 2002.
Christopher Kruegel, William Robertson, Fredrik Valeur and Giovanni Vigna. Reliable Software Group. University of California Santa Barbara. {chris,wkr,fredrik,vigna}@cs.ucsb.edu

Saturday, April 26, 2008

Obfuscation, Weird Languages, and Code Aesthetics

The standard idea of code aesthetics, when such an idea manifests itself at all, allows for programmers to have elegance and clarity as their standards. This paper explores programming practices in which other values are at work, showing that the aesthetics of code must be enlarged to accommodate them. The two practices considered are obfuscated programming and the creation of «weird languages» for coding. Connections between these two practices, and between these and other mechanical and literary aesthetic traditions, are discussed.

1. Introduction

Programmers write code in order to cause the computer to function in desired ways. But modern computer programs are written in a form, usually textual, that is also meant to be manipulable and understandable by human beings. For a programmer to understand what she herself is writing, and to incorporate code that others have written, and to simply learn how to program with greater facility and on a larger, more complex scale, code has been made legible to people. While a computer system may compile or interpret code, it is important to the nature of code that it is interpreted by people as well.

A typical perspective on code would be that clarity and elegance are the only possible values that programmers can have when writing it, although they may succeed to a greater or lesser extent at writing clear and elegant code. But if this were the case, how is it possible to explain the way that people sometimes intentionally obfuscate their code, making its functioning more or less impenetrable, even when there is no commercial or practical reason to do so? The existence of obfuscated programming as a software development practice, and as an aesthetic practice, throws a wrench into the simplified theory of coding that would claim that coders must always strive for clarity. An additional complication is seen in programming languages that are themselves designed as jokes or parodies, sometimes called «weird programming languages» or «esoteric programming languages.» Such languages are designed to make legibility of any program difficult. Obfuscated code and weird languages highlight the importance of the human reading of code in software development. If some code is only to be read by a machine, it can be neither obfuscated nor clear: it can only function properly or not.

This paper suggests some ways to enlarge an aesthetics of code to account for the existence of obfuscated programming and «weird languages». Such consideration shows that a previously neglected layer of computing and new media is available for rich aesthetic understanding.

2. Reading Code

Version 2.1 of the online lexical reference system WordNet gives 11 senses for «read», including «look at, interpret, and say out loud something that is written or printed» and «interpret the significance of, as of palms, tea leaves, intestines, the sky, etc.; also of human behavior». [14] This discussion is about a fairly literal application of the most common sense, «interpret something that is written or printed», although of course code that appears on a screen (rather than being written or printed out) can also be read.

The understanding of behavior is certainly involved in reading code in the primary sense of «read», however. It is essential to any ordinary human reading of a computer program to develop an understanding of how the computer will behave, and what it will compute, when it runs the code that is being examined. In a popular book on the history of software, one of the developers of FORTRAN is characterized as «an extraordinary programmer who could "execute" a program in his head, as a machine would, and then write error–free code with remarkable frequency». [7] Actually, all programmers must do this to some extent, using some internal model of what code will do. Just as understanding what a program does, and why, is critical on a practical level for the programmer, it is important to the aesthetics of code as well. Because code functions, «the aesthetic value of code lies in its execution, not simply its written form. To appreciate it fully we need to "see" the code to fully grasp what it is we are experiencing and to build an understanding of the code's actions». [2]

The analysis of a computer program or system often involves examining how the program behaves and «reading» (in this other sense, «interpreting the significance of») the intention behind the program, the structure of the program, or the more fundamental causes for the outputs observed. This is very frequently done in reverse–engineering in «black–box» situations, where code and other internals are not available for inspection. A network administrator might also be able to «read» the behavior of a malfunctioning router and figure out the problem without looking at any code. But these types of analysis also apply to systems that are not governed by legible code at all, and are not, by themselves, examples of the phenomenon under consideration, the human reading and interpretation of particular texts, computer programs.

Reading in the main sense is about looking at something abstract. «Reading a photograph» sounds odd, perhaps because the photograph is not printed matter but also because it represents a framed perspective rather directly, with little abstraction. It is much more usual to read a diagram or map, because these are abstract representations. The development of software brought code into a legible condition. Cables patched into the ENIAC were not themselves legible, but assembly language for the stored–program EDSAC was. Human readability of programs was further enhanced as high–level programming languages, beginning with FORTRAN, were developed.

int i;main(){for(;i["]<i;++i){--i;}"];read('-'-'-',i+++"hell\
o, world!\n",'/'/'/'));}read(j,i,p){write(j/p+p,i---j,i/i);}

// Figure 1. An anonymous entry to the 1984 International Obfuscated C Code Contest that prints «hello, world!»

In the question and answer period after a lecture, Donald Knuth, the famous computer scientist who is author of The Art of Computer Programming, recalls reading the program SOAP from Stan Poley: «absolutely beautiful. Reading it was just like hearing a symphony, because every instruction was sort of doing two things and everything came together gracefully». He also remembers reading the code to a compiler written by Alan Perlis and others: «plodding and excruciating to read, because it just didn't possess any wit whatsoever. It got the job done, but its use of the computer was very disappointing». Knuth says of the aesthetics of reading programs and the reader's pleasure: «I do think issues of style do come through and make certain programs a genuine pleasure to read. Probably not, however, to the extent that they would give me any transcendental emotions». [6]

This discussion is not about any sentimental effects that code may have on the human reader, but does consider in detail the issues of programming style and the ways in which human readers read code. An aesthetic of code is suggested by Knuth's comments, one that is typified by beauty and grace and is clearly identified by Maurice Black in his dissertation, «The Art of Code»:

Computing culture ... has adopted a traditional model of literary aesthetics as a means of effecting change, finding political utility and social value in the wellcrafted product that is at once entirely usable and wholly beautiful to contemplate. The distinctions are clearly evident in the respective disciplines' discourses: whereas terms such as «elegant» and «beautiful» circulate freely in computer culture to describe wellcrafted code, elegance, beauty, and all their synonyms have been effectively exiled from the vocabulary of literary and cultural theory ... [1]

Black devotes a section to Knuth's aesthetic views and his concept of «literate programming,» and another section to John Lions's book–length commentary on the beautiful, elegant Unix operating system. «The Art of Code» clearly establishes the classical aesthetic of programming as the dominant one in the discourse of software development. More recent articles, such as one entitled «Beautiful Code» that appeared in Dr. Dobbs, show that this aesthetic is still going strong: «Instead of searching for some automated measure ... perhaps we should be striving for beauty in our work because we believe that beautiful things are better.» [3] It is fairly easy to find programmers extolling the beauty of programs and code snippets online, and also easy to find suggestions for writing elegant, clearly–written code in introductory programming textbooks.

There is a dark side to coding, however, one in which, even though a person can see into what would otherwise be the black box of the program, the source code itself is obscure, contrived to foil human legibility rather than enhance it.

3. Hello, Obfuscation

In 1984 Landon Curt Noll and Larry Bassel held the first International Obfuscated C Code Contest. The contest was a success that has been repeated many times; judging of the 18th IOCCC was underway when this article was written. Only small, complete C programs can be entered in the contest, which rewards originality and the aesthetic abuse of the C language. The contest's stated goals include demonstrating the importance of programming style («in an ironic way») and illustrating «some of the subtleties of the C language.» [4]

An anonymous entry in the first IOCCC (Figure 1) accomplishes these goals in only two lines, and also plays on the conventional «hello, world!» program, a program which is typically used as a simple first example when learning a programming language. Brian Kernighan and Dennis Ritchie (the creator of C) begin their classic book The C Programming Language [5] with such a program:

#include <stdio.h>
main()
{
    printf("hello, world\n");
}

The obfuscated program prints «hello, world!» as it is supposed to, but in a very tortuous way. To see how this program comments on C programming style and the subtleties of C, it is necessary to discuss the program in detail, and to discuss the C programming language in detail. The explication that follows will be most easily followed by those who know how to program and will be best understood by those who have had some experience programming in C. However, the connection between the obfuscations seen in this code and the particular nature of C should be evident to some extent even to those who are not able, or do not wish, to follow all the details.

To begin, here is a clearer C program that prints «hello, world!»:

main()
{
    write(0,"hello, world!\n",14);
}

Even this simple program comes with a bit more baggage than the BASIC equivalent, 10 PRINT "hello, world!", and it is more complex than the program Kernighan and Ritchie use to introduce C. The system call write is used in this code with three arguments: 0 means the writing will be done to standard output; the second argument is the string to write, which includes a newline character encoded as \n at the end; and the third argument, 14, is the length of the string, the number of characters in it. The following program adds one layer of obfuscation, by using a function to print out the "hello, world!\n" string one character at a time:

int i;

main()
{
    for(i=0 ; i<14 ; i++)
    {
        write_one_letter("hello, world!\n" + i);
    }
}

write_one_letter(letter)
{
    write(0,letter,1);
}

This makes it harder to see how the program works, but it makes visible some of the trickery that is possible, some would even say encouraged, in C. Notice that part of this program involves adding a string constant and a number, an operation which cannot be done in many strongly typed programming languages. In Java, where addition of String objects is defined as concatenation, evaluating the expression ("string" + 17) involves constructing a String out of the number, then adding the two: the result is "string17". A string constant in C is «really» a number, however, which means that adding a string and a number has an entirely different meaning. The string, seen as a number, is the address in memory where the first character resides. Add one to this number, and the result is the location of the second character. So this for loop, starting at position 0 and finishing at 13, has the effect of sending each character in the string to the write_one_letter function for printing.

To obfuscate the for loop a bit more, the i<14 condition is written in a more elaborate way. Oddly enough, this condition could be written "xxxxxxxxxxxxxx"[i], which has the effect of returning character number i from a string that has 14 characters in it. This yields a positive number (meaning TRUE) until i reaches 14, which corresponds to the end of the string; when the end of the string is reached it returns FALSE. This happens to be the case because strings in C are terminated with NULL, which, in C, means the same thing as FALSE. Now, to make things more puzzling, any array reference in C can either be written a[b] or b[a]. The values of a and b are added together and their sum is used to look up the array entry, so it doesn't matter which one is inside the brackets and which one comes before them. Thus, the condition can be written even more confusingly as i["xxxxxxxxxxxxxx"]. Also, any string that is 14 characters long can be used in this condition. To create additional confusion about the program's syntax, the fully–obfuscated program uses a different string to create the condition i["]<i;++i){--i;}"]. This makes it difficult to see where the data of the string ends and the code of the program begins.

The function write_one_letter is also given two additional, superfluous parameters and its name is changed to read. Redefining read to be a function that writes one letter is a particularly gruesome move, but this is allowed in C; read is a system call, not a keyword.

int i;

main()
{
    for(i=0 ; i["]<i;++i){--i;}"] ; i++)
    {
    read(0,"hello, world!\n" + i,1);
    }
}

read(j,letter,p)
{
    write(0,letter,1);
}

The meaningful name letter can be changed to i to make it seem as if this is the same i that was used previously — it is not. And, within the read function, i is written as i--, which suggests that the i up above might be getting decremented when this happens — it is not; this decrementing has no effect because this variable i «expires» immediately, at the end of the function. The call to read can be crammed into the increment part of the for statement, with the ++ operator is placed after i, to increment its value after the statement has been executed; then another + can be added to perform addition and make the puzzling–looking +++. The initialization of i to 0 can be left out. Integer variables in C are set to zero when they are defined, so the i=0 in the program actually has no effect, except to make the program easier to understand. With these changes, the code looks like this:

int i;

main()
{
    for( ; i["]<i;++i){--i;}"] ;
        read(0,i+++"hello, world!\n",1));
}

read(j,i,p)
{
    write(0,i--,1);
}

There are only two differences between this code and the final obfuscated program: the formatting of the text and the use of some confusing ways to write zero and one. To turn to the second of these, one fancy way to write zero is '-'-'-', that is, the numerical value of the '-' character subtracted from itself. Similarly, '/'/'/' divides the numerical value of the '/' character by itself, giving one. (Doing arithmetic with characters, like adding numbers and strings, is also not the most standard programming practice, although programmers are of course aware that characters have numerical representations.) The fancy zero and fancy one values that are obtained by doing this are passed to the read function as the variables j and p; that function then uses other elaborate ways to write zero and one. j/p+p is always 0/2 in this code and thus always zero. i/i is always one. i---j is a way of writing (i--)-j, and, since j has the value zero, this does a meaningless subtraction and is the same as just writing i--. Adding in these elaborate ways of expressing zero and one, the code looks like this:

int i;

main()
{
    for( ; i["]<i;++i){--i;}"] ;
        read('-'-'-',i+++"hello, world!\n",'/'/'/'));
}

read(j,i,p)
{
    write(j/p+p,i---j,i/i);
}

The final program is the above code with all unnecessary whitespace removed and with the resulting line broken in two, using a backslash in the middle of the "hello, world!\n" string.

This example suffices to explain what obfuscations are and how they relate to the programming language in which they are written, although most IOCCC entries do far more elaborate things. Gavin Barraclough's 2004 entry, which won best of show, is exemplary. His program, less than 3600 characters in length, is actually formatted in a «friendly» way, but is cryptically scattered with one–letter variable names. The approximately two and a half pages of code provide, as the hint file explains,

a 32–bit multitasking operating system for x86 computers, with GUI and filesystem, support for loading and executing user applications in elf binary format, with ps2 mouse and keyboard drivers, and vesa graphics. And a command shell. And an application — a simple text–file viewer. [4]

4. The Comedian As The Language C

Some of the obfuscations that are seen in IOCCC, and some that can be seen in the «hello, world!» program, can be more or less universally applied by programmers, regardless of language. The use of meaningless variable names such as j and p is always possible. The deceptively–named variable i (which looks like an earlier variable i) and the misleadingly–named read function are other examples of a universal programming pitfall. Whenever variable and function names can be freely chosen, there is always the potential for the coder's choice to be uninformative or misleading. This can be intensified in C, where variable names are case sensitive; some programs take advantage of this to name variables o and O, for instance, inviting additional confusion with the number zero. This play, which can be called naming obfuscation, shows one very wide range of choices that programmers have. Such play refutes the idea that the programmer's task is automatic, value–neutral, and disconnected from the meanings of words in the world.

While these programs often critique or play with programming in general, the winning IOCCC programs also strongly assert their Cness. a[b] and b[a] do not mean the same thing in other languages, so a programmer could not choose the more confusing of the two. Other languages do not define the addition of strings and numbers, or they define it in a way that seems more intuitive, at least to beginning programmers. But C, by giving the programmer the power to use pointers into memory as numbers and to perform arithmetic with them, particularly enables this sort of pointer confusion. By showing how much room there is to program in perplexing ways — and yet accomplishing astounding results at the same time — obfuscated programs demonstrate that C is powerful, and also that clarity in C code is achieved only with effort.

The «fake ending» to the for loop in the hello world program, which is achieved by embedding a deceptive string "] <i;++i){--i;}", is an example of data/code confusion. This is actually a mild example meant to fool a reader for a moment into thinking that this (meaningless) string is code; other obfuscated programs may transgress the code/data boundary in other ways, by consuming their source code as input, by generating their own code as output, or by modifying themselves as they run.

There is also an Obfuscated Perl contest, run annually by The Perl Journal since 1996. While Perl is quite unlike C, even beginning Perl programmers will be quick to realize the great potential for obfuscation that lies within the language. For one thing, Perl offers a dazzling variety of extremely useful special variables, represented with pairs of punctuation marks; this feature of the language nearly merits an obfuscation category of its own. Perl's powerful pattern–matching abilities also enable cryptic and deft string manipulations. Perl is sometimes deacronymized as «Practical Extraction and Report Language,» but has also been said to stand for «Pathologically Eclectic Rubbish Lister.» The language is ideal for text processing, which means that printing «hello, world!» and other short messages can be done in even more interesting ways. Thus, the tradition of writing an obfuscated Perl program that prints «Just another Perl hacker,» arose on USENET and became common enough that a program to do this is known simply as a JAPH. The popularity of these programs is attested to by the first section of the Perl FAQ, which answers the question «What is a JAPH?» [10]

More generally, Perl has as its mantra «there are many ways to do it.» A half–dozen Perl programmers may easily know eight or ten different ways to code exactly the same thing. Because of this, obscure ways of doing fairly common tasks are lurking everywhere. A common, high–level obfuscation technique that is seen in obfuscated Perl and also in obfuscated C (however differently it may be expressed there) involves choosing the least likely way to do it. This could mean using a strange operator, a strange special variable, or an unusual function (or an ordinary function in an unusual way). It could also involve treating data that is typically seen as being one type as some other type, a view that is permissible according to the language but not intuitive.

Perl and C are distinguished by having obfuscated programming contests, but they are not widely despised languages — unlike, for instance, COBOL or Visual Basic. Why are these hateful programming languages not the targets of obfuscatory ridicule? The most obvious explanation is that the programmers who write obfuscated code are Perl and C hackers, often professional ones. They enjoy hacking in these languages, as do many free software developers and creative coders, and would not choose to program in COBOL or Visual Basic for fun. Their play with Perl and C is not pure pillory. In addition to making fun of some «misfeatures» or abusable features of the languages, obfuscated code shows how powerful, flexible programming languages allow for creative coding, not only in terms of the output but in terms of the legibility and appearance of the source code.

What all obfuscations have in common — naming obfuscations and language–specific ones, such as choosing the least well–known language construct to accomplish something — is that they explore the play in a language, the free space that is available to programmers. If something can only be done one way, it cannot be obfuscated. The play in a programming language can also be used to make the program signify something else, besides being valid code that compiles or is interpreted to some running form.

5. Multiple Coding

Recent IOCCC programs include a racing game in the style of Pole Position, a CGI–enabled Web server, and a maze displayer with code in the shape of a maze. It is common for obfuscated programs to be of unusual visual appearance. The code may spell out the name of the program, or the name of the contest, in large letters, or be in the form of some other ASCII art picture. This is a type of double coding, or, more generally, multiple coding, which can also be seen in Perl poetry and in «bilingual» programs.

The classic example of double coding in natural languages is the sentence «Jean put dire comment on tape,» which is grammatical English and grammatical French, although each word has a different meaning in each language. (In French, the sentence means "Jean [male name] is able to say how one types.") Harry Mathews contributed to further French/English double coding by assembling the Mathews Corpus, a list of words which exist in both languages but have different meanings. In programming, an important first step was the 1968 Algol by Noel Arnaud, a book of poems composed from keywords in the Algol programming language. However, these poems are not executable programs; they are English poems that were assembled from a very restricted vocabulary. [8]

A notable modern ancestor of Arnaud's Algol is Perl poetry, in which texts that can be read as poems are devised so as to also be valid Perl programs. As critics of code aesthetics have noted, even award–winning Perl poetry is often little more than an exercise of «porting» existing song lyrics into Perl, and the practice «does little to articulate the language of perl itself.» [2] While it is possible to obfuscate a program, in the sense of the IOCCC or the Obfuscated Perl Contest, by fashioning it in the form of an English poem, the goals of competitive obfuscators and Perl poets appear to be quite different. Although a Perl poem must be a valid program, what the program actually does is often an afterthought in Perl poetry. For instance, the winning program in the first Perl Poetry Contest does nothing. In contrast, a program's function is essential to obfuscated programming. So, while Perl poetry is an interesting phenomenon to many new media scholars, there are reasons, quite apart from any possible distaste for poetry, that this practice seems less interesting to programmers. The interesting phenomenon of multiple coding can be found in obfuscated programs, too, while these programs also feature impressive, intricate workings that are essential to their aesthetics.

Some other and quite extreme examples of multiple coding are also seen in programs that are «bilinguial» or «multilingual» and are analogous to «Jean put dire comment on tape» — they are valid computer programs in more than one computer language. These can be achieved by the re–use of keywords and operators or by using comments in one program to include code in another language.

6. Hello, Weird

In the field of weird languages, also known as esoteric languages, the programmer moves up a level to exploit not just the play of a particular language, but the play that is possible in programming language design itself. Weird programming languages are not designed for any real–world application or normal educational use; rather, they are intended to test the boundaries of programming language design. A quality they share with obfuscated code is that they often ironically comment on features of existing, traditional languages.

There are literally dozens of weird languages, commenting on many different aspects of language design, programming history and programming culture. A representative selection is considered here, with an eye towards understanding what these languages have to tell us about programming aesthetics.

Languages are considered in terms of four dimensions of analysis: 1) parody, spoof, or explicit commentary on language features, 2) a tendency to reduce the number of operations and strive toward computational minimalism, 3) the use of structured play to explicitly encourage and support doublecoding, and 4) the goal of creating a puzzle, and of making programming difficult. These dimensions are not mutually exclusive categories, nor are they meant to be exhaustive. Any one weird language may be interesting in several of these ways, though one particular dimension will often be of special interest.

7. Abandon All Sanity, Yewho Enter Here: Intercal

INTERCAL is the canonical example of a language that parodies other programming languages. It is also the first weird language, and is highly respected in the weird language community. It was designed in 1972 at Princeton University by two students, Don Woods and James Lyon. (Later, while at Stanford, Woods was the co–author of the first interactive fiction, Adventure.) The explicit design goal of INTERCAL is

...to have a compiler language which has nothing at all in common with any other major language. By "major" we meant anything with which the author's were at all familiar, e.g., FORTRAN, BASIC, COBOL, ALGOL, SNOBOL, SPITBOL, FOCAL, SOLVE, TEACH, APL, LISP and PL/I.» [13]

INTERCAL borrows only variables, arrays, text input/output, and assignment from other languages. All other statements, operators and expressions are unique (and uniquely weird). INTERCAL has no simple if construction for doing conditional branching, no loop constructions, and no basic math operators — not even addition. Effects such as these must be achieved through composition of non–standard and counterintuitive constructs. In this sense INTERCAL also has puzzle aspects.

However, despite the claim that this language has «nothing at all in common with any other major language», INTERCAL clearly spoofs the features of contemporaneous languages, combining multiple language styles together to create an ungainly, unaesthetic style. From COBOL, INTERCAL borrows a verbose, English–like style, including optional syntax that increases the verbosity; all statements can be prepended with PLEASE. Sample INTERCAL statements in this COBOL style include FORGET, REMEMBER, ABSTAIN and REINSTATE. From FORTRAN, INTERCAL borrows the use of optional line numbers, which can appear in any order, to mark lines, and the DO construct, which in FORTRAN is used to initiate loops. In INTERCAL, however, every statement must begin with DO. Like APL, INTERCAL makes heavy use of single characters with special meaning, requiring even simple programs to be liberally sprinkled with non alphanumeric characters. In a sense, INTERCAL exaggerates the worst features of many languages and combines them together into a single language.

The compiler, appropriately called «ick,» continues the parody. Anything the compiler can't understand, which in a normal language would result in a compilation error, is just skipped. This «forgiving» feature makes finding bugs very difficult; it also introduces a unique system for adding program comments. The programmer merely inserts non–compileable text anywhere in the program, being careful not to accidentally embed a bit of valid code in the middle of their comment.

The language manual hammers home the parody. After explaining that INTERCAL stands for «Compiler Language with No Pronounceable Acronym,» the manual proceeds with a series of in jokes on language design. At one point the reader is presented with a logic diagram that claims to provide a simpler way of understanding the SELECT operation (SELECT being one of INTERCAL's two non–intuitive math operators): «The gates used are Warmenhovian logic gates, which means the outputs have four possible values: low, high, undefined ..., and oscillating ...» The reader is presented with a maze–like logic diagram in which lines needlessly zig–zag, sometimes dead–end, and all eventually connect at the system bus, the BUS LINE; of the many lines heading off diagram from the BUS LINE, all go «TO NEW YORK» except for the one «TO PHILIDELPHIA.» All non–alphanumeric characters are given special names: tail (,), hybrid (;), mesh (#), worm (-) and so forth.

Thirty–three years later, INTERCAL still has a devoted following. Eric Raymond, the current maintainer of INTERCAL, revived the language in 1990 with his implementation CINTERCAL, which added the COME FROM construct to the language — the inverse of the much–reviled GO TO.

8. Minimalism: Brainfuck

Languages that parody comment on other programming languages; languages in the minimalist vein, on the other hand, comment on the space of computation. Specifically, they call attention to the very small amount of structure needed to create a universal computational system. (A «system» in this sense can be as varied as a programming language, a formal mathematical system, or a physical processes, such as a machine.) A universal system can perform any computation that it is theoretically possible to perform; such a system can do anything that any other formal system is capable of doing, including emulating any other system. This property is what allows one to implement one language, such as Perl, in another language , such as C, or to implement an interpreter or compiler for a language directly in hardware (using logic gates), or to write a program that runs on some specific hardware to provide a platform for yet other programs (as the Java Virtual Machine does). Universality in a programming language is obviously a desired trait, since it means that the language places no limits on the processes that can be specified in the language. There are less powerful ways to compute, some of which are used often — for instance, regular expressions, of the sort found in the Find and Replace dialog of word processors, are powerful enough to tell whether a string has an even number of characters in it, but cannot determine whether the length of a string is a prime number, as a universal computer can.

Universal computation was discovered by Alan Turing and described in his 1937 investigation of the limits of computability, «On Computable Numbers.» While his paper proved the counter–intuitive result that there exist formally specified problems for which there exists no computational process (that is, no program) for finding a solution, the important result for this paper was his definition of a notional machine, the Turing Machine, to specify what he meant by computation.

A Turing Machine consists of 1) an infinite tape, divided into cells (memory locations), along which a read/write head moves reading and writing symbols to and from the tape, and 2) a single state register that can store a symbol indicating the machine's current state. A Turing Machine is governed by a rule table which specifies, for each possible combination of state symbol and symbol read from the tape, what symbol the head will write to the tape, whether the head will move left or right, and what new symbol is stored in the state register. While it is easy to imagine that one could define a TM to compute a specific function, Turing proved that there exist TMs that can simulate the activity of any arbitrary TM; these are universal Turing Machines. The structure necessary to achieve universality is surprisingly small; for example, a universal TM can be defined using only 2 state symbols and 18 tape symbols (2x18).

Minimalist languages strive to achieve universality while providing the smallest number of language constructs possible. Such languages also often strive for syntactic minimalism, making the textual representation of programs minimal as well. Minimal languages are sometimes called Turing Tarpits, after epigram 54 in Alan Perli's Epigrams of Programming: «54. Beware the Turing tar–pit in which everything is possible but nothing of interest is easy.» [11].

Brainfuck is an archetypically minimalist language, providing merely seven commands, each represented by a single character. These commands operate on an array of 30,000 byte cells initialized to 0. The commands are:

> Increment the pointer (point to the memory cell to the right) < Decrement the pointer (point to the memory cell to the left) + Increment the byte pointed to - Decrement the byte pointed to . Output the byte pointed to , Accept a byte of input and write it into the byte pointed to [ Jump forward to the corresponding ] if pointing to 0 ] Jump back to the command after the corresponding [ if pointing to a non–zero value.

A Brainfuck «hello, world» program follows:

++++++++++[>+++++++>++++++++++>+++>+<<<<>++.>+.++
+++++..+++.>++.<<+++++++++++++++.>.+++.------.---
-----.>+.>.

Minimalist languages also comment on computer architectures as well the nature of computation, and can have the flavor of a minimal assembly language. The language OISC explicitly parodies assembly language, for example. OISC stands for the «One Instruction Set Computer», referencing the standard acronyms RISC (Reduced Instruction Set Computer) and CISC (Complex Instruction Set Computer). OISC consists of a single instruction, subtract–and–branch–unless–positive. subleq(a, b, c) subtracts the contents of memory location a from the contents of memory location b, stores the result in b, and, if the result of the subtraction was 0 or negative, jumps to the address stored in memory location c. Assembly languages commonly contain separate arithmetic operations (add and subtract), as well as various branch operations that test a memory location and branch if the memory location is, for example, positive, or negative, or zero. OISC parodies assembly by combining an arithmetic and branch operation into a single instruction and providing that to the programmer as the only instruction.

9. Structured Play: Shakespeare

Some weird languages encourage double coding by structuring the play within the language such that valid programs can also be read as a literary artifact. As was previously described, double–coding is certainly possible in languages such as C and Perl, and in fact is an important skill in the practice of obfuscated programming. But where C and Perl leave the space of play relatively unstructured, forcing the programmer to shoulder the burden of establishing a double coding, structured play languages, through their choice of keywords and their treatment of programmer defined names (e.g. variable names), support double coding within a specific genre of humanreadable textual production. The language Shakespeare exemplifies this structured play aspect.

Here is a fragment of a Shakespeare program that reads input and prints it out in reverse order:

[Enter Othello and Lady Macbeth]

Othello:
You are nothing!
        Scene II: Pushing to the very end.

Lady Macbeth:
Open your mind! Remember yourself.
Othello:

You are as hard as the sum of yourself and a stone
wall. Am I as horrid as a flirt-gill?
Lady Macbeth:

If not, let us return to scene II. Recall your
imminent death!

Othello:
You are as small as the difference between
yourself and a hair!

Shakespeare structures the play of the language so as to double–code all programs as stage plays, specifically, as spoofs on Shakespearean plays. This is done primarily by structuring the play (that is, the free space) that standard languages provide in the naming of variables and constants. In standard languages, variable names are a free choice left to the programmer, while numeric constants (e.g. 1) are either specified by the textual representation of the number, or through a name the programmer has given to select constants. In contrast, Shakespeare Dramatis Personae (variables) must be the name of a character from some Shakespeare play, while constants are represented by nouns. The two fundamental constants in Shakespeare are -1 and 1. The dictionary of nouns recognized by the Shakespeare compiler have been divided into positive, negative, and neutral nouns. All positive (e.g. «lord», «angel», «joy») and neutral (e.g. «brother», «cow», «hair») nouns have the value 1. All negative nouns (e.g. «bastard», «beggar», «codpiece») have the value -1. Constants other than -1 and 1 are created by prefixing them with adjectives; each adjective multiplies the value by 2. So sorry little codpiece denotes the number -4.

The overall structure of Shakespeare follows that of a stageplay. Variables are declared in the Dramatis Personae section. Named acts and scenes become labeled locations for jumps; let us return to scene II is an example of a jump to a labeled location. Enter and exit (and exeunt) are used to declare which characters (variables) are active in a given scene; only two characters may be on stage at a time. Statements are accomplished through dialog. By talking to each other, characters set the values of their dialog partner and themselves, compare values, execute jumps, and so forth. Conditional jumps are accomplished by one character posing a true or false question, and the second character describing what action to take based on the truth value. Such a jump appears in the previous code sample, where Othello asks Lady Macbeth Am I as horrid as a flirt–gill? (is the value of the variable Othello equal to -1), and Lady Macbeth responds If not, let us return to scene II.

In a programming language, keywords are words that have special meaning for the language, indicating commands or constructs, and thus can't be used as names by the programmer. An example from C is the keyword for used to perform iteration; for can not be used by the programmer as the name of a variable or function. In standard languages, keywords typically limit or bound play, as the keywords are generally not selected by language designers to facilitate double–coding. This is, in fact, what makes code poetry challenging; the code poet must hijack the language keywords in the service of a doublecoding. In contrast, weird languages that structure play provide keywords to facilitate the double–coding that is generally encouraged by the language. Shakespeare keywords maintain a stylistic consistency with a melodramatic spoof of Shakespearean plays. Output is accomplished via Open your heart (output value as number) and Speak your mind (output value as character), input by Listen to your heart (input value as number) and Open your mind (input value as character). A number of comparative synonyms are provided for accomplishing inequality tests. For example, friendlier and jollier perform the greater–than test, as in are you friendlier than a fatherless bastard?, while punier and worse perform the less–than test, as in are you punier than a gentle king?

Another language, Chef, illustrates different design decisions for structuring play. Chef facilities double–coding programs as recipes. Variables are declared in an ingredients list, with amounts indicating the initial value (e.g., 114 g of red salmon). The type of measurement determines whether an ingredient is wet or dry; wet ingredients are output as characters, dry ingredients are output as numbers. Two types of memory are provided, mixing bowls and baking dishes. Mixing bowls hold ingredients which are still being manipulated, while baking dishes hold collections of ingredients to output. What makes Chef particularly interesting is that all operations have a sensible interpretation as a step in a food recipe. Where Shakespeare programs parody Shakespearean plays, and often contain dialog that doesn't work as dialog in a play («you are as hard as the sum of yourself and a stone wall»), it is possible to write programs in Chef that might reasonably be carried out as a recipe. Chef recipes do have the unfortunate tendency to produce huge quantities of food, however, particularly because the sous–chef may be asked to produce sub–recipes, such as sauces, in a loop.

A number of languages structuring play have been based on other weird languages. Brainfuck is particularly popular in this regard, spawning languages such as FuckFuck (operators are replaced with curse words) and Cow (instructions are all the word «moo» with various capitalizations).

10. The Sun The Sun, His Mind Puzzle: Malbolge

Languages that have a puzzle aspect explicitly seek to make programming difficult by providing unusual, counter–intuitive control constructs and operators. While INTERCAL certainly has puzzle aspects, its dominant feature is its parody of 1960s language design. Malbolge, named after the eighth circle of hell in Dante's Inferno, is a much more striking example of the puzzle language. Where INTERCAL sought to merely have no features in common with any other language, Malbolge had a different motivation, as author Ben Olmstead writes:

It was noticed that, in the field of esoteric programming languages, there was a particular and surprising void: no programming language known to the author was specifically designed to be difficult to program in. Certainly, there were languages which were difficult to write in, and far more were difficult to read (see: Befunge, False, TWDL, RUBE...). But even INTERCAL and BrainF***, the two kings of mental torment, were designed with other goals ... Hence the author created Malbolge. ... It was designed to be difficult to use, and so it is. It is designed to be incomprehensible, and so it is. So far, no Malbolge programs have been written. Thus, we cannot give an example. [9]

Malbolge was designed in 1998. It was not until 2000 that Andrew Cooke, using AI search techniques, succeeded in generating the first Malbolge program, the «hello, world!» program — actually, it prints HEllO WORld — that follows:

(=<`$9]7<5YXz7wT.3,+O/o'K%$H"'~D|#z@b=`{^Lx8%$Xmr
kpohm-kNi;gsedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA
@?>=<;:9876543s+O<oLm

The writing of complex Malboge programs was enabled by Lou Scheffer's cryptanalysis of Malbolge and discovered «weaknesses» that the programmer can systematically exploit:

The correct way to think about Malboge, I'm convinced, is as a cryptographer and not a programmer. Think of it as a complex code and/or algorithm that transforms input to output. Then study it to see if you can take advantage of its weaknesses to forge a message that produced the output you want. [12]

His analysis proved that the language allowed for universal computation. The «practical» result was the production of a Brainfuck to Malbolge compiler.

What makes Malbolge so difficult? Like many minimalist languages, Malbolge is a machine language written for a fictitious and feature–poor machine, and thus gains some difficulty of writing and significant difficulty of reading from the small amount of play provided to the programmer for expressing human, textual meanings. However, as Olmstead points out, the mere difficulty of machine language is not enough to produce a truly devilish language. The machine model upon which Malbolge runs has the following features which contribute to the difficulty of the language:

Trinary machine model

Programmers are used to all number representations bottoming out in binary representation at the machine–level. By making trits rather than bits the fundamental representation, this de–familiarizes the machine. This trinary orientation is borrowed from tri–INTERCAL, a trinary variant of INTERCAL.

Minimalism

Malbolge provides a minimal computational model. There are three registers, two of which are a data pointer and a code pointer, and seven instructions, represented by the ASCII characters (j i * p < / v). j and i manipulate the data and code pointer, * and p perform two trinary operations, < and / read and write characters from the A (accumulator) register, and v stops the machine.

Counterintuitive operations

Like INTERCAL, Malbolge does not provide standard constructs, such as conditional branching or arithmetic. Instead those operations must be built from two operations. * rotates the trinary cell pointed to by the D pointer 1 trit to the right. (Actually, bit–wise rotation is a standard operation on most computers — by providing this construct, Malbolge is being uncharacteristically forgiving.) p performs a tritwise operation on the contents of the A register and the number pointed to by D register. The p operation, often referred to as the crazy op, purposefully corresponds to no natural operation. In presenting the table that describes how trits are combined by the crazy op, Olmstead writes «don't look for a pattern, it's not there.»

Indirect instruction decoding

In standard machine models of computation, the code that will be executed next is determined by a program counter. Usually, after executing one instruction, the program counter is simply incremented so that it points to the next one. The only other thing that can happen is a «branch,» which corresponds, for instance, to if and GOTO statements. In this case, the execution of the current instruction causes the program counter's value to change, so that it points to some other location in memory. In either situation, the code that runs next is sitting somewhere in memory; it is directly fetched and run. In standard machine models, the instructions as laid out in memory are exactly the instructions the machine will execute.

Malbolge, in contrast, performs a complicated transformation on the instruction pointed at by the code pointer before executing it. As the manual states:

When the interpreter tries to execute a program, it first checks to see if the current instruction is a graphical ASCII character (33 through 126). If it is, it subtracts 33 from it, adds C [the code pointer] to it, mods it by 94, then uses the result as an index into the following table of 94 characters:
+b(29e*j1VMEKLyC})8&m#~W>qxdRp0wkrUo[D7,XTcA"lI
.v%{gJh4G\-=O@5`_3i<?Z';FNQuY]szf$!BS/|t:Pn6^Ha

If the character indexed in the table is one of the seven characters corresponding to Malbolge operations, the operation is executed. Otherwise the machine does nothing, except to increment both the code pointer and the data pointer (the constant incrementing of the data pointer provides another annoyance for the programmer). Note that the transformation depends on where the instruction resides in memory because C (the code pointer) is added as part of this step; the same value would execute as two different instructions at two different locations in memory. AMalbolge programmer cannot lay out the instructions she wants executed, but must lay out instructions so that after they have been taken through this complicated transformation, the eventual result will be the instructions that were supposed to be executed in the first place. To make matters more difficult, Malbolge programs can only consist of the seven characters that correspond to operations; the programmer can't simply write a program consisting of non–operation characters that will transform to operations.

Mandatory self–modifying code

In standard programming practice, code is treated as immutable. Though both code and data reside as patterns in memory, the block of memory patterns corresponding to code remains fixed, while the block of memory patterns corresponding to data is manipulated by the executing code. Self–modifying code treats its code block as mutable, literally changing its own operations as it runs. Self–modifying code is notoriously difficult to read and write; where the textual representation of the program is by necessity static, the structure of the process dynamically changes over time. In Malbolge, the programmer is forced to write self–modifying code, as code modification is built into the definition of code execution:

After the instruction is executed, 33 is subtracted from the instruction at C, and the result is used as an index in the table below. The new character is then placed at C, and then C is incremented.
5z]&gqtyfr$(we4{WP)H-Zn,[%\3dL+Q;>U!pJS72FhOA1C
B6v^=I_0/8|jsb9m<.TVac`uY*MK'X~xDl}REokN:#?G"i@

So, in addition to the complexities added by the indirect instruction decoding, the instructions are constantly changed by an arbitrary transformation. It is therefore impossible to write code in Malboge that does the same thing twice in a row. These factors account for the two years that passed before the first Malbolge «hello, world» program appeared.

Scheffer, in his cryptanalytic treatment of Malbolge, discovered a number of «weaknesses» that made it possible to write arbitrary programs in Malbolge — proving, therefore, that is is capable of universal computation. The most notable weaknesses are as follows: The permutation table used to modify code exhibits short cycles — that is, if one chooses carefully, instructions can be selected that turn back into themselves before very long. Specifically, a permutation cycle is a sequence of code transformations that comes back to itself. For example, the p instruction (the crazy op), when located at memory location 20, will turn into the j instruction (to store a value in memory) the first time it is executed, then into a «no op» (do nothing) once the j instruction is executed, then into another no op when the no op is executed, and finally, after this no op is executed, back to the p instruction. Another forgiving aspect of Malboge is that the branch instruction, i, is not modified, nor is its target. Exploiting these regularities allowed Scheffer to develop general Malbolge code constructs that, for example, allow one to create a block of code that performs a given function every other time it is executed, one that safely does nothing the alternate times. These discoveries paved the way for the creation of a BrainFuck to Malbolge compiler.

11. Toward A Broader Code Aestehtics

Programs in weird languages generally have the property of being difficult to read. This suggests that weird languages may be «auto–obfuscating,» requiring obfuscation from programmers. But obfuscated code contests are not about merely producing code that is hard to read; they are about exploiting the syntax and semantics of the language to comment on the language itself. Weird languages emphasizing minimalism and puzzles are «merely» hard to read in the same way that assembly language is hard to read; they provide so little play that it is virtually impossible to double–code interestingly. Languages structuring play, in contrast, are hard to read because of the insistence of the enforced double–coding. The textual meaning of the program is inevitably not about the procedural meaning of the program, but about some unrelated domain. Of the weird languages described here, it may be only INTERCAL that is truly auto–obfuscating. Since INTERCAL parodies several languages, resulting in a language in which nothing can be expressed cleanly or elegantly, the difficulty of reading INTERCAL programs is a result of such programs being about the parody languages, and thus in some sense about INTERCAL itself.

By commenting on the nature of programming itself, weird languages point the way towards a refined understanding of the nature of everyday coding practice. In their parody aspect, weird languages comment on how different language constructions influence programming style, as well as on the history of programming language design. In their minimalist aspect, weird languages comment on the nature of computation and the vast variety of structures capable of universal computation. In their puzzle aspect, weird languages comment on the inherent cognitive difficulty of constructing effective programs. And in their structured play aspect, weird languages comment on the nature of double–coding, how it is the programs can simultaneously mean something for the machine and for human readers.

All of these aspects are seen in everyday programming practice. Programmers are extremely conscious of language style, of coding idioms that not only «get the job done», but do it in a way that is particularly appropriate for that language. Programmers actively structure the space of computation for solving specific problems, ranging from implementing subuniversal abstractions such as finite–state machines for solving problems such as string searching, up to writing interpreters and compilers for custom languages tailored to specific problem domains, such as Perl for string manipulation. All coding inevitably involves double–coding. «Good» code simultaneously specifies a mechanical process and talks about this mechanical process to a human reader. Finally, the puzzle–like nature of coding manifests not only because of the problem solving necessary to specify processes, but because code must additionally, and simultaneously, double–code, make appropriate use of language styles and idioms, and structure the space of computation. Weird languages thus tease apart phenomena present in all coding activity, phenomena that must be accounted for by any theory of code.

Programming has already been connected to literature in an interesting way, albeit without deep consideration of obfuscation and weird languages as programming practices.[1] Obfuscation and weird languages invite us to join programming contexts to the literary contexts that must obviously be considered when evaluating literary code. They also suggest that coding can resist clarity and elegance to strive instead for complexity, can make the familiar unfamiliar, and can wrestle with the language in which it is written, just as much contemporary literature does. When a program is double–coded to have some literary meaning, or indeed, any human meaning, this meaning can play with what programming language researchers call the semantics of the code: what the code actually does as it executes. A very simple case of such play can even be seen in the obfuscated C «hello, world!» program, in which read is used to name a function that writes one letter. In such play, the levels of human meaning and machine meaning must both be considered.

As the name «Turing Machine» suggests, the computer is a machine. Whether it is realized as a physical device or imagined and abstract, it is made up of parts and performs tasks. A tradition of overcomplicated machinery has manifested itself in art in several ways, but perhaps most strikingly in Alfred Jarry's Pataphysics, «the science of imaginary solutions,» which involves the design of complicated physical machinery and also the obfuscation of information and standards. As a joke, and as a parody of the complex French calendar, Jarry introduced a new calendar. It begins on his birthday and is divided into thirteen months, each of 29 days. Each day has an obscure name in the pataphysical calendar, and the last day of the month is, in all but two cases, an imaginary day. The second month, for instance, is «Haha,» and its second day is «Dissolution of Edgar Allan Poe, dinomythurge.» The College de 'pataphysique revises the calendar once in a while, changing the names of days.

An aesthetic of mechanical obfuscation is also seen in the kinetic installations of Peter Fischli and David Weiss and in their film «The Way Things Go» (1987–1988), as well as in the earlier visual art of Robert Storm Petersen, Heath Robinson, and Rube Goldberg. (The weird language RUBE was so named as a tribute to Goldberg.) These depictions and realizations of mechanical ecstasy comment on engineering practice and physical possibility, much as obfuscated coding and weird languages comment on programming and computation. These «art machines,» like obfuscated programs, are interesting because they do something in a very complex way, but to be worth anyone's attention they must actually do something and have a machine meaning as well as a human one.

Perhaps most oddly, obfuscated programs and weird languages are inviting. They ask for the full engagement of those who read them or program in them, and offer to show how strangely things can be done. They invite theorists and critics of new media to look into the dark box of the machine and see how creativity is at work in there, too. To understand how programmer–artists, programmer–authors, game developers, and hackers of other stripes achieve what they do, it will be necessary to understand the full range of programming practices, to not just play with the finished, executable file, but to also consider the play that happens in programming it.

References

  1. Black, M. J. The Art of Code. Ph.D. Dissertation, University of Pennsylvania. 2002.
  2. Cox, G., A. McLean, and A. Ward. The Aesthetics of Generative Code. http://www.generative.net/papers/aesthetics/ 2000.
  3. Heusser, M. Beautiful Code. Dr. Dobb's. www.ddj.com/documents/ddj1122411683430/ 2005.
  4. International Obfuscated C Code Contest. http://www.ioccc.org/
  5. Kernighan, B. W. and D. M. Ritchie. The C Programming Language. 2nd Ed. Prentice Hall, Englewood Cliffs, New Jersey. 1988.
  6. Knuth, D. E. Things a Computer Scientist Rarely Talks About. Center for the Study of Language and Information, Stanford, California. 2001.
  7. Lohr, Steve. Go To. Basic Books, New York. 2001.
  8. Mathews, H. and A. Brotchie, eds. Oulipo Compendium. Atlas Press, London. 1998.
  9. Olmstead, B. Malboge. http://www.antwon.com/other/malbolge/malbolge.txt 1998.
  10. Perl 5.6 FAQ. 23 May 1999. http://www.perldoc.com/perl5.6/pod/perlfaq1.html
  11. Perlis, A. Epigrams on Programming. SIGPLAN Notices, 17(9), September 1982. http://www.bio.cam.ac.uk/~mw263/Perlis_Epigrams.html
  12. Scheffer, L. http://www.lscheffer.com/malbolge.html
  13. Woods, D. and J. Lyon, The INTERCAL Programming Language Revised Reference Manual. 1st Ed. 1973, CINTERCAL revisions, L. Howell and E. Raymond, 1996.
  14. WordNet 2.1. http://wordnet.princeton.edu/
Michael Mateas and Nick Montfort. "A Box, Darkly: Obfuscation, Weird Languages, and Code Aesthetics." In Proceedings of the 6th Digital Arts and Culture Conference, IT University of Copenhagen, 1-3 Dec 2005, pp. 144-153.

Wednesday, April 23, 2008

Experiment With Control Code Obfuscation

Control code obfuscation is intended to prevent malicious reverse engineering of software by masking the program control flow. The idea for further advancing the state of the art was presented in 2000 by WANG C. An obfuscating system for Java based on the ideas of WANG C is implemented and experimented. The experiment results show that obfuscation can be done efficiently with moderate increases in code size, execution times, while making the obfuscated code resilient to a variety of reverse engineering attacks.

1. Introduction

In distributed computing surroundings, a software may be deployed on several computers which can not be trusted. In malicious host surroundings, a software can be reverse engineered and also can be tampered with. The confidentiality, integrality and availability of software are threatened, so convenient and effective methods are needed to protect the intellectual property of software. Code obfuscation transformation emerges as a defense technique to protect software from reverse engineering. The advent and widely use of Java language brings the widely research on code obfuscation because the class file is easy to be decompiled.

The goal of code obfuscation is to make it difficult for an attacker from understanding the inner workings of a program by making the obfuscated programs too difficult to understand, that is, by making the task of reverse engineering the program too expensive in terms of the resources or time required.

The obfuscating transformations can be classified according to the kind of information they target [1]. Layout transformations remove source code formatting and scramble identifiers. Control transformations affect the control flow of the program, while data transformations operate on the data structures used in the program [2]. The aim of control obfuscators is to obscure the code by affecting its control flow, in order to make it more difficult to understand for a malicious user (attacker).This paper is concerned primarily with control transformations and concretely considers the approach taken form WANG Chen–xi's dissertation [3]. This choice is motived by two factors. First based on our experiments, it seems more difficult to break than those we had considered earlier. Second, it is a key component of an industrial obfuscation tool by Cloakware Inc [4].

We have implemented and experimented with an obfuscation system for Java which relies on the control flow flattening transformation. Our experiments show that obfuscation can be done efficiently with moderate increases in code size, execution times, while making the obfuscated code resilient to a variety of reverse engineering attacks.

2. Implementation and experimental results

2.1 Control flow flattening algorithm

Control flow flattening aims to obscure the control flow logic of a program by «flattening» the control flow graph so that all basic blocks appear to have the same set of predecessors and successors. The actual control flow during execution is guided by a dispatcher variable. At runtime, each basic block assigns to this dispatcher variable a value indicating which next basic block should be executed next. A switch block then uses the dispatcher variable to jump indirectly, through a jump table, to the intended control flow successor. As an example, considering the program shown in Figure 1.

An example program and its control flow graph Figure 1: An example program and its control flow graph

Basic control flow flattening of this program results in the control flow graph shown in Figure 2, where S is the switch block and x the dispatcher variable.1 The initial assignment to the dispatcher variable x in the block Init is intended to route control to A, the original entry block of f (), when control first enters the function; after this, control flow is guided by assignments to x in the various basic blocks.

Control flow graph after flattening Figure 2: Control flow graph after flattening

2.2 Byte code obfuscation

We choose the Java class files (those ending with .class) as objects being obfuscated. Since Java is a platform–independent language, source code is compiled into byte code, which stored in a .class file, and Java byte code is the form of instructions that the Java virtual machine executes. It is possible to create class files which no Java compiler can produce, and yet, which pass the Java Verifier with flying colors. Such class files are said to be deviant. The goal of byte code obfuscation is to produce deviant class files, such that the byte code file gets stronger protection against the reverse engineering. Here we use the BCEL to analyze, create, and manipulate (binary) Java class files [5].

Classes are represented by objects which contain all the symbolic information of the given class: methods, fields and byte code instructions, in particular. Such objects can be read from an existing file, transformed by a program (e.g. a class loader in run–time) and dumped to a file again.

To do the byte code obfuscation, we should know the format of class files and the byte code instruction set (which are described in more detail in the Java Virtual Machine Specification). Especially, we will deal with the security constraints that the Java Virtual Machine has to check at run–time, i.e. the byte code verifier.

When we manipulate Java class files, the code must meet the following conditions so that the JVM can accept it [6].

  1. Type correctness: the arguments of an instruction are always of the types expected by the instruction.
  2. No stack overflow or underflow: an instruction never pops an argument off an empty stack, nor pushes a result onto a full stack.
  3. Code constraints: the program counter must always point within the code for the method, to the beginning of a valid instruction encoding.
  4. Register initialization: a load from a register must always follow at least one store in this register.
  5. Object initialization: when an instance of a class C is created, one of the initialization method for class C must be invoked before the class instance can be used.

Having learned the constraints from JVM, we can do the obfuscation described in 2.1. Control flow flattening is performed in two steps. First, Partition basic blocks; Second, Flatten the control flow. The details are discussed in 2.3 and 2.4.

2.3 Partition basic blocks

The standard algorithm for grouping a sequence of instructions into basic blocks relies on identifying which instructions are leaders [7]. An instruction is a leader if:

  • It is the first instruction in the sequence, or;
  • It is the destination of a branch instruction, or;
  • It follows a branch instruction.

The instructions are partitioned into basic blocks that start with a leader and contain no other leaders. All of the instructions which follow this leader will be in a single basic block. A basic block includes all of the instructions up to the instruction before the next leader. But this method of partitioning basic blocks will not work when facing the java class file.

The modified code may be rejected by the Java verifier when the code violate the constraint, all the execution paths which lead to a particular point must have operand stacks with the same number and type of items on them. So, we use the modified basic block algorithm [8].

FindLeaders: This process returns a list of the leaders in a method code. A leader is the first instruction in a basic block. For all possible execution paths through a method's code, we keep track of the operand stack size after each instruction is executed. Supposed that after the current instruction executed, the operand stack is empty. Then any instructions that would be executed after the current one are leaders.

PartitionBasicBlocks: partitions the instructions in a methods code so that each partition starts with a leader and contains no other leaders.

As an example, consider the program shown in Figure 3. The Java source code is compiled to class file, and we get the instruction set through the BCEL. Every piece of instruction has a number that denotes its position in the set. Using the algorithm described above, we get 6 basic blocks.

Partition basic blocks
#InstructionsBasic blocks #
0iconst_01
1istore_11
2goto 172
5getstatic 73
8aload_03
9iload_13
10aaload3
11invokevirtual 53
14iinc 1,14
17iload_15
18aload_05
19arraylength5
20if_icmplt 55
23return6
Figure 3: Partition basic blocks

2.4 Flatten the control flow

Once get the basic blocks, we can flatten the control flow using the following algorithm:

Get all the entrances of basic blocks. We use two arrays to record the entrances. The first is an integer array which records the number of every entrance. The second is an array composed of Instructions which record the instruction entrance, then scan all of the basic blocks and record the value into the two arrays.

Register initialization. A load from a register must always follow at least one store in the register. During the process of partitioning basic blocks, the store instruction and the load instruction to the same variable may be put into two basic blocks, and the verifier will reject the modified file. Here, we scan all of the instructions in the method and record all the store instructions. Then insert corresponding store instructions to the beginning of the modified file and all the load instruction will be accepted as a result.

Create the switch case statement. From the step 1, we have known the entrances of basic blocks and at the same time, give a number to each entrance. The number is used by the switch statement to determine which block is to be executed next. Here we insert a switch case statement to the beginning of the method to control the execution flow.

Insert goto. We insert a goto statement to the end of every basic block, and the goto is used to which jump to the switch statement so that the execution flow will go to the next basic block.

Dump to the file. The result after step 4 is a interior form of BCEL and we should dump it to the class file to finish the obfuscation transformation.

2.5 Experimental results

Our experiments were run under Windows xp on AMD Sempron 1.8GHz, with 512MB RAM. We used the Sun Java 2 SDK Version 1.2.1 Windows Release Virtual Machine with the default Just–In–Time compiler turned on.

We tested our obfuscation tool on three type Java programs, In table 1.There are three programs: jump–intensive program, loop–intensive program and sequence–process program. The jump–intensive program includes many jump statements, while the loop–intensive program includes many for or while statements. There is little jump or loop statement in sequence–process program. Each test was done twelve times. The numbers in the table are the averages. Dimensions: code size is in bytes; execution time is in milliseconds.

The experiments show that the obfuscation makes the code size, the branch number and execution time increased. The more branch number is, the more cost. We should make the balance between performance and security.

We have tried to attack the obfuscated program with several java reverse engineering tools and not get the right control flow.

3. Conclusions

Although obfuscation results in less efficient code, it is relatively cheap to perform and has aroused increased interest in the past two years. This paper experimented with a obfuscation system, and the result shows that obfuscation can be done efficiently with moderate increases in code size, execution times, while making the obfuscated code resilient to a variety of reverse engineering attacks.

References

  1. D. Low C. Collberg, C. Thomborson. A taxionomy of obduscating transformations. Technical Report 148. Dept. of Computer Science, The Univ. of Auckland, 1997.
  2. D. Low. Java Control Flow Obfuscation (Master of Science Thesis). Department of Computer Science, The University of Auckland, 1998.
  3. WANG C. A Security Architecture for Survivability Mechanisms (Ph.D. Thesis). Department of Computer Science, University of Virginia, October 2000.
  4. S. Chow, Y. Gu, H. Johnson, and V. Zakharov. An approach to the obfuscation of control–flow of sequential computer programs. In G. Davida and Y. Frankel. (Eds.), Information Security, ISC 2001, volume 2200 of Lectures Notes in Computer Science (LNCS): Springer Verlag, 2001, 68.
  5. Available at: http://jakarta.apache.org/bcel/manual.html.
  6. Xavier Leroy. Java bytecode verification: Algorithms and formalizations. Journal of Automated Reasoning.
  7. Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman. Compilers, Principles, Techniques and Tools. Addison–Wesley, ISBN 0–201–10088–6, 1986.
  8. Douglas Low. Java Control Flow Obfuscation. June 3, 1998.
SONG Ya-qi, LI Li (Computer Department, North China Electric Power University, Baoding 071003, China) (Edited by Rachel, Yunflyer), Journal of Communication and Computer, ISSN1548-7709, USA.

Tuesday, April 22, 2008

Deobfuscation — Reverse Engineering Obfuscated Code

In recent years, code obfuscation has attracted attention as a low cost approach to improving software security by making it difficult for attackers to understand the inner workings of proprietary software systems. This paper examines techniques for automatic deobfuscation of obfuscated programs, as a step towards reverse engineering such programs. Our results indicate that much of the effects of code obfuscation, designed to increase the difficulty of static analyses, can be defeated using simple combinations of straightforward static and dynamic analyses.

Our results have applications to both software engineering and software security. In the context of software engineering, we show how dynamic analyses can be used to enhance reverse engineering, even for code that has been designed to be difficult to reverse engineer. For software security, our results serve as an attack model for code obfuscators, and can help with the development of obfuscation techniques that are more resilient to straightforward reverse engineering.

Introduction

In recent years, code obfuscation has attracted some attention as a low cost approach to improving software security [4, 7, 8, 21, 22, 29]. The goal of code obfuscation is to make it difficult for an attacker to reverse engineer programs. The idea is to prevent an attacker fromunderstanding the inner workings of a programby making the obfuscated program «too difficult» to understand — that is, by making the task of reverse engineering the program «too expensive» in terms of the resources or time required to do so. Obfuscation has also been used to protect «software watermarks» and fingerprints, which are designed to thwart software piracy [1, 7, 8]. The presumption is that making it difficult for attackers to understand the internal workings of a program prevents them from discovering vulnerabilities in the code, and serves to protect the program owner's intellectual property.

It is important to note, however, that code obfuscation is merely a technique. Just as it can be used to protect software against attackers, so too it can be used to hide malicious content. For example, certain kinds of sophisticated computer viruses, e.g., polymorphic viruses, have resorted to using obfuscation techniques to prevent detection by virus scanners [27].

This raises two closely related questions. The first question, from a software engineering pespective, is: What sorts of techniques are useful for understanding obfuscated code? For example, suppose we have downloaded, from a web site, a file purporting to be a security patch for some application. Before applying the patch, we may want to verify that the file does not contain any malicious payload. How can we verify this if the contents of the file have been obfuscated? The second question, from a security perspective, is: what are the weaknesses of current code obfuscation techniques, and how can we address them? If our obfuscation schemes are ineffective in thwarting attackers from reverse engineering the code, then they are not only useless, but are in fact worse than useless: they increase the time and space requirements of the program, and can contribute to a false sense of security that keeps other security measures from being deployed. Thus, identifying any weaknesses in current obfuscation schemes by developing and testing attack models can lead to better obfuscation schemes and concomitant improvements in software security.

This paper aims to address the questions raised above, regarding techniques for understanding obfuscated code and the strengths and weaknesses of sophisticated obfuscation algorithms. We describe a suite of code transformations and program analyses that can be used to identify and remove obfuscation code and thereby help reverse engineer obfuscated programs. We use these techniques to examine the resilience of the control flow flattening obfuscation technique, which has been proposed in the research literature [2, 29] and used in a commercial code obfuscation product by Cloakware [5], against attacks based on combinations of static and dynamic analyses. Our results indicate that from the perspective of reverse engineering, simple dynamic techniques can often be very useful in coping with code obfuscation. From a software security perspective, we show that many obfuscation techniques can be largely neutralized using combinations of simple and well known static and dynamic analyses.

2. Obfuscating Transformations

Conceptually, we can distinguish between two broad classes of obfuscating transformations. The first, surface obfuscation, focuses on obfuscating the concrete syntax of the program. An example of this is changing variable names or renaming different variables in different scopes to the same identifier , as carried out by the «Dotfuscator» tool for obfuscating .NET code [25]. The second, deep obfuscation, attempts to obfuscate the actual structure of the program, e.g., by changing its control flow or data reference behavior [4, 8]. While the former may make it harder for a human to understand the source code, it does nothing to disguise the semantic structure of the program. It therefore has no effect on algorithms used for reverse engineering, such as program slicing, that rely on code structure and semantics rather than the concrete syntax. For example, it is straightforward to undo most of the effects of Dotfuscator–style variable renaming simply by using a parser to resolve variable references using the scope rules of the language and rename variables accordingly. Deep obfuscation, by contrast, changes the actual structure of the program, and therefore affects the efficacy of semantic tools for program analyses and reverse engineering. Space constraints preclude a more detailed elaboration of different kinds of deep obfuscation techniques, but the interested reader is referred to a discussion and more detailed taxonomy by Collberg et al. [9]. For the purposes of this paper, it suffices to note that working around deep obfuscation — which requires reasoning about semantic aspects of the program — is intuitively more difficult than working around surface obfuscation, which is essentially a syntactic issue. This paper is concerned primarily with deep obfuscation techniques that attempt to disguise the control flow logic of a program.

In prior work, we considered the problem of deobfuscating programs that had been subjected to a number of control flow obfuscations based on opaque predicates; we found that for the obfuscations considered (a set of control flow obfuscations implemented in Collberg's Sandmark obfuscation tool for Java programs [6]), most of the obfuscation could be removed using a combination of fairly straightforward static and dynamic analyses [3]. This paper considers a different approach to control flow obfuscation, taken from Chenxi Wang's dissertation [29, 30]. This choice is motivated by three factors. First, based on our experiments, it seems more difficult to break than those we had considered earlier [3]. Second, this approach has been considered by other researchers as well [2], and its resilience is therefore of interest to the research community. Finally, it is a key component of an industrial obfuscation tool by Cloakware Inc. [5].

This section describes the basic control flow obfuscation technique as well as two enhancements that aim to make the basic approach harder to break.

2.1 Basic Control Flow Flattening

Control flow flattening aims to obscure the control flow logic of a programby «flattening» the control flow graph so that all basic blocks appear to have the same set of predecessors and successors. The actual control flow during execution is guided by a dispatcher variable. At runtime, each basic block assigns to this dispatcher variable a value indicating which next basic block should be executed next. A switch block then uses the dispatcher variable to jump indirectly, through a jump table, to the intended control flow successor.

An example program and its control flow graph Figure 1: An example program and its control flow graph
Control flow graph after basic flattening Figure 2: Control flow graph after basic flattening

As an example, consider the programshown in Figure 1. Basic control flow flattening of this program results in the control flow graph shown in Figure 2, where S is the switch block and x the dispatcher variable ?. The initial assignment to the dispatcher variable x in the block Init is intended to route control to A, the original entry block of f(), when control first enters the function; after this, control flow is guided by assignments to x in the various basic blocks.

2.2 Enhancement I: Interprocedural Data Flow

In the basic control flow flattening transformation discussed in Section 2.1, the values assigned to the dispatch variable are available within the function itself. Because of this, while the control flow behavior of the obfuscated code is not obvious, it can be reconstructed by examining the constants being assigned to the dispatch variable. This, in turn, requires only intraprocedural analysis.

Enhancing flattening with Interprocedural Data Flow Figure 3: Enhancing flattening with Interprocedural Data Flow

The resilience of the obfuscation technique can be improved using interprocedural information passing. The idea is to use a global array to pass the dispatch variable values. At each call site to the function, these values are written into the global array starting at some random offset within the array (appropriately adjusted to avoid buffer overflows). The offset so chosen may be different at different call sites for the function, and is passed to the obfuscated callee either as a global or as an argument. The obfuscated code then assigns values to the dispatch variable from the global array. Neither the actual locations accessed, nor the contents of these locations, are constant values, and are not evident by examining the obfuscated code of the callee. The code resulting from applying this obfuscation to the program in Figure 1 is illustrated in Figure 3.

2.3 Enhancement II: Artificial Blocks and Pointers

The obfuscation technique detailed above can be extended by adding artificial basic blocks to the control flow graph. Some of these artificial blocks are never be executed, but this is difficult to determine by a static examination of the program because of the dynamically computed indirect branch targets in the obfuscated code. We then add indirect loads and stores, through pointers, into these unreachable blocks. These have the effect of confusing static analyses about the possible values taken on by the dispatch variable. Figure 4 shows the result of applying this to the program of Figure 1.

Enhancing flattening with artificial blocks and pointers (Click on image for full size) Figure 4: Enhancing flattening with artificial blocks and pointers

In our implementation, we add two artificial basic blocks corresponding to each block in the original function: one of these blocks is actually executed at runtime, while the other is simply a decoy added to mislead static analysis. Given a block B in the original program, let the corresponding artificial block that gets executed be denoted by B|, and the decoy artificial block be B||. Indirect assignments through pointers are added to both these artificial blocks. However, only the assignments in the block B| set the dispatch variable to the appropriate values so as to give the right control flow during execution; the decoy block B||, by contrast, sets the dispatch variable to other values, so as to give a misleading picture of control flow. In the original block B, the value of the dispatch variable that gets loaded is that previously assigned in the artificial block B|. Hiding the starting value of the switch variable makes it harder for a static analyzer to deduce which blocks are executed and hence find out the valid definitions of the switch variable.

3. Deobfuscation

This section describes a number of analyses and program transformations that we have found useful for reverse engineering obfuscated code.

3.1 Cloning

Many obfuscation techniques rely on introducing spurious execution paths into the program to thwart static program analyses [4, 8]. These paths that can never be taken at runtime, but cause bogus information to be propagated along them during program analyses, thereby reducing the precision of information so obtained and making it harder to understand the program logic. This is illustrated in Figure 5(a), where information is propagated between basic blocks A and B along the «actual» control flow path 1 as well as the spurious control flow path 2, the latter having been introduced by the obfuscator. The bogus data flow information propagated along 2 then has the effect of introducing imprecision in the results of program analyses at points where execution paths come together. In Figure 5(a), the results of forward dataflow analyses, such as reaching definitions, are tainted at the entry to B, while those of backward analyses, such as liveness analyses, are affected at the exit from A.

Code Cloning Figure 5: Code Cloning

One way to address this problem is to clone portions of the program in such a way that the spurious execution paths no longer join the original execution paths and taint the information obtained from analysis. The result of applying cloning to basic block B in Figure 5(a) is shown in Figure 5(b). In this case, this results in improved forward dataflow information available at the entry to B. In this example, however, cloning does not eliminate the spurious control flow edge AB|, and so does not improve the backward dataflow information available at the exit from A.

This transformation obviously has to be applied judiciously, since otherwise it can cause large increases in code size and further exacerbate the reverse engineering problem. Moreover, since the goal of deobfuscation is to try to identify and remove obfuscation code, this means that in general, cloning has to be applied without knowing, ahead of time, which execution paths are spurious and which are not. One possible approach, in such situations, would be to apply cloning selectively at pointswheremultiple control flow paths join, and where the dataflow information propagated along some paths is significantly less precise than that propagated along others. Alternatively, if we know something about the kind of obfuscation that has been applied, it may be possible to apply cloning in a way that exploits this information. For example, it is relatively straightforward to infer that control flow flattening has been applied, because of the distinctive control flow graphs it produces.

Code Cloning for Control Flow Flattening Figure 6: Code Cloning for Control Flow Flattening

For the purposes of this paper, we use cloning in the context of one of our deobfuscator implementations (see Section 4.1), as illustrated in Figure 6. Consider the obfuscated program fragment shown in Figure 6(a), where the basic blocks A, B, and C all transfer control to a switch–block S. Cloning creates three copies S1, S2, and S3 of the switch–block S, corresponding to the successors A, B, and C respectively. The control flow successors of each of these copies is the set of control flow successors of the original switch–block, i.e., each of the copies S1, S2, and S3 has an edge to each of the blocks A, B, and C. In the resulting program, shown in Figure 6(b), the dataflow information entering the switch–block S1 is not commingled with that entering the switch–block S2 from B or that entering the switch–block S3 from C.

3.2 Static Path Feasibility Analysis

We use the term static path feasibility analysis to refer to constraint–based static analyses to determine whether an (acyclic) execution path is feasible. Given an acyclic execution path π with χ the set of variables live at entry to π, the idea is to construct a constraint Cπ such that the logical formula (∃ χ)Cπ is unsatisfiable only if, for all possible executions of the program, π is never executed. Cπ is thus a conservative approximation to the effects of the execution of the instructions along π. If (∃ χ)Cπ can be shown to be unsatisfiable, we can conclude that π is unfeasible.

In principle, we can imagine many different ways to construct the constraint Cπ corresponding to a path p. For the purposes of this paper, our goal is to take into account the effects of arithmetic operations on the values of variables, effectively obtaining an analysis that resembles constant propagation, but propagates information along a single execution path rather than along all execution paths. To this end, we use linear arithmetic constraints to reason about variable values. The discussion below assumes a low–level programrepresentation, e.g., as three–address code, RTL, or even machine instructions.

Assume that each instruction in the program has a unique name Ik. The value of a variable χ at the beginning of π is denoted by χ0, while at intermediate points along the path, the value of χ immediately after instruction Ik is denoted by χk. An unknown value is denoted by . The constraint Cπ is constructed as a conjunction of a constraint Ck for each instruction Ik in π, as follows:

  1. Assignment: Ik ≡ ‘x = y’. Then, Ckχk = Yk , where Ij refers to the most recent instruction in π that defined y ( j = 0 if there is no definition of y in π before Ik).
  2. Arithmetic: Ik ≡ ‘x = y⊕z’ for some operation , where Ii and Ij refer to the most recent instructions defining y and z respectively (i = 0 if y has not yet been defined along π, and similarly with j). Then, Ckχk = ƒ⊕(yi, zj). Here, ƒ⊕ expresses the semantics of the operation . If the semantics of is not known to the analyzer, or if either yi = ⊥ or zj = ⊥, then Ck ≡ x = ⊥.
  3. Indirection: Pointers can be modelled at different levels of precision, with a concomitant tradeoff in analysis speed [14]. A full discussion of pointer analysis is beyond the scope of this paper; we require only that the treatment of pointers be conservative, i.e., that the set of possible targets for a pointer during analysis be a superset of the actual set of targets during any execution.
  4. Branches: Ik ≡ ‘if e goto L for some Boolean expression e. In this case,
    Unconditional branches can be treated as a special case where e ≡ true, while multi–way branches such as those arising from switch statements, can be modelled as a semantically equivalent series of conditional branches.
  5. Otherwise, the effects of instruction Ik cannot be modelled by the analyzer. The analysis is aborted in this case, and our systemconservatively assumes that π is a feasible path.

Once the constraint Cπ has been constructed in this way, a constraint solver is used to determine its satisfiability. The constraints so generated can be simplified, using path slicing techniques [15], to reduce the cost of testing for satisfiability; our current implementation, which uses the Omega calculator [23] for satisfiability testing, does not currently do such simplifications.

An example of static path feasibility analysis Figure 7: An example of static path feasibility analysis

Figure 7 illustrates the use of constraints for static path feasibility analysis. The parenthetical figures to the right of each basic block serve to identify different instructions. Consider the path π = B0 → B2 → B3 → B5. The only relevant live variable at the entry to this path is u. The corresponding constraint Cπ is therefore:

(∃u0)[x1 = 1 ∧ y2 = 2 ∧ u0 > 0 ∧ z5 = x1 — y2 ∧ z5 > 0]

It is not difficult to see that this constraint is unsatisfiable, which means that the path π is unfeasible. Note that conventional constant propagation would obtain z = ⊥ at entry to block B3, and thereby conclude that the path π is feasible.

Note that this example could also have been handled by cloning block B3, which would have the effect of preventing the loss of information resulting from the control flow join of edges B1 → B3 and B2 → B3, after which constant propagation would give the expected results. Thus, path feasibility analysis and cloning can be seen as complementary techniques.

3.3 Combining Static and Dynamic Analyses

Conventional static analyses, such as that of Section 3.2, are inherently conservative ?, so the set of edges resulting from purely static deobfuscation techniques are, in general, a superset of the actual set of edges. Conversely, dynamic analyses, such as program tracing or edge profiling, cannot take into account all the possible input values to a program, and therefore are able to observe only a subset of all its possible execution paths.

The dual natures of these two approaches to program analysis suggests that we try to combine them. This can be done in two ways. We can begin with an underapproximation to the set of control flow edges obtained via dynamic analysis, then use static analysis to add back some control flow edges that could be taken. Alternatively, we can begin with an overapproximation to the set of control flow edges edges obtained via static analysis, then use dynamic analysis to remove some control flow edges (or paths) that are not actually taken. In either case, the result may contain either more or less edges than the original program, i.e., when we combine static and dynamic analyses the result cannot be guaranteed to be either sound or precise. Nevertheless, from the perspective of reverse engineering and program understanding, such combined analyses can be very useful for overcoming the limitations of purely static and purely dynamic analyses.

For the work described in this paper, we used a static analysis to improve the results of dynamic analysis by adding back some control flow edges that could possibly be taken. The essential idea behind our approach is based on the following gedankenexperiment: suppose we know, somehow, which control flow edges can actually be taken during execution. Then, we can simply mark these edges and propagate dataflow information only along such marked edges, thereby avoiding the imprecision resulting from propagating information along edges that can never be taken at runtime. Conventional static analyses can then be thought of as the degenerate case where all edges are marked. We can improve on this situation by using dynamic analyses to identify edges that are actually taken during execution andmarking only these edges, then propagating dataflow information along these marked edges, as follows:

  1. Initially mark only those edges that are identified as taken by the dynamic analysis.
  2. Carry out constant propagation on the program, propagating information only along marked edges. If a conditional branch is encountered where only one the outgoing control flow edges is taken during execution, but where the outcome of the branch cannot be uniquely determined from the constant propagation, add the branch that is not taken during execution into the set of control flow edges that can be taken, and mark it.

In our implementation, the effect of this approach is to prune the dataflow information propagated into switch blocks. As an example, consider the following control flow fragment, where solid arrows represent control flow edges that are taken during execution,while dashed arrows correspond to edges that are never taken:

In this example, basic block B is never executed, so the control flow edges S → B and B → S are not marked and have no information propagated along them. The assignment ‘x = 2’ in block B is therefore not considered for static analysis; this results in the value 2 not being considered to be a possible value for the variable x at the switch.

4. Experimental Evaluation

We evaluated our ideas using two different binary rewriting systems for the Intel x86 platform: PLTO [24] and DIABLO [10]. We implemented three control flow flattening obfuscations described inWang's dissertation and discussed in Section 2 in these tools, and used these to obfuscate ten programs from the SPECint–2000 benchmark suite. While these programs happen to be written in C, our experiments were carried out on program binaries.

ProgramOriginalObfuscatedEffects of Obfuscation
Functions (Forig)Edges (Eorig)Functions (Fobf)Edges (Eobf)Fobf/ForigEobf/Eorig
bzip2422,65530157,1920.71459.21
crafty10412,172894,309,5020.855352.05
gap82543,0797681,973,9800.93045.82
gcc1,79299,5161,3988,816,0580.78088.59
gzip732,91659107,8820.80837.00
mcf197991916,7561.00020.97
parser18012,299174684,9040.96655.69
twolf16514,7991571,277,4100.95186.32
vortex63839,2296151,969,7340.96350.21
vpr2528,948211310,2100.83734.67
GEOM. MEAN:0.87659.43
(a) PLTO
ProgramOriginalObfuscatedEffects of Obfuscation
Functions (Forig)Edges (Eorig)Functions (Fobf)Edges (Eobf)Fobf/ForigEobf/Eorig
bzip2352,16734168,0320.97177.54
crafty10211,853862,701,6000.843227.92
gap80944,4317382,963,7370.91266.70
gcc1,07180,1686851,801,5530.63922.47
gzip441,8713599,4860.79553.17
mcf186051816,9081.00027.97
parser18510,301174714,2230.94069.34
twolf16512,7721561,553,1170.945121.60
vortex62032,0485991,298,4390.96640.52
vpr1032,3058444,2880.81519.21
GEOM. MEAN:0.87655.1
(b) DIABLO Table 1: Static characteristics of original and obfuscated benchmark programs

Each of our benchmarks was compiled using gcc version 3.2.2, at optimization level–O3, with additional command–line flags to produce statically linked relocatable binaries, and the resulting binaries processed using the obfuscators mentioned above. Functions containing (indirect jumps resulting from) switch statements were not obfuscated because our obfuscators currently are not able to process the resulting control flow. Library functions were also excluded, because in most cases such functions contain nonstandard control flow, e.g., where control jumps from one function into another without using the normal call/return mechanism for inter–procedural control transfers. Static characteristics of these benchmarks are shown in Table 1, which compares the original programs with those resulting from basic control flow flattening ?. Overall, Table 1 shows that our tools obfuscate most user functions in the program (on average, about 88%). As expected, obfuscation causes the number of control flow edges to increase, though the scale of the increase — a factor of roughly 55Χ to 60Χ — is larger than we had expected.

ProgramPLTODIABLO
AddedRemoved% Over% UnderAddedRemoved% Over% Under
bzip2154,537154,5370.000.00165,865164,6570.730.00
crafty4,297,3304,297,3300.000.002,689,7472,685,3740.160.00
gap1,930,9011,930,9010.000.002,919,3062,900,5640.640.00
gcc8,716,5428,716,5420.000.001,801,55390,8930.600.00
gzip104,996104,9960.000.0097,61596,8210.810.00
mcf15,95715,9570.000.0016,30315,9442.200.00
parser672,605672,6050.000.00703,922698,7000.740.00
twolf1,262,6111,262,6110.000.001,540,3451,533,7740.430.00
vortex1,930,5051,930,5050.000.001266,3911,255,6630.850.00
vpr301,262301,2620.000.0041,98341,2261.800.00
GEOM. MEAN:0.000.000.720.00
(a) Basic Flattening
ProgramAddedRemoved% Over% Under
bzip2154,537116,89623.950.00
crafty4,297,3303,051,10528.920.00
gap1,930,9011,177,85038.150.00
gcc8,716,5424,936,99342.870.00
gzip104,99674,11128.630.00
mcf15,95715,1984.500.00
parser672,605464,09830.440.00
twolf1,262,611820,69834.590.00
vortex1,930,5051,351,35429.400.00
vpr301,262165,69543.700.00
GEOM. MEAN:26.890.00
(b) Flattening with Interprocedural Data Flow
ProgramAddedRemoved% Over% Under
bzip2165,639130,74321.760.56
crafty4,403,7503,169,69728.210.01
gap2,365,9551,655,98331.230.03
gcc9,609,6465,830,09739.940.01
gzip125,50897,53923.690.36
mcf22,33522,3751.601.69
parser786,423590,21526.090.02
twolf1,401,063973,94931.180.03
vortex2,275,7091,735,78725.000.02
vpr386,508259,88934.210.08
GEOM. MEAN:21.400.06
(c) Flattening with Artificial Blocks and Pointers Table 2: Deobfuscation results

Control flow deobfuscation involves deleting spurious control flow edges that have been added by the obfuscator. To evaluate the efficacy of various deobfuscation techniques, therefore, we compare the deobfuscated program Pdeobf with the original program Porig to classify any errors made by the deobfuscator in deleting control flow edges. In principle, there are two kinds of such errors that can occur: first, Pdeobf may contain some edge that does not appear in Porig; and second, Pdeobf may not contain some edge that appears in Porig. We term the first kind of error overestimation errors (written Δover), and the second kind of errors underestimation errors (written Δunder):

Δover = |{e | ePdeobf and ePorig}|
Δunder = |{e | ePdeobf and ePorig}|

Since the input to the deobfuscator is the obfuscated program, we express the overestimation and underestimation errors relative to the number of edges in the input obfuscated program.

4.1 Basic Flattening

We first consider programs obfuscated using the basic control flow flattening technique described in Section 2.1. This turns out to be straightforward to deobfuscate using purely static techniques. We considered two different approaches: the DIABLO implementation used cloning followed by conventional constant propagation to disambiguate control flow; the PLTO implementation used Constraint–based Path Feasibility Analysis.

The results of deobfuscation are shown in Table 2(a). For each of our implementations, we consider two metrics: the extent of deobfuscation, i.e., the number of obfuscation edges that we were able to remove via the deobfuscation process; and precision, which gives the number of overestimated and underestimated edges, as discussed above. It can be seen that the PLTO implementation, using constraint–based path feasibility analysis, is able to recover the original programs completely, without any error. The DIABLO implementation, which uses code cloning followed by constant propagation, is able to remove over 99% of the obfuscation edges. The resulting programs still have a small amount of overestimation errors (0.72% on average), due to edges that did not appear in the original programs. This is to a great extent an artifact of the program transformation used: the cloning process introduces a number of additional control flow edges into the program, and these are not all eliminated by the constant propagation. It turns out that most of them could be eliminated quite easily by an additional phase of liveness analysis and jump–chain collapsing (i.e., where a jump to a jump is replaced by a single jump to the final target). However, we did not do this for the purposes of this paper.

4.2 Flattening with Interprocedural Data Flow

For flattening with interprocedural data flow, we used only the PLTO implementation, using static path feasibility analysis by itself as well as in combination with dynamic execution tracing.

In this case, because our path feasibility analysis is purely intra–procedural in nature, it is unable to achieve any deobfuscation.

We do somewhat better when the static analysis is combined with dynamic tracing. The results are shown in Table 2(b). The resulting deobfuscated programs have some overestimation errors, ranging from 4.5% for the mcf benchmark to 43.7% for vpr, with an overall mean of 26.9%. There is no underestimation error for any of the benchmarks. It is significant that even though the underlying static analysis is purely intraprocedural, and has no deobfuscation effect by itself, the effect of combining it with dynamic analysis is to remove 100 − 26.9 ≃ about 73% of the obfuscation edges. Note that the combination of static and dynamic analyses makes a difference only for functions that are actually executed: for functions that are not executed on our test inputs, we do not consider any edges to be removed, and all of their obfuscation edges are counted towards the overestimation error in Table 2(b).

4.3 Flattening with Artificial Blocks and Pointers

For flattening with dummy blocks and pointers, we again used only the PLTO implementation, using static path feasibility analysis by itself as well as in combination with dynamic execution tracing.

The static path feasibility analysis is unable to deobfuscate this case, because it currently does not handle indirect memory accesses through pointers.

Deobfuscation improves when static and dynamic analyses are combined. The results are shown in Table 2(c). In this table, the values in the column labelled «Added» differ from the corresponding values in Table 2(b) because the addition of artificial blocks introduces some additional control flow edges in this case. As in the case of flattening with interprocedural data flow, all of the obfuscation edges for functions that are not executed are counted towards the overestimation error. Overestimation error ranges from 1.6% for mcf to just under 40% for gcc, with an overall mean of 21.4%. There is a small amount of underestimation error as well in this case, ranging from 0.01% for crafty and gcc to 1.7% for mcf, with an overall mean of 0.06%. In other words, deobfuscation removes 100 − (21.4 + 0.06) ≃ 78% of the obfuscation edges.

4.4 Deobfuscation Time

The total time taken by the PLTO–based deobfuscator for basic control flow flattening ranges from about 7 seconds for mcf (constraint generation: 2.5 sec; constraint solution: 4.5 sec) to about 21 minutes for gcc (constraint generation: 631.5 sec; constraint solution: 640.1 sec). The times for the two enhanced obfuscations are similar, ranging from 7 sec to 22 mins for the case of interprocedural data flow, and from 8.5 sec to 24 mins for the case of artificial blocks and indirection.

Related Work

There does not appear to be a great deal of prior work on reverse engineering obfuscated code. Kapoor [16] and Kruegel et al. [18] discuss algorithms for disassembling obfuscated binaries. Lakhotia and Kumar discuss techniques to handle obfuscated procedure calls in binaries [19, 20]. The focus of these works, as well as the techniques used, are very different from those described here.

A number of researchers have considered the use of dynamic analysis–either by itself, or in conjunction with static analysis–for reverse engineering [17, 26, 28]; Stroulia and Systa give an overview [26]. Much of this work focuses on dealing with legacy software, e.g., for determining modularization and semantic clustering or understanding high level design patterns, and for visualizing dynamic system behavior. All of this is fundamentally different from the work described here, which has the dual aims of identifying techniques to help reverse engineer obfuscated code, and for evaluating the strengths and weaknesses of code obfuscation techniques. In particular, our work focuses on using simple static and dynamic analyses to reverse engineer programs that have specifically been engineered to make reverse engineering difficult.

The idea of combining static and dynamic analyses is discussed by Ernst [11].

Conclusions

Code obfuscation has been proposed by a number of researchers as a means to make it difficult to reverse engineer software. Obfuscating transformations typically rely on the theoretical difficulty of reasoning statically about certain kinds of program properties. This paper shows, however, that it may be possible to bypass much of the effects of some obfuscations by a combination of static and dynamic analyses. In particular, we examine the problem of deobfuscating the effects of control flow flattening, a control obfuscation technique proposed in the research literature and used in commercial code obfuscation tools. Our results show that basic control flow flattening can be removed in a relatively straightforward way using purely static techniques, while enhancements to the basic technique can be largely deobfuscated using a combination of static and dynamic techniques.

References

  1. G. Arboit. A method for watermarking java programs via opaque predicates. In Proc. 5th. International Conference on Electronic Commerce Research (ICECR–5), 2002.
  2. L. Badger, L. D'Anna, D. Kilpatrick, B. Matt, A. Reisse, and T. Van Vleck. Self–protecting mobile agents obfuscation techniques evaluation report. Technical Report Report No. #01–036, NAI Labs, March 2002.
  3. S. Chandrasekharan. An evaluation of the resilience of control flow obfuscations. Undergraduate Honors Thesis, Dept. of Computer Science, The University of Arizona, Dec. 2003.
  4. W. Cho, I. Lee, and S. Park. Against intelligent tampering: Software tamper resistance by extended control flow obfuscation. In Proc. World Multiconference on Systems, Cybernetics, and Informatics, 2001.
  5. S. Chow, Y. Gu, H. Johnson, and V. A. Zakharov. An approach to the obfuscation of control–flow of sequential computer programs. In Proc. 4th. Information Security Conference (ISC 2001), Springer LNCS vol. 2000, pages 144–155, 2001.
  6. C. Collberg, G. Myles, and A. Huntwork. Sandmark – a tool for software protection research. IEEE Security and Privacy, 1(4):40–49, July/August 2003.
  7. C. Collberg and C. Thomborson. Software watermarking: Models and dynamic embeddings. In Proc. 26th. ACM Symposium on Principles of Programming Languages, pages 311–324, January 1999.
  8. C. Collberg and C. Thomborson. Watermarking, tamper–proofing, and obfuscation – tools for software protection. IEEE Transactions on Software Engineering, 28(8), August 2002.
  9. C. Collberg, C. Thomborson, and D. Low. A taxonomy of obfuscating transformations. Technical Report 148, Department of Computer Sciences, The University of Auckland, July 1997.
  10. B. De Bus, B. De Sutter, L. Van Put, D. Chanet, and K. De Bosschere. Link–time optimization of arm binaries. In Proc. 2004 ACM Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES'04), pages 211–220, 7 2004.
  11. Michael D. Ernst. Static and dynamic analysis: Synergy and duality. In WODA 2003: ICSE Workshop on Dynamic Analysis, Portland, OR, pages 24–27,May 2003.
  12. D. Evans and D. Larochelle. Improving security using extensible lightweight static analysis. IEEE Software, 19(1):42–51, January/February 2002.
  13. S. Guyer and K. McKinley. Finding your cronies: Static analysis for dynamic object colocation. In Proc. OOPSLA'04, pages 237–250, October 2004.
  14. M. Hind and A. Pioli. Which pointer analysis should I use? In Proc. 2000 ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 113–123, 2000.
  15. R. Jhala and R. Majumdar. Path slicing. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 38–47, June 2005.
  16. A. Kapoor. An approach towards disassembly of malicious binaries. Master's thesis, University of Louisiana at Lafayette, 2004.
  17. R. Kazman and S. J. Carri`ere. Playing detective: Reconstructing software architecture from available evidence. Automated Software Engineering: An International Journal, 6(2):107–138, April 1999.
  18. C. Kruegel,W. Robertson, F. Valeur, and G. Vigna. Static disassembly of obfuscated binaries. In Proc. 13th USENIX Security Symposium, August 2004.
  19. E. U. Kumar, A. Kapoor, and A. Lakhotia. DOC – answering the hidden «call» of a virus. Virus Bulletin, April 2005.
  20. A. Lakhotia and E. U. Kumar. Abstract stack graph to detect obfuscated calls in binaries. In Proc. 4th. IEEE InternationalWorkshop on Source Code Analysis and Manipulation, pages 17–26, September 2004.
  21. C. Linn and S.K. Debray. Obfuscation of executable code to improve resistance to static disassembly. In Proc. 10th. ACM Conference on Computer and Communications Security (CCS 2003), pages 290–299, October 2003.
  22. T. Ogiso, Y. Sakabe, M. Soshi, and A. Miyaji. Software obfuscation on a theoretical basis and its implementation. IEEE Trans. Fundamentals, E86–A(1), January 2003.
  23. W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Comm. ACM, 35:102–114, August 1992.
  24. B. Schwarz, S. K. Debray, and G. R. Andrews. Plto: A link–time optimizer for the Intel IA–32 architecture. In Proc. 2001 Workshop on Binary Translation (WBT–2001), 2001.
  25. Preemptive Solutions. Dotfuscator. www.preemptive.com/products/dotfuscator.
  26. E. Stroulia and T. Systa. Dynamic analysis for reverse engineering and program understanding. ACM SIGAPP Applied Computing Review, 10(1):8–17, 2002.
  27. Symantec Corp. Understanding and managing polymorphic viruses. Technical report, 1996.
  28. T. Systa. Static and Dynamic Reverse Engineering Techniques for Java Software Systems. PhD thesis, Dept. of Computer and Information Sciences, University of Tampere, Finland, 2000.
  29. C. Wang, J. Davidson, J. Hill, and J. Knight. Protection of software–based survivability mechanisms. In Proc. International Conference of Dependable Systems and Networks, July 2001.
  30. Chenxi Wang. A Security Architecture for Survivability Mechanisms. PhD thesis, Department of Computer Science, University of Virginia, October 2000.
Sharath K. Udupa, Saumya K. Debray - Department of Computer Science, The University of Arizona; Matias Madou - Ghent University

Monday, April 21, 2008

Code Obfuscation Using Genetic Algorithms

We intend to present a method of code obfuscation, in combination with the use of genetic algorithms, which may help make the challenge of breaking software less appealing to malicious users, without forcing developers to spend too much time and effort on the obfuscation process.

Before describing the details of our theory, we give a brief background introduction to software copy protection, as well as to code obfuscation and genetic algorithms.

Introduction

Looking back at software development over the years, there has been many different ways in which the software developers have tried to protect their software from piracy.

In the older days, preventing piracy could be attempted by marking a tape as copy protected, and when marked as such, the computer would refuse to copy it using the tape player / recorder. It soon became obvious that if you had a double deck regular tape player, you could produce a perfect copy of the tape there, thus getting past the copy protection without hassle. Other methods included extracting the program from the original tape and moving it onto another tape where it could be loaded using generic loader programs, or even moving it to floppy discs. Some of the methods used to fool the copy protection were indeed so easy that anyone, even people without any computer knowledge, could use them. All you needed to do was own a good stereo, and your problems were gone already.

Later on, more advanced copyright protection measures were developed to stop piracy. In some cases, bad sectors on disc were added, and checked at start–up, knowing that even though the readable content may be copied correctly, the bad sectors would not. Checking these sectors would be a dead give–away to if the disc had been copied or not.

In other cases, advanced paper wheel constructions were included in the software package, where the user was prompted to give the correct answer to a given challenge by turning the wheels around and enter the given output.

Yet another common copy protection method included asking the user to open a given page in the software manual, find a specific paragraph and enter one or more words found there.

All of these methods should have been rather effective protection measures against piracy, since they were not easily solvable by the regular user. Copying entire manuals or trying to memorize or reproduce paper wheels meant a lot of work and trouble, and in the end it may be cheaper and more convenient to just buy the original product. Yet piracy flourished. Why? Because there were users who managed to alter the program itself and remove the copy protection totally. Adding a very advanced anti piracy function to your software to see if the user has actually purchased the product or not, means nothing in the end if the method is never actually executed. The key to avoiding protection is simply to remove the part of the program which does the check, or to make sure that it always returns true ( tell the rest of the program that the user did indeed manage to enter the secret word from page 43, paragraph 5 ).

So how is it possible? The reason that software can so easily be cracked and copy protection removed, is that in order for the computer to be able to run the program, the program executable has to contain all the operations to be carried out during the execution. This is rather straightforward, as the program cannot be run without them. However, if the computer can read the instructions, then is it not possible for the user to do so as well? Yes and no. Most people would have a hard time reading the instructions from the file since they are at a very low level. What they can do is to try to reverse engineer the instructions inside the executable file into a more readable list of instructions, which may be understood by the user. Normally, reverse engineering will transform the executable code 'one step back' into assembly language instructions. As high level languages are becoming more and more frequently used, most people never bother to learn the assembly language. There is simply no use anymore.

Many years ago, using the assembly language for software developing was the only actual way to go, if you wanted to make sure that your program would not be a lot slower than needed be. Back then, the difference between executing a program generated by a higher language compiler and one written by hand in assembler, in terms of execution time, were tremendous. CPU cycles were very limited, so saving a few instructions here, and a few instructions there, would have a large impact on performance. Also, the uniformity of the computing environment (as all computers contained the same chipsets, memory size, CPU model, etc) would allow the software developers to directly access the hardware in their assembly code to achieve the results they wanted in a single instruction or two, rather than the 25 or 50 or even more than 100 instructions which may be needed if it were done through higher level library calls. Interestingly enough, this kind of "dirty programming", was in fact industry standard, as computers were mostly used as game machines, and games would require as good performance as possible to look good, feel new and run smoothly.

However, despite the fact that fewer and fewer people are nowadays learning the assembly language, this does not mean that piracy has ended, or that reverse engineering is not a threat to software piracy prevention methods. Rather on the contrary. The globalization of the internet, in combination with higher connection speeds, leads to that few people actually need to know assembler. It is enough that a few people know it, reverse engineer the software and remove the copy protection function, then release it on the internet for everyone to enjoy. Most of them have little actual interest in the software product itself, but work for fame and glory, or possibly for the challenge of breaking others software. The key problem with placing any kind of copy protection inside the software is that it will always be possible to see exactly what the program does, and as long as you can see what the program does, you can either alter the program to remove unwanted parts (cracks), or figure out other ways to get past it (key generators, etc). An executable file is simply not as secret and well protected as some people may believe, and many may wish.

Protecting Software

So how is it possible to protect software, knowing that everything placed inside the program will be accessible to smart end–side users? There are many different ways to try to address the problem, some of which include:

  • Hardware challenges, where the user must have a piece of hardware installed to use the software correctly.
  • Serial key protection mechanisms and license key file protection.
  • Program encryption, with run–time decryption, which effectively prevents anyone in hold of the executable file to reverse engineer it into assembly code.
  • Server–sided client copy protection, where the client 'calls home' to check if a license key is correct, or if the software's file integrity is still intact.

All copy protection methods have drawbacks. Hardware challenges require that the developer creates custom made hardware, preferably a new version for each program / version of program, adding a new dimension of work and expense to the development, often raising the price while lowering user freedom, as the software cannot be run if the hardware is missing or broken.

Serial key protection and license files are not truly safe, as the validation of serial numbers must be located somewhere inside the software, and as such, can be found by anyone who decides to reverse engineer the program to look for it.

Program encryption is a very good way of preventing hackers from seeing the code easily, but the execution time of the program is greatly affected by the need of runtime decryption, and if a kernel level debugger is used, it may still be possible to see exactly what the program is doing at any given time, as this shows the actual instructions being executed.

Server–sided client security is becoming increasingly common as more and more computers can be expected to have an internet connection. However, there are two different kinds of scenarios possible when it comes to this solution. The first scenario being that the user has a firewall, and simply blocks the program. Experienced users will probably see no reason to let any program access the internet unless it is part of the programs key functionality to do so. The second case scenario is when the program must be allowed to access the internet for one reason or another, and here the server sided security may actually work. But that brings us back to our previous conclusion, that as long as the user has the executable file, it is possible to simply remove the protection completely. If the client software does not call home to see if it is pirated or not, it will never discover the truth.

Throughout the years, software developers have made many brave attempts to thwart pirates, but there are few examples of cases where it has actually been successful for any longer period of time. Preventing users from finding out exactly what software does is in fact quite impossible. As long as the end user of a program should be able to run the software, there will always be ways to reverse engineer it.

The growth of the internet has also helped piracy become more widely spread, as it is possible to quickly distribute pirated software across the planet using high speed connections, peer to peer networks, and so on.

Code Obfuscation

A common saying is that "there is no such thing as clientside security". As we have demonstrated in the introduction above, this is in many ways true. There is no way to prevent users who have gotten in hold of the executable file to find out exactly what the file does and how to alter it to remove copy protection functions, or to create solutions to get past them. Instead, we decided to take a look a one method for making it harder for the malicious user to understand the program after it has been reverse engineered, namely code obfuscation.

Code obfuscation is the process of facing the fact that end users will be able to reverse engineer your software but deciding to give them a hard time understanding exactly what it does. Basic code obfuscation methods may include:

  • Adding code that does nothing or is never actually executed.
  • Find different, more complex and less readable, ways to carry out the same operations.
  • Move the code around, placing related functions as far away from each other as possible, so it becomes harder to follow the execution of the program.
  • Encode data in different ways. For instance, encrypt all data that is stored in memory to make it more difficult to see them when debugging the program

Code obfuscation does not focus on preventing users from seeing what the program does. Instead it takes a different approach and creates a needle–in–the–haystack kind of scenario for the would–be hacker. If the code is obfuscated enough, it may take days or weeks to understand what you are looking for and where to find it inside all the code, and perhaps even then, the code contains too many trick operations to be able to make sense of it.

Naturally, it is not possible to prevent the most dedicated reverse engineers from eventually finding some way to break the software, but if the work becomes too slow and tiresome, most of them just may give up before finding their way.

There are, of course, drawbacks with using code obfuscation. If we add too much nonsense code into the program, it may damage the execution time, making the program a lot slower than it was intended to be. Another drawback would be the file size of the executable file.

Adding too many instructions into it may make the size of the executable grow to an undesirable size, but this problem is not as urgent, as HD producers are constantly making bigger and bigger discs while prices are remaining stable or even dropping.

Code obfuscation focuses on making it harder for the malicious user to make sense of what the program actually does, by adding code, encoding strings, and so on. Drawbacks may include performance issues and file size.

Genetic Algorithms

Genetic algorithms are a commonly used optimization technique, which can be considered as a random beam search of the search space. It aims to find an optimal (or locally optimal) solution to a problem as quickly as possible.

In genetic algorithms, the possible solutions are made up from a collection of individuals, called the population. Each individual in the population holds one possible solution, and the initial population consists of randomly generated individuals. Normally the solution will be stored inside the individual as a bit string value, which will be translated into actual values at a later stage, for instance when trying to determine how well an individual performs.

For each individual and generation, a fitness value is calculated. The fitness value is a measure of how well the individual's solution solves the problem in question, and is calculated using a fitness function. The fitness function must, of course, be individual for each problem, and may be some kind measure of performance, solution size, and so on.

After calculating the fitness value of each individual for the generation, different methods are used to copy the best individuals into the next generation, and to let two good individuals create an offspring (also called crossover), which hopefully will be even better than the individual parents. Crossover is normally carried out by copying parts of the bit string from each parent into the child, using a policy which determines how to choose which bits to copy. Please see figure 1 below for a very simple example of single point crossover.

Single Point Crossover Example Figure 1. Single Point Crossover Example

There are many different ways of selecting individuals for crossover, but as genetic algorithms are inspired by biology, and the goal is to create the best possible individual, most of the selection models somehow involve the previously mentioned fitness value of the individuals, trying to make sure that the chance to reproduce is higher for individuals who represent betters solutions to the given problem.

Normally, around 40 percent of the individuals, namely the best ones, will be copied directly into the next generation, while the remaining 60 percent will be created using crossover. Something called mutation will also be used to randomly change some value for less than 1% of the population. As mentioned above, most selection methods, both for crossover and for copying, are strongly linked to the fitness of the individuals, so the reason for using mutation is to prevent premature convergence and to help increase the diversity of the population. Normally it involves the inversion of the value of one (or more) bits, as shown in figure 2.

Single Bit Mutation Example Figure 2. Single Bit Mutation Example

Genetic algorithms are an efficient way of finding a good solution to a given problem within in a short period of time. Starting out with a random population, and using fitness proportionate selection methods, genetic algorithms are able to quickly home in on a good solution, and use random mutation as a safe–guard to prevent the search from getting stuck in a local optima, with too many individuals converging around a single solution.

Anyone interested in reading more about genetic algorithms, and how to implement them, should take a look at [1]. This paper merely gives a very brief introduction, in order for the reader who has not yet come across genetic algorithms to get a general idea of what they are.

Genetic Algorithms and Code Obfuscation

As we mentioned above, code obfuscation is when software developers intentionally change the code to make it more complex than it needs to be, in order to make it less readable for anyone trying to reverse engineer their product.

Genetic algorithms, on the other hand, are an automated search / optimization function for finding good solutions to a problem in a short time.

We decided to take a closer look at the possibilities of combining the two into an application that would automatically help create code obfuscation using genetic algorithms. We mostly focused on the first two versions of code obfuscation, namely adding code that does nothing or is never run, and finding different, more complex (and less readable) ways to carry out the same operations. In our study, we had in mind the option of extracting a section of the source code, obfuscating it, and then inserting the obfuscated code in the stead of the original code. However, other obfuscation techniques, such as moving code around, creating new functions that may or may not be run at some point in the execution, etc, should be far from impossible to implement using the same kind of obfuscation engine that we discuss in the rest of this paper.

General Aspects

First of all, one would have to consider the goal of the task at hand. We start out with a section of source code, and we would like to create a new section of source code which does the same but which is harder to understand, and possibly carries out some operations that are not actually needed. Another goal would be to avoid letting the code obfuscation slow down the final application too much. In table 1 below, we have listed a few of the goals we would like to achieve, as well as things we would like to avoid.

OutcomeResult
More codeGood
More complex codeGood
Longer execution timeBad
More functionsGood
Large output fileBad
Losing original functionalityVery, very bad
Table 1. List of possible outcomes

Naturally, some of the different possible outcomes are in conflict with each other. For instance, we would like to have more code, and more complex code, but we would probably not like to have code that is infinitely longer and more complex than the original code, or the program will no longer be able to function properly.

Also, one very important aspect is that the original functionality of the code that is being obfuscated must be preserved. In case the functionality is altered, the obfuscation becomes a problem. Thus, the method for obfuscating the code must be well tested and flawless in its execution.

Given all the information above, one could draw the conclusion that the following aspects would be important to the design of the obfuscation engine:

  • Computational effort must be calculated and compared to the original computational effort for the code being replaced. If the difference is too large, the solution is inadequate.
  • Original functionality must be preserved at all times. It is of paramount importance that the code obfuscation does not introduce bugs or altered behavior to the application, or the obfuscation will end up creating useless code that is neither readable nor useable.
  • The relationship between the original code size and the output code size must not be too large. Although we want more code, to be able to better hide our original – important – code, we do not want it to be grossly oversized.

One more thing to consider would be that the solution produced by the obfuscation engine would have to be created in such a way that it would not be overly obvious to anyone reading the output code that it was created by the obfuscator, lest the reader may choose to simply ignore it and move on. However, in some cases this may actually be a positive effect, in case the code created by the obfuscating engine actually contains both valid and invalid code, as the user may miss important code, judging it to be automated code rather than real.

Representation Scheme

The representation scheme of the genetic algorithms is what determines how the bit string should be interpreted into an actual solution. In our case, it would be feasible to use a rather straightforward representation, where the bit strings are translated into different programming statements and operations, using both randomly generated variables and variables that have been found by analyzing the given code.

However, we feel that in this case, a bit string would be insufficient to use to try to contain the entire solution. Instead we would suggest using an object oriented design, where each individual keeps a list of "lines of code". It would not be impossible to use bit strings but for functional reasons, as well as programming simplicity, it would probably be better to use a more complex representation. Translation between real code and the individual's data, as well as verification of code functionality and calculation of computational effort should be more intuitive and easily implemented that way.

Creating the Fitness Function

When designing the fitness function for the code obfuscator, one would have to take great care to the aspects we have mentioned earlier. First of all, the most important criteria must be that the code contains the original functionality. It must in no way alter the values of variables in such a way that the output given by the obfuscated code would differ from the output given by the original code. This is rather complex, given that one of the points with obfuscating the code must be to make it harder for a hacker to understand, and as such, it should preferably be working with variables and data that is important in the real program. Regardless of complexity, any individual who contains a solution that does not fully represent the original source code, must have a fitness value of 0. It must be deemed as unsuitable and either be removed, to be replaced by a new individual, or allowed to remain in the hope that crossover or mutation would once again restore the functionality.

Another important characteristic of the individual that should affect the fitness value would be faulty code, which cannot be run or exhibits faulty behavior. Same as above, it should be given a fitness value of 0, since our goal is to create obfuscated code, not defect code.

Apart from these two important safeguards against getting useless output, the fitness function should try to calculate the computational effort needed to run the code, as well as the code size and code complexity of the output. Longer, more complex code would give a better fitness value, as long as the time needed to execute it would not have grown too much, or the code size would not be by far too big. It would be important for the user of the obfuscation engine to be able to tweak these kinds of settings to make it better suit his needs. In some cases, execution time may be of great importance, while file size is not, and so on.

Code Analysis and Verification

One very important part of the obfuscation engine would be the code analysis. It must be able to correctly read and interpret the code it is given to be able to create new code that does exactly the same thing but in a different way. If the code analysis is not completely correct, slight variances may occur, and the output code will lead to unpredictable behavior.

Problems with the code analysis would also lead to problems during the verification phase: if the code analysis failed to properly interpret the given code, it may believe that the newly created code is functionally equal to the original code, while in fact, it is only equal to its own interpretation of the original code.

Challenges

There are a number of serious challenges that would arise when using automated code obfuscation. First of all, when calculating the computational effort, the system would have to either generalize to estimate the time needed for execution of the given set of instructions, or it would have to be able to run and measure the real time actually needed. In case of the latter, either by being able to run the original program and injecting the code into the expected location, or by creating a new program where the code has been replaced by the obfuscated code, then comparing the time of the two. Even though estimating the computational effort may be more time saving, to actually execute the new code would possibly give other advantages.

For instance, newly created bugs would be clearly visible if the program would be unable to execute the newly created part of the code. Another benefit of running the code would be the possibility of functionality verification, where the new code could be run, along with the original version, and the exact outputs from the two could be compared.

One way of solving all these things would be if the obfuscation engine contained a debugger or similar that would be able to run the original program liked during normal execution, to a breakpoint, namely where the code to be obfuscated starts. It could then save all the relevant values, run to the next breakpoint, where the code to be obfuscated ends, save all the relevant values once more, go back to the first breakpoint and then restore the values contained there. Then for every individual to validate, proceed with the execution from the first breakpoint using the new code stored in the individual currently being evaluated. Once the execution of the newly created code was complete, the computational effort would be known by comparing the new execution time with the original execution time, the number of bugs would be known, hopefully none, and the number of changes in the output compared to the original code would be known, by simply comparing the new state with that obtained during the initial run.

Even though the implementation of such a debugger would be a considerable task in itself, it would probably help save time in the end, as well as add reliability to the obfuscator. For instance, time would no longer be estimated but measured, and changes to the environment during the run of the obfuscated code would be clearly visible rather than predicted.

Conclusions

Code obfuscation is a good way of deterring hackers from bothering to hack your software. It does not try to prevent them from seeing what is inside the executable but instead it makes the task harder and more time consuming by adding more code and more complexity.

Manual code obfuscation is rather time consuming, and it is hard to verify if the manually created code is effective or not, as we cannot tell for sure exactly how it affects the execution time of the program, or the state of the system, compared to the original code.

Using genetic algorithms to find solutions that are big enough, yet not too big, complex enough, yet not too slow, and 100% surely do not alter the execution of the important parts of the original software, would be a big challenge but far from impossible, and very useful indeed. It is very difficult and time consuming to manually obfuscate the code, but with an automated code obfuscating engine that actually runs the newly created code, it should be possible to make it a lot harder for malicious users to remove for instance copy protection functions from software, and yet be able to know for sure that the new code is functioning properly. Given enough thought to the design of the obfuscator, the software developers using it should be able to finish all their work on their software, then feed the output to the obfuscation engine, which in turn creates a new application which is verified to function as the original did, contains a lot more complex code, yet runs at an acceptable speed, and is a lot harder to reverse engineer.

Considerable challenges in creating such an obfuscation engine would involve:

  • The analysis of the original source code.
  • Selecting a good representation scheme for the individuals.
  • Finding good ways to create the new random code, given the results from the code analysis.
  • Creating good functions for crossover of individuals.
  • Implementing a reliable debugger in which to run the newly created code.
  • Verifying the output and comparing it to the original results.

Although there are a number of difficulties to address, and pitfalls to avoid, we believe that it should be possible to create a good code obfuscation engine, using genetic algorithms, and that there may be industrial interest in such a product. Protecting software has been a battle between hackers and developers going on for many years, and in the end the hackers always find a way to break that new, 'unbreakable' protection system. By obfuscating the code enough, it does not prevent hackers from reading it, but it makes the task less interesting. If much enough of the code has been obfuscated, it is no longer amusing for the hacker to try to find what he is looking for.

The reason why we believe that genetic algorithms would be a good choice for this task is because of the flexibility and speed. Since the initial generation is randomly created, and the size of the possible search space is very large, then if the creation function is good, the genetic algorithms should be able to find good obfuscated versions of the source code in a rather short time. Using a good debugging engine to test it should ensure that the resulting code is reliable and functional.

In the end, the obfuscation engine could both save time for the developers, who no longer need to manually obfuscate their code, or spend time on extensive testing to verify the new code once it is done, as well as lower the risk of that people who reverse engineering the code find anything useful unless spending a lot of time and effort on it, thus raising the security of the software.

References

  1. Reeves, Colin R.and Rowe, Jonathan E., Genetic Algorithms — Principles and Perspectives : A Guide to GA Theory. Kluwer Academic Publishers, ISBN 1–40–207240–6 (2002)
Niklas Linden - Blekinge Institute of Technology

Sunday, April 13, 2008

Code Obfuscation Literature Survey

In this paper we survey the current literature on code obfuscation and review current practices as well as applications. We analyze the different obfuscation techniques in relation to protection of intellectual property and the hiding of malicious code. Surprisingly, the same techniques used to thwart reverse engineers are used to hide malicious code from virus scanners. Additionally, obfuscation can be used to protect against malicious code injection and attacks. Though obfuscation transformations can protect code, they have limitations in the form of larger code footprints and reduced performance.

Introduction

In the last decade, distribution of code in an architecturally–neutral format has increased the ability to reverse engineer source code from executables. This practice has greatly concerned software companies who desire to protect the intellectual property of their products. Though copyright laws forbid direct piracy of software, developers are worried by possible theft of proprietary data structure and algorithmic design. Though there are several methods for protecting software, such as encryption, server–side execution and native code, obfuscation has been found to be the cheapest and easiest solution to this problem[6].

Code obfuscation is the practice of making code unintelligible, or at the very least, hard to understand. The process of code obfuscation is the application of transformations to the code, which changes the physical appearance of the code, while preserving the black–box specifications of the program. In this project we have surveyed the different obfuscation techniques that are currently being researched, particularly those that can be implemented via a compiler.

Not only can code obfuscation be used to protect intellectual property, it is also used extensively by authors of malicious code to avoid detection. Many viruses utilize obfuscation techniques to subvert virus scanners by continually changing their code signature with obfuscating transformations. We review the types of viruses that commonly use obfuscation and the techniques they employ.

The paper is laid out as follows: in Section 2 we review the general methods used in common obfuscation techniques, then in Section 3 we discuss techniques used to thwart static analysis. Section 4 talks about obfuscation in the disassembly phase, and Section 5 addresses obfuscation techniques used by viruses. Finally, we conclude in section 6.

General Methods for Obfuscation

General code obfuscation techniques aim to confuse the understanding of the way in which a program functions. These can range from simple layout transformations to complicated changes in control and data flow. The techniques described below summarize the work by Collberg et al. [4].

Obfuscation Quality

The quality of an obfuscating transformation is evaluated according to four criteria:

  1. Potency: How much obscurity it adds to the program
  2. Resilience: How difficult is it to break for an automatic deobfuscator
  3. Stealth: How well the obfuscated code blends with the rest of the program
  4. Cost: How much computation overhead it adds to the obfuscated application

Obfuscating control transformations

The control flow transformations used for obfuscation can be described as affecting the aggregation, ordering or computations of the flow of control. Aggregation transformation breaks up computations that are logically related and merges computations that are not. Control ordering transformations randomize the order in which the computations are carried out. Computation transformations insert new code or make algorithmic changes to source application.

The key to the success of control transformations is the resilience of opaque predicates and variables[5]. Opaque predicates and variables are constructs whose values are known to the obfuscator, but are difficult for the deobfuscator to deduce. An opaque predicate is trivial if a deobfuscator can deduce it by static local analysis, and weak if a deobfuscator can deduce it by static global analysis.

The key to the success of control transformations is the resilience of opaque predicates and variables[5]. Opaque predicates and variables are constructs whose values are known to the obfuscator, but are difficult for the deobfuscator to deduce. An opaque predicate is trivial if a deobfuscator can deduce it by static local analysis, and weak if a deobfuscator can deduce it by static global analysis.

Computation Transformation

There is a strong relation between the complexity of the code and the number of predicates it contains. As the number of predicates increase in a body of code, insertion of dead or irrelevant code into the program becomes easier. Therefore, the presence of opaque predicates provides opportunities to further obfuscate a program. For instance, a basic block, B can be split into two basic blocks by inserting an opaque predicate PT (which always evaluates to true) in the center of B. Different obfuscations can then be applied to each half of B. We can also obfuscate a loop by making the termination condition complex. This can be done by introducing an opaque predicate PT (always evaluates to true) or PF (always evaluates to false) that does not affect the number of times a loop will execute.

Obfuscating data abstractions

In the following sections we describe transformations that obscure data abstractions used in source code. There are two types of such obfuscation: namely, modifying inheritance relations and restructuring data arrays. This information is gleaned from a further study done by Collberg et al. [7].

Modifying inheritance relations

According to the Chidamber metric1, the complexity of a program increases with greater depth of the inheritance tree. Along these lines, we can artificially increase the complexity of a program either by splitting up a class or by inserting a new bogus class. Suppose C is the class we want to factor into C1 and C2, we should make sure that this factoring respects the scope of the variables V in C. In other words, if we split a class in two, one half must not end up with all the variables, while the other half contains all the methods.

Inheritance relations can also be modified by false refactoring. Refactoring takes place in two steps. Firstly, identify two independent classes that implement the same behavior. Secondly, move the features common to both the classes to a parent class. False refactoring is similar to refactoring except that false refactoring is performed on two independent classes that do not have a common behavior.

Restructure Arrays

There are many different types of transformations that can be devised to obscure the operations performed on an array. These transformations include: splitting an array, merging two or more arrays, flattening an array (i.e. decrease the dimensions of the array), and folding the array (i.e. increase the dimensions of the array).

Obfuscating Procedural Abstractions

In the following section we discuss different obfuscating transformations that obscure the procedural abstractions in a program. These break up user–defined abstractions or introduce new abstractions, thereby destroying the original structure of the source code.

Table Interpretation

This is one of the most efficient, but expensive, transformations used for obfuscation of code. The fundamental idea is to convert a portion of the code to a different machine code. This new code is then executed by a new virtual machine interpreter included in the obfuscated code. There is usually one to two orders of magnitude slowdown for each level of interpretation, and so this transformation should only be used for those sections of code that consumes a small part of the total runtime.

While this transformation is highly potent, it is not very resilient. The deobfuscator could inline code for each bytecode instruction prior to decompilation. One way to thwart this is to replace the bytecode string by a program that produces it or obfuscate the program itself by introducing bogus predicates and dead code.

Inline and Outline Methods

Inlining, a code optimization technique also finds a use in code obfuscation. Once code is inlined and the procedure itself has been removed there is no trace of that abstraction left in the code. Conversely, outlining is the process of selecting a group of statements in a procedure and using them to create a sub–procedure. One example of applying inlining and outlining could be to inline two procedures A and B which are called one after the other, and then outline a portion of their combined code into a new procedure.

Clone Methods

While examining the purpose of a subroutine, a reverse engineer would examine its body and signature. Also the environments where these functions are called play an important role in the understanding of the code. We can make this process difficult by obfuscating a method’s call site to make it appear like different routines are called.

Obfuscating built–in data types

In the following sections we discuss different obfuscating transformations that obscure basic data types used in the source application. Designing such transformations is difficult because these data types form an integral part of programming languages. These transformations are very high in cost and low in stealth. Cost is high because we are obscuring the data type, but it is not very stealthy because the transformation is easy to identify.

Split variables

Variables of restricted range can be split up into two or more variables. In order to split a variable V of type T into two variables p and q of type U, we need to provide three pieces of information:

  1. A function f(p,q) that maps the values of p and q into the corresponding value of V.
  2. a function g(V) that maps the value of V into the corresponding values of p and q.
  3. new operations cast in terms of operations on p and q.

The potency, resilience and cost of this transformation all grow with the number of variables into which the original variable is split.

Convert static to procedural data

Static strings contain much useful information to a reverse engineer. A simple way of obfuscating them would be to convert the string to a program that computes the string. Aggregating the computation of all static string data into one function, however, may be un–stealthy. We can achieve much higher resilience and potency if the program that computes the string is broken into smaller components and embedded in the normal control flow of a program.

Merge scalar variables

This method of obfuscation involves merging two or more scalar variables into a single variable. The variables v1,v2...vk can be merged into one variable Vm provided the the combined ranges of v1,v2...vk fit within the precision of Vm. Arithmetic on individual variables would be transformed into arithmetic on Vm. The resilience of merging variables, however, is too low to make the transformation profitable.

Code obfuscation by obstructing static analysis of programs

Effective alias detection is shown to be NP Hard and this method of code obfuscation aims to introduce non trivial aliases into the program such that data flow analysis is rendered slow or less precise. The transformations discussed below introduce aliases into the program which further hinders the systematic analysis of the program by breaking down control flow. We focus here on techniques used for obstructing static analysis of programs. Static analysis can be either controlflow sensitive or control–flow insensitive. Flowinsensitive analysis does not provide as much information with which to tamper, as does flowsensitive analysis. Control–flow sensitive analysis first constructs a control flow graph of a program and conducts data flow analysis. The techniques to thwart static analysis increase the data dependency of the control flow graph. This results in increased complexity and decreased precision of the analysis. In the following sections we summarize the work of Wang et al. [13].

Control–flow Transformations

Control flow transformations are accomplished in two steps. The first step involves decomposing high level control transfers into a series of ifthen–goto statements. The second step is to then modify the goto statements such that the target addresses are determined dynamically. This can be achieved by branching on a data variable instead of direct jumps. These transformations make direct branches dependent on data and results in the flattening of the control flow graph. Now, given that there is no direct information about branch target addresses, the problem of building a flow–graph for static analysis of the program depends on identifying the latest definition of the variable at each branching point.

Data–flow Transformations

Once control–flow transformations have been applied, the problem of constructing a flow–graph now depends on data–flow analysis. Data–flow problems have been proved to be NP complete or harder. The fundamental difficulty in data flow analysis arises from the presence of aliases in the program. The second set of transformations aims to introduce aliases into the program that determine the computation and hence the analysis, of branch targets. The transformation is done in the following steps:

  1. In each function, introduce an arbitrary number of pointer variables.
  2. Insert artificial blocks, or code in existing blocks, that assign pointers to data variables.
  3. Replace references to variables and array elements with indirection through these pointers. Previously meaningful computation on data quantities can be replaced with semantically equivalent computation through the pointers.
  4. A definition of a pointer is placed in a different block than its use.

The effect of these transformations is that the static analyzer does not know which blocks to execute and since definitions to pointers and their uses are placed in different blocks the analyzer will not be able to deduce which definition applies to each use of the pointer. This approach of thwarting static disassembly is capable of expanding analysis times and reducing the precision of analysis to useless levels.

Code Obfuscation in Disassembly Phase

Most of the obfuscation techniques discussed so far were applied during the decompilation phase. We can also obfuscate code at the disassembly phase. In this section we discuss the two widely used static disassembly algorithms and techniques to thwart each of them.

There are two methods of disassembly: static disassembly, where the file being disassembled by the disassembler is not executed during the course of disassembly; and dynamic disassembly, where the file is executed on some input while the disassembly happens. We focus only on code obfuscation techniques for thwarting static disassembly proposed in the literature. There are two kinds of static disassembly: linear sweep and recursive traversal. A linear sweep disassembler sweeps through the program disassembling instructions as they are encountered. A recursive traversal disassembler would disassemble the instructions according to the flow of control. That is, when a branch instruction is encountered, it determines the possible control flow successors of that instruction and proceeds with disassembly at those instructions. In the following sections we summarize the work of Linn et al. [9].

Thwarting disassembly

A machine code file consists of different sections that contain various details about the program. One such instruction stored in the machine code of a program is the program entry point, i.e the location in the machine code where the machine instructions begin. To thwart a disassembler, we have to confuse as much as possible its notion of where the instruction boundaries in a program lie. On some instruction sets the disassembly process is self–repairing. That is even when a disassembly error occurs, the disassembler eventually ends up resynchronizing with the actual instruction. So the efforts to confuse disassembly should take this factor into account. Some techniques used to thwart disassembly are discussed in the following sections.

Junk Insertion

We can thwart disassembly by inserting junk bytes at specific places in the instruction stream. To confuse the disassembler, the junk instructions should be partial instructions and in order to preserve program semantics the junk instructions inserted should be unreachable at run time. To do so we choose candidate blocks, which are blocks that will allow such junk bytes to be inserted before it. To ensure that these junk instructions are unreachable during program execution, these candidate blocks must not allow the execution to fall through to them. Once this is done, we have to determine which junk instructions to insert so as to confuse the disassembler as much as possible and delay resynchronization for as long as possible.

Thwarting Linear Sweep

A linear sweep disassembly cannot distinguish data in the text section. This can be exploited in thwarting disassembly by inserting junk bytes at selected locations in the instruction stream. The junk bytes can only be inserted before candidate blocks that do not have execution fall through to it. In programs obtained from a typical optimizing compiler candidate blocks are around 30 instructions apart. To enable introduction of more junk bytes, we have to find more candidate blocks and this can be done with the help of branch flipping. The idea is to invert the sense of conditional jumps, by converting code of the form

bcc Addr

where cc represents a condition to

bcc Ľ
jmp Addr
Ľ:

where cc is the complementary condition to cc.

Thwarting Recursive Traversal

Recursive traversal deals intelligently with control flow and this can be exploited to confuse the disassembly process. Once a branch is encountered, recursive traversal continues disassembly at the possible targets of this branch. Recursive traversal assumes that the normal branches and call functions work reasonably. That is, a conditional branch has two possible targets and a call to a function returns to the point from where it was called. Another aspect of recursive traversal is the difficulty in identifying possible targets of indirect control transfer functions.

Branch functions

The assumption that the call function returns to the instruction immediately following the call instruction can be exploited using branch functions. Given a finite map Φ over locations in a program

Φ = {a1b1,a2b2 ... anbn}

a branch function ƒΦ is a function that, whenever it is called from one of the locations ai, causes control to be transferred to the corresponding location bi. Given such a branch function we can replace n conditional statements by the following lines:

A1:jmp b1
A2:jmp b2
An:jmp bn

By calls to branch function

A1:jmp ƒΦ
A2:jmp ƒΦ
An:jmp ƒΦ

Such branch functions have two purposes. One which obscures the computation of the target address within branch function, and the second creates opportunities for misguiding the disassembler. Such branch functions can be implemented in a number of ways. One implementation would be to use the return addresses to look up the table. Another, better, implementation is to have the callee pass, as an argument the offset from the instructions immediately after it to the target bi.

Call conversion

A variation of the branch function scheme can be used to insert junk bytes of instructions immediately following call instructions as well. This is done by rerouting call instructions through a specialized branch function, which branches to its appropriate target function, but returns to some predetermined offset from the original call instruction. This can obscure control flow information and also makes it difficult to decipher function entry points while increasing the potential to mislead the disassembler.

Opaque predicates

The assumption that a conditional branch has two targets can be exploited by converting an unconditional branch to a conditional branch, that always goes in one direction. This can be achieved by the use of opaque predicates. Once the unconditional branch is replaced by a conditional branch that uses an opaque predicate, we have a candidate block before which we can insert junk bytes to mislead disassembly.

Jump Table Spoofing

In addition to inserting junk bytes to mislead disassembly we can also insert artificial jump tables to subvert recursive traversal disassembly. Recursive traversal disassembly attempts to determine the size of the jump table to identify possible targets of an indirect jump. This behavior can be exploited by introducing a jump table that is unreachable at runtime.

Code Obfuscation as it Relates to Viruses

Not only is code obfuscation used to protect commercial software against reverse engineering; malicious code writers can also use obfuscation to hide their creations from virus scanners. In the end, both uses have the same basic intent: to obscure understanding of the program. Of course each task involves different limiting properties. Viruses need to deal with the fact that their obfuscations must continually change, or else scanners will have a known signature to match against. This brings up time and space restrictions. Conversely, commercial software has performance requirements, which limit the types and extent of obfuscation.

In this section we discuss the differences between obfuscation of commercial software and malicious code. We first introduce the particular types of viruses that use obfuscation techniques; we then discuss the transformations they typically use in subverting detection. Finally, we explore the possibility of learning from these techniques and perhaps transferring them to the task of protecting commercial software. We restrict our survey to viruses in the interest of time and space.

Virus Types

The two main types of malicious code that use obfuscation techniques to hide themselves from virus scanners are polymorphic and metamorphic viruses. Simple viruses do not change from generation to generation, so virus scanners can rely on simplistic string matching detection to determine if a file has been infected or not. Polymorphic and metamorphic viruses were developed so as to subvert general scanner methods. They rely on techniques that change their code signature each infection generation, making it impossible for string matching algorithms to detect their presence.

Polymorphic

Polymorphic viruses are an extension to encrypted viruses [1, 14]. An encrypted virus simply encrypts its body and then attaches a decryption key to itself, which changes from generation to generation. A polymorphic virus also encrypts its body, however, it randomly changes its decryption algorithm each time the virus infects a host. Because the decryption scheme is never encrypted, and it is generally tacked on to the end of the virus code, it might be source of detection by virus scanners. Therefore, when a polymorphic virus changes its decryption routine, it changes the part of the signature that scanners could detect.

Virus scanners can still get around this obfuscation by basically waiting for a polymorphic virus to deobfuscate itself. Virus scanners have the ability to run a virus in a “sandbox” or emulator, which allows the virus to run without causing harm. As the virus runs, it must decrypt itself to attempt to infect a file and therefore the emulator can scan for the virus signature when the virus decides to execute. Obviously this technique has limitations, one of which is how to know how long to run.

Metamorphic

Metamorphic viruses, like polymorphic viruses encrypt themselves to hide their signatures from virus scanners. Metamorphic viruses, however, can change their source code and then recompile themselves if compilers are available on the victim machine. This allows the virus to add and remove junk code, which evolves their signature over time. Metamorphic viruses are, of course, an advance in technology beyond the techniques of polymorphic viruses. Whereas, polymorphic viruses will decrypt themselves in memory to propagate and infect hosts, metamorphic viruses do not. In fact, a metamorphic virus never reveals its entire virus body at once, making it almost impossible for a simple signature–matching scanner to detect it.

As a response, virus scanners have evolved to use heuristics to detect metamorphic viruses. Common techniques still involve emulation, or running on a Virtual Machine, which simulates some of the functionality of a operating system. Tracking system calls is a technique that has seen much research in the last few years [12, 8]. M. Christodorescu et al. have recently developed a static analysis technique to detect metamorphic viruses [3]. Techniques to detect a metamorphic virus can be very much like those that are used to reverse engineer commercial software. Static analysis, decompilation and the like can recognize obfuscation efforts such as dead code insertion.

Obfuscation Techniques

Obfuscations used by virus writers are generally utilized for the purpose of changing the signature of the code, i.e. the sequence of instructions so that a virus scanner will be unable to use search strings to detect the presence of the virus. Common obfuscation techniques, which change the layout of code, are: dead code insertion (also called junk or trash insertion, code transposition, register reassignment and instruction substitution. These obfuscations are explained briefly below, though they generally expand on the earlier descriptions of obfuscation transformations.

Dead Code Insertion

This transformation acts just like it sounds, it involves placing ineffectual instructions in amongst existing instructions. These new instructions can be no–ops, change the state of a program, only to change it right back, or can involve jump instruction placement which will skip the new instructions all together. Though changes such as these most likely will not fool a reverse engineer, it has been shown by M. Christodorescu et al. [3] that these simple transformations can subvert many virus scan engines.

Code Transposition

Code transposition is simply the cosmetic movement of code within a file. For instance the location of procedures could be changed around, or instructions within a procedure, which are independent of one another, can be reordered.

Register Reassignment

Register reassignment refers to the change of registers used by live variables. If a particular register R1 is not used during the live range of a variable, then the register R2 used currently to store the live variable can be replaced by R1. Again, this transformation would not stand up to reverse engineering techniques, but it does have the effect of changing the signature of the program.

Instruction Substitution

When compared to the above transformations, this obfuscation technique is the most resilient, even when it comes to reverse engineering. Instruction substitution is implemented using a library of equivalent instructions. It replaces an instruction in the body of code with one that is equivalent. This can greatly change the signature of the code, and is difficult to deobfuscate, especially if the library of equivalent instructions is unavailable.

Comparisons

Virus writers obfuscate code to change the signature of the code, so that virus scanners cannot recognize it. Which makes sense because the scanners are searching for an infection that may not be there. This is obviously different than obfuscation from a reverse engineering standpoint. In the latter case the obfuscation is meant to hide important or sensitive data, which the reverse engineer knows is present. In general, obfuscation techniques used by malicious code writers are not as potent as those used for commercial software.

There are practices, however, which malicious code authors use, that may have applications to obfuscation of commercial software. Such an example is a technique used by the Zmist virus to hide its polymorphic decryptor[12]. The decryptor is placed within the host program in chunks separated by pieces of the host code and connected together by jumps. It might be possible to apply this type of transformation on commercial software. An important or sensitive algorithm could be broken into chunks and spread throughout the code. For the most part, however, transformations applied to viruses will not stand up against reverse engineering attacks and so cannot directly be applied to commercial software.

Another Angle

There is of course another angle to obfuscation of commercial software, which is the use of obfuscation to protect against malicious attacks. Such attacks are generally in the form of code injection, where a malicious entity attempts to insert code into an existing program with the intent that the code will be run and transfer control of the system to the attacker [10]. One can apprehend that to insert code into a program and have it successfully run without notice requires an indepth understanding of the victim program. To thwart such an attack, clearly it would help if it were difficult to understand the inner workings of the victimized program. And here code obfuscation can take on yet another role.

There is current research in the area of address obfuscation, which aims to vary the location of data and procedures through dynamic change of the program’s address space during execution [11, 2]. Not only does this technique subvert understanding of a program, but it also has the ability to protect against repeated attacks because of its ability to change dynamically. This is an especially interesting technique because it seems to influenced by the way in which viruses themselves can mutate to avoid detection.

Conclusion

Though code obfuscation can thwart many attacks, given enough time and effort any of the techniques discussed above can be overcome by reverse engineers. No obfuscation has yet been found that can completely resist reverse engineering. In addition to this drawback, code obfuscation increases the code footprint, decreases performance, and can hinder certain compiler optimizations. In spite of these limitations, obfuscation techniques, when used sparingly, and combined appropriately, can add a layer of protection against theft and insertion of malicious code. The literature surveyed produces an inconclusive result as to whether obfuscation techniques used by viruses can be applied to commercial software. Clearly, obfuscation does help to protect against virus infections, but further study is needed to ascertain which techniques produce the best results.

References

  1. Understanding and managing polymorphic viruses, 1996.
  2. Sandeep Bhatkar, Daniel C. DuVarney, and R. Sekar. Address obfuscation: an efcient approach to combat a broad range of memory error exploits.
  3. M. Christodorescu and S. Jha. Static analysis of executables to detect malicious patterns. In 12th USENIX Security Symposium, pages 169–186, August 2003.
  4. Christian Collberg, Clark Thomborson, and Douglas Low. A taxonomy of obfuscating transformations, July 1997.
  5. Christian Collberg, Clark Thomborson, and Douglas Low. Manufacturing cheap, resilient, and stealthy opaque constructs. In Principles of Programming Languages 1998, POPL’98, pages 184–196, 1998.
  6. Christian S. Collberg and Clark Thomborson. Watermarking, tamper–proofing, and obfuscation — tools for software protection. In IEEE Transactions on Software Engineering, volume 28, pages 735–746, August 2002.
  7. Christian S. Collberg, Clark D. Thomborson, and Douglas Low. Breaking abstractions and unstructuring data structures. In International Conference on Computer Languages, pages 28–38, 1998.
  8. A. Lakhotia and E. U. Kumar. Abstract stack graph to detect obfuscated calls in binaries. In IEEE International Workshop on Source Code Analysis and Manipulation, September 2004.
  9. C. Linn and S. Debray. Obfuscation of Executable Code to Improve Resistance to Static Disassembly, 2003.
  10. C. M. Linn, M. Rajagopalan, S. Baker, C. Collberg, S. K. Debray, J. H. Hartman, and P. Moseley. A multi–faceted defence mechanism against code injection attacks.
  11. M. Madou, B. Anckaert, P. Moseley, S. Debray, B. De Sutter, and K. De Bosschere. Software protection through dynamic code mutation.
  12. Peter Szor and Peter Ferrie. Hunting for Metamorphic, September 2001.
  13. Chenxi Wang, Jonathan Hill, John Knight, and Jack Davidson. Software tamper resistance: Obstructing static analysis of programs. Technical Report CS–2000–12, 12 2000.
  14. T. Yetiser. Polymorphic Viruses: Implementation, Detection, and Protection. VDS Advanced Research Group, January 1993.

Arini Balakrishnan, Chloe Schulze
CS701 Construction of Compilers, Instructor: Charles Fischer Computer Sciences Department University of Wisconsin, Madison
December 19th, 2005