Pointer Analysis
In this article, we begin the discussion of a very simple flow-insensitive pointer alias analysis assuming that there are no procedure calls. We shall show in subsequent sections how to handle procedures first context insensitively, then context sensitively. Flow sensitivity adds a lot of complexity, and is less important to context sensitivity for languages like Java where methods tend to be small. The fundamental question that we wish to ask in pointer-alias analysis is whether a given pair of pointers may be aliased. One way to answer this question is to compute for each pointer the answer to the question what objects can this pointer point to?" If two pointers can point to the same object, then the pointers may be aliased.
Why is Pointer Analysis Difficult?
Pointer-alias analysis for C programs is particularly difficult, because C programs can perform arbitrary computations on pointers. In fact, one can read in an integer and assign it to a pointer, which would render this pointer a potential alias of all other pointer variables in the program. Pointers in Java, known as references, are much simpler. No arithmetic is allowed, and pointers can only point to the beginning of an object. Pointer-alias analysis must be interprocedural. Without interprocedural analysis, one must assume that any method called can change the contents of all accessible pointer variables, thus rendering any intraprocedural pointer-alias analysis ineffective. Languages allowing indirect function calls present an additional challenge for pointer-alias analysis. In C, one can call a function indirectly by calling a dereferenced function pointer. We need to know what the function pointer can point to before we can analyze the function called. And clearly, after analyzing the function called, one may discover more functions that the function pointer can point to, and therefore the process needs to be iterated. While most functions are called directly in C, virtual methods in Java cause many invocations to be indirect. Given an invocation x . M () in a Java program, there may be many classes to which object x might belong and that have a method named m. The more precise our knowledge of the actual type of x, the more precise our call graph is. Ideally, we can determine at compile time the exact class of x and thus know exactly which method m refers to. Consider the following sequence of Java statements:
Object o;
o = new String ();
n = o.length () ;
Here o is declared to be an Object. Without analyzing what o refers to, all possible methods called "length" declared for all classes must be considered as possible targets. Knowing that o points to a String will narrow interprocedural analysis to precisely the method declared for String. It is possible to apply approximations to reduce the number of targets. For example, statically we can determine what all the types of objects created are, and we can limit the analysis to those. But we can be more accurate if we can discover the call graph on the fly, based on the points-to analysis obtained at the same time. More accurate call graphs lead not only to more precise results but also can reduce greatly the analysis time otherwise needed. Points-to analysis is complicated. It is not one of those "easy" data flow problems where we only need to simulate the effect of going around a loop of statements once rather, as we discover new targets for a pointer, all statements
Assigning the contents of that pointer to another pointer need to be re-analyzed. For simplicity, we shall focus mainly on Java. We shall start with flow insensitive and context-insensitive analysis, assuming for now that no methods are called in the program. Then, we describe how we can discover the call graph on the fly as the points-to results is computed. Finally, we describe one way of handling context sensitivity.
A Model for Pointers and References
Let us suppose that our language has the following ways to represent and manipulate References:
- Certain program variables are of type "pointer to T" or "reference to T," where T is a type. These variables are either static or live on the run-time stack. We call them simply variables.
- There is a heap of objects. All variables point to heap objects, not to other variables. These objects will be referred to as heap objects.
- A heap object can have fields, and the value of a field can be a reference to a heap object (but not to a variable). Java is modeled well by this structure, land we shall use Java syntax in examples. Note that C is modeled less well, since pointer variables can point to other pointer variables in C, and in principle, any C value can be coerced into a pointer.
Since we are performing an insensitive analysis, we only need to assert that a given variable v can point to a given heap object h; we do not have to address the issue of where in the program v can point to h, or in what contexts v can point to h. Note, however, that variables can be named by their full name. In Java, this name can incorporate the module, class, method, and block within a method, as well as the variable name itself. Thus, we can distinguish many variables that have the same identifier. Heap objects do not have names. Approximation often is used to name the objects, because an unbounded number of objects may be created dynamically. One convention is to refer to objects by the statement at which they are created. As a statement can be executed many times and create a new object each time, an assertion like "v can point to h" really means "v can point to one or more of the objects created at the statement labeled h."
The goal of the analysis is to determine what each variable and each field of each heap object can point to. We refer to this as points-to analysis; two pointers are aliased if their points-to sets intersect. We describe here an inclusion-based analysis; that is, a statement such as v = w causes variable v to point to all the objects w points to, but not vice versa. While this approach may seem obvious, there are other alternatives to how we define points-to analysis. For example, we can define an equivalence-based analysis such that a statement likes v = w would turn variables v and w into one equivalence class, pointing to all the variables that each can point to. While this formulation does not approximate aliases well, it provides a quick, and often good, answer to the question of which variables point to the same kind of objects.
Flow Insensitivity
We start by showing a very simple example to illustrate the effect of ignoring control flow in points-to analysis.
Let us now formalize a flow-insensitive pointer-alias analysis based on the discussion above. We shall ignore procedure calls for now and concentrate on the four kinds of statements that can affect pointers:
- Object creation. H: T v = new T O; this statement creates a new heap object, and variable v can point to it.
- Copy statement. v = w; Here, v and w are variables. The statement makes v point to whatever heap object w currently points to; i.e., w is copied into v.
- Field store. V. f = w; the type of object that v points to must have a field f, and this field must be of some reference type. Let v point to heap object h, and let w point to g. This statement makes the field f, in h now point to g. Note that the variable v is unchanged.
- Field load. v = w. f ; Here, w is a variable pointing to some heap object that has a field f , and f points to some heap object h. The statement makes variable v point to h.
Note that compound field accesses in the source code such as v = w. f .g will be broken down into two primitive field-load statements: Let us now express the analysis formally in Data log rules. First, there are only two IDB predicates we need to compute:
- 1. ~ t s (VH,) means that variable V can point to heap object H.
- 2. Hats (H, F, G) means that field F of heap object H can point to heap object G.
The EDB relations are constructed from the program itself. Since the location of statements in a program is irrelevant when the analysis is flow insensitive, we only have to assert in the EDB the existence of statements that have certain forms. In what follows, we shall make a convenient simplification. Instead of defining EDB relations to hold the information garnered from the program, we shall use a quoted statement form to suggest the EDB relation or relations that represent the existence of such a statement. For example, "H : T V = new T" is an EDB fact asserting that at statement H there is an assignment that makes variable V point to a new object of type T. We assume that in practice, there would be a corresponding EDB relation that would be populated with ground atoms, one for each statement of this form in the program.
Effects of a Method Invocation
The effects of a method call such as x = y. n (z) in Java on the points-to relations can be computed as follows:
- Determine the type of the receiver object, which is the object that y points to. Suppose its type is t. Let run be the method named n in the narrowest super class of t that has a method named n. Note that, in general, which method is invoked can only be determined dynamically.
- The formal parameters of m are assigned the objects pointed to by the actual parameters. The actual parameters include not just the parameters passed in directly, but also the receiver objects itself. Every method invocation assigns the receiver object to the t h I s ~an r I a b l e .We refer to the t h I s variables as the 0th formal parameters of methods. In x = y. n (2), there is two formal parameters: the object pointed to by y is assigned to variable t h i s, and the object pointed to by x is assigned to the first declared formal parameter of m.
- The returned object of m is assigned to the left-hand-side variable of the assignment statement. In context-insensitive analysis, parameters and returned values are modeled by copy statements. The interesting question that remains is how to determine the type of the receiver object.
\We can conservatively determine the type according to the declaration of the variable; for example, if the declared variable has type t, then only methods named n in subtypes of t can be invoked. Unfortunately, if the declared variable has type Object, then all methods with name n are all potential targets. In real-life programs that use object hierarchies extensively and include many large libraries, such an approach can result in many spurious call targets, making the analysis both slow and imprecise. We need to know what the variables can point to in order to compute the call targets; but unless we know the call targets, we cannot find out what all the variables can point to. This recursive relationship requires that we discover the call targets on the fly as we compute the points-to set. The analysis continues until no new call targets and no new points-to relations are found.
Dynamic Loading and Reflection
Languages like Java allow dynamic loading of classes. It is impossible to analyze all the possible code executed by a program, and hence impossible to provide any conservative approximation of call graphs or pointer aliases statically. Static analysis can only provide an approximation based on the code analyzed. Remember that all the analyses described here can be applied at the Java byte code level, and thus it is not necessary to examine the source code. This option is especially significant because Java programs tend to use many libraries. Even if we assume that all the code to be executed is analyzed, there is one more complication that makes conservative analysis impossible: reflection. Reflection allows a program to determine dynamically the types of objects to be created, the names of methods invoked, as well as the names of the fields accessed. The type, method, and field names can be computed or derived from user input, so in general the only possible approximation is to assume the universe.
The forename method in the Class library takes a string containing the class name and returns the class. The method new Instance returns an instance of that class. Instead of leaving the object o with type Objact, this object is cast to a super class T of all the expected classes.
While many large Java applications use reflection, they tend to use common idioms, such as the one shown in Example . As long as the application does not redefine the class loader, we can tell the class of the object if we know the value of class Name. If the value of class Name is defined in the program, because strings are immutable in Java, knowing what class Name points to will provide the name of the class. This technique is another use of points-to analysis. If the value of class Name is based on user input, then the points-to analysis can help locate where the value is entered, and the developer may be able to limit the scope of its value.
Context-Sensitive Pointer Analysis
Context sensitivity can improve greatly the precision of interprocedural analysis. We talked about two approaches to interprocedural analysis, one based on cloning and one on summaries. Which one should we use? There are several difficulties in computing the summaries of points-to information. First, the summaries are large. Each method's summary must include the effect of all the updates that the function and all its callers can make, in terms of the incoming parameters. That is, a method can change the points-to sets of all data reachable through static variables, incoming parameters and all objects created by the method and its callers. While complicated schemes have been proposed, there is no known solution that can scale to large programs. Even if the summaries can be computed in a bottom-up pass, computing the points-to sets for all the exponentially many contexts in a typical top-down presents an even greater problem. Such information is necessary for global queries like finding all points in the code that touch a certain object.
In this article, we discuss a cloning-based context-sensitive analysis. A cloning-based analysis simply clones the methods, one for each context of interest. We then apply the context-insensitive analysis to the cloned call graph. While this approach seems simple, the devil is in the details of handling the large number of clones. How many contexts are there, it is not uncommon to find 1014 contexts in a Java application? Representing the results of these many contexts is the challenge. We separate the discussion of context sensitivity into two parts:
- How to handle context sensitivity logically? This part is easy, because we simply apply the context-insensitive algorithm to the cloned call graph.
- How to represent the exponentially many contexts? One way is to represent the information as binary decision diagrams (BDD's), a highly optimized data structure that has been used for many other applications. This approach to context sensitivity is an excellent example of the importance of abstraction. As we are going to show, we eliminate algorithmic complexity by leveraging the years of work that went into the BDD abstraction.
We can specify context-sensitive points-to analysis in just a few lines of Data log, which in turn takes advantage of many thousands of lines of existing code for BDD manipulation. This approach has several important advantages. First, it makes possible the easy expression of further analyses that use the results of the points-to analysis. After all, the points-to results on their own are not interesting. Second, it makes it much easier to write the analysis correctly, as it leverages many lines of well-debugged code.
Contexts and Call Strings
The context-sensitive points-to analysis described below assumes that a call graph has been already computed. This step helps make possible a compact representation of the many calling contexts. To get the call graph, we first run context-insensitive points-to analysis that computes the call graph on the fly. We now describe how to create a cloned call graph. A context is a representation of the call string that forms the history of the active function calls. Another way to look at the context is that it is a summary of the sequence of calls whose activation records are currently on the run-time stack. If there are no recursive functions on the stack, then the call string - the sequence of locations from which the calls on the stack were made - is a complete representation. It is also an acceptable representation, in the sense that there are only a finite number of different contexts, although that number may be exponential in the number of functions in the program. However, if there are recursive functions in the program, then the number of possible call strings is infinite, and we cannot allow all possible call strings to represent distinct contexts.