SEARCHING & SORTING
Searching and sorting are fundamental operations, which are frequently used in many applications. Identifying a particular item or a record among a set of elements or records is called as searching. If the item or record is found then search is said to be successful; otherwise unsuccessful. Sorting is a process of arranging a set of elements or records in a specific way. In sorting the elements are arranged in increasing or decreasing order of their key values. In this article you can read about various searching and sorting techniques. Searching Notations Before the main article starts, you should read some terms, which are frequently used in searching process. A table or a file is a group of records, each record having one or more fields, known as keys. These keys are used to differentiate among different records. For instance we may regard a telephone directory as a file of records, each record having three fields : name, address and phone number. The key may be either the person name or the phone number. In some other applications one may desire the phone number at a particular address, so this field too could be the key. We can say that there is a strong relationship between the key and the record. The association between the key and the record may be simple or complex. In the simplest form, a key may be contained within the record at a specific offset from the start of the record. Such a key is called an internal key or an embedded key as shown in figure-1(a). On the other hand, in the complex form, there may be separate table of keys that include pointers to the records. Such keys are called external keys as shown in figure-1(b). For the sake of simplicity we will adopt the rule throughout this article, which is given a key there should be at most one record have the same key value. Such a key is called a primary key. For example, the telephone number is a primary key because there can not be two records with the same telephone number. However since any field of a record can serve as the key in a particular applications, keys need not be unique. For example, if we are using name as a key in a telephone directory, there may be more than one person with the same name in the directory, such a key is called a secondary key. Figure 1 When we search a table or a file for a specific record, it will be a time-consuming action in a program and when we write searching techniques we must consider this point. Before writing any searching techniques we must see how the records are arranged. A table or file may be organized as an array of records, a linked list, a tree or even a graph. If there are large number of records then it will be necessary to store the records in the file on some secondary media, which are external to the computer memory, such as tape or disk. This case is called external searching. On the other hand if the records to be searched are stored entirely within the computer memory, it is called internal searching. In this article we will study internal searching in detail. Three are three basic searching techniques – sequential, binary and hash searching. Let us study these searching techniques in details. Sequential (Linear) Searching
Introduction
A general searching algorithm accepts two arguments – the target ‘item’ and the list to be searched. A list may contain ‘n’ number of records. For the sake of simplicity each record is assumed to consist of a key field only. Our goal is to find an item in the list whose key is equal to the ‘item’. The searching algorithm will indicate whether it has found ‘item’ or not. If the ‘item’ is found in a contiguous list then it returns the index of the list where the ‘item’ is found or in case of a linked list it returns any positive value, say 1. Unfortunately if the ‘item’ was not found, it returns a value –1.
Basic Searching Techniques
The simplest form of a search is the sequential searching or linear searching. In sequential searching we search for a target ‘item’ in the list by examining the records in the list one by one. Initially the process starts with the first record. If this first comparison results in false then the algorithm proceeds to the next record. This process continues until either the desired record is found or the list is exhausted. Thus we can say that in sequential search we start at one end of the list and scan down it until the desired item is found or the other end is reached. This technique is suitable for a table or a linked list.
Let us assume that ‘A’ is an array of 10 keys; A[0], A[1], through A[9] and we have following list of numbers:
A[0] |
A[1] |
A[2] |
A[3] |
A[4] |
A[5] |
A[6] |
A[7] |
A[8] |
A[9] |
28 |
24 |
36 |
29 |
14 |
75 |
80 |
31 |
60 |
49 |
And let we want to find the item ‘75’. Initially we will start searching with the very first element of the list, that is A[0]. If item ‘75’ is not found at A[0] then we will take next element, that is A[1], as shown in figure-2(a) and compare it with the item ‘75’. If item ‘75’ is not found at A[0] then we will take next element as shown in figure-2(b). And this process goes on until the item is found or the list gets exhausted. Figure-2 shows that the item is found at 6th location.
Here is the sequential search algorithm.
Algorithm : SequentialSearch
SequentialSearch(a, n, item)
Here ‘a’ is an array of ‘n’ number of elements and ‘item’ represents the item to be searched in array ‘a’
- Step-1 : Initialize i = 0
- Step-2 : Repeat through step-4 while (i < n)
- Step-3 : Compare a[i] and item
- If (a[i] == item) then return (i) and go to step-6
- Step-4 : Increment the value of ‘i’ as: i = i+1
- Step-5 : If item not found then just print a message – “Item not found” and return (-1)
- Step-6 : Exit
The C implementation of this sequential search algorithm is as:
Implementation : SequentialSearch
SequentialSearch(int a[], int n, int item)
{
int i;
for (i = 0; i < n; i++)
{
if (a[i] == item)
return (i);
}
return (-1);
}
This function examines each element in turn, using the for loop; upon finding one that matches the search argument, its index is returned. Unfortunately if there is no such element, it returns –1. Thus this algorithm checks each entry in turn. When the values in the array ‘a’ are not distinct then this function will return the first index ‘i’ of the array ‘a’ which is equal to the ‘item’.
Figure 2
Sequential Search in Linked List
We can also store a table as a linked list. Let us assume that the table is organized as a linear linked list point to by an external pointer variable ‘first’ and linked by a pointer field ‘link’. For this we will assume the type declaration of the node of the linked list is as:
struct Node
{
int data;
struct Node *link;
};
typedef struct Node Node;
Node *first;
A version of sequential search for linked list is equally easy , as follows:
SequentialSearch(Node *first, int item)
{
while (first != NULL)
{
if (first->data == item)
return (1);
first = first->link;
}
return (-1);
}
As far as the comparisons of keys is concerned, both versions are same, but the running time of the two versions may be different due to their implementation. So whenever we will compare this searching technique with another method we will consider the concept of number of keys of comparisons.
Analysis of Sequential Search
Initially when we start searching we really do not know how many times the loop will be iterated because the number of comparisons required for a successful search is not fixed. It all depends upon the position of item within the array. And we have no way to know in advance whether or not the search will be successful. The target ‘item’ will be the first element in the array, the last or some where between. If it is the first element then only one comparison is sufficient and if it is the last element then ‘n’ comparison are required, where ‘n’ is the length of the list. If the search is unsuccessful then ‘n’ comparisons are required. Then we can say that the best time for successful search is ‘1’ and the worst case is ‘n’ comparison.
And in general if it is assumed that if the search is successful, the average behavior of the algorithm is calculated by adding the number of comparisons needed for all successful search and divide it by ‘n’ as follows:
(1+2+3+….+n)
n
n*(n+1)*1
2 n
(n+1)
2
Hence the average number of key comparisons done by sequential search in the successful case is (n+1)/2. However in any case the number of comparison is O(n).
- Sequential Searching on an Ordered list
The efficiency of sequential searching can be greatly improved if the list is stored in ascending or descending order. In such case the expected number of comparisons required for an unsuccessful search can be reduced. Assuming that the elements appear in the ascending order, the search can be terminated as soon as we encounter an element that is greater than or equal to ‘item’ is found. When the element is equal to ‘item’ then the search is successful. When the element is greater than ‘item’ it can be concluded that ‘item’ is not present in the array and the search ends unsuccessfully.
Here is the algorithm of sequential search on an ordered list.
Algorithm : SequentialSearchOrder
SequentialSearchOrder(a, n, item)
Here ‘a’ is an array of ‘n’ number of elements and ‘item’ represents the item to be searched in array ‘a’
- Step-1 : Initialize i = 0
- Step-2 : Repeat through step-4 while (i < n)
- Step-3 : Compare a[i] and item
- If (a[i] == item) then return (i) and go to step-6
- Else if (item < a[i]) go to step-5
- Step-4 : Increment the value of ‘i’ as: i = i+1
- Step-5 : return (-1) as item was not found in the list
- Step-6 : Exit
The C implementation of this sequential search algorithm is as:
Implementation : SequentialSearchOrder
SequentialSearchOrder(int a[], int n, int item)
{
int i;
for (i = 0; i < n; i++)
{
if (a[i] == item)
return (i);
else if (a[i] > item)
break;
}
return (-1);
}
In this case there is no change in the expected number of comparisons for a successful search. The average number of comparisons for a successful search is still O(n). But there is definitely a reduction of the expected number of comparison required for an unsuccessful search. In fact this value may be ‘1’ in the best case and ‘n+1’ in the worst case.
- Indexed Sequential Searching
Indexed sequential searching is another improved searching technique for a sorted list. In this we maintained an auxiliary array ‘aux’ in addition to the sorted list itself. Each element in the array aux consists of a value (key) and a link to the record in the list that corresponds to the key. The elements in the array ‘aux’ as well as the elements in the original list must be sorted on the key. If the index is one fifth the size of the original list, every tenth record of the list is represented in the array ‘aux’. Figure-3 illustrates this.
Figure 3
Here is the indexed sequential search algorithm.
Algorithm : IndexSequentialSearch
IndexSequentialSearch(a, n, item)
Here ‘a’ is an array of ‘n’ number of elements and ‘item’ represents the item to be searched in array ‘a’
- Step-1 : Initialize an auxiliary array ‘aux’ of size n/5 such that every fifth element of the array ‘a’ is represented in ‘aux’.
- Step-2 : Find the appropriate index using an auxiliary array ‘aux’. Let ‘start’ and ‘end’ be the starting and ending index of the reduced list.
- Step-3 : Initialize i = start
- Step-4 : Repeat through step-6 while (i <= end)
- Step-5 : Compare a[i] and item
- If (a[i] == item) then return (i) and go to step-8
- Step-6 : Increment the value of ‘i’ as: i = i+1
- Step-7 : If item not found then return (-1)
- Step-8 : Exit
The C implementation of this indexed sequential search algorithm is as:
Implementation : IndexedSequentialSearch
IndexSequentialSearch(int a[], int n, int item)
{
int aux[10];
int i, value, start, end;
for(i=0; i
aux[i] = i*5;
for(i=0; i
{
value = aux[i];
if (a[value] > item)
break;
}
start = aux[i-1];
if (i==n/5)
end = n-1;
else
end = aux[i]-1;
for(i=start; i<=end; i++)
{
if (a[i] == item)
return (i);
}
return(-1);
}
The major advantage of this indexed sequential searching technique is that the elements in the list are examined sequentially even if all the elements in the list are accessed, yet the search time for a particular item is sharply reduced. In this case the sequential search is performed on the smaller index rather than on the lager one. Once the correct index position has been found in the original list, a second sequential search is performed on a smaller portion of the original list itself.
One main point to note here is that if the original list is so large that even the use of an index does not achieve the sufficient efficiency we will use another secondary index which will act as an index to the primary index, which points to the entries in the sequential table.
Binary Searching
No doubt the sequential search is easy and simple to write. It is efficient for the short lists but highly inefficient for long lists. Imaging try to find the name “Suman” in a large telephone directory by reading one name at a time starting at the front of the directory. We can improve this search time if the elements in the list are sorted in some order. One of the better known techniques for searching an ordered list is called binary searching.
In binary searching we compare the item with the element in the center of the list and then restrict our attention to only the first or second half of the list, depending on whether the ‘item’ comes before or after the central one. If the ‘item’ is in the first half of the list, the procedure is repeated on the first half of the list. In this way at each step we reduce the length of the list to be searched by half. And this process continues until either the required ‘item’ is found or the search intervals become empty.
Binary search is not good for linked list since it requires jumping back and forth from one end of the list to middle. That’s why we will discuss only contiguous lists.
Let us assume that the array being searched is sorted. Initially we will make a comparison between the ‘item’ and the middle element of the array, then based on the result of the comparison with the middle element one can draw one of the following conclusion:
- if (item == a[mid] then the middle element is the one being searched for
- if (item < a[mid] then the ‘item’ being searched for may be in the first half of the array
- if (item > a[mid] then the ‘item’ being searched for may be in the second half of the array
In the case when the ‘item’ is not found the process is repeated on the half in which the desired ‘item’ is likely to be present. This process goes on until either the search terminates successfully or the final division leads to a half consisting of ‘1’ number of elements. Let we have following list of numbers:
A[0] |
A[1] |
A[2] |
A[3] |
A[4] |
A[5] |
A[6] |
A[7] |
A[8] |
A[9] |
28 |
33 |
36 |
49 |
64 |
75 |
80 |
91 |
95 |
98 |
And let we want to find the item ‘49’. Initially we will start searching with the middle element of the list, that is A[5]. If item ‘49’ is not found at A[5] then we will search the first half of the list as shown in figure-4(a) and compare it with the item ‘49’. If item ‘49’ is not found at A[5] then we will take the middle element of the first half of the list as shown in figure-4(b). And this process goes on. Figure-4 shows that the item is found at 4th location.
Figure 4
Here is the binary search algorithm.
Algorithm : BinarySearch
BinarySearch(a, n, item)
Here ‘a’ is an array of ‘n’ number of elements and ‘item’ represents the item to be searched in array ‘a’
- Step-1 : Set lower = 0
- Step-2 : Set upper = n-1
- Step-3 : Repeat through step-5 while (lower <= upper)
- Step-4 : mid = (lower+upper)/2
- Step-5 : Compare a[mid] and item
- If (a[mid] == item) then return (mid) and go to step-7
- Else if (a[mid] > item ) then upper = mid-1
- Else lower = mid+1
- Step-6 : If item not found then return (-1)
- Step-7 : Exit
The C implementation of this Binary search algorithm is as:
Implementation : BinarySearch
BinarySearch(int a[], int n, int item)
{
int mid, lower, upper;
lower = 0;
upper = n-1;
while (lower <= upper)
{
mid = (lower+upper)/2;
if (a[mid] == item)
return (mid);
else if (a[mid] > item)
upper = mid-1;
else
lower = mid+1;
}
return (-1);
}
Analysis of Binary Search
Each comparison in the binary searching reduces the number of possible candidates by a factor of 2. In other words we can say that after each comparison either the search terminates successfully or the size of the array remaining to be searched is about one half of the original size.
And in worst case, the expected number of comparison require O(log2n) to search the target ‘item’, even if the search is unsuccessful.
Comparison Trees : How fast is it possible to Search
Till now we have studied sequential searching and binary searching. However the process of searching for ‘item’ in an array ‘a’ by means of comparing ‘item’ with the elements of ‘a’ can be conveniently expressed in terms of a binary tree. We trace the algorithm by representing each comparison of keys by a node (vertex) of the tree. Inside the circle we put the index of the key against which we are comparing the ‘item’. Two edges emanating out of a node represents the result of the comparison and are labeled accordingly. When the search is successful, we put the location where the ‘item’ is found at the end of the appropriate branch which we call a leaf and draw as a square as shown in figure-5.
Note that the search starts by the comparison at the root node of the tree. When we search for an ‘item’ into a tree then there are two possibilities – either the item is found or not found. When the search algorithm terminates we put either ‘F’ (for failure) or the location where the ‘item’ is found at the end of the appropriate branch. A successful search terminates in some internal nodes while an unsuccessful search terminate in an external node. We often call such trees as comparison trees.
The comparison tree for a sequential search is especially simple as shown in figure-5.
Figure 5
Figure-6 shows the sequence of comparisons required when binary search is applied to an array of 10 elements.
The number of comparison done in the comparison tree is the number of internal nodes traversed in going from the top of the tree (which is called its root) down the appropriate path to a leaf. The root itself has level 0, the nodes immediately below it has level 1, and so on. The largest level that occurs in a tree is called the height of the tree. The minimum height of a tree having ‘n’ nodes is log2n. So if we search for an ‘item’ in an array of ‘n’ elements then there are at least ‘n’ internal nodes.
Now we can say that the minimum worst case complexity of search algorithm must be equal to minimum height of a comparison tree having ‘n’ internal nodes, which is of course log2n. Thus there can not be comparison based searching technique where worst case complexity is less than O(log2n). On the same track we find that the average number of comparison based searching technique can not be less than O(log2n).
Figure 6 Comparison Of Binary Search For n=10