Ensuring data integrity is a very important part of computer science, from the Web page to software that is being ordered off the Web. If files are not periodically checked, then viruses and Trojans can be inserted into the data without the organization's being aware of it. This can be very serious because if a customer buys the virused software from the organization, he or she may not be a repeat customer; also, the media reports stating that the software is virused could shut down an organization.
Another scenario where corrupt data could affect a company is in its communication with third-party vendors. For example, if an online mortgage company that looks for the best rates from banks receives a rate from one bank that is more than 200% higher than the others, a hacker might have altered the message. It is obvious that that bank will not be getting the business.
Protocols and software can be used to prevent thesesituations. The basic strategy to assist an organization is to establish security requirements, a policy for software, and steps to accommodate the plan. An organization should dedicate resources specifically for handling security issues and employ software architects and developers to specifically ensure the security of the organization. Many organizations are focused on getting the product finished, but if the product is deployed on an insecure framework, the entire product is compromised. Hackers spend their time knowing the market, and there have been many cases where a product was damaged before it reached the market.
Understanding the Hash Function
The secure hash is an algorithm that takes a stream of data and creates a fixed-length digest from it. The digest is a fingerprint of the data. No message digest is perfect, but theoretically it should have a low collision rate, if any, and be a quick, secure algorithm that provides a unique fingerprint for every message. If even one single bit of data is changed in the message, the digest should change as well.
Notice, however, that there is a very remote probability that two different arbitrary messages can have the same fingerprint. When two or more messages can have the same fingerprint, it is known as a collision. When the same exact message is hashed twice, it should generate the same digest. These are just some of the requirements that the hash function is based on and they should be the criteria for which hash algorithm to choose.
The hash functions will generally fall into three types of algorithms based on their uses. There are hashes that don't require a key, those that require a secret key, and those that require a key pair. The algorithms that don't require a key are known as message digests. Those algorithms that require a secret key are known as message authentication codes, and those that require a key pair are known as digital signatures.
Understanding the Message Digest
A message digest (MD) is an algorithm that uses a hash function to create a digest. The digest is simply the fingerprint of the original message. The digest is used to validate that the message has not been altered. In order to check the integrity of a digest, it must be compared against the original digest, which must be trusted by the receiver as being untampered with. For instance, if the message is M, and a message digest is used (MD), a digest (D) is produced. This is illustrated in the following equation.
MD1(M)1 = D1
When the message needs to be validated again at a later time, the message is hashed to a new digest. If any data is changed in the message, even by one bit, the message digest must produce a different digest as illustrated in the next equation:
MD2(M)2 = D2
Now the two digests are compared, and if there is a difference between the digests, D2 is considered invalid or altered.
Note In the D1 and D2 comparison, D1 must be trusted by the receiver as being the original digest and so it is up to the organization to keep it safe. One suggestion is to put D1 in an LDAP server.
Encryption and digests
Another use of the digest is that it is encrypted in a message such as SSL or X.509 to be unencrypted by a public key and checked for corruption of the data. Since the private key is needed to encrypt the digest, only the owner of the private key can generate the digest. The owner of the private key is usually the initiator of the message, so this scenario works well. Any user that has a copy can decrypt the message, but cannot encrypt the message without a private key. This private-public key scenario is an example of a key pair.
If you are familiar with Serial Communications and TCP/IP, this type of message integrity check may look familiar. In TCP/IP, there is a Cyclic Redundancy Code (CRC) to ensure that the receiver received the message in its entirety. If the receiver calculates the CRC and it doesn't match the message, the TCP/IP packet is retransmitted. The CRC code uses a 12-bit, 16-bit, or 32-bit CRC size. First, the CRC uses a polynomial calculation to sum the bits in the message into the desired bit-size CRC digest. Then, the CRC is used to detect errors in a transmission. The idea of using a digest for messages has been around for quite some time in other protocols; the algorithms have evolved over time.
Many algorithms can be used for checking the message digest, such as MD2, MD4, MD5, SHA-0, SHA-1, RIPEMD-160, Tiger, and many more. When testing the message, the tester must be aware of the algorithm that is being used. If the digest was hashed using MD5 and the message to be validated was hashed using SHA-1, then the digests is different even if the messages are the same. An organization needs to establish standards for which algorithms it uses for the MD.
Differentiating MDs
Many characteristics are used to differentiate MDs. Each MD usually has an initialization registers set of four or five values that will be the first values used in the hash. The registers were originally optimized for 32-bit processing machines and are the values that will initialize the registers. The initialization values are important to ensure that the input data is not the firstof the initialization variables, so that even less can be known about the input data. When the algorithm is initialized, buffers need to be zeroed out. When the digest is returned, the algorithm needs to be initialized again to start a new digest. Many algorithms use temporary buffers and have the capability to add input data through an update method.
One of the characteristics of the message digest is referred to as a one-way hash. A one-way hash means that the input data cannot be recovered by looking at the digest or hash. After the initialization of the message, data can be inputted for the algorithm to compute. The data must not exceed the message digest's maximum size. The message digest breaks down the input data into blocks. Most algorithms use a 512-bit block size, but the block size is algorithm-specific. If the data input is smaller than the block size, the algorithm must pad the data to reach the correct block size. Lengths are added inside many of the blocks to contain the length of the original message. After the input data is entered and formatted to the correct block size, each block will go through the algorithm's computations.
Breaking down the algorithm
The algorithm is normally broken down into rounds and operations. The rounds are a set of like operations performed on the data block. For example, SHA-1 has four rounds, and each round has 20 steps. The step is the number of times that the data is transformed. A round is the number of completely different transformations on the data. After the data has been hashed upon, the result needs to be compressed into a digest. The compression will take the 512-bit block and put it into a 160-bit digest in SHA-1; other algorithms have different sizes. An example of the padding, initialization, and updates for SHA-1. Many of the message digests have different values, different operations in the computation, and several other factors; but the basic flow remains the same.
The initial variables in the five registers in SHA-1 are variables to initialize the chaining variables. The initial variables are hashed with the input message block. The result of the hash is used as initial variables in the next input message block that will be hashed. Then the result of that hash is used next as chaining variables, and this process continues until the final phase is called by the application to change the hash into the hash digest. The hash in SHA-1 has five integer registers until the final phase, and when the entering the final phase, the hash is converted to 20 bytes.
The general steps of a message digest algorithm can be described as:
Step 1: Initialization.
Step 2: Break the data input into the appropriate block size, padding if necessary.
Step 3: Append the length.
Step 4: Pass each block through the algorithm's rounds and operations.
Step 5: Compress to digest the data.
Implementing the Different Message Digest Algorithms in Java
To understand which message digests are supported in Java, you can get a listing of the properties of the service providers.
The update method of the MessageDigest class adds the input data to the algorithm. Multiple updates can be done to the message digest to be hashed. The final phase will not complete until the digest method is executed. The variable chain starts at the point of the getInstance or digest method and ends at the next digest method. What this means is that if the program does a getInstance, then does several updates, and finally a digest call, the digest is the total on all messages passed through the updates.
When the digest method is called, the variables and buffers will be reset to an initial state so that a new digest can start from that point on. Being able to provide multiple updates is one of the features that Java provides in addition to abstracting the algorithms. Another feature worth noting is the java.security.
DigestInputStream class, which associates a message digest with an input stream. When data is read into the input stream, it is sent directly to the update of the message digest. Classes such as the DigestInputStream can alleviate several method calls that would be required to read and call the updates.
This article discussed the use of the message digest for ensuring message integrity. There are many message digest algorithms that can be used and more are evolving every day. Most of the modern-day algorithms are based on Ron Rivest's MD4. Some of the algorithms such as RIPEMD-160 have become much more complex in the computations of the hash, which means that the execution time of the algorithm is higher.
The algorithm that Ron Rivest designed after MD4 was MD5, the successor of MD4 because of MD4 collisions. Collisions occur when multiple messages can generate the same digest. MD5 is much faster than RIPEMD-160 but can also generate some collisions because the algorithm is not as complex as RIPEMD-160. In the middle of MD5 and RIPEMD-160 is SHA-1, which is faster than RIPEMD-160 but slower than MD5 because of its computational power. So there are several choices for the message digest algorithm. The algorithms supported by the Sun JDK 1.4 are MD5 and SHA-1.