Cryptographic Hash Functions

The cryptographic hash functions are well-known cryptographic primitives. They are widely used, either as stand-alone or as a building block in other, more complex constructions. Their applicability covers a wide area, some popular examples being password storage, electronic signatures, and blockchain. Please note that cryptographic hash functions must satisfy some (of the) security notions I will discuss later on, while non-cryptographic hash functions do not necessarily have to. Therefore, not all hash functions are cryptographically strong. Non-cryptographic hash functions are also valuable, e.g., in data structures. However, for simplicity, I will sometimes refer to cryptographic hash functions as simply hash functions. As a note, there are also hash functions with a secret key, known as keyed hash functions. I will not refer to these keyed hash functions here; I will only explain the keyless (or unkeyed) hash functions.

Cryptographic hash functions accept binary inputs of arbitrary length (i.e., of any size, to the extreme infinite). We denote this by {0,1}*. Regardless of the length of the input, the length of the output is of fixed size, (usually) dependent on the security parameter n. We denote this by {0,1}^l(n), where l(n) is a function of the security parameter n.

Because the output space (finite, of fixed length) is smaller than the input space (in theory, infinite), we say that hash functions compress. In practice, we are also normally interested in hash functions that compress. Nevertheless, note that, in theory, the input value x can be of any length; hence, there are input values x that are shorter or equal in size to the output (however, we normally represent all inputs on the same number of bits/same data type). The graphical representation I use here is adapted from the literature and suggests a longer input x that compresses to a shorter (fixed-length) output H(x). The hash function H itself is illustrated as an isosceles trapezoid with the long base towards the input and the short base towards the output, to highlight its compressing property.

Hash functions are deterministic, meaning that for an input x a hash function H always returns the same output, denoted by H(x). So, regardless of how many times one wants to compute H(x), he/she will always get the same value.

Finally, a cryptographic hash function is "easy" (or "efficient") to compute but "difficult" (or "inefficient") to reverse. This holds when the function definition (in our case H) is public (i.e., we are in the settings of Kerckhoffs's principle, which I have explained in a previous post - see Principles) and the function does not rely on any other secret information. Such a function is called a one-way function. One-way functions are heavily used in cryptography. I will come back to this notion later when I will explain the cryptographic properties of the hash functions.

The first pre-image resistance asks for the function H to be one-way. Formally, the property means that for any PPT adversary

, the probability of

winning the first pre-image game is negligible (see the previous post Unconditional vs. Conditional Security for the definitions of PPT and negligible). We denote the first pre-image game for the adversary

, the hash function H, and the security parameter n by

. We say that the adversary wins the game (the output is 1) if given X = H(x') for a random x', the adversary outputs a bit-string x s.t. H(x)=X. Otherwise, the adversary fails (the output is 0).

H is second pre-image resistant if, for any PPT adversary

, the probability of

winning the second pre-image game is negligible. We denote the second pre-image game for the adversary

, the hash function H, and the security parameter n by

. We say that the adversary wins the game (the output is 1) if given a random x, the adversary outputs a bit-string y ≠ x s.t. H(x)=H(y). Otherwise, the adversary fails (the output is 0).

In their very famous paper entitled New Directions in Cryptography [1], Diffie and Hellman ask for the hash function to be one-way in the sense of the first, but also second, pre-image. They are thus the first to introduce the concept of one-way hash functions and the ones that show the potential of these functions in the world of asymmetric cryptography [2]. However, a formal definition has only been given later by Merkle and Rabin. You can read about this and the history of formal definitions of different security notions related to the hash functions in [2].

H is a collision-resistant hash function if, for any PPT adversary

, the probability of

winning the collision game is negligible. We denote the collision game for the adversary

, the hash function H, and the security parameter n by

. We say that the adversary wins the game (the output is 1) if he/she outputs two distinct values x ≠ y such that their hash value is equal, i.e., H(x) = H(y). Otherwise, the adversary fails (the output is 0). Note that there is no hash function with 0 collision resistance, i.e., without collisions. This is because the hash functions compress (i.e., the input is larger than the output), so there exist two different inputs leading to the same output.

Ivan Damgård, a remarkable professor at Aarhus University, Denmark, was the first to formally introduce the collision resistant hash functions [2,3]. As a note, he is well known for many significant results in the field, including the Merkle-Damgård construction.

Informally, the three security definitions can be thought of as follows:

first pre-image resistance: given a hash value X =H(x') for a random x', a bounded adversary ( is PPT) should fail in finding a valid input x s.t. H(x)=X. In simple words, the function is easy to compute but difficult to invert.

Note that the adversary can output either the initial value x = x' or a distinct value x ≠ x'.

second pre-image resistance: given a random x, a bounded adversary ( is PPT) should fail in finding a different input y ≠ x with the same hash H(x)=H(y).

Note that, as a difference from the first pre-image property, the adversary cannot output the initial value y = x (formally, x is now even as input to the game) but a distinct value y ≠ x.

collision-resistance: a bounded adversary ( is PPT) should fail in finding two different inputs x ≠ y with the same hash H(x)=H(y).

Traditionally we see the collision resistance property as being the strongest. This is because (under some constraints such as the input of the hash function large enough compared to its output), the collision resistance implies the second pre-image, which in turn implies the first pre-image. This means that (1) if H is collision-resistant, then H is second pre-image resistant, and (2) if H is second pre-image resistant, then H is first pre-image resistant or one-way. Informally, (1) holds because if for a given x the adversary can find a second pre-image y ≠ x s.t. H(y)=H(x), the adversary breaks the collision game by outputting x, y. Informally, (2) holds because if the adversary can break the first pre-image then it will compute y so that H(y) = X and y ≠ x with high probability. Both claims are proved by negation: (1) if H is not second pre-image resistant, then H is not collision resistant and (2) if H is not first-image resistant, then H is not second pre-image resistant.

The current literature on hash functions is vast. The interested reader will find good presentations of hash functions in [4, 5].

The easy way an adversary might try to break a hash function is by brute force, e.g., trying different inputs until he/she succeeds in his/her goal. Assume the adversary wants to break the strongest property and find a collision. If the adversary tries 2^l(n)+1 distinct inputs (i.e., the number of different output values + 1), he/she will find a collision for sure. This is because the number of input values exceeds the possible number of outputs, so at least two inputs will lead to the same output.

This simple brute force attack can be probabilistically improved by considering the birthday paradox. The birthday paradox says that in a group of 23 people, the probability that two are born on the same day is higher than 50%. Generalizing the birthday paradox for a set with N elements, the adversary has a 50% probability of choosing twice the same element twice if he/she randomly (and independently) chooses at least q > 1.2*sqrt(N) elements from the set. In the case of the traditional birthday paradox, N = 365, so q > 1.2*sqrt(365) = 22.926 ... This leads to q =23.

In consequence, the birthday paradox allows an adversary to find a collision after ca. O(2^{l(n)/2}) evaluations of the hash function, where l(n) is the dimension of the output in bits. The interested reader can find more information about the birthday paradox in e.g., [4,5].

Breaking the security definitions of hash functions can have severe consequences. For example, an adversary breaking the second pre-image resistance could replace a signed document D with a different document D'. This works because the digital signature is normally computed on the hash H(D) (for several reasons, including efficiency) and not the document D itself. Hence, if H(D') = H(D), the adversary could simply replace D with D', thus forging the digital signature for D'. Similarly, if the adversary finds a collision D ≠ D' such that H(D) = H(D'), he/she might convince someone to sign D (that looks acceptable) and thus obtain a valid signature for D' too. Think about D' as being a document that brings some prejudice to the signing party, so that the signing party would never sign freely. Back in 2017, researchers and practitioners in CWI and Google, found the first collision in SHA-1 [6]. Their findings were popular, especially because they show how to output two different .pdf files with valid (and, in fact, not much too different) content with the same hash value. Their attack required significant computation on high-tech infrastructure, but it was 100 000 times faster than a brute force attack based on the birthday paradox, which made it feasible. Read more about the attack on the dedicated webpage shattered.io.

Another clear exemplification of the severe consequences of breaking the one-way property is the derivation of session keys (in key hierarchies). Suppose the simplified scenario in which a session key k_i (corresponding to the session i) is simply derived from a fixed-length master key msk using a one-way function H: k_i = H(msk,i). Then, leaking the session key k_i and inverting H results in exposing the master key msk.

Nevertheless, there are many useful implementations of hash functions and in fact, hash functions are nowadays one of the pillars of secure digitalization, being continuously used as a primitive in more complex constructions. From probably the most well-known family of hash functions - Secure Hash Algorithm (SHA), SHA-256 and SHA-3 are still secure today. In fact, SHA-3 is the first standardized hash algorithm chosen after a public competition [7, 8] and based on the winning Keccak [9]. Finally, there are hash functions that are highly used in specific applications due to their particular properties. Just to mention a few: bcrypt [10] is widely used for password storage due to its ability to be (configurable) slow and thus prevent brute-force attacks, and poseidon [11] is widely used in zero-knowledge proofs for blockchain.

[1] Diffie W., Hellman, M.E. (1976) New directions in cryptography. IEEE Transactions on Information Theory, 22(6), 644–654.

[2] Van Tilborg, H. C., & Jajodia, S. (Eds.). (2014). Encyclopedia of cryptography and security. Springer Science & Business Media. Available at: https://www.researchgate.net/profile/Krzysztof-Kryszczuk/publication/230674947_Springer_Encyclopedia_of_Cryptography_and_Security

[3] Damgård, I.B. (1988). “Collision free hash functions and public key signature schemes.” Advances in Cryptology—EUROCRYPT’87, Lecture Notes in Computer Science, vol. 304, eds. D. Chaum and W.L. Price. Springer-Verlag, Berlin, 203–216.

[4] Kats, J., Lindell, Y. Introduction to Modern Cryptography, CRC Press. More info: http://www.cs.umd.edu/~jkatz/imc.html

[5] Boneh, D., Shoup, V. A Graduate Course in Applied Cryptography. Available at: https://toc.cryptobook.us/

[6] Stevens, M., Bursztein, E., Karpman, P., Albertini, A., & Markov, Y. (2017). The first collision for full SHA-1. In Advances in Cryptology–CRYPTO 2017: 37th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 20–24, 2017, Proceedings, Part I 37 (pp. 570-596). Springer International Publishing.

[7] NIST Releases SHA-3 Cryptographic Hash Standard. Available at: https://www.nist.gov/news-events/news/2015/08/nist-releases-sha-3-cryptographic-hash-standard

[8] Hash functions - SHA-3 Project. Available at: https://csrc.nist.gov/projects/hash-functions/sha-3-project

[9] Team Keccak. Available at: https://keccak.team/index.html

[10] Provos, N., & Mazieres, D. (1999). Bcrypt algorithm. In USENIX.

[11] Grassi, L., Khovratovich, D., Rechberger, C., Roy, A., & Schofnegger, M. (2021). Poseidon: A new hash function for {Zero-Knowledge} proof systems. In 30th USENIX Security Symposium (USENIX Security 21) (pp. 519-535).

Search This Blog

Pages on Security

Cryptographic Hash Functions

Comments

Post a Comment

Popular posts from this blog

Stream Ciphers

Symmetric vs. Asymmetric Encryption

Unconditional vs. Conditional Security