Sometimes people use the terms encryption and encoding interchangeably. Also, often hashing is called into play as an encryption mechanism. Confusing these concepts may lead to misunderstandings in the way security is implemented.
Let's take a high-level overview of these concepts and clarify the differences.
If you prefer watching videos, here is a video version of this article:
What Is Encoding?
Let's start with encoding. You can define it as a technique to transform data from one format to another so that it can be understood and consumed by different systems.
Basically, encoding has to do with information representation. When you have some information, say the name of the mineral that weakens Superman, you can represent it through letters, as in kryptonite. This is a handy representation for humans but not so easy for being manipulated by computers. What usually happens, in this case, is the transformation of this sequence of characters into a sequence of bits like this:
01101011 01110010 01111001 01110000 01110100 01101111 01101110 01101001 01110100 01100101
You have two representations for the same information. The letter-based representation is usually understood by human systems; the bit-based representation is more suitable for computer systems. Commonly, you say that the sequence of letters has been encoded into a sequence of bits.
So, encoding is just a transformation from one data representation to another, keeping the same information. Usually, it involves a conversion table, such as an ASCII table in our example, that maps a representation item in one system to the corresponding representation item in the other system.
You can find several encoding mechanisms out there. To mention just a few in the character encoding space, apart from the dear old ASCII, you have:
- Unicode, which allows you to represent more complex items than letters, such as emoji and other symbols.
- Base64, which lets you represent binary data, such as an image, through text.
- URL encoding, useful to represent arbitrary data in an URL, where some characters are reserved or cannot be used (think of spaces or colons, for example).
Consider JSON Web Tokens (JWT), for example. The three parts that compose a token are encoded using Base64-URL, a variant of Base64 encoding combined with URL encoding. The following is an example of encoded JWT:
You can see the decoded version of this JWT using the jwt.io debugger.
This encoding mechanism allows the token to be easily passed in HTML and HTTP environments without fear of clashes with reserved or unrepresentable characters. If you want to learn more about JWTs, you can download the JWT Handbook.
Encoding ensures interoperability between systems. It allows systems that use different data representations to share information.
Encoding has no security purpose. Anyone who knows the conversion algorithm can encode and decode data. The conversion algorithm is not kept secret. On the contrary, it is public in order to facilitate interoperability between systems.
Finally, encoding is a reversible process. You can transform a piece of data from one representation to another and then go back to the original representation without information loss.
"Every decoding is another encoding.."
— David Lodge
What Is Encryption?
Encryption is a technique that makes your data unreadable and hard to decode for an unauthorized user.
So, basically, encryption is a mechanism that transforms data into a different representation so that prying eyes cannot understand it. Wait! Isn't this transformation the same as encoding after all? 🤔
How can a human being understand that the following sequence of bits represents the word kryptonite?
01101011 01110010 01111001 01110000 01110100 01101111 01101110 01101001 01110100 01100101
In fact, the question is not far-fetched. In a way, encryption is a form of encoding. It transforms data from one representation to another. For this reason, sometimes, people use the terms encryption and encoding interchangeably. However, the purpose of encryption is different from the encoding's one.
Look at the definition above. The encryption technique aims at making data unreadable and hard to decode. If you think about it for a moment, it is the opposite reason of pure encoding: encoding aims at making data as much understandable as possible across systems while encryption tries to make it undecipherable, unless you are authorized.
The main goal of encryption is to ensure data confidentiality, i.e., protecting data from being accessed by unauthorized parties.
So, while encoding makes its conversion algorithms as public as possible, encryption should keep such algorithms private. Actually, it's not really like that. Relying on secret algorithms is not the best choice to protect data in the long run. Better solutions rely on well-known algorithms whose data transformation is based on sequences of numbers or letters called keys.
Do not create your own encryption algorithm unless you are a Math expert with a long experience in the cryptography field.
The best mechanisms to encrypt data are based on mathematical algorithms that can be solved only with the possession of a key or with advanced computational power. Two families of key-based encryption algorithms exist:
- Symmetric-key algorithms: these algorithms use the same key to encrypt and decrypt data. The Advanced Encryption Standard (AES) algorithm is an example of this family's algorithms.
- Asymmetric-key algorithms: these algorithms use different keys to encrypt and decrypt data. The two keys are bound by a complex mathematical relationship. RSA is an example of an algorithm in this family. Check out this blog post for a gentle introduction to asymmetric-key algorithms.
Like pure encoding, encryption is a reversible process as well, although just for authorized people. Authorized people are the ones in possession of a decryption key.
The challenge of authorized versus unauthorized people is to make data decryption without the key as hard as possible. This leads to applying a mix of cautions such as complex mathematical relationships between the keys, keeping them secret, changing them frequently, and so on.
"Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break."
— Bruce Schneier
What Is Hashing?
Let's take a look at hashing now. Basically, it's a technique to generate a unique fixed-length string (hash) strictly depending on the specific input data.
Since the generated hash depends on the specific input data, any small change to the input data generates a different hash. So, having the hash of a given piece of data, you can verify if that data has been altered by calculating its hash and comparing it with the one you already have. In other words, hashing ensures data integrity.
Suppose you like Lex Luthor's sarcastic line, That's Kryptonite, Superman. Little souvenir from the old home town. You like it so much that you want it to always be quoted in exactly this way. No character changed.
You calculate its hash string as follows:
By the way, I used SHA-256 as the algorithm to calculate that hash string. But don't mind about this. I'll spend a few words on algorithms in a moment.
Assume someone writes that quote slightly different, such as, That's kryptonite, Superman. Little souvenir from the old home town. The following is the hash of this second version:
You can see that the two hashes are different just by looking at the first characters. This tells you that the two sentences are not written the same way. Did you notice the difference between the two sentences? The second version has a lower-case k instead of the capital K in kryptonite.
You may say: why do you compare the two hash strings when you could compare the two sentences? Good point! Imagine that you have a document of thousands of words or a high-quality picture instead of a sentence. In that case, comparing the hash strings is more effective than comparing the whole media.
So, how can you get a hash? A hashing algorithm must have the following features:
- The resulting hash has a fixed length.
- The same input always produces the same output.
- Multiple different inputs should not produce the same output.
- It must not be possible to obtain the input from the output data.
- Any change to the input data implies a different resulting hash.
As you can see, point 4 implies that hashing is not a reversible process, unlike encoding and encryption. Also, point 3 seems to say that, while you should get different hashes for distinct input data, it can't be guaranteed. Actually, this point makes the difference between hashing algorithms. For example, MD5 has been a very common hashing algorithm in the past, but in 2008 it was deprecated due to collision detection. The same happened to some early algorithms of the Secure Hashing Algorithms (SHA) family.
The Right Tool for the Right Goal
As you've seen, encoding, hashing, and encryption have their specific purposes and features. Confusing their capabilities and roles in your system may lead to disastrous consequences.
For example, you may think that encrypting passwords is the best security option. Actually, it's a very bad idea. That's what the Adobe engineers learned in a data breach in 2013. The attackers who got access to their user database could break the encryption algorithm. Remember that encryption is a reversible process. Even if they don't have the decryption key, they may have enough time to guess it. Adobe reset users' passwords as a countermeasure, but you know users use the same password for multiple services. So, even if access to Adobe services may be safe, access to other websites was potentially compromised.
They should have used hashing instead of encryption to store users passwords securely. You know that hashing is not a reversible process. Attackers can't determine the password from which the hash was generated. But also, simply relying on hashing is not the best option, as the LinkedIn breach teaches us.
I know, it's a hard world! 🙄
If you want, you can learn more about storing passwords using hashing and how to use salt to store them properly. Also, for a more technical comparison of encoding, encryption, and hashing, read this article.
Now you know the difference between encoding, encryption, and hashing. To briefly recap what you learned throughout this article, take a look at the following cheat sheet: