CPUs and GPUs have different architectures and have different use-cases.
CPUs are the “brain” behind most of our electronic devices and are good at doing lots of different things. They are very general-purpose processors, coordinating the wide range of tasks your computer will do.
GPUs, on the other hand, are much more specialized computing engines. They are designed to handle the processing of 3D images efficiently, but rendering graphics is pretty much computing loads of matrix arithmetic.
The types of computations that GPUs excel at are called parallel computations: tasks that can be broken up into smaller, independent computations and be performed simultaneously.
The concept can be visualized in this 1:30 minute video, where Mythbusters Adam Savage and Jamie Hyneman show a painter-robot they built to illustrate how parallel computing works.
Generally speaking, the more a task can be broken up into smaller independent pieces, the faster, the larger calculation will happen. And the number of pieces the original task can be split into is largely determined by the number of cores present on the GPU.
If you look at the number of GPU cores over the years, you will see that they have been increasing exponentially in the past few years. And while most CPUs top out at a few dozen cores, GPUs have thousands of cores. The latest high-end GPUs from Nvidia has more than 10,000 cores.
This unparalleled growth led developers to start exploring the capabilities of GPUs to more than graphical processing, in what is now known as “general-purpose computing on GPUs” (GPGPU).
To aid developers in making use of their GPUs full capacity and allow the CPU and GPU to process data bidirectionally, Nvidia created CUDA, a software platform that makes it easier for developers to build applications using APIs.
They are always collaborating with developers to refactor the algorithms used in their applications to make the most use of Nvidia’s hardware. The result is very good integration with widely-used applications, such as Adobe Photoshop CC, Adobe Premiere CC, and Autodesk Maya.
CUDA can be downloaded for free, but it requires an Nvidia GPU to work. There is also an open-source alternative, OpenCL, which is the primary framework for graphics processing on AMD GPUs. Nvidia supports OpenCL, but with CUDA being developed to run with Nvidia hardware in mind, it has better performance when run on Nvidia GPUs.
These platforms allow developers to make extensive use of GPUs in applications other than graphic processing that would highly benefit from parallel computing, and password cracking is one of them. Password cracking is what we call an embarrassingly parallel task. An embarrassingly parallel task is a task that can be trivially broken up into smaller pieces, where the computations on each piece are independent of the results of their "neighbors'".
Consider a scenario where you want to add two 20x20 matrices by hand. You summon up 19 more friends and tell them that each one of you will be responsible for calculating the results of a single row, and then you’ll merge the results back together into a single matrix.
Since the arithmetic of each element in a matrix can be performed independently, you can carry on your own calculations without interfering with the calculations’ of your friends; all of you can make your calculations “in parallel.”
Password cracking is an embarrassingly parallel task because it works under the same principle. I’ve written about how passwords are cracked in more detail, but here’s how it works in simple terms.
Say an adversary obtains a database with 1 million different hashed passwords that they wish to crack.
For simplicity’s sake, let’s assume:
- They know the application limits user passwords to 10 characters.
- They can only try to brute-force passwords (assume no dictionary, no masks, and no rainbow tables)
To crack all passwords, they’ll need to load a list of all hashes they’re trying to crack on memory, hash some character combination using the same hashing function that was used to hash the passwords on the list and see if it matches up with one of the hashes on memory.
They will have to repeat this for all character combinations, up to all 10 valid characters.
If they instruct the CPU to go through a list of all possible combinations, the CPU will calculate the hashes roughly linearly, – even though it's possible to make computations in parallel on a CPU, the number of cores are very limited when compared to GPUs.
However, the result of a single hash calculation doesn’t affect any other hash calculation, so we can speed up this process thousands of times by computing them in parallel. Not only that, you can increase the speed even more by utilizing a cluster with multiple GPUs.
Parallel computing has many other applications, many of which are extremely valuable to society, such as weather prediction for agriculture, public transportation, and identifying spam.
This is why Nvidia and other manufacturers have been investing heavily into this field in the past decade, and the trickle-down effect of these advancements will continue to affect password cracking.
FPGAs and ASICs
Increasing the hashing speed by a factor of 100 by using GPUs is bad news for people who are trying to store passwords securely.
But not all hashing algorithms are vulnerable to GPU-accelerated attacks.
To make hashing algorithms more resistant to such attacks, developers try to make algorithms use a lot of memory as a way to increase the attack’s computational cost.
bcrypt is one of these algorithms.
When a computer is calculating a bcrypt hash, most of the time is wasted waiting for the next instruction to come from the slower memory to the faster memory (more on this in a bit) rather than computing the hashes.
FPGAs and ASICs, are two other types of computing devices that play a crucial role in parallel computing.
FPGAs, or field-programmable gate arrays, are integrated circuits that can be reprogrammed for specific tasks after manufacturing. Its hardware can be optimized for a specific application.
This, in turn, allows for massive parallelism while using much less power, even though the clock speeds on these chips are much slower than those of modern CPUs.
ASICs, application-specific integrated circuits, on the other hand, are integrated circuits that are “fixed,” but are application-specific. The hardware is designed with one application in mind and can do nothing else, but they do what it was designed to do extremely well.
GPUs are, actually, an example of specialized ASIC chips. In the beginning, they were created to deal strictly with the kinds of computation a chip would have to deal with rendering tasks. It was only after some time, in the late 2000s, when CUDA came out, that they started to become more programmable.
The downside is that designing custom chips is expensive since you’ll have to eat the costs for research and development on top of the manufacturing costs.
To justify the investment for a manufacturer, there must be a large enough demand for chips like these. Alternatively the chips must aid in some kind of profitable activity to the buyer, so they can have a return on their money.
That’s why you’ve probably heard of Bitcoin mining ASICs. These chips can only do one thing: calculate double-hashes of SHA256, and there’s enough profit in mining Bitcoins currently, to justify its manufacturing.
Somewhere in the early 2010s, mining ASICs became the de facto standard hardware of Bitcoin mining rigs, dethroning the clusters of multiple high-end GPUs because they were much more effective in doing that one thing.
Can Bitcoin Mining ASICs Be Used to Crack Passwords?
As I said, ASICs are designed from the ground up with one task in mind, and in the case of Bitcoin mining rigs, that task is calculating double-hashes of SHA256.
Those rigs can’t do anything else but that. Unless the passwords your adversary is trying to crack have been double-hashed with SHA256, which is unlikely, those rigs are useless.
It’s an impractical (but not impossible, and may be a reality in the future as technology becomes more commoditized) attack model to have a rig like this for cracking passwords because the ASIC chip configuration might not match the parameters that the application developer used to store the passwords.
An adversary would need one ASIC rig set up for each hashing function, and for configurable functions such as bcrypt, each probable parameter combination. Not only that is hard to do from a development point of view, but the attack would also have to make sense financially.
FPGA Rigs for Cracking Bcrypt Passwords
Even though FPGAs are less efficient than ASICs for a specific use such as calculating hashes of a specific algorithm, they make up for it by being cheaper and more versatile.
Let’s see how FPGAs can go beyond the limitations of GPUs when cracking bcrypt.
For general hashing functions, such as SHA1 and MD5, a GPU will probably farewell, since these algorithms don’t force memory bottlenecks as they weren’t intended for storing passwords.
To compute a bcrypt hash, the processor will have to constantly access instructions that are stored in memory. Each core has a fast, but small amount of memory (L1 cache). If there’s insufficient space on the L1 memory to compute the hash, the data will have to be stored on the slower, main GPU RAM that’s shared amongst all cores.
There is a cost to transferring this data, and this creates a huge bottleneck in the number of hashes per second a computer can calculate. Most of the time, calculating bcrypt hashes is wasted waiting for data to come from the main memory to the faster memory so that the hash can be computed.
This is not a problem for FPGAs because they have more than enough L1 memory to deal with bcrypt hashes.
The team behind Scattered Secrets, a password breach notification service, wanted to test this, so they built a cluster of FPGAs capable of cracking bcrypt about 35 to 40 times faster than current high-end GPUs, and using only about 5% of the power.
Using an RTX 2080Ti as a benchmark, they got 54 thousand bcrypt hashes per second on their testing rig, while the rig they built with 18 quad Spartan-6 LX150 FPGA boards and cranked out 2.1 million bcrypt hashes per second.
Mind you, these are not state-of-the-art FPGA chips: they were introduced back in 2011. These boards were the tool of choice because they supported Jack The Ripper, a popular password cracking software. As password cracking applications begin to support newer chips, the speed of these attacks will only increase.
Looking into the Future
Elon Musk already said that machine learning and AI will be at the forefront of our technological development at least for the next decade, and they rely heavily on parallel computing.
There will always be financial incentives to develop new technologies to solve complex problems we face as a society. The numbers of GPU cores grew from 1 core in 1995 to 24 cores in 2006, then exploded to 128 cores in 2009 and continued to grow exponentially to over 10,000 by 2021.
As new technology arises, we have to keep an eye on how they will change the security landscape, because it’s a game of cat-and-mouse between the good and bad guys.
One of the most promising future developments in parallel computing is quantum computing.
As we approach the end of Moore’s Law (which states that the number of transistors in microchips doubles every two years), researchers are trying to find ways to continue to increase exponentially the pace at which we can make complex calculations. Quantum computers can completely revolutionize artificial intelligence, drug research, and, of course, the security industry.
In my next article, we are going to talk about how quantum computing works and what it means to the security industry.