What features of scrypt() make Tenebrix GPU-resistant?

There's a fork of Bitcoin called Tenebrix that is claiming to be CPU-friendly and GPU-resistant (with regard to mining). They say that this is because they're using scrypt instead of SHA256. From what I'm reading, scrypt seems similar to bcrypt or PBKDF2 in that it's just a multi-round scheme. If it's simply more rounds of something that's already faster on GPU than CPU then wouldn't N rounds still be faster on GPU than CPU or am I missing something inherent to scrypt that ruins GPU efficiency?

David Perry

Posted 2011-09-30T18:59:58.273

Reputation: 13 848

4Keep in mind that Tenebrix is GPU-resistant, not custom-hardware-resistant. A custom PCB covered with lots of cheap pipelined DRAM chips and memory buses will still give vastly more hashes-per-dollar than a normal computer. I still don't understand why the Tenebrix creators think it's so important to hobble GPU miners. – eldentyrell – 2011-09-30T20:03:06.043

1I poked through the PDF on scrypt and while I didn't understand ALL of what was written, there was a section where the author combined the CPU & memory requirements for custom hardware to break a bunch of algorithms in terms of "die space" and compared the costs that way and scrypt still came out on top. I'm assuming these requirements also hobble FPGAs and at the very least would make ASICs much more expensive than for Bitcoin? – David Perry – 2011-09-30T20:32:56.853

1Oh, I see what you're saying. Building the memory into an ASIC would be expensive, but building the ASIC and buying already-cheap RAM on the side wouldn't be. How much RAM does Tenebrix's implementation of scrypt actually require? – David Perry – 2011-09-30T20:35:06.867

2David, the maximum size of the table (see my answer below) is a pseudorandom function of the message being hashed. There is no "required" amount of memory, although if you don't haven enough to hold the table you'll end up repeating a lot of expensive computation. – eldentyrell – 2011-09-30T21:48:23.580

These two questions might interest you: Are there algorithms that could have been chosen for mining that balance CPU/GPU? and Are there algorithms that could be used for mining that resist acceleration with ASICs?

– nmat – 2011-09-30T23:20:55.773

@Dave Perry: So long as the algorithm is well-constrained by RAM speed, buying already-cheap RAM on the side won't help you. CPUs already do that. If your performance is largely RAM-limited and you use the same RAM as CPUs, you can't significantly outperform them. (That's the theory anyway, whether a particular implementation realizes that, I don't know.) – David Schwartz – 2011-10-01T00:43:08.397

@David, you need to separate "speed" into "latency" and "bandwidth". Since mining is inherently parallel (you can work on multiple blocks), latency is never the problem. Regarding bandwidth, it's not hard to design a custom machine with massively more memory bandwidth than a top-of-the-line PC: just keep adding more memory buses and DRAM slots. – eldentyrell – 2011-10-01T21:55:23.427

@eldentyrell: Sure, but that doesn't bring the cost down. You can do the same thing just by adding more cheap motherboard/CPU/RAM combos. The issue is whether there's a cost advantage to custom hardware. If it's RAM-limited, there's very little since cheap PCs are disproportionately powerful in the RAM department. – David Schwartz – 2011-10-02T00:03:46.760

1DRAM isn't fast enough. The L2 cache inside a cPU has roughly 3 clock cycle latency. Main RAM has roughly 240 clock cycle latency. Taking a FPGA and hooking up a bunch of DRAM busses would provide horrible performance. For a huge amount of clock cycles the CPU would sit idle waiting for data to come SLLLLLLLLLLLLLOOOOOOOOOOOWWWWWWWWWWWWLLLLLLLL out of main memory. It would then very quickly perform a couple of operations and then SLLLLLLLLLLLLLOOOOOOOOOOOOOWWWWWWWWWWLLLLLLLLLYYYYY push the results back into main memory. Main Memory =/= Cache when it comes to latency. – DeathAndTaxes – 2011-10-10T13:12:59.727

@theUnhandledException: Thanks for clarifying all that. It makes a lot more sense how the implementation is custom-hardware resistant when you consider that the memory expenses of sCrypt are not so high as to exceed the amount of L2 cache available on the CPU. It would probably be cheaper to buy obscene amounts of CPU resources than to attempt a custom hardware attack. – David Perry – 2011-10-11T00:20:27.660

Answers

The scrypt() algorithm has at its core a routine called ROMmix. Basically, it defines

  V(1) = hash(message)
  V(2) = hash(hash(message))
  V(3) = hash(hash(hash(message)))
  ...

and it calculates

  V(V(V(V(...(message))))

Since computing V(n+1) requires computing V(n) first, the most efficient way to do this is to cache all of the previously-computed values. Once you've generated a large enough table, the V(V(V(...))) is just a bunch of lookups.

Caching all the previously computed values requires lots of memory, and since each lookup depends on the previous one it's sensitive to memory latency (although if you're mining you can work on several blocks in parallel and pipeline the requests to get around this).

GPUs can perform far more integer operations per second than a normal CPU, but have roughly the same memory bandwidth/latency as a CPU. So an algorithm that is memory-dominated should "level the playing field" between CPUs and GPUs.

I still don't understand why the Tenebrix folks consider this to be an important goal. It just "equalizes" GPUs and CPUs, but you can still build custom hardware that does scrypt() much faster and cheaper than a CPU. So it's just going from "GPUs are best" to "custom printed circuit boards covered in memory buses are best". Nobody's been able to explain why this change is worth all the trouble.

eldentyrell

Posted 2011-09-30T18:59:58.273

Reputation: 1 221

I don't think it's that they're GPU hostile so much as they want to be CPU friendly. The current state of Bitcoin is that semi-specialized hardware is required just to participate. You can CPU mine Bitcoins but you're likely to never see a bitcent from it. It may be possible to bring in custom hardware, but I don't think it's cheap or likely and so CPU mining has someplace to flourish now. – David Perry – 2011-09-30T23:26:11.413

I don't think you can build custom hardware that does scrypt much faster and cheaper than a CPU. CPUs already have pretty close to the fastest, cheapest memory paths we are capable of making. – David Schwartz – 2011-10-01T00:46:40.483

1@DavidSchwartz, you're confusing latency with bandwidth. PCs are optimized for memory latency, but since mining is inherently parallel (just work on several blocks in parallel) you can always compensate for that by pipelining. It is not hard to build a machine with way, way, way more memory bandwidth than a PC: just devote all of the motherboard space to DIMM slots and memory bus traces instead of stuff like PCI slots and southbridge chips. Sun's Niagra machines were a midpoint along this continuum, and they had obscene memory bandwidth compared to then-top-of-the-line PCs. – eldentyrell – 2011-10-01T21:59:39.113

2@David Perry writes I don't think it's cheap or likely and so CPU mining has someplace to flourish now. -- trust me, it is likely. If Tenebrix currency ever becomes valuable somebody (probably even me) will build the machine described above and start selling them. The situation will rapidly become worse than is for bitcoin: not only won't you be able to mine with CPUs, you won't be able to mine profitably with any device sold at a retail electronics store. – eldentyrell – 2011-10-01T22:01:29.847

@eldentyrell: It would be much cheaper just to use cheap motherboards and cheap CPUs. An x86 CPU is about the cheapest memory controller you can get. Yes, a fully custom solution might provide some advantage, but it would be very, very since the algorithm needs something CPUs do well rather than something they do poorly. – David Schwartz – 2011-10-02T00:05:27.987

1@eldentyrell - you have it exactly backwards LATENCY not BANDWIDTH is what is important in scrypt. It doesn't require much bandwidth but it is nearly constantly either fetching or writing data to memory. This creates a bottleneck as the CPU has to idle until the operaiton is complete. Pipelining is useless because the net operation is based on inputs in memory. Latency is what kills scrypt performance. L2 cache has roughly a 2-5 cycle latency (depending on CPU design) while main memory has 240+ cycle latency. – DeathAndTaxes – 2011-10-10T13:16:10.037

2@this_site_is_a_cesspool it's quite easy: Tenebrix creators are silly kids who wanted to play mining and were angry they couldn't do it with Bitcoin, so created a custom currency which allowed them to do it with CPUs. Of course this is a very short-sighted and misguided operation, doomed to fail horribly. – o0'. – 2013-02-01T17:17:54.030

This might interest you https://bitcointalk.org/index.php?topic=45849.0 FPGA implementation was investigated, too, and it appears somewhat lacking in performance compared to a CPU about 1/7th the price, thus being quite non cost-effective

Dude

Posted 2011-09-30T18:59:58.273

Reputation: 41

Tenebrix uses Scrypt as the proof of work algorithm. Scrypt was said to be GPU resistant due to the in memory look up tables the algorithm uses. Most people no longer mine tenebrix with CPUs and use GPUs instead. Mining tenebrix with a GPU you can expect to hash at approximately 1/1000th the rate you would get on the same card mining bitcoins. For example, if you had a 5970 mining bitcoin at ~700Mhash/s you could expect to get ~700khash/s mining tenebrix. There is by no means a direct correlation as in the example and in reality the tenebrix has rate will probably be less but it is true enough to do rough hardware comparisons where litecoin mining data is lacking.

Swurl

Posted 2011-09-30T18:59:58.273

Reputation: 31

I think I may have located my own answer (on a different StackExchange beta even!)

The answer to question "Why can't one implement bcrypt in Cuda?" seems similar enough to apply to scrypt/OpenCL as well (since they're basically the same technologies with different names) but I'd like verification from someone with a hair more crypto knowledge than myself. Here's the accepted answer from the other question:

It is not impossible, only harder. This is because of RAM. In a GPU, you have a number of cores which can do 32-bit operations. They will run at one operation per cycle and per core, as long as they operate on their respective registers. RAM access, however, is more troublesome. Each group of cores has access to a small amount of shared RAM, and all cores can read and write the GPU main RAM, but there are access restrictions: not all cores can read from or write to RAM simultaneously (constraints are stricter for main RAM).

Now bcrypt is a variant of the Blowfish key scheduling, which is defined over a table (a few kilobytes) which is constantly accessed and modified throughout the algorithm. Due to the size of the table, each core will have to store it in the GPU main RAM, and they will compete for usage of the memory bus. So bcrypt will run -- but not with full parallelism. At any time, most cores will be stalled, waiting for the memory bus to become free. This comes from the type of elementary operation bcrypt consists in, not from the fact that bcrypt is derived from the key schedule of a block cipher.

For SHA-1 or SHA-256, computation entirely consists in 32-bit operations on a handful of registers, so a password cracker will run without doing any memory access at all, and full parallelism is easily achieved (I did it on my GeForce 9800 GTX+, and I got about 98% of the theoretical maximum speed with a straightforward unrolled SHA-1 implementation).

For details on the programming model in CUDA, have a look at the CUDA C Programming Guide. Also, the author of bcrypt now proposes scrypt, which is even heavier on the memory accesses, exactly so that implementation is hard on GPU and FPGA.

David Perry

Posted 2011-09-30T18:59:58.273

Reputation: 13 848

I know I am late to the party, but what is the incentive to use custom hardware when economies of scale make CPUs cheap ?

A sane Rich Attacker would just buy hundreds of 4-cpu servers with, say, Bulldozer CPUs (In case of bitcoin a Rich Attacker could buy a screaming ton of GPUs or whichever equipment gives him best bang per buck)

You can't defend yourself against a Rich Attacker by technological means because no matter whether you use CPUs, GPUs, or Babage engines, he will just buy more tech than you can afford, end of line.

Black

Posted 2011-09-30T18:59:58.273

Reputation: 31