The scrypt() algorithm has at its core a routine called ROMmix. Basically, it defines
V(1) = hash(message)
V(2) = hash(hash(message))
V(3) = hash(hash(hash(message)))
...
and it calculates
V(V(V(V(...(message))))
Since computing V(n+1) requires computing V(n) first, the most efficient way to do this is to cache all of the previously-computed values. Once you've generated a large enough table, the V(V(V(...))) is just a bunch of lookups.
Caching all the previously computed values requires lots of memory, and since each lookup depends on the previous one it's sensitive to memory latency (although if you're mining you can work on several blocks in parallel and pipeline the requests to get around this).
GPUs can perform far more integer operations per second than a normal CPU, but have roughly the same memory bandwidth/latency as a CPU. So an algorithm that is memory-dominated should "level the playing field" between CPUs and GPUs.
I still don't understand why the Tenebrix folks consider this to be an important goal. It just "equalizes" GPUs and CPUs, but you can still build custom hardware that does scrypt() much faster and cheaper than a CPU. So it's just going from "GPUs are best" to "custom printed circuit boards covered in memory buses are best". Nobody's been able to explain why this change is worth all the trouble.
4Keep in mind that Tenebrix is GPU-resistant, not custom-hardware-resistant. A custom PCB covered with lots of cheap pipelined DRAM chips and memory buses will still give vastly more hashes-per-dollar than a normal computer. I still don't understand why the Tenebrix creators think it's so important to hobble GPU miners. – eldentyrell – 2011-09-30T20:03:06.043
1I poked through the PDF on scrypt and while I didn't understand ALL of what was written, there was a section where the author combined the CPU & memory requirements for custom hardware to break a bunch of algorithms in terms of "die space" and compared the costs that way and scrypt still came out on top. I'm assuming these requirements also hobble FPGAs and at the very least would make ASICs much more expensive than for Bitcoin? – David Perry – 2011-09-30T20:32:56.853
1Oh, I see what you're saying. Building the memory into an ASIC would be expensive, but building the ASIC and buying already-cheap RAM on the side wouldn't be. How much RAM does Tenebrix's implementation of scrypt actually require? – David Perry – 2011-09-30T20:35:06.867
2David, the maximum size of the table (see my answer below) is a pseudorandom function of the message being hashed. There is no "required" amount of memory, although if you don't haven enough to hold the table you'll end up repeating a lot of expensive computation. – eldentyrell – 2011-09-30T21:48:23.580
1
These two questions might interest you: Are there algorithms that could have been chosen for mining that balance CPU/GPU? and Are there algorithms that could be used for mining that resist acceleration with ASICs?
– nmat – 2011-09-30T23:20:55.773@Dave Perry: So long as the algorithm is well-constrained by RAM speed, buying already-cheap RAM on the side won't help you. CPUs already do that. If your performance is largely RAM-limited and you use the same RAM as CPUs, you can't significantly outperform them. (That's the theory anyway, whether a particular implementation realizes that, I don't know.) – David Schwartz – 2011-10-01T00:43:08.397
@David, you need to separate "speed" into "latency" and "bandwidth". Since mining is inherently parallel (you can work on multiple blocks), latency is never the problem. Regarding bandwidth, it's not hard to design a custom machine with massively more memory bandwidth than a top-of-the-line PC: just keep adding more memory buses and DRAM slots. – eldentyrell – 2011-10-01T21:55:23.427
@eldentyrell: Sure, but that doesn't bring the cost down. You can do the same thing just by adding more cheap motherboard/CPU/RAM combos. The issue is whether there's a cost advantage to custom hardware. If it's RAM-limited, there's very little since cheap PCs are disproportionately powerful in the RAM department. – David Schwartz – 2011-10-02T00:03:46.760
1DRAM isn't fast enough. The L2 cache inside a cPU has roughly 3 clock cycle latency. Main RAM has roughly 240 clock cycle latency. Taking a FPGA and hooking up a bunch of DRAM busses would provide horrible performance. For a huge amount of clock cycles the CPU would sit idle waiting for data to come SLLLLLLLLLLLLLOOOOOOOOOOOWWWWWWWWWWWWLLLLLLLL out of main memory. It would then very quickly perform a couple of operations and then SLLLLLLLLLLLLLOOOOOOOOOOOOOWWWWWWWWWWLLLLLLLLLYYYYY push the results back into main memory. Main Memory =/= Cache when it comes to latency. – DeathAndTaxes – 2011-10-10T13:12:59.727
@theUnhandledException: Thanks for clarifying all that. It makes a lot more sense how the implementation is custom-hardware resistant when you consider that the memory expenses of sCrypt are not so high as to exceed the amount of L2 cache available on the CPU. It would probably be cheaper to buy obscene amounts of CPU resources than to attempt a custom hardware attack. – David Perry – 2011-10-11T00:20:27.660