We have a mathematical proof that to tolerate n malicious nodes, you need 2n + 1 good nodes. The full proof is found in G. Bracha and T. Rabin, Optimal Asynchronous Byzantine Agreement, TR#92-15, Computer Science Department, Hebrew University. It's also well known in the industry. It is not possible for an asynchronous system to provide both safety (the guarantee that all non-malicious nodes will eventually agree on what progress was made) and liveness (the ability to continue to make forward progress) with more than this number of malicious failures.
You can trivially ensure safety by simply making no forward progress at all. And you can trivially make forward progress unsafely by just letting each node do whatever they want. Neither of these modes of operation are useful.
Let's take a step back to make this answer more helpful:
Why do you need a distributed agreement algorithm at all? Well, you need one in cases where there is more than one way a system could validly make forward progress and you need all the participants in the system to agree on which one of them.
Consider a simple example: I have $10 in the bank, and I write two $10 checks, one to Alice and one to Bob. Either one alone is valid, but we can't let them both go through.
If we had a central authority, they could just clear whichever one they saw first. But what if we don't want a central authority or don't want a single point of failure? And what if we have potentially malicious participants?
Well, you could just sort the checks after representing them as binary data. But that's where the asynchronous component bites us. When do we sort them? Say I see both checks and sort them. How do I know that one second later I won't see a third check that sorts first? And maybe someone else already saw that one. Ouch!
So, we have the following requirements:
1) Our system is asynchronous.
2) Some participants may be malicious.
3) We want safety, that is, we do not want one honest participant honoring one check and one honest participant honoring the other.
4) We want liveness, that is, it's not fair just saying we never clear any checks. Sure, that's safe, but not useful. We want to be sure that we eventually agree on which checks to clear.
So, now the question arises -- how many dishonest partcipants can we tolerate in our asynchronous system and still guarantee both safety and liveness?
As a simple way to get the gist of the proof, though it is not rigorous:
Suppose we have n nodes of which h are honest and d are dishonest. Obviously, n = h + d. Now the system needs to come to consensus on which of two checks to clear.
Think about the case where all the honest nodes are evenly split about the two directions the system could make forward progress. The malicious nodes could tell all the honest nodes that they agree with them. That would give h/2 + d nodes agreeing on each of two conflicting ways the system could make forward progress.
In this case, the honest nodes must not make forward progress or they will go in different directions, losing safety. Thus, the number of nodes required to agree before we can make forward progress must be greater than half the number of honest nodes plus the number of malicious nodes, or we lose safety.
If we call t the threshold required to make forward progress, that gives us: t > (h/2) + d. This is the requirement for safety.
But the malicious nodes could also fail to agree at all. So the number of nodes required to agree before we can make forward progress must be no more than the number of honest nodes or we lose liveness.
This gives us t <= h. Or h >= t. This is the condition for liveness.
Combining the two results, we get:
h >= t > (h/2) + d
h > (h/2) + d
(h/2) > d
d < (h/2)
Thus the number of faulty nodes we can tolerate is less than half the number of honest nodes. Thus we cannot tolerate 1/3 or more of the nodes being dishonest or we lose either safety or liveness.
Can you provide a link? Your description could not clearify it for me. I also do not know what is meant by directions. In a peer to peer network there is no direction. – Ini – 2017-08-30T13:08:46.400
What do you want to tell me with that? – Ini – 2017-08-30T13:13:46.140
yes. I was asking for a link to "G. Bracha and T. Rabin, Optimal Asynchronous Byzantine Agreement, TR#92-15, Computer Science Department, Hebrew University". "Think about the case where all the non-faulty nodes are evenly split about two directions the system could make forward progress." I do not know what is a DIRECTION in a peer to peer system. – Ini – 2017-08-30T13:16:39.037
Can you maybe assign a number or letter to the different kinds of non faulty nodes in your example. In your example one half does something else then the other half but you do not assign different characters or so to them, which makes it pretty hard do understand. – Ini – 2017-08-30T13:30:58.117
I cannot figure out what this sentence should mean "Think about the case where all the non-faulty nodes are evenly split about two directions the system could make forward progress." If the non-faulty node goes in a other direction then another non-faulty isn't it then a faulty node? – Ini – 2017-08-30T13:34:01.137
1Take bitcoin for example. Suppose I have one bitcoin and I compose two transactions, one sending it to Bob and one sending that same bitcoin to Charlie. The system must somehow agree on either sending them to Bob or sending them to Charlie. Before that, non-faulty nodes have no magic way to pick a winning transaction. Some may not even know the transaction to Charlie exists, so they believe the system should make forward progress in the Bob direction, but they won't actually make that progress until there's a consensus. – David Schwartz – 2017-08-30T13:41:31.657
Why would they disagree? Because they received a different message from a speaker? Like in Antshares examples? https://github.com/neo-project/docs/blob/master/en-us/node/consensus.md
– Ini – 2017-08-30T13:41:59.6031There can be a lot of reasons. Maybe they saw the transactions in the opposite order. Maybe they only saw one transaction. Maybe they saw one transaction really close to a cutoff and some think it arrived before and some after. It all depends on the design of the rest of the system. But if you can magically ensure non-faulty nodes never disagree, you don't need a consensus algorithm. Consensus algorithms are needed when there's some way honest nodes can disagree and they need to agree to safely make forward progress. – David Schwartz – 2017-08-30T13:42:43.690
Ok I'll study that abit now. Give me some time. Maybe I'll have another question, but I'll confirm this as the right answer after I thought abit about it. – Ini – 2017-08-30T13:44:13.940
how is forward progress defined? Is it any progress regardless of whether it is disruptive or not? Can you also define the terms safety, liveness? Maybe with a reference to a speaker scenario like in Antshares? – Ini – 2017-08-30T13:51:22.253
Is that a correct definition of forward progress: forward progress is made after reaching consensus is reached. ? – Ini – 2017-08-30T14:02:49.560
1@Invader For formal proof purposes, forward progress is usually defined as ruling out at least one future state of the system that was previously possible. For purposes of a particular algorithm, it's usually defined in an algorithm specific way such as confirming one transaction or whatever. – David Schwartz – 2017-08-30T14:13:50.900
In the example where faulty nodes agreee with all non-faulty nodes (which go in two directions) safety is not there. Because half of the non-faulty nodes have not agreed with the other half of non-faulty nodes. "safety - the guarantee that all non-malicious nodes will eventually agree on what progress was made". Can you explain, what I did not grasp there? What happens if the non-faulty nodes do not agree? The system would make forward-progress anyway or not, because the threashold is reached? – Ini – 2017-08-30T14:15:29.663
Also I do not understand why a non-faulty note should go in the same direction as a faulty one. A non-faulty node would always reject non-valid transactions/blocks. Or in other words if a faulty node agrees with a non-faulty nodes direction why is that a malicious action of the faulty-node? – Ini – 2017-08-30T14:20:01.337
Let us continue this discussion in chat.
– David Schwartz – 2017-08-30T14:22:09.550How does this relate to PoW? I mean in PoW we can have d > (h/2) and everything still works fine, but it takes longer to achieve arguably finality. In PoW if d is to big, then there is the possibility of a chain-rewrite. – Ini – 2018-04-27T17:24:55.830
1@Invader Everything doesn't still work fine. It may still work fine or it may fail. There's a statistical chance you're fine and a statistical chance the dishonest nodes will win. PoW sacrifices safety to tolerate more dishonest nodes, and that's a perfectly reasonable choice to make. – David Schwartz – 2018-04-27T17:34:15.637
So this formula d < (h/2) actually says that you can create a system that has have safety (does not produce forks) and liveness in case you have d < (h/2), but you cannot create such a system with the same properties with d >= (h/2)? – Ini – 2018-04-27T18:07:20.437
Or in other words. You cannot create a system that has safety and liveness and d >= (h/2) byzantine nodes. True? – Ini – 2018-04-27T18:20:35.130
@Invader Right. You can pick a threshold and then compute when you lose safety and when you lose liveness. But no threshold will preserve both safety and liveness once you exceed 1/3 failed nodes. – David Schwartz – 2018-07-07T23:12:03.387
Thank you, So in this case, contrary to belief that Bitcoin is secure till majority (= 1/2) of mining power is honest should be wrong, isn't it? Thanks – Questioner – 2018-11-08T10:37:55.383
1@sas Bitcoin isn't secure even if all mining power is honest. Consider, for example, if a network failure splits the network in two. Both sides will produce blocks every 20 minutes on average and both sides will eventually have sufficient confirmations to rely on transactions that could conflict from one side to the other. The applicability of these theoretical results to practical systems is not always simple or direct. – David Schwartz – 2018-11-08T17:55:46.433