How do Golomb Coded Sets work?

4

From my understanding Golomb coded sets are a probabilistic data structure that encodes the deltas of an order set of elements.

With things like txids, which are evenly distributed, Golomb coded sets are an efficient way to encode things inside of a block.

What is the ordering used to determine how txids are encoded in the set? It seems we need a base txid, and then compute deltas from that original txid? Is it just as simple as the lowest big endian numeric interpretation of a txid in a block?

Chris Stewart

Posted 2018-10-28T16:44:04.883

Reputation: 865

Answers

3

The data-structure is encoding a set which means it is not encoding an order.

If you draw numbers uniformly at random from some universe then sort the result (thus destroying the order information), the differences between the numbers follow an exponential distribution. The GCS uses a golomb coder to efficiently store these differences. The sort method used is irrelevant so long as its consistent with the differencing method used.

There is nothing probabilistic about the set encoding itself. But for BIP158 the set being encoded is not TXIDs but short hashes of relevant outputs. Because the hashes are short they can have collisions, making the result probabilistic.

G. Maxwell

Posted 2018-10-28T16:44:04.883

Reputation: 6 039

Can you elaborate on short hashes for "relevant" outputs? This maybe warrant a separate question -- more related to BIP158 -- but what are "short hashes of relevant outputs"? Isn't the entire block encoded into a golomb coded set -- and then relayed to your client peer? How does the server compute "relevant outputs" when it doesn't understand what that is?Chris Stewart 2018-10-28T20:38:31.877

2The GCS's in BIP158 encode for each block which scripts are used in it (whether the block contains at least one output a given script, or whether it spends at least one output assigned to that script). It does that by encoding short hashes of those scripts. A client will download the whole filter for the block, and try to match it against the short hashes of all scripts it is interested in. If it finds a match, it downloads the full block, otherwise it can skip the block and continue with the next one. The filters are generally only a few kilobytes in size.Pieter Wuille 2018-10-28T21:28:58.027

1Pieter's answer covers it, by saying relevant I was thinking about empty/op_return not being included.G. Maxwell 2018-10-29T08:37:56.747