Parse UTXO of a transaction from chainstate

I'm doing some analysis on the UTXO set by reading from the chainstate database.

I was following the documents given by https://github.com/bitcoin/bitcoin/blob/d4a42334d447cad48fb3996cad0fd5c945b75571/src/coins.h#L19-L34

/** pruned version of CTransaction: only retains metadata and unspent transaction outputs
 *
 * Serialized format:
 * - VARINT(nVersion)
 * - VARINT(nCode)
 * - unspentness bitvector, for vout[2] and further; least significant byte first
 * - the non-spent CTxOuts (via CTxOutCompressor)
 * - VARINT(nHeight)
 *
 * The nCode value consists of:
 * - bit 1: IsCoinBase()
 * - bit 2: vout[0] is not spent
 * - bit 4: vout[1] is not spent
 * - The higher bits encode N, the number of non-zero bytes in the following bitvector.
 *   - In case both bit 2 and bit 4 are unset, they encode N-1, as there must be at
 *     least one non-spent output).

The parser worked fine when the number of UTXO is small. However for the following tx (which have 2501 outputs), it failed:

2540b961f4a0b231db3bc5a23608307394eae037d8afd0462e9b794e02f00000

For the key 'c' + 2540b961f4a0b231db3bc5a23608307394eae037d8afd0462e9b794e02f00000, the (deobfuscated) value in chainstate looks like this:

01907050e140254150443a0c280004...

Where 01 is the version, 9070 is the nCode which tells if its a coinbase tx, the unspentness of vout[0], vout[1], and the length of the following unspentness bitvector for vout[2:]. By looking at blockchain.info, there are 2501 outputs, so there shall be (2501 - 2)/8 = 312 bytes following. However, parsing 9070 as a varint, removing the last thee bits, and +1 only give me 2288 / 8 + 1 = 287. (I got 2288 by (0x90 - 0x80 + 1) * 0x80 + 0x70, which is the MSB-128 varint used in bitcoin protocol.)

Did I missed something here? How exactly does one parse the varint?

h__

Posted 2017-05-11T18:51:03.293

Reputation: 433

Answers

You are right, however you are not interpreting the result correctly. The value is indeed 287, that is (2288 >> 3) + 1, however this does not mean that the bitvector contains 287 bytes, but that it contains 287 non-zero bytes, so when parsing the bitvector, you should decrease the byte counter only when you find a non-zero value. Here you have a piece of code that deals with this (n is 287 in this case):

# If n is set, the encoded value contains a bitvector. The following bytes are parsed until n non-zero bytes have
# been extracted. (If a 00 is found, the parsing continues but n is not decreased)
if n > 0:
    bitvector = ""
    while n:
        data = utxo[offset:offset+2]
        if data != "00":
            n -= 1
        bitvector += data
        offset += 2

Notice that this is just a fragment of the code. I've coded a full decoder in Python recently, you can check out the code on github.

sr-gi

Posted 2017-05-11T18:51:03.293

Reputation: 2 382

Hmm, that make sense. I've read lots of questions and answers by you on this topic. Thanks so much! – h__ – 2017-05-11T22:37:53.887

May I also ask how are CTxOuts serialized? I get that most of it are of the form CompressedAmount + 00 + hash160 of pubkey, but there are lots of nonstandard transactions which are hard to parse. – h__ – 2017-05-11T22:46:07.087

https://github.com/bitcoin/bitcoin/blob/v0.14.1/src/compressor.h#L17L27 – Pieter Wuille – 2017-05-11T22:51:17.663

It is also part of the code I've linked in the answer, let me extend the answer to include that part of the code (check the comments) – sr-gi – 2017-05-11T22:53:23.220

Found it, thanks. https://github.com/sr-gi/bitcoin_tools/blob/d679c41183e315686729ca6b8bd79c7daa499b4d/utils/utils.py#L363-L388

– h__ – 2017-05-11T22:57:28.333

Exactly, there it is. – sr-gi – 2017-05-11T22:59:41.100