1
There is a json for unit tests of Japanese characters which I want to validate using Python, specifically with this fork of pybitcointools, which has bip39 functionality.
Unit tests from Trezor's python-mnemonic test vectors work fine (in Python 2.7 IME), however, this is straightforward since there's no normalization of unicode dialectics and such, since all mnemonics are lower case English.
The Japanese fields are:
- Entropy (hex)
- Mnemonic (Japanese)
- Password (Japanese, appears to be the same for all tests)
- Seed (hex, 64 bytes)
xprv
So entropy seeds mnemonic (bip39?), then mnemonic | password hashes to Seed; Seed then acts as the master key for the bip32 xprv? (correct me if I'm wrong!?)
So, assuming it's that straightforward...
- how is the Japanese unicode text "normalized"? (Is it just NKFD Unicode normalization, which Electrum 2.0 does?)
- what does "normal" mean for Japanese?
Thanks for the response. I've got gaps in how UTF-8 fits in, though I understand pretty well what NKFD does. Why not just encode Unicode directly? Also, I do understand the ideographic space, I think (\u3000 is the Japanese "equivalent" of \u0020). It was actually Electrum 2.0's implementation of the seed preparation which was confusing me. That really is strange Electrum deviated in such obscure places
– Wizard Of Ozzie – 2015-06-04T11:18:33.150Scratch that. You can't encode Unicode 4 byte code points as well as UTF8, right? – Wizard Of Ozzie – 2015-06-04T11:22:01.300
@WizardOfOzzie Unicode strings are simply sequences of integers in the range [0,0x10FFFF]. Before you can hash it, you need to convert it into a sequence of bytes. A simple way is to just take each 4-byte int, and use those bytes as-is (UTF-32LE encoding), but this is inneficient from a space point of view (considering that English only needs 1 byte per character). UTF-8 is more complicated, but more space efficient most of the time. – Christopher Gurnee – 2015-06-04T11:27:29.953
@WizardOfOzzie Agreed that Electrum 2.x's normalization is much more complex, but it's helpful to minimize the chance of loss from mistyped mnemonics given the lack of any specific wordlists. (And BIP-39's requirement for specific wordlists was something Electrum 2.x's dev really disagreed with.) – Christopher Gurnee – 2015-06-04T11:34:16.163
Does this look right?
norm = lambda d: (' '.join(unicodedata.('NFKD', unicode(d)).split('\u3000'))).encode('utf-8')– Wizard Of Ozzie – 2015-06-04T22:26:15.627Assuming
dis a mnemonic sentence of Python2 type string or a unicode? I'd think something like this for most languages (different for Chinese, which might not have spaces ind):norm = lambda d: (u' '.join(unicodedata.normalize('NFKD', unicode(d)).split())).encode('utf-8')(I added the missing "normalize" and changed split to split on all whitespace). Note that if you're accepting input from a user, BIP-39 requires that you verify the checksum. – Christopher Gurnee – 2015-06-04T22:47:31.220Thx, I'll try it now. Hmmm, assuming py 2/3 so let's say both Unicode and str. Yep, I'm pretty well versed with bip39 so my mn2hex function asserts check_bip39 – Wizard Of Ozzie – 2015-06-04T23:25:20.207
I've gotten all the unit tests to work *except for the comparison of
– Wizard Of Ozzie – 2015-06-09T04:35:23.577bip39_hex_to_mnwithVECTOR['mnemonic']as the function returns a standard space (ie\u0020) whereas the test vectors use\u3000. Electrum (2.x) solves the ideographic space (\u3000) problem by concatenating all CJK words without spaces. See https://github.com/simcity4242/pybitcointools/blob/master/test_bip39.py#L123 (I've only got it to work withu' '.join(v['mnemonic'].split()), whereas it should just bev['mnemonic']Tangentially related, why do the word lists for English and Japanese have the correct format (each word is a unicode object consisting of unicode characters), whereas the Spanish, Chinese and French (https://github.com/simcity4242/pybitcointools/blob/master/bitcoin/_bip39.py#L2349 and below) wordlists are of this format:
– Wizard Of Ozzie – 2015-06-09T04:42:25.387'\xe7\x9a\x84'?@WizardOfOzzie By my reading of the standard, bip39_hex_to_mn() should return mnemonics with ideographic spaces for Japanese, so that the result can be used for both display purposes and for calculating the binary seed. As it's currently written, the returned value is only suitable for the latter purpose. Given this change, you could remove the "hack" you made to TestBIP39JAP. – Christopher Gurnee – 2015-06-09T14:04:26.803
@WizardOfOzzie Regarding your word lists, I agree with you that I'd prefer they all be unicode objects for consistency. Where did you get your word lists? Perhaps you should download the official BIP-39 word list text files directly from the repo, and load them with something like
– Christopher Gurnee – 2015-06-09T14:22:39.610with io.open(language+'.txt', encoding='utf_8_sig') as words_file: words[language] = tuple(word.strip() for word in words_file).I loaded them using iPython, then wrote the variable to a file with
%store var >> file.py. I'll update both recommendations, thanks! – Wizard Of Ozzie – 2015-06-10T03:24:10.353