Benchmark results before:

num_operations: 1000000
insert      : 191.090 s (  5238.3/s) CPU; 191.106 s (  5237.7/s) wall clock
lookup      : 30.430 s ( 33068.8/s) CPU; 30.428 s ( 33064.0/s) wall clock
lookup_range: 85.700 s ( 11694.5/s) CPU; 85.700 s ( 11693.6/s) wall clock
insert2     : 143.320 s (  6986.7/s) CPU; 143.315 s (  6986.6/s) wall clock
remove      : 246.010 s (  4068.0/s) CPU; 246.013 s (  4067.9/s) wall clock
remove_range: 225.210 s (  4444.0/s) CPU; 225.286 s (  4442.4/s) wall clock

That's insert, lookup, and remove operations (and range operations) on a B-tree with one million keys. insert2 is insertions where keys are sorted. All nodes are stored in memory, not on disk.

After some micro-optimization (replacing my own flexible bsearch implementation with the Python standard library one):

num_operations: 1000000
insert      : 97.850 s ( 10237.5/s) CPU; 97.860 s ( 10237.1/s) wall clock
lookup      : 12.430 s ( 81566.1/s) CPU; 12.429 s ( 81609.8/s) wall clock
lookup_range: 52.670 s ( 19047.6/s) CPU; 52.669 s ( 19050.0/s) wall clock
insert2     : 72.600 s ( 13806.4/s) CPU; 72.595 s ( 13808.6/s) wall clock
remove      : 134.510 s (  7443.8/s) CPU; 134.512 s (  7444.0/s) wall clock
remove_range: 169.740 s (  5897.3/s) CPU; 169.795 s (  5895.6/s) wall clock

Not bad, for a train ride's work, with some additional improvements late at night in a hotel room.

Code is clearer too, now, at least mostly. And there's less of it: duplicating standard library functionality always maks me feel icky.

The numbers for removals look bad, really bad. Espcially range removal should be much better. If anyone wanted to get familiar with my btree code base, that would be an excellent place to start from. See the README for instructions.