NAME List::Insertion - Binary search a list for insertion point SYNOPSIS Export a subroutine to locate the index of the first instance (left) of value in a sorted simple scalar list: # Exports 'search_numeric_left' only # use List::Insertion {type=>"numeric", duplicate=>"left"}; # Data has value 30 duplcated my @data=(10, 20, 30, 30, 40, 50, 60, 70); #index 0 1 2 3 4 5 6 7 # The insert position will be index 2, which is left of duplicates # my $key= 30; my $pos=search_numeric_left $key, \@data; if($data[$pos] == $key){ # Exact match, $pos is the index of the searched $key in the array (left) } else{ # $pos is the index where $key should be inserted } Same as above with string comparison: use List::Insertion {type=>"string", duplicate=>"left"}; my @data=qw<10 20 30 30 40 50 60 70>; #index 0 1 2 3 4 5 6 7 my $key= "30"; my $pos=search_string_left $key, \@data; if($data[$pos] eq $key){ } else{ } Export a subroutine to search for insertion point, of array of hashes, right of duplicates, with accessor to comparison value: # Data has value=>3 duplcated. Elements are hash refs so use an accessor snippet # use List::Insertion {type=>"numeric", duplicate=>"right", accessor=>'->{value}'}; my @data=( {value=>1, other=>"a"}, {value=>2, other=>"b"}, {value=>3, other=>"c"}, {value=>3, other=>"d"}, {value=>4, other=>"e"}, {value=>5, other=>"f"}, ); # The insert position will be index 4, which is right of duplicates # my $key= 3; my $pos=search_numeric_right $key, \@data; Export 'make_search' and create anonymous subroutine instead of injecting named subroutines to name space use List:::Insertion "make_search"; my @data=(1, 2, 3); my $search=make_search type=>"numeric", duplicate=>"left"; my $key=2; my $pos=$search->($key, \@data); DESCRIPTION List::Insertion implements binary search algorithms to locate the insertion point of a sorted list for a given value. If the value were to be inserted at that position, the list would remain sorted. A distinction is made between inserting left or right of an exact match or duplicated entry. This allows to insert data before contiguous equal items or after. Performance rather than flexibility is favoured in the implementation and necessitates the following restrictions on the data stored in the array/list and how it's compared: Data must be sorted in asscending order only This simplifies the combinations of subroutines exported No code blocks can be specified While code blocks are very flexible, 90% of the functionality can be achieved with 'accessor' snippets. Element comparision is implicitly numerical or string No object methods for comparison. Only basic "<=>", ">=", "le" and "ge" operators are used for internally for string and numeric comparison. List element must be homogeneous and defined All items in the list are expected to have the same data structure. The structure can be a simple scalar like a string or number, in which case no 'acceesor' snippet is used. On the other hand it could be complex, like an array of arrays or hashes of objects to any level. The only restriction in this case is the 'accessor' snippet is able to access the value for comparison via post dereferencing or method calls Although intended for searching for a insertion point, this module can also be used for general searching of elements. A simple check of equality between the search key and value at the found index will determine if the value was actually found or not. The returned index will never indicate 'not found' as there is always a insert location in a list. No symbols are exported by default. Specifications given at import time are used to generate and export the search routine (s) needed. Anonymous search subroutines can also be generated. PERFORMANCE General Binary Search Performance Thanks to the lack of a CODE blocks and more streamlined comparison, this module is on par with List::BinarySearch::XS for at least one level of structured data. Not bad for pure perl. Please see the associated benchmarking script bench.pl in this distribution: Results for searching an array with 1000 random numeric elements stored in a key/value hash. The keys searched are also randomly generated and may not existing inthe sorted list. Comparision between List::BinarySearch::PP, List::BinarySearch::XS, and List::Insertion Rate L::BS::PP L::I L::BS::XS L::BS::PP 14902/s -- -80% -81% L::I 74796/s 402% -- -6% L::BS::XS 79644/s 434% 6% -- Building/Updating an Already Sorted List Perl is pretty fast when sorting bulk array data in a single sort call. However adding a single new value and resorting an existing array is not nearly as fast. Normally to update an already sorted array with a new value, you would push the new value to the array and resort. This module can improve performance by first locating the insert point at which you would splice in the new value. For example: eg. my @data= map rand(10), 1..1000; my $key=4.3; # Instead of this ... # my @perl_sort; push @perl_sort, $key; @perl_sort=sort {$a <=> $b} @perl_sort; # Do this ... # my @sorted; my $pos; if(@sort){ $pos=search_numeric_left $key, \@sorted; splice @sorted, $pos, 0, $key; } else { push @sorted, $key; } The benchmarking script build-sorted.pl in this distribution demonstrates building a list where each element is a key value pair (hash) storing a random value. This module is within approx 20% of the XS implementation of List::BinarySearch::XS and is much faster than the pure perl List::BinarySearch::PP. Results for constructing/sorting an array with 1000 random numeric elements one element at a time. Each element is key/value hash. Comparision between List::BinarySearch::PP, List::BinarySearch::XS, perl sort, and List::Insertion Rate perl_sort_update L_BS_PP_update L_I_update L_BS_XS_update perl_sort_update 25.5/s -- -86% -97% -97% L_BS_PP_update 179/s 603% -- -76% -81% L_I_update 760/s 2884% 325% -- -18% L_BS_XS_update 922/s 3523% 416% 21% -- API Importing When importing this module, the type of comparison (string or numeric) and the behaviour in dealing with duplicate values (left or right) is specified along with optional accessor and prefix options. Combinations of these options are generated via "Data::Combination", allowing multiple subroutines to be configured and returned with minimal typing. Consider the following examples: use List::Insertion {type=>"numeric", position=>left}; #(1) use List::Insertion {type=>"numeric", position=>["left", "right"]}; #(2) use List::Insertion { #(3) type=>"numeric", duplicate=>["left", "right"], accessor=>'->{hash_key}' }; use List::Insertion { #(4) type=>"numeric", duplicate=>["left", "right"], accessor=>'->{hash_key}->method', prefix=>"find"}; 1. Imports the subroutine "search_numeric_left" 2. Imports the subroutines "search_numeric_left" and "search_numeric_right" 3. Imports the subroutines "search_numeric_left" and "search_numeric_right" and uses the accessor '->{hash_key} when accessing elements. Elements must all be hash references and the hash key will be compared in numeric fashion. 4. Imports the subroutines "find_numeric_left" and "find_numeric_right" and uses the accessor '->{hash_key}->method' when accessing elements. Elements must all respond to 'method', with its return value compared in numeric fashion. The default values of supported options are used if no matching options are present in an import specification. The defaults are: type=>"string", duplicate=>"left", accessor=>"", prefix=>"search" Supported options during importing include: type type=>NAME or type=[NAME,...] A plain scalar or array ref of comparison type names. The type is implicitly used as the second part of an exported subroutines name. Supported values for NAME are: numeric Numerical comparison string String comparison duplicate duplicate=>SIDE or pos=>[SIDE,...] A plain scalar or array ref of side names. The side to choose when duplicate values are encountered. This is used implicitly as the last part of an exported subroutines name. Supported values for SIDE are: left, lesser Choose the lower index when the duplicate items are encountered. eg my @list=( 10, 20, 20, 20, 30) /\ || A 'left' search for 20 will result in a index of 1 right, greater Choose the greater index (after duplicates) when the duplicate items are encountered. eg my @list=( 10, 20, 20, 20, 30) /\ || A 'right' search for 20 will result in a index of 4 accessor accessor=>STRING A string consisting of perl post dereferencing/method call syntax, which is used is to access internal levels of a element's data structure. Internally this string is literally appending to the array element indexing code: eg Acessor: ->{hash_ref}[array_deref]->method Interal code: ... $array[$index] ... Resulting code: ... $array[$index]->{hash_ref}[array_deref]->method ... If not specified, it is treated as an empty string, and elements in the list are treated as numeric/string simple scalars. Their value are used directly in comparisons in the search algorithm. If specified, elements are dereferenced/called with the accessor. The resulting value is used in the comparison in the search algorithm. The value of the search key is NOT subject to the accessor and is used directly in comparison. prefix prefix=>STRING A string which becomes the start of the imported subroutines name. If unspecified, the string "search" is used. eg use List::Insertion {prefix=>"my_searcher", ...}; # The subrotine imported will start with my_searcher my $pos=my_searcher_... Anonymous Subroutines Instead of importing named subroutines into your namespace, anonymous subroutines can be generated by importing the "make_search" subroutine: use List::Insertion "make_search"; make_search my $sub=make_search {options} Creates a search subroutine configured with a options hash ref. Each option is a key value pair, as described in the importing section. Only simple scalars key/values are allowed, as only a single subroutine is returned per call. Multiple calls to this subroutine will need to be used to generate multiple search subroutines. This is the subroutine called internally during import to generate the named subroutines. The option prefix has no effect as the routine is anonymous. Using Generated or Exported subrotines The generated/imported subroutines are named in the format: prefix_type_duplicate where prefix, type and duplicate represent the prefix, data type ( string or numeric) and duplicated entry handling (left or right) configuration These routines are called with two arguments, the search key and reference to the sorted data: my $insert=find_nv_left $key, \@data; The return value is the index in the @data, which if inserting $key will keep the list sorted. The value of the element located at $insert my be equal to the search key. NOTE: Search routines never return less then 0 or otherwise indicate 'element not found'. The index is always the point when data can be inserted. So an empty list will always return a found index of 0, as this where an element would be inserted. FUTURE WORK Make an XS version That could be tricky with the accessor feature. Validate accessor Add a 'validator' for testing/confirmation of accessor syntax SEE ALSO List::BinarySearch and the List::BinarySearch::PP(pure perl) and List::BinarySearch::XS (XS enhanced) 'sub modules' provide more flexibility than this module thanks to the use of code blocks for element comparison. REPOSITORY and BUG REPORTING Please report any bugs and feature requests on the repo page: GitHub AUTHOR Ruben Westerberg, COPYRIGHT AND LICENSE Copyright (C) 2023 by Ruben Westerberg This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, or under the MIT license DISCLAIMER OF WARRANTIES THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.