Horspool’s Algorithm is a simplified version of the Boyer-Moore string searching algorithm, designed to find the occurrence of a “needle” (pattern) within a text. It improves on the efficiency of simpler search methods primarily through the use of a bad character shift table. This table allows the algorithm to skip several characters in the text after a mismatch, significantly reducing the number of comparisons needed to find the pattern.
Table of Contents
Horspool's Algorithm
From the concept of Computer Science, The Boyer-Moore Horspool algorithm, also known as Horspool’s algorithm is used for locating substrings within strings. Nigel Horspool published it as SBM in 1980.
Consider, as an example, searching for the pattern BARBER in some text:
s0 … c … sn−1
B A R B E R
s0 … c … sn−1
B A R B E R
- Starting with the last R of the pattern and moving right to left, we compare the corresponding pairs of characters in the pattern and the text.
- If all the pattern’s characters match successfully, a matching substring is found.
- Then the search can be either stopped altogether or continued if another occurrence of the same pattern is desired.
If a mismatch occurs, we need to shift the pattern to the right. Clearly, we would like to make as large a shift as possible without risking the possibility of missing a matching substring in the text.
Horspool’s algorithm determines the size of such a shift by looking at the character c of the text that is aligned against the last character of the pattern. This is the case even if character c itself matches its counterpart in the pattern.
In general, the following four possibilities can occur.
Case 1: If there are no c’s in the pattern – e.g., c is letter S in our example – we can safely shift the pattern by its entire length (if we shift less, some character of the pattern would be aligned against the text’s character c that is known not to be in the pattern):
s0 … S … sn−1
||
B A R B E R
B A R B E R
s0 … S … sn−1
||
B A R B E R
B A R B E R
Case 2: If there are occurrences of character c in the pattern but it is not the last one there – e.g., c is letter B in our example – the shift should align the rightmost occurrence of c in the pattern with the c in the text:
s0 … B … sn−1
||
B A R B E R
B A R B E R
s0 … B … sn−1
||
B A R B E R
B A R B E R
Case 3: If c, happens to be the last character in the pattern but there are no c’s among its other m − 1 characters – e.g., c is letter R in our example – the situation is similar to that of Case 1 and the pattern should be shifted by the entire pattern’s length m:
s0 … M E R … sn−1
|| || ||
L E A D E R
L E A D E R
s0 … M E R … sn−1
|| || ||
L E A D E R
L E A D E R
Case 4: Finally, if c happens to be the last character in the pattern and there are other c’s among its first m − 1 characters—e.g., c is letter R in our example— the situation is similar to that of Case 2 and the rightmost occurrence of c among the first m − 1 characters in the pattern should be aligned with the text’s c:
s0 … A R … sn−1
|| ||
R E O R D E R
R E O R D E R
s0 … A R … sn−1
|| ||
R E C O D E R
R E C O D E R
These examples clearly demonstrate that right-to-left character comparisons can lead to further shifts of the pattern than the shifts by only one position always made by the brute-force algorithm.
For example, for the pattern BARBER, all the table’s entries will be equal to 6, except for the entries for E, B, R, and A, which will be 1, 2, 3, and 4, respectively.
ALGORITHM
ShiftTable(P[0..m − 1])
//Fills the shift table used by Horspool’s and Boyer-Moore algorithm
//Input: Pattern P[0..m − 1] and an alphabet of possible characters
//Output: Table[0..size − 1] indexed by the alphabet’s characters and filled with shift sizes computed by formula (1)
for i ← 0 to size − 1 do
Table[i]← m
for j ← 0 to m − 2 do
Table[P[j]]← m − 1 − j
return Table
Summarization of Horspool's Algorithm
- Step 1 For a given pattern of length m and the alphabet used in both the pattern and text, construct the shift table as described above.
- Step 2 Align the pattern against the beginning of the text.
- Step 3 Repeat the following until either a matching substring is found or the pattern reaches beyond the last character of the text.
Starting with the last character in the pattern, compare the corresponding characters in the pattern and text until either all m characters are matched (then stop) or a mismatching pair is encountered.
Pseudocode of Horspool's Algorithm in DAA
HorspoolMatching(P[0..m − 1], T[0..n − 1])
//Implements Horspool’s algorithm for string matching
//Input: Pattern P[0..m − 1] and text T[0..n − 1]
//Output: The index of the left end of the first matching substring
// or −1 if there are no matches
ShiftTable(P[0..m − 1]) //generate Table of shifts
i ← m − 1 //position of the pattern’s right end
while i ≤ n − 1 do
k ← 0 //number of matched characters
while k ≤ m − 1 and P[m − 1 − k] = T [i − k] do
k ← k + 1
if k = m
return i − m + 1
else i ← i + Table[T [i]]
return −1
Horspool's Visualization Tool
Custom Test Cases
3 Major Reasons to Use Horspool's Algorithm
- Efficiency in Average Case Scenarios: In average case scenarios, Horspool’s algorithm usually beats more straightforward algorithms such as the string search. The number of comparisons between the pattern and text is decreased because it avoids areas of the text where a match is not possible. This effectiveness is particularly clear when working with lengthy texts and irregularly recurring patterns.
- Implementation Simplicity: It is easier to learn and apply Horspool’s algorithm than the complete Boyer-Moore algorithm. It keeps the Boyer-Moore approach’s effective bad-character shift but improves the good-suffix shift to make it easier to implement without substantially sacrificing the performance.
- Effectiveness with English Text: The algorithm performs best when applied to English text or in situations where the size of the alphabet is not large. In these cases, the algorithm’s processing stage is fast, and the bad-character rule significantly enhances the ability to skip irrelevant comparisons.
Comparison with Other Algorithms
While similar to Boyer-Moore, Horspool simplifies by using fewer heuristics, making it faster for small alphabets but less optimal in larger datasets. Compared to Knuth-Morris-Pratt, it excels in practice but lacks theoretical guarantees of best-case performance.
Key Applications:
Horspool’s algorithm is widely used in:
- Text search utilities
- Data compression
- Plagiarism detection tools
Related Articles
- Huffman Coding Algorithm
- 0/1 Knapsack Problem using Branch and Bound
- Kruskal’s Algorithm Minimum Spanning Tree
- Floyd-Warshall Algorithm
- Radix Sort Algorithm
- Heap Sort Algorithm
- Regex Generator
- Whack-a-Mole Game using JavaScript
- Karnaugh Map Solver
- Prim’s Algorithm in DAA
- Clipdrop AI
- ISTQB Certification cost in India
- Microsoft Copilot vs GitHub Copilot
Frequently Asked Questions
For computing the shifts in the Boyer-Moore algorithm’s space complexity by using only the bad-character shift of the window’s rightmost character.
Compared to many other string search algorithms, the Boyer-Moore algorithm has a lower constant factor because it skips parts of the text using the information gathered during the preprocess step. As the pattern length grows, the algorithm generally operates more quickly.
Horspool displays an O(m+σ) time complexity for the preprocessing phase and an O(σ) space complexity for the searching phase. 1. The algorithm of Zhu Takaoka: Rule 2 is used in an algorithm by Zhu-Takaoka titled “On improving the average case of the Boyer-Moore string matching algorithm, 1987.”
It is a simplified version of the Knuth–Morris–Pratt algorithm’s Boyer–Moore string-search method.
Horspool’s algorithm has an average-case time complexity of O(n/m), where n is the length of the text and m is the length of the pattern.
It depends. Horspool is simpler but may be slightly less efficient than Boyer-Moore on larger datasets.
Awesome! This has been a really wonderful post. Many thanks for providing the algorithm.