Alg-Code: Check is string is a substring of another string

Tuesday, February 8, 2011

Check is string is a substring of another string

# rolling hash method (Rabin-Karp)
--> searching for multiple target strings
The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm). In contrast, the Aho–Corasick string matching algorithm has asymptotic worst-time complexity O(n+m) in space O(m).
Rather than pursuing more sophisticated skipping, the Rabin–Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; for example, we might have hash("hello")=5. Rabin–Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value.

The algorithm is as shown:

 1 function RabinKarp(string s[1..n], string sub[1..m])
 2     hsub := hash(sub[1..m]);  hs := hash(s[1..m])
 3     for i from 1 to n-m+1
 4         if hs = hsub
   5             if s[i..i+m-1] = sub
 6                 return i
 7         hs := hash(s[i+1..i+m])
 8     return not found

Lines 2, 5, and 7 each require Θ(m) time. However, line 2 is only executed once, and line 5 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 4 is executed n times, but only requires constant time. So the only problem is line 7.

If we naively recompute the hash value for the substring s[i+1..i+m], this would require Θ(m) time, and since this is done on each loop, the algorithm would require Ω(mn) time, the same as the most naive algorithms. The trick to solving this is to note that the variable hs already contains the hash value of s[i..i+m-1]. If we can use this to compute the next hash value in constant time, then our problem will be solved.

We do this using what is called a rolling hash. A rolling hash is a hash function specially designed to enable this operation. One simple example is adding up the values of each character in the substring. Then, we can use this formula to compute the next hash value in constant time:

 s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.

Notice that if we're very unlucky, or have a very bad hash function such as a constant function, line 5 might very well be executed n times, on every iteration of the loop. Because it requires Θ(m) time, the whole algorithm then takes a worst-case Θ(mn) time.

# KMP

# Boyer-Moore (en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm)
the average-case performance of the algorithm, for a text of length $n$ and a fixed pattern of length $m$ , is $n / m$ : in the best case, only one in $m$ characters needs to be checked.
3n comparision, worst case O(n)

The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it's searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

# naive solution which is O(n^2) and requires checking for the target string at every position in the source string.
When we could measure the length(e.g. m) of the "match str" (using strlen or advance the prt + counter), this can be improved to (n-m)^2

1 function NaiveSearch(string s[1..n], string sub[1..m])
  2    for i from 1 to n-m+1
3       for j from 1 to m
4          if s[i+j-1] ≠ sub[j]
5             jump to next iteration of outer loop
  6       return i
7    return not found

#Fast String Searching With Suffix Trees
Notes:very nice post on introducing the Suffix Trees

Alg-Code

Pages

Welcome~~~

Tuesday, February 8, 2011

Check is string is a substring of another string

No comments:

Post a Comment