Welcome~~~


Another blog:
http://fun-st.blogspot.com/

It is easier to write an incorrect program than understand a correct one.

Sunday, February 27, 2011

Edit Distance problem

Edit Distance problems (http://en.wikipedia.org/wiki/Edit_distance): 
In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other. There are several different ways to define an edit distance, depending on which edit operations are allowed: replace, delete, insert, transpose, and so on. There are algorithms to calculate its value under various definitions:

Most popular:
a) Levenshtein distance
 [http://en.wikipedia.org/wiki/Levenshtein_distance]
- to have 2 string, one (S) is m, one(T) is n,
- generate a matrix d[(m+1)*(n+1)]
- fill up the first row and first column with 1->m ,and 1-> n
- do the following to propagation of the matrix (i = i... m, j = 1...n):
           *  if S[i] =T[j]  : d[i, j] = d[i-1,j-1]     // which is its left-upper element.
                                                                    // no operation required
           *  if not:   d[i, j ] = min(  
                                     d[i-1, j] + 1,   // deletion
                                     d[i, j-1] + 1    // an insertion
                                  =    d[i-1, j-1] + 1   // a substitution
                               )
The final result contains at the bottom right element (shown red below)
which is the smallest one in the 3 "previous directions" plus 1.
Draw the below picture, and it will be very easy to see, notice that when drawing, we could keep 3 differnt arrows denoting the selection is from where it comes from, this will correspond to the operations such as deletion, insertion...as shown above

kitten
0123456
s1123456
i2212345
t3321234
t4432123
i5543223
n6654332
g7765443
Saturday
012345678
S101234567
u211223456
n322233456
d433334345
a543444434
y654455543
This will require d[i,j] operations, and since the next element only require the i-1, or j-1 level of element, we can only keep the i-1 and j-1 level to save the space. 

b) Longest common subsequence problem 
[http://en.wikipedia.org/wiki/Longest_common_subsequence_problem]

The difference between the Leve-D and Longest-Common-subsequence is the former one is to compute the approximate string matching, often used in spell check to find a certain way to change the words in one or two sentence with least cost.
Personal think that the best way to study this problem is follow the instruction in wiki, to draw the picture:

Traceback example
ØAGCAT
Ø000000
G0\overset{\ \ \uparrow}{\leftarrow}0\overset{\nwarrow}{\ }1\overset{\ }{\leftarrow}1\overset{\ }{\leftarrow}1\overset{\ }{\leftarrow}1
A0\overset{\nwarrow}{\ }1\overset{\ \ \uparrow}{\leftarrow}1\overset{\ \ \uparrow}{\leftarrow}1\overset{\nwarrow}{\ }2\overset{\ }{\leftarrow}2
C0\overset{\ \uparrow}{\ }1\overset{\ \ \uparrow}{\leftarrow}1\overset{\nwarrow}{\ }2\overset{\ \ \uparrow}{\leftarrow}2\overset{\ \ \uparrow}{\leftarrow}2


Then move forward to exe to draw the path of :

| 0 1 2 3 4 5 6 7
     |   M Z J A W X U
-----|-----------------
0    | 0 0 0 0 0 0 0 0
1  X | 0 0 0 0 0 0 1 1
2  M | 0 1 1 1 1 1 1 1
3  J | 0 1 1 2 2 2 2 2
4  Y | 0 1 1 2 2 2 2 2
5  A | 0 1 1 2 3 3 3 3
6  U | 0 1 1 2 3 3 3 4
7  Z | 0 1 2 2 3 3 3 4

Others :

Hamming distance
Longest common subsequence problem
Levenshtein distance
Damerau–Levenshtein distance
Jaro–Winkler distance
Wagner–Fischer edit distance
Ukkonen's algorithm
Hirschberg's algorithm

# DP: maximum sum submatrix of given matrix
From : http://www.artofproblemsolving.com/Forum/viewtopic.php?p=385650#385650

There are 2 algos for it. One N^3 and another N^4. 
N^4: you need to keep S[a][b] where it is the sum of all the position from (0,0) to (m,n). now run N^4 algo I mean select starting left up node and select down right node. And compute the sum. 
N^3: it keeps track on the sum of rows. I mean S[m][n] means (m=row n=col) S[m][n]=sum(i=1 to m) S[i][n]. now select any two col. And go through the rows. If the sum becomes negative any time exclude the last row and update your maximum if necessary. Then go to next row and thus continue running. Same for minimum! 


The best algorithm I can think of has the running time about N^3, which is much better than the N^6 of the trivial algorithm that just considers all possible rectangles and computes the sum in each rectangle in the most straightforward way, but still unacceptably long for N=8000 (at least for my PC, which can do about 10^6simple operations per second). So, the algorithm I'm going to describe doesn't solve the problem completely. 

Compared to the trivial algorithm, one can make two major improvements. The first one is to replace the given matrix (a_{ij})_{1\leq i,j\leq N} by the matrix of the partial sums(S_{mn})_{0\leq m,n\leq N} where S_{mn}=\sum_{i\leq m,j\leq n}a_{ij}. This can be done in N^2 time replacing the row elements by row partial sums first and then doing the same operation with columns. The point is that the sum of a_{ij} over a rectangle m'< i\leq m'', n'< j\leq n'' can be computed now as S_{m''n''}+S_{m'n'}-S_{m''n'}-S_{m'n''}. This improvement alone reduces the running time of the trivial algorithm from N^6 to N^4

Now, assume that m' and m'' are fixed (about N^2 possible choices) and our task is to find the best n' and n''. Let B_n=S_{m''n}-S_{m'n}. We need to find n'\leq n'' such that B_{n''}-B_{n'} is as large as possible. To this end, for every k, find M_k=\max_{n\geq k}B_n and \mu_k=\min_{n\leq k}B_n. Also, record the positions where the maximum and the minimum are attained. This can be done in linear time. After that it remains to find the maximum of M_k-\mu_k, which can also be done in linear time. So, the total time is about N^2\cdot N=N^3




Code:
for i = 1 to N
   sum[i][0] = 0
   for j = 1 to N
      sum[i][j] = matrix[i][j] + sum[i][j - 1]
best = 0
for i = 1 to N
   for j = i to N
      max[0] = 0
      min[0] = 0
      result = 0
      for k = 1 to N
         result = result + sum[k][j] - sum[k][i]
         if result > max[k - 1] then max[k] = result else max[k] = max[k - 1]
         if result < min[k - 1] then min[k] = result else min[k] = min[k - 1]
      if max[N] - min[N] > best then best = max[N] - min[N]
return best


Also: [Ref] A sub-cubic time algorithm for the k-maximum subarray problem, which achieve O(n3 √log log n/ log n + k log n) for a general problem where overlapping is allowed for solution arrays. This complexity is sub-cubic when k = o(n3/ log n). The best known complexities of this problem are O(n3 + k log n), which is cubic when k = O(n3/ log n), and O(kn3 √log log n/ log n), which is sub-cubic when k = o(√log n/ log log n).


# To find the Maximum size square sub-matrix for a 1-0 matrix 
Given a binary matrix, find out the maximum size square sub-matrix with all 1s.
For example, consider the below binary matrix.

0  1  1  0  1    1  1  0  1  0    0  1  1  1  0    1  1  1  1  0    1  1  1  1  1    0  0  0  0  0

The maximum square sub-matrix with all set bits is

1  1  1     1  1  1     1  1  1

Algorithm:
Let the given binary matrix be M[R][C]. The idea of the algorithm is to construct an auxiliary size matrix S[][] in which each entry S[i][j] represents size of the square sub-matrix with all 1s including M[i][j] and M[i][j] is the rightmost and bottommost entry in sub-matrix.

1) Construct a sum matrix S[R][C] for the given M[R][C].      a) Copy first row and first columns as it is from M[][] to S[][]      b) For other entries, use following expressions to construct S[][]          If M[i][j] is 1 then             S[i][j] = min(S[i][j-1], S[i-1][j], S[i-1][j-1]) + 1          Else /*If M[i][j] is 0*/             S[i][j] = 0 2) Find the maximum entry in S[R][C] 3) Using the value and coordinates of maximum entry in S[i], print    sub-matrix of M[][]
Some explanations form stackoverflow:
For each of the cells we will keep a counter of how big a square can be made using that cell as top left. Clearly all cells with 0 will have 0 as the count.
Start iterating from bottom right cell and go to bottom left, then go to one row up and repeat.
At each scan do this:
  1. If the cell has 0 assign count=0
  2. If the cell has 1 and is an edge cell (bottom or right edge only), assign count=1
  3. For all other cells, check the count of the cell on its right, right-below, and below. Take the min of them and add 1 and assign that to the count. Keep a global max_count variable to keep track of the max count so far.
At the end of traversing the matrix, max_count will have the desired value.
Complexity is no more that the cost of traversal of the matrix.
This is how the matrix will look like after the traversal. Values in parentheses are the counts, i.e. biggest square that can be made using the cell as top left.
1(1) 0(0) 1(1) 0(0) 1(1) 0(0) 1(1) 0(0) 1(4) 1(3) 1(2) 1(1) 0(0) 1(1) 1(3) 1(3) 1(2) 1(1) 0(0) 0(0) 1(2) 1(2) 1(2) 1(1) 1(1) 1(1) 1(1) 1(1) 1(1) 1(1) 

Implementation in Python 

   def max_size(mat, ZERO=0):     
   """Find the largest square of ZERO's in the matrix `mat`."""     
 nrows, ncols = len(mat), (len(mat[0]) if mat else 0)     
 if not (nrows and ncols): return 0 # empty matrix or rows     
 counts = [[0]*ncols for _ in xrange(nrows)]     
 for i in reversed(xrange(nrows)):     
  # for each row         
  assert len(mat[i]) == ncols # matrix must be rectangular         
  for j in reversed(xrange(ncols)): # for each element in the row             
   if mat[i][j] == ZERO:
    counts[i][j] = (1 + min( counts[i][j+1],  # east  
                      counts[i+1][j],  # south
                             counts[i+1][j+1] # south-east                     )) 
  if i < (nrows - 1) and j < (ncols - 1) else 1 # edges     
 return max(c for rows in counts for c in rows)
--


♥ ¸¸.•*¨*•♫♪♪♫•*¨*•.¸¸♥

1 comment:

  1. really cool....


    can anyone comment about this book?

    http://www.careermonk.com/?qa=data-structures-and-algorithms-made-easy

    ReplyDelete