Alg-Code: Edit Distance problem

Edit Distance problems (http://en.wikipedia.org/wiki/Edit_distance):

In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other. There are several different ways to define an edit distance, depending on which edit operations are allowed: replace, delete, insert, transpose, and so on. There are algorithms to calculate its value under various definitions:

Most popular:

a) Levenshtein distance
[http://en.wikipedia.org/wiki/Levenshtein_distance]
- to have 2 string, one (S) is m, one(T) is n,
- generate a matrix d[(m+1)*(n+1)]
- fill up the first row and first column with 1->m ,and 1-> n
- do the following to propagation of the matrix (i = i... m, j = 1...n):
   * if S[i] =T[j] : d[i, j] = d[i-1,j-1] // which is its left-upper element.
   // no operation required
   * if not: d[i, j ] = min(
   d[i-1, j] + 1,   // deletion
     d[i, j-1] + 1 // an insertion
   = d[i-1, j-1] + 1 // a substitution
   )
The final result contains at the bottom right element (shown red below)
which is the smallest one in the 3 "previous directions" plus 1.
Draw the below picture, and it will be very easy to see, notice that when drawing, we could keep 3 differnt arrows denoting the selection is from where it comes from, this will correspond to the operations such as deletion, insertion...as shown above

		k	i	t	t	e	n
	0	1	2	3	4	5	6
s	1	1	2	3	4	5	6
i	2	2	1	2	3	4	5
t	3	3	2	1	2	3	4
t	4	4	3	2	1	2	3
i	5	5	4	3	2	2	3
n	6	6	5	4	3	3	2
g	7	7	6	5	4	4	3

		S	a	t	u	r	d	a	y
	0	1	2	3	4	5	6	7	8
S	1	0	1	2	3	4	5	6	7
u	2	1	1	2	2	3	4	5	6
n	3	2	2	2	3	3	4	5	6
d	4	3	3	3	3	4	3	4	5
a	5	4	3	4	4	4	4	3	4
y	6	5	4	4	5	5	5	4	3

This will require d[i,j] operations, and since the next element only require the i-1, or j-1 level of element, we can only keep the i-1 and j-1 level to save the space.

b) Longest common subsequence problem
[http://en.wikipedia.org/wiki/Longest_common_subsequence_problem]

The difference between the Leve-D and Longest-Common-subsequence is the former one is to compute the approximate string matching, often used in spell check to find a certain way to change the words in one or two sentence with least cost.
Personal think that the best way to study this problem is follow the instruction in wiki, to draw the picture:

Traceback example
	A	G	C	A	T
Ø	0	0	0	0	0
G	$\overset{\ \ \uparrow}{\leftarrow}$ 0	$\overset{\nwarrow}{\ }$ 1	$\overset{\ }{\leftarrow}$ 1	$\overset{\ }{\leftarrow}$ 1	$\overset{\ }{\leftarrow}$ 1
A	$\overset{\nwarrow}{\ }$ 1	$\overset{\ \ \uparrow}{\leftarrow}$ 1	$\overset{\ \ \uparrow}{\leftarrow}$ 1	$\overset{\nwarrow}{\ }$ 2	$\overset{\ }{\leftarrow}$ 2
C	$\overset{\ \uparrow}{\ }$ 1	$\overset{\ \ \uparrow}{\leftarrow}$ 1	$\overset{\nwarrow}{\ }$ 2	$\overset{\ \ \uparrow}{\leftarrow}$ 2	$\overset{\ \ \uparrow}{\leftarrow}$ 2

Then move forward to exe to draw the path of :

| 0 1 2 3 4 5 6 7
     |   M Z J A W X U
-----|-----------------
0    | 0 0 0 0 0 0 0 0
1  X | 0 0 0 0 0 0 1 1
2  M | 0 1 1 1 1 1 1 1
3  J | 0 1 1 2 2 2 2 2
4  Y | 0 1 1 2 2 2 2 2
5  A | 0 1 1 2 3 3 3 3
6  U | 0 1 1 2 3 3 3 4
7  Z | 0 1 2 2 3 3 3 4

Others :

Hamming distance

Longest common subsequence problem

Levenshtein distance

Damerau–Levenshtein distance

Jaro–Winkler distance

Wagner–Fischer edit distance

Ukkonen's algorithm

Hirschberg's algorithm

# DP: maximum sum submatrix of given matrix

From : http://www.artofproblemsolving.com/Forum/viewtopic.php?p=385650#385650

Also see: http://www.careercup.com/question?id=2233

There are 2 algos for it. One N^3 and another N^4.
N^4: you need to keep S[a][b] where it is the sum of all the position from (0,0) to (m,n). now run N^4 algo I mean select starting left up node and select down right node. And compute the sum.
N^3: it keeps track on the sum of rows. I mean S[m][n] means (m=row n=col) S[m][n]=sum(i=1 to m) S[i][n]. now select any two col. And go through the rows. If the sum becomes negative any time exclude the last row and update your maximum if necessary. Then go to next row and thus continue running. Same for minimum!

The best algorithm I can think of has the running time about $N^3$ , which is much better than the $N^6$ of the trivial algorithm that just considers all possible rectangles and computes the sum in each rectangle in the most straightforward way, but still unacceptably long for $N=8000$ (at least for my PC, which can do about $10^6$ simple operations per second). So, the algorithm I'm going to describe doesn't solve the problem completely.

Compared to the trivial algorithm, one can make two major improvements. The first one is to replace the given matrix $(a_{ij})_{1\leq i,j\leq N}$ by the matrix of the partial sums $(S_{mn})_{0\leq m,n\leq N}$ where $S_{mn}=\sum_{i\leq m,j\leq n}a_{ij}$ . This can be done in $N^2$ time replacing the row elements by row partial sums first and then doing the same operation with columns. The point is that the sum of $a_{ij}$ over a rectangle $m'< i\leq m'', n'< j\leq n''$ can be computed now as $S_{m''n''}+S_{m'n'}-S_{m''n'}-S_{m'n''}$ . This improvement alone reduces the running time of the trivial algorithm from $N^6$ to $N^4$ .

Now, assume that $m'$ and $m''$ are fixed (about $N^2$ possible choices) and our task is to find the best $n'$ and $n''$ . Let $B_n=S_{m''n}-S_{m'n}$ . We need to find $n'\leq n''$ such that $B_{n''}-B_{n'}$ is as large as possible. To this end, for every $k$ , find $M_k=\max_{n\geq k}B_n$ and $\mu_k=\min_{n\leq k}B_n$ . Also, record the positions where the maximum and the minimum are attained. This can be done in linear time. After that it remains to find the maximum of $M_k-\mu_k$ , which can also be done in linear time. So, the total time is about $N^2\cdot N=N^3$ .

Code:

for i = 1 to N

   sum[i][0] = 0

   for j = 1 to N

      sum[i][j] = matrix[i][j] + sum[i][j - 1]

best = 0

for i = 1 to N

   for j = i to N

      max[0] = 0

      min[0] = 0

      result = 0

      for k = 1 to N

         result = result + sum[k][j] - sum[k][i]

         if result > max[k - 1] then max[k] = result else max[k] = max[k - 1]

         if result < min[k - 1] then min[k] = result else min[k] = min[k - 1]

      if max[N] - min[N] > best then best = max[N] - min[N]

return best

Also: [Ref] A sub-cubic time algorithm for the k-maximum subarray problem, which achieve O(n³ √log log n/ log n + k log n) for a general problem where overlapping is allowed for solution arrays. This complexity is sub-cubic when k = o(n³/ log n). The best known complexities of this problem are O(n³ + k log n), which is cubic when k = O(n³/ log n), and O(kn³ √log log n/ log n), which is sub-cubic when k = o(√log n/ log log n).

# To find the Maximum size square sub-matrix for a 1-0 matrix

(http://geeksforgeeks.org/?p=6257)

http://stackoverflow.com/questions/3806520/finding-maximum-size-sub-matrix-of-all-1s-in-a-matrix-having-1s-and-0s

http://stackoverflow.com/questions/1726632/dynamic-programming-largest-square-block

Given a binary matrix, find out the maximum size square sub-matrix with all 1s.

For example, consider the below binary matrix.

0  1  1  0  1    1  1  0  1  0    0  1  1  1  0    1  1  1  1  0    1  1  1  1  1    0  0  0  0  0

The maximum square sub-matrix with all set bits is

1  1  1     1  1  1     1  1  1

Algorithm:
Let the given binary matrix be M[R][C]. The idea of the algorithm is to construct an auxiliary size matrix S[][] in which each entry S[i][j] represents size of the square sub-matrix with all 1s including M[i][j] and M[i][j] is the rightmost and bottommost entry in sub-matrix.

1) Construct a sum matrix S[R][C] for the given M[R][C].      a) Copy first row and first columns as it is from M[][] to S[][]      b) For other entries, use following expressions to construct S[][]          If M[i][j] is 1 then             S[i][j] = min(S[i][j-1], S[i-1][j], S[i-1][j-1]) + 1          Else /*If M[i][j] is 0*/             S[i][j] = 0 2) Find the maximum entry in S[R][C] 3) Using the value and coordinates of maximum entry in S[i], print    sub-matrix of M[][]

Some explanations form stackoverflow:

For each of the cells we will keep a counter of how big a square can be made using that cell as top left. Clearly all cells with 0 will have 0 as the count.
Start iterating from bottom right cell and go to bottom left, then go to one row up and repeat.
At each scan do this:
  If the cell has 0 assign count=0
  If the cell has 1 and is an edge cell (bottom or right edge only), assign count=1
  For all other cells, check the count of the cell on its right, right-below, and below. Take the min of them and add 1 and assign that to the count. Keep a global max_count variable to keep track of the max count so far.
At the end of traversing the matrix, max_count will have the desired value.
Complexity is no more that the cost of traversal of the matrix.
This is how the matrix will look like after the traversal. Values in parentheses are the counts, i.e. biggest square that can be made using the cell as top left.
1(1) 0(0) 1(1) 0(0) 1(1) 0(0) 1(1) 0(0) 1(4) 1(3) 1(2) 1(1) 0(0) 1(1) 1(3) 1(3) 1(2) 1(1) 0(0) 0(0) 1(2) 1(2) 1(2) 1(1) 1(1) 1(1) 1(1) 1(1) 1(1) 1(1) 
Implementation in Python 
   def max_size(mat, ZERO=0):     
   """Find the largest square of ZERO's in the matrix `mat`."""     
 nrows, ncols = len(mat), (len(mat[0]) if mat else 0)     
 if not (nrows and ncols): return 0 # empty matrix or rows     
 counts = [[0]*ncols for _ in xrange(nrows)]     
 for i in reversed(xrange(nrows)):     
  # for each row         
  assert len(mat[i]) == ncols # matrix must be rectangular         
  for j in reversed(xrange(ncols)): # for each element in the row             
   if mat[i][j] == ZERO:
    counts[i][j] = (1 + min( counts[i][j+1],  # east  
                      counts[i+1][j],  # south
                             counts[i+1][j+1] # south-east                     )) 
  if i < (nrows - 1) and j < (ncols - 1) else 1 # edges     
 return max(c for rows in counts for c in rows)

♥ ¸¸.•*¨*•♫♪♪♫•*¨*•.¸¸♥

Alg-Code

Pages

Welcome~~~

Sunday, February 27, 2011

Edit Distance problem

Then move forward to exe to draw the path of :

| 0 1 2 3 4 5 6 7 | M Z J A W X U -----|----------------- 0 | 0 0 0 0 0 0 0 0 1 X | 0 0 0 0 0 0 1 1 2 M | 0 1 1 1 1 1 1 1 3 J | 0 1 1 2 2 2 2 2 4 Y | 0 1 1 2 2 2 2 2 5 A | 0 1 1 2 3 3 3 3 6 U | 0 1 1 2 3 3 3 4 7 Z | 0 1 2 2 3 3 3 4

Others :

Implementation in Python

1 comment: