Longest Common Subsequence

...with an unconventional approach

Page content

The longest common subsequence (LCS) problem is typically solved using the classic dynamic programming approach: bottom-up with memo. Here, we attempt an atypical implementation which solves the LCS problem across three sequences using a sparse memo.

Problems amenable to dynamic programming have to meet two important criteria:

  • Optimal substructure: Solving a subproblem optimally, contributes to the optimal solution of the larger problem. The longest subsequence terminating at a particular point in the string, is made up of ‘smaller’ longest subsequences terminating at successively preceding points.

  • Overlapping subproblems: The longest subsequence is made up of constituent subsequences which are the longest at their respective points of termination. As the subsequence grows, we recursively or iteratively search for the longest preceding subsequence and end up solving the same subproblems over and over.

The Implementation

We share the implementation in four parts. The first part below is the function definition of LCS and reading in the arguments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#! /usr/bin/env python3

import random

def longest_subsequence(s1, s2, s3):
    '''Input: three sequences.  Repeat one sequence if you have only two.
       Output: tuple with length of LCS, matching subsequence indices in
               sequence 1,2 and 3 respectively.
    '''

    ls1 = [None] + list(s1).copy()   # padding to help with edge conditions
    ls2 = [None] + list(s2).copy()
    ls3 = [None] + list(s3).copy()
        
    m, n, r = len(ls1), len(ls2), len(ls3)

Populating the solution space

The second part below implements the sparse data structure (memo), the solution search (max3d) and the nested loop that implements the recurrence relation for LCS.

Our memo is sparse in the sense we keep track of only the intersections and the length of the longest subsequence terminating at that point as a 4-tuple. Each 4-tuple: (l,i,j,k) represents the longest subsequence terminating at the intersection (i,j,k) with its length l. These are the solutions to our subproblems.

A subsequence has to necessarily end at the matching intersection of strings (3 in our case). Thus, finding the longest subsequence at an intersection (i,j,k) involves finding the longest subsequence among all preceding/prior intersections and adding 1 to it.

Intuition: Finding this longest preceding subsequence involves locating the 4-tuple in the cuboid space (because we have 3 strings) enclosed by the planes x=0,y=0,z=0 and x=i,y=j,z=k, that corresponds to the longest preceding subsequence. This is an implementation of our recurrence relation.

Note that x,y,z axes represent the 1-based indices of the three respective strings layed out along these axes. It is also important to have a base-case tuple (0,0,0,0) to initialize our sparse memo.

16
17
18
19
20
21
22
23
24
25
26
    memo = [(0,0,0,0)] # sparse--will have info only on non-zero matches.    
    # Searches for and returns the longest substring in the space preceding a triple intersection.
    max3d = lambda i,j,k: max([tup for tup in memo if (
                          tup[1] < i  and tup[2] < j and tup[3] < k)],
                          default=(0,0,0,0))   

    for i in range(1,m):
        for j in range(1,n):
            for k in range(1,r):
                if (ls1[i] == ls2[j]) and (ls2[j] == ls3[k]):  # match at i,j,k
                    memo.append((max3d(i,j,k)[0]+1, i, j, k))

Identifying the solution

We have the memo filled up now, with tuples that have all longest subsequence terminations at each intersection (i,j,k). All we have to do is find the tuple with the maximum LCS length. However, we do more: we identify the indices of the LCS in each of the three strings and thus the literal LCS as well.

The while loop starts with the maximum value of the LCS and works its way back to the previous termination iteratively until it reaches the first element in the sequence using the same solution search approach (max3d).

27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
    res = []   # Structure to return result
    p,q,v = m,n,r
    
    last_term = max(memo) # tuple with longest subseq length and coordinates
    res.append(last_term) # append to result data

    p,q,v = last_term[1], last_term[2], last_term[3]

    while (maxtup := max3d(p,q,v))[0] > 0:
        res.append(maxtup)
        p,q,v = maxtup[1],maxtup[2],maxtup[3]
        
    if len(res) == 1:  # No common sequence found
        return (0,[],[],[])
        
    res = res[::-1]   # Reverse to get ascending order
    # array of subseq indices in s1,s2,s3
    _, is1, is2, is3 = list(zip(*res)) 
    
    # Convert these to 0 based indices
    is1 = [e-1 for e in is1]
    is2 = [e-1 for e in is2]
    is3 = [e-1 for e in is3]
        
    return(len(is1), is1, is2, is3)

To exercise our code, we create three random substrings and get the LCS common to them.

52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
                        
if __name__ == "__main__":
    # Generate three random alphanumeric sequences
    # numerals, upper and lower case    
    chars = [chr(c+48) for c in range(10)] + \
            [chr(c+65) for c in range(26)] + \
            [chr(c+97) for c in range(26)]

    rnd_str = lambda length: [random.choice(chars) for _ in range(length)]
    s1,s2,s3 = rnd_str(70), rnd_str(75), rnd_str(80)

    # Test case for non-overlapping strings
    # s1,s2,s3 = "abcdefg","hijklmn","opqrstu"
    
    print("The three strings:")
    print(''.join(s1),''.join(s2),''.join(s3), sep='\n' )
    
    print("\nLength of longest subsequence, its indices in s1,s2,s3")
    print(result := longest_subsequence(s1,s2,s3))
    # To get the literal subsequence, filter (say) s1 using its lcs indices
    print("Longest subsequence in s1,s2,s3: ",[s1[i] for i in result[1]])    

The output of the program is:

The three strings:
aYM7zR76GWMyWmdwpb98syIZ3UgDP9OUKdgQuqM4JqGNpAdnpe9qtvZX3gidinQqxqa7vC
cVTJwuwMXl4sO1AopWUdjlyhrsdztMRhsSnjmEe5bFce9VjxIqLg2zagKNcnkcZHlqdsFPMM8aT
r4WTlMZnA2MRIydpRgfh5oxnBWrHxhsi20IPhAzXDjBN9k41rPNGXHum2AvRvUtZgVOT8LjUYZLmVAci

Length of longest subsequence, its indices in s1,s2,s3
(7, [10, 11, 14, 20, 29, 43, 54], [7, 22, 26, 32, 44, 57, 62], [10, 13, 14, 30, 44, 50, 73])
Longest subsequence in s1,s2,s3:  ['M', 'y', 'd', 's', '9', 'N', 'Z']

Takeaways

We have here, an implementation of LCS that is frugal on memory and extensible across two or more strings. The bottleneck is the nested loop that checks for intersection across the lists. We have also illustrated an intuitive conceptualization of the LCS problem and the exploration of its solution space.