Robert Eisele
Engineer, Systems Architect and DBA

Codingame: Genome Sequencing

Original Problem

The Goal

You are working as a computer scientist in a laboratory seeking to sequence the genome. A DNA sequence is represented by a character string (of A, C, T and G) such as GATTACA. The problem is that biologists are only able to extract sub-sequences of the complete sequence. Your role is to combine these partial sub-sequences to recover the original sequence.

In this exercise you are asked to calculate the length of the shortest sequence that contains all the sub-sequences of the input data.

Rules

You are given N sub-sequences and you must return the length of the shortest sequence that contains all the sub-sequences. There may be several sequences of the same minimum length and which fit the requirement. We are not asking you to list these, but only to return their length.

Note that there is always a solution. One can indeed simply concatenate all the sub-sequences to obtain a valid sequence. But by nesting (even partially) the sub-sequences, it is generally possible to obtain a shorter sequence (see the example).

Example

For example, you have three sub-sequences AGATTA, GATTACA, and TACAGA. The sequence AGATTACAGA is the shortest sequence that contains all the sub-sequences.

Note that in this example, there are other sequences which contain all of the sub-sequences such as TACAGATTACAGATTA.. However, we prefer the former because it is shorter (10 characters instead of 16).
Example
AGATTACAGA contains the sub-sequences AGATTA, GATTACA, and TACAGA.

Game Input

Input

Line 1: The number N of sub-sequences

N following lines: one sub-sequence by line, represented by a string of characters from A, C, T and G. Each sub-sequence ranges from 1 to maximum 10 characters long.

Output
The length of the shortest sequence containing all the sub-sequences.
Constraints
0 < N < 6

Solution

I had a hard time finding a solution for this problem, since I wanted to generalize it and walked from a naive approach to a suffix-tree approach. I then started googling for substring problems and eventually found that our problem is known as Shortest common supersequence problem. This problem was proofed to be NP-complete and even dynamic programming solutions still remain exponentially [1]. The trick here is the rather small constraint of \(0<N<6\). Using it, it allows us to brute-force a solution by permuting all possible orders and find the smallest matching solution. We start by implementing a function to permute an array:

function permutate(arr) {
  var res = [];

  function worker(arr, tmp) {

    for (var i = 0; i < arr.length; i++) {
      var cur = arr.splice(i, 1);
      if (arr.length === 0) {
        res.push(tmp.concat(cur));
      }
      worker(arr.slice(), tmp.concat(cur));
      arr.splice(i, 0, cur[0]);
    }
    return res;
  }
  return worker(arr, []);
}

For the AGATTACAGA example this gives the following permutations:

AGATTA,GATTACA,TACAGA
AGATTA,TACAGA,GATTACA
GATTACA,AGATTA,TACAGA
GATTACA,TACAGA,AGATTA
TACAGA,AGATTA,GATTACA
TACAGA,GATTACA,AGATTA

When we take the first element of the first row and try to find the first offset to match the second element, we see easily it's one. When we append the remaining characters CA to the first string and continue with third element, we see that the matching pattern is TACA with an offset of 5 and the not matching characters GA that need to be appended. We need to take care of one special case, where the string from the list is already contained in the first string, like GAT in AGATTA. Implementing the routine is then not that hard anymore:

function analyze(perm) {

  var minLen = Infinity;
  // Loop over all permutations
  for (var cur of perm) {

    // Take the first element of a permutation
    var str = cur[0];
    // Try to append it to the first possible position
    for (var j = 1; j < cur.length; j++) {

      // Loop over the cumulative string
      for (var i = 0; i < str.length; i++) {

        // If an element is fully embedded in the already known string, nothing to do
        if (str.indexOf(cur[j]) !== -1) {
          break;
        } else if (str.slice(i) === cur[j].slice(0, str.length - i)) {
          // Append the characters, that are not part of the string already
          str+= cur[j].slice(str.length - i);
          break;
        }
      }
      // If we couldn't find a match, append it to the end
      if (i === str.length) {
        str+= cur[j];
      }
    }

    // Update the total minimum length
    minLen = Math.min(str.length, minLen);
  }
  return minLen;
}

var seqs = [];

var N = +readline();
for (var i = 0; i < N; i++) {
  seqs.push(readline());
}

print(analyze(permutate(seqs)));

[1] T. Jiang and M. Li On the Approximation of Shortest Common Supersequences and Longest Common Subsequences (1995)

« Back to problem overview

All images are copyright to Codingame