# Codingame Solution: Genome Sequencing

## The Goal

In this exercise you are asked to calculate the length of the shortest sequence that contains all the sub-sequences of the input data.

## Rules

`N`sub-sequences and you must return the length of the shortest sequence that contains all the sub-sequences. There may be several sequences of the same minimum length and which fit the requirement. We are not asking you to list these, but only to return their length.

Note that there is always a solution. One can indeed simply concatenate all the sub-sequences to obtain a valid sequence. But by nesting (even partially) the sub-sequences, it is generally possible to obtain a shorter sequence (see the example).

## Example

Note that in this example, there are other sequences which contain all of the sub-sequences such as TACAGATTACAGATTA.. However, we prefer the former because it is shorter (10 characters instead of 16).

## Game Input

**Input**

Line 1: The number `N` of sub-sequences

`N` following lines: one sub-sequence by line, represented by a string of characters from A, C, T and G. Each sub-sequence ranges from 1 to maximum 10 characters long.

**Output**

`N`< 6

## Solution

I had a hard time finding a solution for this problem, since I wanted to generalize it and walked from a naive approach to a suffix-tree approach. I then started googling for substring problems and eventually found that our problem is known as **Shortest common supersequence problem**. This problem was proofed to be NP-complete and even dynamic programming solutions still remain exponentially [1]. The trick here is the rather small constraint of \(0<N<6\). Using it, it allows us to brute-force a solution by permuting all possible orders and find the smallest matching solution. We start by implementing a function to permute an array:

function permutate(arr) { var res = []; function worker(arr, tmp) { for (var i = 0; i < arr.length; i++) { var cur = arr.splice(i, 1); if (arr.length === 0) { res.push(tmp.concat(cur)); } worker(arr.slice(), tmp.concat(cur)); arr.splice(i, 0, cur[0]); } return res; } return worker(arr, []); }

For the `AGATTACAGA` example this gives the following permutations:

AGATTA,GATTACA,TACAGA AGATTA,TACAGA,GATTACA GATTACA,AGATTA,TACAGA GATTACA,TACAGA,AGATTA TACAGA,AGATTA,GATTACA TACAGA,GATTACA,AGATTA

When we take the first element of the first row and try to find the first offset to match the second element, we see easily it's one. When we append the remaining characters `CA` to the first string and continue with third element, we see that the matching pattern is `TACA` with an offset of 5 and the not matching characters `GA` that need to be appended. We need to take care of one special case, where the string from the list is already contained in the first string, like `GAT` in `AGATTA`. Implementing the routine is then not that hard anymore:

function analyze(perm) { var minLen = Infinity; // Loop over all permutations for (var cur of perm) { // Take the first element of a permutation var str = cur[0]; // Try to append it to the first possible position for (var j = 1; j < cur.length; j++) { // Loop over the cumulative string for (var i = 0; i < str.length; i++) { // If an element is fully embedded in the already known string, nothing to do if (str.indexOf(cur[j]) !== -1) { break; } else if (str.slice(i) === cur[j].slice(0, str.length - i)) { // Append the characters, that are not part of the string already str+= cur[j].slice(str.length - i); break; } } // If we couldn't find a match, append it to the end if (i === str.length) { str+= cur[j]; } } // Update the total minimum length minLen = Math.min(str.length, minLen); } return minLen; } var seqs = []; var N = +readline(); for (var i = 0; i < N; i++) { seqs.push(readline()); } print(analyze(permutate(seqs)));

[1] T. Jiang and M. Li On the Approximation of Shortest Common Supersequences and Longest Common Subsequences (1995)

All images are copyright to Codingame