Back to Sliding Window
Repeated DNA Sequences
Repeated DNA Sequences
š§© Problem Statement
The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
š¹ Example 1:
Input:
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output:
["AAAAACCCCC","CCCCCAAAAA"]
š¹ Example 2:
Input:
s = "AAAAAAAAAAAAA"
Output:
["AAAAAAAAAA"]
š Approaches
1. Sliding Window with Hash Set
We need to check every substring of length 10.
- Use a sliding window of size
L = 10. - Maintain a
seenhash set to track substrings we've encountered. - Maintain a
repeatedhash set to track the answers (to avoid duplicates in the output). - Iterate through the string, extracting the substring at each position
i. - If substring is in
seen, add torepeated. - Else, add to
seen.
⨠Intuition
This is a direct application of checking "have I seen this before?". Since the window size is fixed and small (10), extracting and hashing the substring is efficient enough for typical constraints ($N \le 10^5$).
š„ Algorithm Steps
- Create
seenset andrepeatedset. - Iterate
ifrom0ton - 10. - Extract
sub = s.substring(i, i + 10). - If
subinseen:
- Add
subtorepeated.
- Else:
- Add
subtoseen.
- Return elements of
repeated.
ā³ Time & Space Complexity
- Time Complexity: $O(N \cdot L)$, where $N$ is string length and $L=10$.
- Space Complexity: $O(N \cdot L)$, to store the substrings in the sets.
š Code Implementations
C++
#include <iostream>
#include <vector>
#include <string>
#include <unordered_set>
using namespace std;
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_set<string> seen, repeated;
if (s.length() < 10) return {};
for (int i = 0; i <= s.length() - 10; ++i) {
string sub = s.substr(i, 10);
if (seen.count(sub)) {
repeated.insert(sub);
} else {
seen.insert(sub);
}
}
return vector<string>(repeated.begin(), repeated.end());
}
};
Python
class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
seen, repeated = set(), set()
for i in range(len(s) - 9):
sub = s[i : i+10]
if sub in seen:
repeated.add(sub)
else:
seen.add(sub)
return list(repeated)
Java
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> seen = new HashSet<>();
Set<String> repeated = new HashSet<>();
for (int i = 0; i <= s.length() - 10; i++) {
String sub = s.substring(i, i + 10);
if (seen.contains(sub)) {
repeated.add(sub);
} else {
seen.add(sub);
}
}
return new ArrayList<>(repeated);
}
}
š Real-World Analogy
Plagiarism Detection:
Imagine scanning a long essay. You look at every sequence of 10 words.
- You write down every 10-word phrase you see.
- If you encounter a phrase you've already written down, you flag it as "repeated" (potential plagiarism or just a common phrase).
šÆ Summary
ā
Fixed Window: Window size is constant (10).
ā
Hashing: Using hashing allows $O(1)$ (amortized) lookup to check for duplicates.
Solution Code
O(n) TimeO(1) Space
#include <algorithm>
#include <iostream>
#include <string>
#include <unordered_set>
#include <vector>
using namespace std;
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_set<string> seen, repeated;
if (s.length() < 10)
return {};
for (int i = 0; i <= s.length() - 10; ++i) {
string sub = s.substr(i, 10);
if (seen.count(sub)) {
repeated.insert(sub);
} else {
seen.insert(sub);
}
}
return vector<string>(repeated.begin(), repeated.end());
}
};
int main() {
Solution sol;
string s1 = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT";
vector<string> res1 = sol.findRepeatedDnaSequences(s1);
cout << "Repeated 1: ";
for (const string &seq : res1)
cout << seq << " "; // Expected: AAAAACCCCC CCCCCAAAAA (order varies)
cout << endl;
string s2 = "AAAAAAAAAAAAA";
vector<string> res2 = sol.findRepeatedDnaSequences(s2);
cout << "Repeated 2: ";
for (const string &seq : res2)
cout << seq << " "; // Expected: AAAAAAAAAA
cout << endl;
return 0;
}
SYSTEM STABLEUTF-8 ⢠STATIC_RENDER