I got an email from a colleague requesting some help on an informatics issue and I thought it might be useful to post it here. I have been thinking about starting a section of this blog on “Technical Help Requests” or something like that so I guess this is a test.
Here is the request
I have a list of sequences of a set of > 1 million short repeat elements in a large eukaryotic genome, and I need to find a ~60 bp region which is most conserved among these elements. While they are “repeat elements”, they can be fairly diverse in specific sequence, but I only need a subset that contain the (near-perfect) conserved sequence. What method or software would you recommend to find this region? All the ones I usually use can’t handle that many lines of input.