Post by Clément Goubert on Wed Jan 13, 2016 4:44 pm

I have one question about setting thresholds. I had been playing around a bit with the number of input reads (0.1X / 0.25X / 0.5X). Is there a good strategy to find out if a certain analysis with a given threshold is ok'ish or should be re-run with a different threshold. Typically, if I increase the threshold, more TEs are found. But do you have experience if there is some sort of plateauing (while still keeping below 1X tho) which is observable and might help to determine the best threshold?

Concerning the threshold, I think exploring several depth of coverage is a good strategy. However, the more you put reads, the more it will find repeats because dnaPipeTE assumes that at coverage under 1X what is assembled is only repeated (that is not always true). Depending on your model species, a way to discriminate is to see if the amount of annotated TEs is plateauing meaning you have reach the threshold, but, if you have a species for which references are weak, it can be false positive (it could badly continue to annotate none repeated).
What I did in my case is to compute the N50 of assembly after each run using from 0.01 to 0.5 X samples size (I also compare 1 vs 2 trinity iteration). What I assume, is that at the beginning, the N50 will increase with the sample size. However, when all the repeated DNA will be assembled, it will begin to assemble none repeated; thus the number of contig will quickly increase and the N50 should begin to decrease until you had sufficiently put enough reads to assemble well the none repeated. In my experience, the N50 quickly increased before plateauing, and then begin to decrease. So I used the threshold that maximized the N50 (you could find it in the supplementary data of the paper). I also found that the threshold is really variable depending of you model species: 0.1X fits well with large and highly repeated genomes, while 0.25X is a good value for genomes like D. melanogaster. So, it is an empirical procedure, and I did not found time yet to precisely test it with model species, for which I have a good expectation of the repeat content.
