RepeatMasker Libraries [SOLVED]

View previous topic View next topic Go down

RepeatMasker Libraries [SOLVED]

Post by Clément Goubert on Wed Jan 13, 2016 4:33 pm

Here is a discussion about the RepeatMasker Library issue. In quick, using this ftp://pbil.univ-lyon1.fr/pub/divers/goubert/specieslib or a one formatted in the same format will work. See the post for more info.


I encountered some problems with RepeatMasker and also have a question about the input.

Is there a way to tell dnapipeTE a relative value? lets say I know my library was sequenced at 5X. Can I tell dnapipeTE t o use 20% of the input reads (this would be 1X) or 10% (0.5X) or 5% (0.1X)? This would save some effort to count reads.

And then I have a problem which might be a incompatibility with the latest RepeatMasker version, or rather the libraries. I used the latest DFAM/RepBase libraries and configured RepeatMasker accordingly.

My RepeatMasker Library folder (RepeatMasker Libraries (downloaded DFAM1.4 and RepBase RepeatMasker Library)

Code:
-bash-4.1$ ls
Dfam.hmm                 RepeatMasker.lib.nhr  RepeatMaskerLib.embl
RepeatAnnotationData.pm  RepeatMasker.lib.nin  old
RepeatMasker.lib         RepeatMasker.lib.nsq  taxonomy.dat
-bash-4.1$ pwd



Now I cannot get dnapipeTE get to run with this library. It alway reports that there is no library specified.

If I give dnapipeTE only the library folder, it creates /20140131/general/ within the Library folder containing only with simple sequence repeats and Ecoli IS sequences and then checks for them.
/data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries/20140131/general/

If I specify the RepeatMaskerLib.embl directly its the same. If I leave it empty or give other values it fails. (NCBIBlastSearchEngine::search: Error...compressed subject database (/data/home/btw826/programs/dnapipeTE/dnaPipeTE/RM_37361.SatAug151717362015/) does not exist!
at /data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/bin/RepeatMasker line 2018)

Therefore I cannot get it running with this RepeatMasker/Libraries.

Do you have any idea how to fix it? Maybe the program needs to be updated in order to work with newer RepeatMasker versions (I read the older versions made species specific libraries, whereas the newer versions create a single .embl library file).


I tried these config.ini values (among others)

Code:
repeatmasker_library =
DEFAULT
[DEFAULT]
/data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries
/data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries/RepeatMaskerLib.embl
/data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries/RepeatMaskerLib
/data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries/RepeatMaskerLib/



the resulting output when the RepeatMasker step was started:

Code:
RepeatMasker version open-4.0.5
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
RepeatMasker::setspecies: Could not find user specified library /data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries/RepeatMaskerLib.
cat: /data/home/btw826/archive/dnapipeTEout/Trinity.fasta.out: No such file or directory
Done

NCBIBlastSearchEngine::search: Error...compressed subject database (/data/home/btw826/programs/dnapipeTE/dnaPipeTE/RM_62450.SatAug151458302015/RepeatMaskerLib.embl) does not exist!
 at /data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/bin/RepeatMasker line 2018.
WARNING: Retrying batch ( 2 ) [ 2,, 79435]...
/data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries/RepeatMaskerLib.embl

RepeatMasker version open-4.0.5
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
RepeatMasker::setspecies: Could not find user specified library [DEFAULT].
cat: /data/home/btw826/archive/dnapipeTEout/Trinity.fasta.out: No such file or directory
Done

RepeatMasker version open-4.0.5
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
Master RepeatMasker Database: /data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries/RepeatMaskerLib.embl ( Complete Database: 20140131 )
Custom Repeat Library: /data/home/btw826/.linuxbrew/Cellar/repeatmasker/4.0.5/libexec/Libraries

RepeatMasker version open-4.0.5
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
RepeatMasker::setspecies: Could not find user specified library DEFAULT.
cat: /data/home/btw826/archive/dnapipeTEout/Trinity.fasta.out: No such file or directory
Done




Concerning your RepeatMasker issue, in dnaPipeTE I set by default RM to use a custom library with the -lib option that points to the "path to RM library" to set in the config.ini file of dnaPipeTE. It allows you either to configure the config.ini file with the repbase library (but it needs to be in fasta format) either using your custom library in fasta format (using the repeat masker nomenclature). So to fix your problem, you should have the repbase library in fasta and set its path in the input.ini file (as if it was a custom library). Usually the path is /home/crazyname/RepeatMasker-version/Libraries/20XXXXXX/root/specieslib, however I did not checked recently and it could be different now… If you need, I can provide you the one I am currently using (this is the library version used in the dnaPipeTE paper).


The exact format and source of the repeatlibrary used by RM is still quite unclear to me (from both the documentation and your email).

I would indeed appreciate if I could get a copy of the library you are using. The I can have a clue how it needs to look like.


The only source of a fasta file with the recent RepBase is the REPET edition of repbase (contains repbase20.05_ntSeq_cleaned_TE.fa). The newest DL of RepBase20.07.fasta contains only .ref files and connot be used. So as the repeatmasker edition (very old version)


I am not sure where otherwise to get a fasta with repeatmasker nomenclature as the repeatmasker program (latest version) does not produce this. Or is there another / third party tools to be used for this?


using /repbase20.05_ntSeq_cleaned_TE.fasta it did something, but came up with errors as well:

Code:
awk: (FILENAME=- FNR=1469) fatal: division by zero attempted


more dtailed part of the output:

Code:
Checking for E. coli insertion elements
identifying Simple Repeats in batch 9 of 11
identifying matches to repbase20.05_ntSeq_cleaned_TE.fasta sequences in batch 9 of 11
identifying Simple Repeats in batch 8 of 11
identifying Simple Repeats in batch 7 of 11
identifying Simple Repeats in batch 6 of 11

Checking for E. coli insertion elements
identifying Simple Repeats in batch 10 of 11
identifying Simple Repeats in batch 2 of 11
identifying matches to repbase20.05_ntSeq_cleaned_TE.fasta sequences in batch 10 of 11

Checking for E. coli insertion elements
identifying Simple Repeats in batch 3 of 11
identifying Simple Repeats in batch 11 of 11
identifying Simple Repeats in batch 1 of 11
identifying matches to repbase20.05_ntSeq_cleaned_TE.fasta sequences in batch 11 of 11
identifying Simple Repeats in batch 4 of 11
identifying Simple Repeats in batch 11 of 11
identifying Simple Repeats in batch 9 of 11
identifying Simple Repeats in batch 10 of 11
processing output:
cycle 1 .
cycle 2 .
cycle 3 .
cycle 4 .
cycle 5
cycle 6 .
cycle 7 .
cycle 8 .
cycle 9 .
cycle 10 .
Generating output... .
masking
done
awk: (FILENAME=- FNR=1469) fatal: division by zero attempted
Done
#########################################
### Making contigs annotation from RM ###
#########################################
Done


Making blast sample...
sampling file found, skipping sampling...
number of reads to sample :  550000
fastq :  /data/home/btw826/f1BR1.fastq
total number of reads : 27970177
sampling 1 samples of 550000 reads...
s_f1BR1.fastq_blast done.
#######################################################
### Blast 1 : raw reads against all repeats contigs ###
#######################################################
Blast 1 files found, skipping Blast 1 ...
###################################################
### Blast 2 : raw reads against annoted repeats ###
###################################################
Blast 2 files found, skipping Blast 2 ...
#####################################################
### Blast 3 : raw reads against unannoted repeats ###
#####################################################
Blast 3 files found, skipping Blast 3 ...
#######################################################
### Estimation of Repeat content from blast outputs ###
#######################################################
parsing blastout and adding RM annotations for each read...
Done, results in: blast_out/blastout_final_fmtd_annoted
#########################################
### OK, lets build some pretty graphs ###
#########################################
Drawing graphs...
null device
          1
null device
          1
null device
          1
null device
          1
Error in library(ggplot2) : there is no package called 'ggplot2'
Execution halted
Done


You'll find here my library in .fasta format, this is the one I always use! I actually found it once I had installed RM and its libraries, in the folder that is for me:

/panhome/goubert/RepeatMasker/Libraries/20140131/root/specieslib

I will chek if this still exists in the new RM version.

The errors you have are normal, since if there is no annotation for one contig, it tries to compute a percentage dividing by an empty column Smile but it is still working well !

Also, I recommend you to remove or change the output name, when your first attempt didn't work: in fact, we implemented a check for "already done" files, but sometimes it does not work well (for example it could see the file, but its actually empty of bad, and still skip the step). So to be sure that dnaPipeTE run properly, either change the output name or remove the outfolder before a new try.
avatar
Clément Goubert
Admin

Posts : 30
Join date : 2016-01-05
Age : 29

View user profile https://lbbe.univ-lyon1.fr/-dnaPipeTE-.html

Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum