[SOLVED] Sample larger than population

View previous topic View next topic Go down

[SOLVED] Sample larger than population

Post by t_schell on Fri Mar 25, 2016 9:17 pm

Dear Clément,
I have an issue when running dnaPipeTE with the options -genome_size and -genome_coverage.
The "number of reads to sample" remains unchanged at the default value 500,000. Shouldn't dnaPipeTE determine the "number of reads to sample" automatically with the genome size and coverage options?
I guess one can change the "number of reads to sample" with the option -sample_size but in the documentaion you state that it should be used
Clément Goubert wrote:without -genome_size and -genome_coverage
.
The STDOUT of dnaPipeTE is:
Code:
            +@@@@@@@@@@#
             @@@@@@#+;.
              +@@;

           Let's go !!!

Start time: Wed Mar 23 17:53:35 2016
number of reads to sample :  500000
fastq :  /home/tilman/results/2016-03-22_dnapipete/all_trimmed_for_and_unpaired_wo_mt.fq
total number of reads : 499553406
The STDERR is:
Code:
Traceback (most recent call last):
  File "./dnaPipeTE.py", line 625, in <module>
    Sampler = FastqSamplerToFasta(args.input_file, args.sample_size, args.genome_size, args.genome_coverage, args.sample_number, args.output_folder, False)
  File "./dnaPipeTE.py", line 152, in __init__
    self.get_sampled_id(self.fastq_R1)
  File "./dnaPipeTE.py", line 208, in get_sampled_id
    tirages = random.sample(range(np), self.number*self.sample_number)
  File "/usr/local/lib/python3.5/random.py", line 315, in sample
    raise ValueError("Sample larger than population")
ValueError: Sample larger than population
I think the error is because I need ca. 3,250,000 reads (genomesize 1.3Gb; coverage 0.25) but dnaPipeTE should determine the number automatically depending on the lengths of subsampled reads.
I saw in the figure of the publication that there is possibly a sampling step beteen the raw reads and the subsampling of <1x for Trinity assembly. Is it true that dnaPipeTE first draws a random sample of a fixed number of reads and subsamples it again for assembly instead of sampling at least 3 times (-sample_number) from the raw reads directly?
Do I do something wrong?
Btw: I found your suggestion to use the maximum N50 for finding the best coverage for subsampling very nice.

t_schell

Posts : 4
Join date : 2016-03-25

View user profile

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by Clément Goubert on Thu Mar 31, 2016 5:23 pm

Hi Tilman,

I also had this issue recently. It is because dnaPipeTE takes the shortest read of the dataset as a basis to estimate the maximum number of read to sample before sampling. So if you have even a few small reads, it can say that the whole dataset is not large enough.
We have fixed that in the dev branch of the pipeline, however some other parts in dev are not enough stable for a release. Here is enclosed the part of the code to modify to fix the bug. From lines 116 to 281, replace the code of dnaPipeTE.py (V1.1) by the one in the following code:

Code:

class FastqSamplerToFasta:
 def __init__(self, fastq_files, number, genome_size, genome_coverage, sample_number, output_folder, blast):
 self.number = int(number)
 self.genome_size = int(genome_size)
 self.genome_coverage = float(genome_coverage)
 self.use_coverage = False
 self.fastq_total_size = 0
 if self.genome_size != 0 and self.genome_coverage > 0.0:
 self.use_coverage = True
 self.genome_base = int(float(self.genome_size) * self.genome_coverage)
 self.sample_number = int(sample_number)
 self.output_folder = output_folder
 if not os.path.exists(self.output_folder):
 os.makedirs(self.output_folder)
 self.fastq_R1 = fastq_files[0]
 self.blast_sufix = ""
 if blast:
 self.blast_sufix = "_blast"
 if len(fastq_files) == 1:
 self.paired = False
 else:
 self.fastq_R1 = fastq_files[1]
 self.R1_gz = False
 self.R2_gz = False
 if self.fastq_R1[-3:] == ".gz":
 print("gz compression detected for "+self.fastq_R1)
 self.R1_gz = True
 self.fastq_R1 = self.fastq_R1[:-3]

 if self.paired:
 if self.fastq_R2[-3:] == ".gz":
 print("gz compression detected for "+self.fastq_R2)
 self.R2_gz = True
 self.fastq_R2 = self.fastq_R2[:-3]
 
 self.files = list()
 if not self.test_sampling(blast):
 self.get_sampled_id(self.fastq_R1)
 print("sampling "+str(self.sample_number)+" samples of "+str(self.number)+" reads...")
 for i in range(self.sample_number):
 if self.R1_gz:
 self.sampling_gz(self.fastq_R1, i)
 else:
 self.sampling(self.fastq_R1, i)
 self.files.append("s"+str(i)+"_"+self.path_leaf(self.fastq_R1)+str(self.blast_sufix)+".fasta")
 if self.paired:
 for i in range(self.sample_number):
 if self.R2_gz:
 self.sampling_gz(self.fastq_R2, i)
 else:
 self.sampling(self.fastq_R2, i)
 self.files.append("s"+str(i)+"_"+self.path_leaf(self.fastq_R2)+str(self.blast_sufix)+".fasta")
 
 def result(self):
 return(self.files)

 def path_leaf(self, path) :
 head, tail = ntpath.split(path)
 return tail or ntpath.basename(head)

 def get_sampled_id(self, file_name):
 self.tirages = list()
 tirages = list()
 sys.stdout.write("counting reads number ...")
 sys.stdout.flush()
 if self.use_coverage:
 np = 0
 size_min = 10 ** 30
 if self.R1_gz:
 with gzip.open(file_name+".gz", 'rt') as file1 :
 for line in file1:
 np += 1
 if np % 4 == 0:
 self.fastq_total_size += len(str(line))
 if len(str(line)) < size_min:
 size_min = len(str(line))
 else:
 with open(file_name, 'r') as file1 :
 for line in file1:
 np += 1
 if np % 4 == 0:
 self.fastq_total_size += len(str(line))
 if len(str(line)) < size_min:
 size_min = len(str(line))
 else:
 if self.R1_gz:
 with gzip.open(file_name+".gz", 'rt') as file1 :
 np = sum(1 for line in file1)
 else:
 with open(file_name, 'r') as file1 :
 np = sum(1 for line in file1)
 np = int((np) / 4)
 sys.stdout.write("\rtotal number of reads : "+str(np)+"\n")
 sys.stdout.flush()
 if self.use_coverage:
 if self.fastq_total_size <= self.genome_base:
 sys.exit("not enought base to sample "+str(self.fastq_total_size)+" vs "+str(self.genome_base)+" to sample")
 self.number = int(float(self.genome_base)/float(size_min))
 if int(self.number)*int(self.sample_number) > np:
 self.number = int(float(np)/float(self.sample_number))
 print( "number of reads to sample : ", str(int(self.number)*int(self.sample_number)), "\nfastq : ", file_name )
 tirages = random.sample(range(np), self.number*self.sample_number)
 for i in range(self.sample_number):
 tirages_sample = tirages[(self.number*i):(self.number*(i+1))]
 tirages_sample.sort()
 self.tirages.extend(tirages_sample)

 def sampling(self, fastq_file, sample_number):
 sys.stdout.write(str(0)+"/"+str(self.number))
 sys.stdout.flush()
 with open(fastq_file, 'r') as fastq_handle :
 self.sampling_sub(fastq_file, fastq_handle, sample_number)

 def sampling_gz(self, fastq_file, sample_number):
 sys.stdout.write(str(0)+"/"+str(self.number))
 sys.stdout.flush()
 with gzip.open(fastq_file+".gz", 'rt') as fastq_handle :
 self.sampling_sub(fastq_file, fastq_handle, sample_number)

 def sampling_sub(self, fastq_file, fastq_handle, sample_number):
 i = 0
 j = self.number*sample_number
 tag = "/s"+str(sample_number)+"_"
 with open(self.output_folder+tag+self.path_leaf(fastq_file)+str(self.blast_sufix)+".fasta", 'w') as output :
 base_sampled = 0
 for line in fastq_handle :
 if (i-1) % 4 == 0 and (i-1)/4 == self.tirages[j]: # if we are at the sequence line in fastq of the read number self.tirages[j]
 output.write(">"+str(j+sample_number*self.number)+"\n"+str(line)) # we write the fasta sequence corresponding
 if self.use_coverage:
 base_sampled += len(str(line))
 if (i-1)/4 == self.tirages[j]:
 j += 1 # we get the number of the next line
 if j % 100 == 0:
 sys.stdout.write("\r"+str(j)+"/"+str(self.number*self.sample_number))
 sys.stdout.flush()
 i += 1
 if self.use_coverage:
 if base_sampled >= self.genome_base:
 break
 if base_sampled >= self.fastq_total_size:
 sys.exit("not enought base to sample "+str(self.fastq_total_size)+" vs "+str(self.genome_base)+" to sample")
 if j >= len(self.tirages):
 break
 sys.stdout.write("\r"+"s_"+self.path_leaf(fastq_file)+str(self.blast_sufix)+" done.\n")

 def test_sampling(self, blast):
 sampling_done = True
 for sample_number in range(self.sample_number):
 tag = "/s"+str(sample_number)+"_"
 if not os.path.isfile(self.output_folder+tag+self.path_leaf(self.fastq_R1)+".fasta") or not os.path.getsize(self.output_folder+tag+self.path_leaf(self.fastq_R1)+".fasta") > 0:
 sampling_done = False
 if self.paired:
 if not os.path.isfile(self.output_folder+tag+self.path_leaf(self.fastq_R2)+".fasta") or not os.path.getsize(self.output_folder+tag+self.path_leaf(self.fastq_R2)+".fasta") > 0:
 sampling_done = False
 for i in range(self.sample_number):
 self.files.append("s"+str(i)+"_"+self.path_leaf(self.fastq_R1)+".fasta")
 if self.paired:
 for i in range(self.sample_number):
 self.files.append("s"+str(i)+"_"+self.path_leaf(self.fastq_R2)+".fasta")
 if sampling_done:
 print("sampling file found, skipping sampling...")
 else:
 self.files = list()
 if blast:
 return False
 return sampling_done

Tell me if it fixed your issue!

Best,

Clément
avatar
Clément Goubert
Admin

Posts : 30
Join date : 2016-01-05
Age : 29

View user profile https://lbbe.univ-lyon1.fr/-dnaPipeTE-.html

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by t_schell on Fri Apr 01, 2016 12:26 am

Dear Clément,
I changed lines 116 to 269 in the v.1.1_02-2016 from github. I guess I should replace the part until
Code:
return sampling_done
and before
Code:
class Trinity:
It took me some time to insert the right number of tabs infront of every line and I hope didn't do a mistake since I am not familiar with python.
After some try and error the script worked at least until the estimation of the number of reads to subsample. When I got it right you calculate the number of reads to subsample like
(genome_size*genome_coverage*sample_number)/<read_length_estimation>
What is done afterwards? The sample is split randomly in <sample_number> parts?
From my naive point of view I would calculate the number of bases needed to subsample per sample_number (genome_size*genome_coverage) and randomly draw sequences from the input file. If the sum of all lengths of this reads exceeds the number of bases to subsample it is finished. Afterwards I would repeat the drawing of sequences <sample_number> times. My questions here are: Is there a reason for subsampling the reads needed in all Trinity runs together from the input file rather than subsample for every Trinity run directly from the input file? And why do you estimate the number of reads to sample and do not count the number of bases already subsampled?
However, I started dnaPipeTE like this:
Code:
python3.5 dnaPipeTE_fix.py -input ~/results/2016-03-22_dnapipete/all_trimmed_for_and_unpaired_wo_mt.fq -output ~/results/2016-03-22_dnapipete/test/ -cpu 30 -genome_size 1300000000 -genome_coverage 0.1
The STDOUT is:
Code:
              @@@@@@@@@@@@#
              .@@@@@@@@@@@,
            +@@@@@@@@@@#
            @@@@@@#+;.
              +@@;

          Let's go !!!

Start time: Thu Mar 31 22:39:23 2016
total number of reads : 499553406
sampling 2 samples of 65000000 reads...
s_all_trimmed_for_and_unpaired_wo_mt.fq done.
s_all_trimmed_for_and_unpaired_wo_mt.fq done.
###################################
### TRINITY to assemble repeats ###
###################################

***** TRINITY iteration 1 *****

Selecting reads for Trinity iteration number 1...
Done

Trinity iteration 1 Done'
###################################
### TRINITY to assemble repeats ###
###################################

***** TRINITY iteration 2 *****

Selecting reads for Trinity iteration number 2...
Done

Trinity iteration 2 Done'

The STDERR is:
Code:
awk: fatal: cannot open file `/home/tilman/results/2016-03-22_dnapipete/test//Trinity_run0/chrysalis/readsToComponents.out.sort' for reading (No such file or directory)
/bin/sh: /home/tilman/tools/dnaPipeTE_git/dnaPipeTE-master/bin/trinityrnaseq_r2013_08_14/Trinity.pl: No such file or directory
awk: fatal: cannot open file `/home/tilman/results/2016-03-22_dnapipete/test//Trinity_run1/chrysalis/readsToComponents.out.sort' for reading (No such file or directory)
/bin/sh: /home/tilman/tools/dnaPipeTE_git/dnaPipeTE-master/bin/trinityrnaseq_r2013_08_14/Trinity.pl: No such file or directory
/bin/sh: /home/tilman/tools/dnaPipeTE_git/dnaPipeTE-master/bin/trinityrnaseq_r2013_08_14/Trinity.pl: No such file or directory
Traceback (most recent call last):
  File "dnaPipeTE_fix.py", line 639, in <module>
    Trinity(config['DEFAULT']['Trinity'], config['DEFAULT']['Trinity_memory'], args.cpu, args.output_folder, sample_files, args.sample_number)
  File "dnaPipeTE_fix.py", line 295, in __init__
    self.new_version_correction()
  File "dnaPipeTE_fix.py", line 323, in new_version_correction
    year = re.search('\d{4}', str(out)).group(0)
AttributeError: 'NoneType' object has no attribute 'group'

Furthermore all fasta files created by dnaPipeTE (reads_run0.fasta, reads_run1.fasta, s0_..., s1_...) remain empty.

Like I said: I don't know if the error is related to the replacing of some lines in the code.
Thank you very much!

Best regards
Tilman

t_schell

Posts : 4
Join date : 2016-03-25

View user profile

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by Clément Goubert on Fri Apr 01, 2016 11:02 am

Hi Tilman,

Try with this:

ftp://pbil.univ-lyon1.fr/pub/divers/goubert/dnaPipeTE.py

This is the whole dnaPipeTE.py script with the change. If this work, I will update the branch on the 1.1 vers.

Thanks,

Clément
avatar
Clément Goubert
Admin

Posts : 30
Join date : 2016-01-05
Age : 29

View user profile https://lbbe.univ-lyon1.fr/-dnaPipeTE-.html

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by t_schell on Fri Apr 01, 2016 3:50 pm

Hi Clément,
I started the script from the ftp with the same parameters as above.
The script was running until
Code:
Trinity iteration 2 Done'
Afterwards there is this error message:
Code:
awk: fatal: cannot open file `/home/tilman/results/2016-03-22_dnapipete/test//Trinity_run0/chrysalis/readsToComponents.out.sort' for reading (No such file or directory)
/bin/sh: /home/tilman/tools/dnaPipeTE_git/dnaPipeTE-master/bin/trinityrnaseq_r2013_08_14/Trinity.pl: No such file or directory
awk: fatal: cannot open file `/home/tilman/results/2016-03-22_dnapipete/test//Trinity_run1/chrysalis/readsToComponents.out.sort' for reading (No such file or directory)
/bin/sh: /home/tilman/tools/dnaPipeTE_git/dnaPipeTE-master/bin/trinityrnaseq_r2013_08_14/Trinity.pl: No such file or directory
/bin/sh: /home/tilman/tools/dnaPipeTE_git/dnaPipeTE-master/bin/trinityrnaseq_r2013_08_14/Trinity.pl: No such file or directory
Traceback (most recent call last):
  File "dnaPipeTE_fix2.py", line 639, in <module>
    Trinity(config['DEFAULT']['Trinity'], config['DEFAULT']['Trinity_memory'], args.cpu, args.output_folder, sample_files, args.sample_number)
  File "dnaPipeTE_fix2.py", line 295, in __init__
    self.new_version_correction()
  File "dnaPipeTE_fix2.py", line 323, in new_version_correction
    year = re.search('\d{4}', str(out)).group(0)
AttributeError: 'NoneType' object has no attribute 'group'

The improvement is that the s0_... and s1_... fasta files are not empty anymore Very Happy
Both files contain sequences matiching nicely the target of 0.1X (~0.099X each file).

Thank you for your help so far.
Tilman

t_schell

Posts : 4
Join date : 2016-03-25

View user profile

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by Clément Goubert on Fri Apr 01, 2016 4:38 pm

Hi!

This is cool that the sampler is now fixed!
I found in the error:

Code:
home/tilman/tools/dnaPipeTE_git/dnaPipeTE-master/bin/trinityrnaseq_r2013_08_14/Trinity.pl:

The path in the config may be wrong; check the current path to Trinity.fasta. If you changed the name of the folder dnaPipeTE-master (previously dnaPipeTE) after having already done some runs, the "config.ini" file generated at the first run in not up to date. So I would first try to remove the config.ini and re-run dnaPipeTE now.

Best,

Clément
avatar
Clément Goubert
Admin

Posts : 30
Join date : 2016-01-05
Age : 29

View user profile https://lbbe.univ-lyon1.fr/-dnaPipeTE-.html

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by t_schell on Fri Apr 01, 2016 11:04 pm

Hi,
thank you for the fast answer!
That was actually quite stupid from my side... I changed the directory from dnaPipeTE.
After deleting the config.ini everything worked very well. You can mark this thread as "solved" Very Happy

The number of cpus was not always limited to the defined number. I specified 30 and had spikes with >180 and average at 40-50 running tasks at the same time. If I remember correctly this was at the crysalis steps. But I think this is more or less a 3rd party problem.

However, I really appreciate your help.
Best regards
Tilman

t_schell

Posts : 4
Join date : 2016-03-25

View user profile

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by Clément Goubert on Mon Apr 25, 2016 2:28 pm

Hi Tilman,

Sorry for my late response! I'm glad you finally manage to make it run!
Concerning Trinity issue with cpus, I already hear about that, but I have no cues how to fix it... Embarassed

Feel free to report your comments on the pipeline!

Cheers,

Clément
avatar
Clément Goubert
Admin

Posts : 30
Join date : 2016-01-05
Age : 29

View user profile https://lbbe.univ-lyon1.fr/-dnaPipeTE-.html

Back to top Go down

Re: [SOLVED] Sample larger than population

Post by Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum