simulate_sublibrary is a program which takes a library file and generates a sublibrary file. A sublibrary file has only a small number of unique sequences (200-300 usually) and many copies of each sequence. The sublibrary is useful for using as a test data set to test the performance of your model. It is also useful for using totalinfo to calculate the best possible performance of any model.
After you install `sortseq_tools`_, this program will be available to run at the command line.
usage: sortseq simulate_sublibrary [-h] [-d {constant,exponential}]
[-ns NUMSUB] [-nc NUMCOPIES]
[-dc DECAYCONSTANT] [-t {dna,rna,protein}]
[-o OUT]
| -d=constant, --disttype=constant | |
how should the number of times each sequence appears be distributed? Possible choices: constant, exponential | |
| -ns=200, --numsub=200 | |
| Number of Different SubLib Sequences | |
| -nc=1000.0, --numcopies=1000.0 | |
| Number of times to measure a given sequence, if an exponential distrubtion type is selected. This will be the maximum number of seqs | |
| -dc=100, --decayconstant=100 | |
| If exponential distribution is selected, the decay constant of the distribution | |
| -t=dna, --type=dna | |
Undocumented Possible choices: dna, rna, protein | |
| -o, --out | Undocumented |
The input should be a sequence library.
Example Input Table:
seq ct
ATTAG 1
ACCTA 15
GGATT 9
...
Example Output Table:
seq ct
ATTAG 6000
AGGAT 6000
...
By default, each chosen sequence is given a uniform number of counts.
Example command to perform the analysis:
sortseq simulate_sublibrary -i my_library.txt -o my_sublibrary.txt