Explainable Generative Deep Learning for plant transcription factor binding sites discovery



Fill out this form for standalone program

Running the standalone program

Create & activate virtual environment and install other required dependencies

conda create -n ptfvac python=3.8.10 -y

conda activate ptfvac

chmod a+x install.sh

./install.sh

cd deeptfactor

sh meme.sh (install MEME)

Note: Always use Alphafold2 generated TF pdb file only (An example is provided). This standalone version was developed in Linux, and has been tested in Ubuntu 20.04 with Python 3.8.10.

Input

fastafile = File containing fasta sequences.

example.pdb = Alphafold2 generated example pdb file.

Running script

To detect the TF binding sites, In parent directory execute following command:

Usage: ./ptfvac "fastafile" "folderpath" "Alphafold generated PDB file" "folderpath of deeptfactor inside PTF-Vac"

eg: ./ptfvac fastafile folderpath example.pdb /home/user/ptfvac

To generate line plot, execute the following command:

python3 plot.py plot/seq1.txt (filename) (generated plots are interactive)

Output

results.txt = The TF binding site information regarding the binding site sequence and it's start and end coordinates in the sequence.

motiflogopwm.csv = Position weight matrix of binding site sequences.

motiflogopwm.pdf = Sequence logo of binding site sequences.

plots = A folder containing importance score plot.

Python script to build model implementing hyperparameter tuning miWords: Module 1 execution script. (Sequence having length less than 400 base) folderpath: /home/user/miWords File Containing RNA sequences
>seq_1

UGAUAAACAAAGUGUGUAACAUCACCUCAUCUACAUGUGUGAUUUUUUUUUUGAAUAUAGACAACUUUUUAGUCAGAGUUUACAUGAGUUUUCACCUAAUUUGUGGUUUAAUUACACCGCAUAUUU
GCCCAAUUUAGUGAGUAUAGUGAGUUUCUGUAGAGAAGCUCAUCUUAGAAUUAUUCAUGUAUUCCACUACUAAA
>seq_2
AGAUCUACAAGAGAAGAUAAGUUUGAGGCAAAUUCGAGAUCUGGAAGCUGGUUUUCUCUUUACAAAUAACACUAACCCUACCAUCAAAUCAAGAAAGGAGGCUUUGAACAAAUAGCUUGAUUGAAG
UAUGAAGUGGCUCGGUGGGCGACGAUGACGGGCGAGCUCCGGCGAGGGCCUGGGGGCCUGAGCGACGAUGACGG
>seq_3
AACGGGUCGUGCCGGCACGGCCCACGAGCGGGCGUGCCGUGCCGUUCCUGGGCCGGCUACAGUGCUGCCGUGCUCGGGCCGGCACGCCUUGGCCCGGCCCAUUUGGCCAGGUAUACUAGUCGGCUC
CAGUCCUCCUCCCCCCAGCGACCUAAGCCGCCACCGCCCUCGCCGCGCUACCGCCAGCGCCGCCUGCCGUCCCU
>seq_4
UUUUCCCCUUGAUUUUAGGGUUAGGGUUUCAUGAUUUGGGGAAAAAUUUGGGAUCUUACUGUAGCUAGGGUUUCGGUUCUUGGGGAUUUGUCUGAGAUUUGCAUGAACUUUUGCUUUCCCCUUCUU
CUAAUUUCUUCAACCCGAAACCCUAGAAGGCCUAAUUCCAUUUCUUAUAUUUCGGGAUUGCAUGAUUUGGGCUU
>seq_5
CCGGACGAUAGCAAGCGCUGGCAGUAGAGUAGGCUAGAGUCAUGAGUCUGAGUCAUGCUGGCUUUAUAUAGACAAAAAAUGGUACUACACACAAAUGAAAUUUCUAGCAAAAAUAAUCAAUGCACU
UUCCUUGAUUACACACCAACUUUAUGUAUAUAUAGGCUGGAAUAAUCCAUUGUGCAUGUACAUGAAUAUAGAUU
miWords: Module 2 execution script. (Sequence having length more than 400 base) File Containing genomic sequences
>seq_Ath

UGAUAAACAAAGUGUGUAACAUCACCUCAUCUACAUGUGUGAUUUUUUUUUUGAAUAUAGACAACUUUUUAGUCAGAGUUUACAUGAGUUUUCACCUAAUUUGUGGUUUAAUUACACCGCAUAUUU
GCCCAAUUUAGUGAGUAUAGUGAGUUUCUGUAGAGAAGCUCAUCUUAGAAUUAUUCAUGUAUUCCACUACUAAAAGAUCUACAAGAGAAGAUAAGUUUGAGGCAAAUUCGAGAUCUGGAAGCUGGU
UUUCUCUUUACAAAUAACACUAACCCUACCAUCAAAUCAAGAAAGGAGGCUUUGAACAAAUAGCUUGAUUGAAGUAUGAAGUGGCUCGGUGGGCGACGAUGACGGGCGAGCUCCGGCGAGGGCCUG
GGGGCCUGAGCGACGAUGACGGAACGGGUCGUGCCGGCACGGCCCACGAGCGGGCGUGCCGUGCCGUUCCUGGGCCGGCUACAGUGCUGCCGUGCUCGGGCCGGCACGCCUUGGCCCGGCCCAUUU
GGCCAGGUAUACUAGUCGGCUCCAGUCCUCCUCCCCCCAGCGACCUAAGCCGCCACCGCCCUCGCCGCGCUACCGCCAGCGCCGCCUGCCGUCCCUUUUUCCCCUUGAUUUUAGGGUUAGGGUUUC
AUGAUUUGGGGAAAAAUUUGGGAUCUUACUGUAGCUAGGGUUUCGGUUCUUGGGGAUUUGUCUGAGAUUUGCAUGAACUUUUGCUUUCCCCUUCUUCUAAUUUCUUCAACCCGAAACCCUAGAAGG
CCUAAUUCCAUUUCUUAUAUUUCGGGAUUGCAUGAUUUGGGCUUCCGGACGAUAGCAAGCGCUGGCAGUAGAGUAGGCUAGAGUCAUGAGUCUGAGUCAUGCUGGCUUUAUAUAGACAAAAAAUGG
UACUACACACAAAUGAAAUUUCUAGCAAAAAUAAUCAAUGCACUUUCCUUGAUUACACACCAACUUUAUGUAUAUAUAGGCUGGAAUAAUCCAUUGUGCAUGUACAUGAAUAUAGAUU
A: T-Score CNN
B: Bimodal CNN
An example of fastQ file
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Sequence ID Start End Name Score Strand
chr1 0 100 abcd 255 +
chr1 10 125 abcd 255 -
sam2bed script parses SAM data from standard input and prints sorted BED to standard output Elaboration: cat read1.bed read2.bed read3.bed read4.bed >read.bed HISAT2: is a fast and sensitive alignment program for mapping next-generation sequencing reads against a reference genome. genome: file with reference sequences index: Index filename prefix (minus trailing .X.ht2) read.fastq: sRNA-seq data read.sam: read alignment data in SAM format. read.bed: read alignment data in bed6 format. Sequence Alignment Map (SAM): is a text-based format originally for storing biological sequences aligned to a reference sequence. Click to expand CSV files: seq_Ath_0.csv, seq_Ath_8.csv, seq_Ath_12.csv
x
This website uses cookies to ensure you get the best experience on out website. OK