Explainable Generative Deep Learning for plant transcription factor binding sites discovery



Standalone version of the program is hosted at GitLab

Running the standalone program

Create & activate virtual environment and install other required dependencies

conda create -n ptfvac python=3.8.10 -y

conda activate ptfvac

unzip software.zip

chmod a+x INSTALL

./INSTALL (This step will take some time to install all dependencies in one go)

cd deeptfactor

sh meme.sh (This step will take some time to install MEME suite)

Note: Always use Alphafold2 generated TF pdb file only (An example is provided). This standalone version was developed in Linux, and has been tested in Ubuntu 20.04 with Python 3.8.10.

Input

fastafile = File containing fasta sequences.

example.pdb = Alphafold2 generated example pdb file.

Running script

To detect the TF binding sites, In parent directory execute following command:

Usage: ./ptfvac "fastafile" "folderpath" "Alphafold generated PDB file" "folderpath of deeptfactor inside PTF-Vac"

eg: ./ptfvac fastafile folderpath example.pdb /home/user/ptfvac

To generate line plot, execute the following command:

python3 plot.py plot/seq1.txt (filename) (generated plots are interactive)

For comparison between motif identified by PTF-Vac and already known motif (In MEME format, an example is provided "known.meme")

Step 1: Convert PTF-Vac motif to MEME format

python3 pwm_meme.py folderpath

Run TOMTOM

Step 2: tomtom motiflogopwm.meme known.meme -o result/folder

Output

results.txt = The TF binding site information regarding the binding site sequence and it's start and end coordinates in the sequence. The results also provides the binding site sequence importance score, its binomial test based p-value, and hypergeometric test based p-value.

motiflogopwm.csv = Position weight matrix of binding site sequences.

motiflogopwm.pdf = Sequence logo of binding site sequences.

plots = A folder containing importance score plot.

Note

For HADDOCK implementation see README_HADDOCK.md file for more details.

Python script to build model implementing hyperparameter tuning miWords: Module 1 execution script. (Sequence having length less than 400 base) folderpath: /home/user/miWords File Containing RNA sequences
>seq_1

UGAUAAACAAAGUGUGUAACAUCACCUCAUCUACAUGUGUGAUUUUUUUUUUGAAUAUAGACAACUUUUUAGUCAGAGUUUACAUGAGUUUUCACCUAAUUUGUGGUUUAAUUACACCGCAUAUUU
GCCCAAUUUAGUGAGUAUAGUGAGUUUCUGUAGAGAAGCUCAUCUUAGAAUUAUUCAUGUAUUCCACUACUAAA
>seq_2
AGAUCUACAAGAGAAGAUAAGUUUGAGGCAAAUUCGAGAUCUGGAAGCUGGUUUUCUCUUUACAAAUAACACUAACCCUACCAUCAAAUCAAGAAAGGAGGCUUUGAACAAAUAGCUUGAUUGAAG
UAUGAAGUGGCUCGGUGGGCGACGAUGACGGGCGAGCUCCGGCGAGGGCCUGGGGGCCUGAGCGACGAUGACGG
>seq_3
AACGGGUCGUGCCGGCACGGCCCACGAGCGGGCGUGCCGUGCCGUUCCUGGGCCGGCUACAGUGCUGCCGUGCUCGGGCCGGCACGCCUUGGCCCGGCCCAUUUGGCCAGGUAUACUAGUCGGCUC
CAGUCCUCCUCCCCCCAGCGACCUAAGCCGCCACCGCCCUCGCCGCGCUACCGCCAGCGCCGCCUGCCGUCCCU
>seq_4
UUUUCCCCUUGAUUUUAGGGUUAGGGUUUCAUGAUUUGGGGAAAAAUUUGGGAUCUUACUGUAGCUAGGGUUUCGGUUCUUGGGGAUUUGUCUGAGAUUUGCAUGAACUUUUGCUUUCCCCUUCUU
CUAAUUUCUUCAACCCGAAACCCUAGAAGGCCUAAUUCCAUUUCUUAUAUUUCGGGAUUGCAUGAUUUGGGCUU
>seq_5
CCGGACGAUAGCAAGCGCUGGCAGUAGAGUAGGCUAGAGUCAUGAGUCUGAGUCAUGCUGGCUUUAUAUAGACAAAAAAUGGUACUACACACAAAUGAAAUUUCUAGCAAAAAUAAUCAAUGCACU
UUCCUUGAUUACACACCAACUUUAUGUAUAUAUAGGCUGGAAUAAUCCAUUGUGCAUGUACAUGAAUAUAGAUU
miWords: Module 2 execution script. (Sequence having length more than 400 base) File Containing genomic sequences
>seq_Ath

UGAUAAACAAAGUGUGUAACAUCACCUCAUCUACAUGUGUGAUUUUUUUUUUGAAUAUAGACAACUUUUUAGUCAGAGUUUACAUGAGUUUUCACCUAAUUUGUGGUUUAAUUACACCGCAUAUUU
GCCCAAUUUAGUGAGUAUAGUGAGUUUCUGUAGAGAAGCUCAUCUUAGAAUUAUUCAUGUAUUCCACUACUAAAAGAUCUACAAGAGAAGAUAAGUUUGAGGCAAAUUCGAGAUCUGGAAGCUGGU
UUUCUCUUUACAAAUAACACUAACCCUACCAUCAAAUCAAGAAAGGAGGCUUUGAACAAAUAGCUUGAUUGAAGUAUGAAGUGGCUCGGUGGGCGACGAUGACGGGCGAGCUCCGGCGAGGGCCUG
GGGGCCUGAGCGACGAUGACGGAACGGGUCGUGCCGGCACGGCCCACGAGCGGGCGUGCCGUGCCGUUCCUGGGCCGGCUACAGUGCUGCCGUGCUCGGGCCGGCACGCCUUGGCCCGGCCCAUUU
GGCCAGGUAUACUAGUCGGCUCCAGUCCUCCUCCCCCCAGCGACCUAAGCCGCCACCGCCCUCGCCGCGCUACCGCCAGCGCCGCCUGCCGUCCCUUUUUCCCCUUGAUUUUAGGGUUAGGGUUUC
AUGAUUUGGGGAAAAAUUUGGGAUCUUACUGUAGCUAGGGUUUCGGUUCUUGGGGAUUUGUCUGAGAUUUGCAUGAACUUUUGCUUUCCCCUUCUUCUAAUUUCUUCAACCCGAAACCCUAGAAGG
CCUAAUUCCAUUUCUUAUAUUUCGGGAUUGCAUGAUUUGGGCUUCCGGACGAUAGCAAGCGCUGGCAGUAGAGUAGGCUAGAGUCAUGAGUCUGAGUCAUGCUGGCUUUAUAUAGACAAAAAAUGG
UACUACACACAAAUGAAAUUUCUAGCAAAAAUAAUCAAUGCACUUUCCUUGAUUACACACCAACUUUAUGUAUAUAUAGGCUGGAAUAAUCCAUUGUGCAUGUACAUGAAUAUAGAUU
A: T-Score CNN
B: Bimodal CNN
An example of fastQ file
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Sequence ID Start End Name Score Strand
chr1 0 100 abcd 255 +
chr1 10 125 abcd 255 -
sam2bed script parses SAM data from standard input and prints sorted BED to standard output Elaboration: cat read1.bed read2.bed read3.bed read4.bed >read.bed HISAT2: is a fast and sensitive alignment program for mapping next-generation sequencing reads against a reference genome. genome: file with reference sequences index: Index filename prefix (minus trailing .X.ht2) read.fastq: sRNA-seq data read.sam: read alignment data in SAM format. read.bed: read alignment data in bed6 format. Sequence Alignment Map (SAM): is a text-based format originally for storing biological sequences aligned to a reference sequence. Click to expand CSV files: seq_Ath_0.csv, seq_Ath_8.csv, seq_Ath_12.csv
x
This website uses cookies to ensure you get the best experience on out website. OK