
In this file we give several examples that show how to use the
Dinucleotide Weight Tensor (DWT) tools. 

The example runs concern:
1. Preparing an initial DWT model from a position specific weight
matrix (PWM).

2. Training a DWT model on a set of sequences thought to contain
binding sites.

3. Predicting binding sites for a given DWT in a set of sequences.

4. Calculate posterior probabilities for dependencies in a DWT.

5. Create a DiLogo picture for a DWT.







----------------------------------------------------------------------
1. Preparing a DWT from a PWM model. 
We provide a simple Perl script that takes a PWM file as input and
produces a DWT file to Standard Out.

INPUT FILES: 
../PWM_models/Myers_HudsonAlpha-BG_10-CEBP.wm

COMMAND: 
perl ../Source/init_dwt_from_pwm.pl ../PWM_models/Myers_HudsonAlpha-BG_10-CEBP.wm > CEBP_init.dwt

OUTPUT FILES: 
CEBP_init.dwt

DESCRIPTION:
The input file is a position specific weight matrix for the human TF
CEBP, which was inferred (using our CRUNCH tool, at crunch.unibas.ch)
from an ENCODE ChIP-seq dataset performed by the Myers lab. The output
file is a DWT file that represents the input PWM file. That is, this
DWT motif by construction has no dependencies between the
positions. The purpose of the file is to use it as a starting point
for DWT inference.



---------------------------------------------------------------------
2. Training a DWT model.

INPUT FILES:
a. CEBP_init.dwt
The initial DWT file to start the training from).

b. ../PeakSequences/Myers_HudsonAlpha-BG_10-CEBP.fa
a FASTA file with the sequence of the binding peaks from the ChIP-seq
experiment. 

c. params_dwt_training
The parameter file for training the DWT model.

COMMAND:
 ../Source/DWT_model CEBP_init.dwt ../PeakSequences/Myers_HudsonAlpha-BG_10-CEBP.fa params_dwt_training

OUTPUT FILES:
a. Results/Myers_HudsonAlpha-BG_10-CEBP.trained.dwt
The file containing the trained DWT.

b. Results/Myers_HudsonAlpha-BG_10-CEBP.sites.dwt
Output file with the predicted binding sites. Each line corresponds to
a binding site with fields:
The name of the sequence the site occurs in.
Its starting position in the sequence.
Its end position in the sequence.
The strand the site occurs on.
The sequence of the site.
The score (log-likelihood ratio under motif and background models)
The posterior in its sequence (fraction of time TF will be bound at
this position in the sequence).

c. Myers_HudsonAlpha-BG_10-CEBP.seq_scores.dwt
A file with, for each sequence s, its total binding energy E_s (which
equals E_s = Log[sum_w exp(E_w)], where the sum is over all windows w
in sequence s and E_w is the score of window w). The third column
gives the probability that sequence will be bound by the TF.

DESCRIPTION:
The DWT_TFBS_prediction program, when run in motif training mode,
takes an initial DWT motif as input together with a set of sequences
(in FASTA format) that are thought to contain binding sites for the
TF. An EM-like algorithm is used to train the DWT, which is output to
a file. The program also predicts binding sites in all input sequences
and outputs these into a file. Finally, the program also calculates a
total binding energy for each sequence and outputs these to a file as
well.


--------------------------------- 
3. Predicting sites with a DWT model.

INPUT FILES:
a. ../DWT_models/Snyder_Stanford-StandardControl-CTCF.trained.dwt
The DWT file previously trained on CTCF ChIP-seq data.

b. ../PeakSequences/Snyder_Stanford-StandardControl-CTCF.fa
Binding peaks from the ChIP-seq experiment with CTCF.

c. params_dwt_site_prediction
Parameter file

COMMAND:
../Source/DWT_model ../DWT_models/Snyder_Stanford-StandardControl-CTCF.trained.dwt ../PeakSequences/Snyder_Stanford-StandardControl-CTCF.fa params_dwt_site_prediction 

OUTPUT FILES:
a. Snyder_Stanford-StandardControl-CTCF.sites.dwt
Output file with the predicted binding sites.

b. Snyder_Stanford-StandardControl-CTCF.seq_scores.dwt
Output file with total binding energies for all sequences.

DESCRIPTION:
When the DWT_TFBS_prediction program is run in site prediction mode,
i.e. with motif training turned off, it simply predicts sites in all
input sequences using the provided DWT motif. It again also predicts a
total binding energy for each site.

---------------------------------
4. Calculate posterior probabilities for dependencies in a DWT.

INPUT FILES:
a. Results/Myers_HudsonAlpha-BG_10-CEBP.trained.dwt
The DWT file whose posteriors for positional dependencies we want to
determine. 

COMMAND:
../Source/dependence_posterior Results/Myers_HudsonAlpha-BG_10-CEBP.trained.dwt > Results/Myers_HudsonAlpha-BG_10-CEBP.posteriors.dwt

OUTPUT FILES:
a. Results/Myers_HudsonAlpha-BG_10-CEBP.posteriors.dwt
Posterior probabilities for interaction between each pair of
positions. Each line has a pair of positions, followed by their
posterior probability of interaction, and finally the raw logR value
for the pair of positions.

DESCRIPTION:
The program positional_dependency_posterior calculates posterior
probabilities for dependency between each pair of positions in the
sites. 


----------------------------------
5. Create a DiLogo picture for a DWT

INPUT FILES: 
a. Results/Myers_HudsonAlpha-BG_10-CEBP.trained.dwt
The DWT file to make a dilogo of.

b. Results/Myers_HudsonAlpha-BG_10-CEBP.posteriors.dwt
File with posterior probabilities for the input DWT file.

COMMAND:
python ../Source/diLogo.py -i Results/Myers_HudsonAlpha-BG_10-CEBP.trained.dwt -p Results/Myers_HudsonAlpha-BG_10-CEBP.posteriors.dwt -c 0.0

OUTPUT FILES:

a. Myers_HudsonAlpha-BG_10-CEBP.pdf
A PDF with the dilogo representation of the motif.

DESCRIPTION:
A PDF with a dilogo is made from a DWT file and the corresponding file
with posterior probabilities. Note that the option "-c prob" indicates
that the dilogo should only show dependencies with posterior
probability at least prob. We set prob=0.0, i.e. all dependencies are
included. 

