***************************
* DWT-Toolbox User Manual *
***************************

CONTENTS:
1. Creating an initial DWT file.
   1.1 Input PWM format.
   1.2 Output DWT motif format.
2. Using DWT_model
   2.1 Syntax
   2.2 Input files:
       2.2.1 DWT input file.
       2.2.2 Input sequence file.
       2.2.3 Parameter file
   2.3 DWT_model parameters
   2.4 Output files:
       2.4.1 The fitted DWT.
       2.4.2 Site file.
       2.4.3 Sequence score file.
3. Using dependence_posterior
   3.1 Syntax
   3.2 Input file.
   3.3 Output file.
4. Creating a dilogo
   4.1 Syntax
   4.2 Input files.
   4.3 Cutoff parameter.
   4.4 Output file.
5. Contact and credits.

*********************************************************************



1. Creating an initial DWT file.

To train a DWT model on input sequences the DWT_model program needs to
be provided with an initial position-specific weight matrix (PWM)
motif, and this initial PWM needs to be provided in DWT motif format. 

To transform an initial PWM motif into a DWT motif we provide a simple
perl script in the DWT-toolbox: 
init_dwt_from_pwm.pl

USAGE: perl init_dwt_from_pwm.pl input_pwm_file > output_dwt_file

1.1 Input PWM format
Our code uses the standard PWM format initially introduced by
TRANSFAC. Example PWM files can be found in the directory PWM_models.
The format is:
a initial line: 
//
A name line:
NA motif_name
A format line:
PO A	C	G	T
Followed by one line for each motif position with format:
position n_A(position)	n_C(position)	 n_G(position)
n_T(position)
Here the n_x(position) correspond to the number of times letter x was
observed at this position.
Finally, a closing line:
//

1.2 Output DWT format

The file format for DWT motifs is very similar to the format of the
PWM files. Instead of each line corresponding to a single position in
the site, each line corresponds to a pair of positions (i,j), with
i<j. Each line has 18 fields, with the first 2 fields giving the
positions i and j, and the remaining 16 fields giving the number of
observations n_xy(i,j) where the pair of letters (x,y) is observed at
the pair of positions (i,j). Here (x,y) are ordered: AA, AC, AG, AT,
CA, CC, CG, ..., TT Examples of DWT files can be found in the
directory DWT_models.

2. Using DWT_model

The executable DWT_model is the main executable of the DWT-toolbox and
can be used to either train a DWT from a set of input sequences, or
use a DWT to predict binding sites and the overall binding energies of
a set of input sequences. 

2.1 Syntax
USAGE: DWT_model dwt_file sequence_file parameter_file

2.2 Input files.
The DWT_model program takes 3 input files: a file with a DWT motif, a
file with input sequences, and a parameter file. All 3 input files are
required. 

2.2.1 DWT input file.
The DWT input file has the format just described in section 1.2
Example DWT model files are provided in the directory DWT_models. When
DWT_model is run to train a motif the DWT input file gives an initial
starting motif. Such a starting DWT motif can, for example, be obtained
from a PWM motif as explained in section 1. 
Alternative,y when DWT_model is run to predict sites for a given DWT
motif, then this file contains the DWT motif model.

2.2.2 Input sequence file.
The input sequence file should be in FASTA format. Example input
sequence files (sequences from binding peaks of ChIP-seq experiments)
are provided in the directory PeakSequences.
When DWT_model is used to train a motif the input sequences should all
be sequences that are thought to contain one or more sites for the
motif.
When no motif training is done, ie. when the DWT_model is just used to
predict sites in sequences, then the input sequence file may of course
also contain sequences that do not contain any.

2.2.3 Parameter file
The parameter file is required and tells DWT_model how it should
analyze the input sequences. The parameters are each discussed in the
following sections.

2.3 DWT_model parameters
The parameter file for DWT_model has a number of mandatory variables
that need to be specified, and also a number of optional parameters
(for which default values are used when not specified). 
Format: Each line in the parameter file that specifies a parameter
starts with the character #, followed by a keyword that indicates
which parameter is specified, followed by the value of the
parameter. We now discuss all parameters in turn.
Examples of parameter files for training a DWT or predicting sites
with a DWT can be found in the directory Examples.

2.3a Mandatory parameters:
#train_DWT (0 or 1)
This is a binary variable (0 or 1) that specifies whether a DWT motif
should be trained, or whether only sites should be predicted. When
train_DWT is set to 1, a DWT motif is trained starting from the input
DWT file. If train_DWT is set to 0, only predictions of sites and
sequence binding energies are made using the input DWT motif. 

#mode (DWT, ADJ, or PWM)
This parameter specifies what kind of motif model to use. In general
the user will want to use the DWT model and set mode to DWT. However,
for comparison it is also possible to either specify the ADJ mode, in
which only dependencies between adjacent positions are allowed, or to
specify PWM to use a traditional position-specific weight matrix
model. 

#background_type (0, 1, or 2)
The background type specifies what kind of background model to use. It
can be set to 0, 1, or 2. When set to 0, a uniform background model
(in which each nucleotide has probability 0.25) is used. If 1, the
background probabilities will be taken from the parameter file. In
that case, an additional five lines:
#background_probs
A probA
C probC
G probG
T probT
have to be specified, specifying the background probabilities for eahc
of the 4 nucleotides. 
When background_type is set to 2, the background model is set from the
input sequences, i.e. the probability of each nucleotide is set to its
frequency in the input sequences. 

2.3b Optional parameters

#output_dir (string, default ./)
Directory where the output files should be written

#minimum_score (floating point number, default -50)
This gives the minimal score (log-likelihood ratio of the site
sequence under the motif model vs under the background model) for a
site to be printed out. Only sites with score over the minimum score
are printed out. When the default of -50 is used, a score is printed
for every sequence segment. This can lead to very large output files. 

#precision (floating point number, default 25.0)
To ensure numerical stability of the determinant calculations that
occur in DWT_model, the range of values in the R matrix are limited to
a maximal range of exp(precision). By default precision is set to
25.0. This value is chosen so as to make precision as large as
possible while still ensuring numerical stability. Setting
significantly higher values of precision could cause numerical
instability. 

#padding_fraction (floating point number between 0 and 1, default 0)
For some datasets (such as HT-SELEX or protein binding arrays) the
input sequences can be short, i.e. on the order of the same size as
the motif length. In this case it can be desirable to score sites that
only partially overlap the input sequences. To allow this, DWT_model
provides the option to pad the input sequences with N nucleotides on
both sites so that partial sites containing these N nucleotides are
also scored. The parameter padding_fraction species how many
nucleotide should be added. The number of added N nucleotides on each
side equals the motif length times the padding fraction. 
By default the padding fraction is zero.

2.4 Output files.

The DWT_model provides as output a file with predicted sites, a file
with total binding energies for each input sequence and, when
DWT_training is used, an trained DWT motif file.

2.4.1 The fitted DWT.
The name of the trained DWT motif model file equals
name.trained.mode,
where mode is either dwt, adj, or pwm depending on the motif model,
and 'name' is the name of the input DWT motif (as specified in the NA
line of the input DWT file). The format of the DWT motif file is
identical to that of the input DWT motif file.

2.4.2 Site file.
The name of the output file with predicted binding sites equals
name.sites.mode,
where mode is either dwt, adj, or pwm depending	on the motif model,
and 'name' is the name of the input DWT motif (as specified in the NA
line of the input DWT file).
The site file contains one line for each binding site with a score
over the minimal score. Each site line has 7 fields corresponding to:
-The name of the sequence in which the site occurs.
-The start position in the sequence.
-The end position in the sequence.
-The strand on which the site occurred.
-The site sequence.
-The site score (log-likelihood ratio of the sequence under motif and
background models).
-The site posterior probability (expected fraction of the time that
the TF is bound at this position of the sequence).

2.4.3 Sequence score file.
The name of the output file with predicted total binding energies for
all sequences equals
name.seq_scores.mode,
where mode is either dwt, adj, or pwm depending on the motif model,
and 'name' is the name of the input DWT motif (as specified in the NA
line of the input DWT file). The seq_scores file has one line for each
input sequence, which has 3 fields:
-The name of the sequence.
-Its total binding energy E_s, which equals E_s = Log[sum_c exp(E_c)],
where the sum is over all sequence segments c within sequence s.
-The fraction of the time a TF will be bound to the sequence. 

3. Using dependence_posterior
The executable dependence_posterior calculates, for each pair of
position (i,j) the posterior probabilities p(i,j) for a direct
dependence to occur between these positions.

3.1 Syntax
USAGE: dependence_posterior dwt_file

3.2 Input files
The input file is simply a DWT file for which posterior probabilities
of dependence are to be calculated.

3.3 Output
Note that dependence_posterior writes its output to standard-out,
i.e. one would typically pipe the results to a file. The output
consists of one line for each pair of positions (i,j), with i<j. Each
line has 4 fields:
-position i
-position j
-the posterior of dependence p(i,j)
-The value log[R(i,j)] of the direct evidence of dependence between
positions i and j. One can think of log[R(i,j)] as a measure of the
mutual information between the letters occurring at positions i and j.

4. Creating a dilogo
The python program diLogo.py can be used to produce a dilogo motif
picture (PDF) for a given DWT motif.

4.1 Syntax
USAGE:python diLogo.py -i dwt_file -p dependence_posterior_file -c cutoff
   
4.2 Input files.
The dwt_file is the input DWT motif file for which a dilogo is to be
produced. The dependence_posterior_file should contain the output of
the dependence_posterior program (described in section 3) run on the
same DWT motif. 

4.3 Cutoff parameter
The cutoff parameter (floating point number between 0 and 1) gives a
cutoff on the posterior proability of dependence for a connection to
be included in the dilogo. Dependencies will only be shown in the oupt
dilogo for pairs of positions (i,j) with posterior p(i,j) larger than
cutoff. The default -c 0.0, shows all dependencies.

4.4 Output file.
The name of the output file (which is produced in the working
directory) is:
name.pdf,
where name is the name of the motif as specified in the DWT motif file.

5. Contact and credits.

The DWT-toolbox was developed by Saeed Omidi, Mihaela Zavolan, and
Erik van Nimwegen.
The method was first presented in:
Automated Incorporation of Pairwise Dependency in Transcription Factor
Binding Site Prediction Using Dinucleotide Weight Tensors .
Saeed Omidi, Mihaela Zavolan, Mikhail Pachkov, Jeremie Breda, Severin
Berger, and Erik van Nimwegen
http://biorxiv.org/content/early/2016/09/28/078212

Please cite our original publication whenever you make use of the
toolbox.

Source code can be obtained from:
http://dwt.unibas.ch

Contact email address:
erik.vannimwegen@unibas.ch

Physical address:
Prof. Erik van Nimwegen
Biozentrum, University of Basel
Klingelbergstrasse 50-70
4056 Basel, Switzerland
