README.md 3.11 KB
Newer Older
Denys Bulavka's avatar
Denys Bulavka committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Database description:
    The database was downloaded on the 22nd of May of 2015, named elm_original_20150522.csv, version history in http://elm.eu.org/infos/news.html. This file can be found in the folder 'database'. This file has 210 motifs. 
    
    We make the following simplifications:
        ^(at protein start) - we ignore this
        $ (at end of protein) - we ignore this
        | OR - take the shorter version
        {0,1} - take the shorter version
        (A) aminoacid modification - we ignore this
    
    We also ignore the following ones:
        *TRG_PTS2	^.{1,40}R[^P][^P][^P][LIV][^P][^P][HQ][LIF]
        *LIG_PCNA_PIPBox_1	((^.{0,3})|(Q)).[^FHWY][ILM][^P][^FHILVWYP][HFM][FMY]..
    
    This generated a list of 208 motifs. The file we will be working with is "database/elm_input.txt".
Denys Bulavka's avatar
Denys Bulavka committed
16

Denys Bulavka's avatar
Denys Bulavka committed
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Analysis description:
    We will take the following approach to the analysis:
    Trim start and end positions with more than 10 characteres.
        
    We identified motifs that: 
        1. Have the same biological role
        2. Described as "minor variants"

    The file 'database/marked_motifs.txt' has the list of motifs we will not take into account from elm_input_modif.txt.


Software:
    Elm_processing:
        Compile the code:
            cd elm_processing
            make
        After compilation the binary will be placed in the folder elm_processing/bin.
    
    Generate files for python sripts:
        The corresponding input files for python scripts are already generated. But if needed, they can be generated using the following commands from the directory elm_processing/bin
        ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 6 >> structure_empiric_frequency.txt
        ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 7 >> structure_theoretic_probabilities.txt

    Other available reports:
        Number of aminoacids per coordinate:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 1
        Aminoacids per coordinate:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 2     
        Number of k discriminant motifs:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 3 -k NUMBER     
        Pairs of k discriminant motifs:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 4 -k NUMBER
        For each k, outputs the number of pairs of motifs with k discriminant positions:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 5
        
    Python scripts:
        Each python script has an input directoy and an output directory. To excecute the each python script it is enought to be inside the script folder and run 'python main.py' which will generate the output and place it in output directory. The rule is the script that has "empiric" in is name should have "structure_empiric_frequency.txt" in its input folder, while the script that has "theoretic" in its name should have "structure_theoretic_probabilities.txt" in its input folder.