README.md 4.08 KB
Newer Older
Denys Bulavka's avatar
Denys Bulavka committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
"" 
This repository contains the scripts and database used for the paper "Thousands of protein linear motif classes may still be undiscovered"
Below is an example of the usage to reproduce the results presented in the paper 
For questions regarding the use of these scripts, please contact the authors of the paper.

Copyright (C) 2021 by Denys Bulavka <dbulavka@gmail.com>

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
THIS SOFTWARE.

""

Denys Bulavka's avatar
Denys Bulavka committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Database description:
    The database was downloaded on the 22nd of May of 2015, named elm_original_20150522.csv, version history in http://elm.eu.org/infos/news.html. This file can be found in the folder 'database'. This file has 210 motifs. 
    
    We make the following simplifications:
        ^(at protein start) - we ignore this
        $ (at end of protein) - we ignore this
        | OR - take the shorter version
        {0,1} - take the shorter version
        (A) aminoacid modification - we ignore this
    
    We also ignore the following ones:
        *TRG_PTS2	^.{1,40}R[^P][^P][^P][LIV][^P][^P][HQ][LIF]
        *LIG_PCNA_PIPBox_1	((^.{0,3})|(Q)).[^FHWY][ILM][^P][^FHILVWYP][HFM][FMY]..
    
    This generated a list of 208 motifs. The file we will be working with is "database/elm_input.txt".
Denys Bulavka's avatar
Denys Bulavka committed
30

Denys Bulavka's avatar
Denys Bulavka committed
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
Analysis description:
    We will take the following approach to the analysis:
    Trim start and end positions with more than 10 characteres.
        
    We identified motifs that: 
        1. Have the same biological role
        2. Described as "minor variants"

    The file 'database/marked_motifs.txt' has the list of motifs we will not take into account from elm_input_modif.txt.


Software:
    Elm_processing:
        Compile the code:
            cd elm_processing
            make
        After compilation the binary will be placed in the folder elm_processing/bin.
    
    Generate files for python sripts:
        The corresponding input files for python scripts are already generated. But if needed, they can be generated using the following commands from the directory elm_processing/bin
        ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 6 >> structure_empiric_frequency.txt
        ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 7 >> structure_theoretic_probabilities.txt

    Other available reports:
        Number of aminoacids per coordinate:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 1
        Aminoacids per coordinate:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 2     
        Number of k discriminant motifs:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 3 -k NUMBER     
        Pairs of k discriminant motifs:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 4 -k NUMBER
        For each k, outputs the number of pairs of motifs with k discriminant positions:
            ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 5
        
    Python scripts:
        Each python script has an input directoy and an output directory. To excecute the each python script it is enought to be inside the script folder and run 'python main.py' which will generate the output and place it in output directory. The rule is the script that has "empiric" in is name should have "structure_empiric_frequency.txt" in its input folder, while the script that has "theoretic" in its name should have "structure_theoretic_probabilities.txt" in its input folder.