"" This repository contains the scripts and database used for the paper "Thousands of protein linear motif classes may still be undiscovered" Below is an example of the usage to reproduce the results presented in the paper For questions regarding the use of these scripts, please contact the authors of the paper. Copyright (C) 2021 by Denys Bulavka Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. "" Database description: The database was downloaded on the 22nd of May of 2015, named elm_original_20150522.csv, version history in http://elm.eu.org/infos/news.html. This file can be found in the folder 'database'. This file has 210 motifs. We make the following simplifications: ^(at protein start) - we ignore this $ (at end of protein) - we ignore this | OR - take the shorter version {0,1} - take the shorter version (A) aminoacid modification - we ignore this We also ignore the following ones: *TRG_PTS2 ^.{1,40}R[^P][^P][^P][LIV][^P][^P][HQ][LIF] *LIG_PCNA_PIPBox_1 ((^.{0,3})|(Q)).[^FHWY][ILM][^P][^FHILVWYP][HFM][FMY].. This generated a list of 208 motifs. The file we will be working with is "database/elm_input.txt". Analysis description: We will take the following approach to the analysis: Trim start and end positions with more than 10 characteres. We identified motifs that: 1. Have the same biological role 2. Described as "minor variants" The file 'database/marked_motifs.txt' has the list of motifs we will not take into account from elm_input_modif.txt. Software: Elm_processing: Compile the code: cd elm_processing make After compilation the binary will be placed in the folder elm_processing/bin. Generate files for python sripts: The corresponding input files for python scripts are already generated. But if needed, they can be generated using the following commands from the directory elm_processing/bin ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 6 >> structure_empiric_frequency.txt ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 7 >> structure_theoretic_probabilities.txt Other available reports: Number of aminoacids per coordinate: ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 1 Aminoacids per coordinate: ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 2 Number of k discriminant motifs: ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 3 -k NUMBER Pairs of k discriminant motifs: ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 4 -k NUMBER For each k, outputs the number of pairs of motifs with k discriminant positions: ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 5 We can as well generate any of the previous reports using a randomly selected subset of the database, for example to select 50% of motifs at random and calculate the number of aminoacids per coordinate: ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 1 -a 0.5 We can as well select a random sample of motifs, and in each of them uniformly at random we select uniformly at random a coordinate and set it to wildcard "." with the option -b %. For example if we select a motif class to modify with 0.5 of probability and the, over this modified datbaase we calculate the number of aminoacids per coordinate: ./main -d ../../database/elm_input.txt -c -m ../../database/marked_motifs.txt -r 1 -b 0.5 Python scripts: Each python script has an input directoy and an output directory. To excecute the each python script it is enought to be inside the script folder and run 'python main.py' which will generate the output and place it in output directory. The rule is the script that has "empiric" in is name should have "structure_empiric_frequency.txt" in its input folder, while the script that has "theoretic" in its name should have "structure_theoretic_probabilities.txt" in its input folder.