RNAmining is a web tool that allows nucleotides coding potential prediction. It takes a user-defined fasta sequences. This tool was implemented using XGBoost machine learning algorithm. Machine learning is a subfield of computer science that developed from the study of pattern recognition and computational learning theories in artificial intelligence. This tool operate through a model obtained from training data analyzes and produces an inferred function, which can be used for mapping new examples.

What files do I need to provide?

You need to upload your RNA sequences in fasta format, see the image example below:

How does the algorithm used work?

The algorithm begins by reading the RNA sequences provided in the uploaded file. Thereafter, it is divided into two main parts: the preprocessing and the prediction. In preprocessing, we perfomed a tri-nucleotides frequency of each RNA sequence and then, we normalized it according to the sequence's lenght. This process is save in a file, which is going to be used as input for the second part. In prediction, since the user provides the organism type (e.g. Homo sapiens), the tool selects a specific organism model trained by XGBoost and perform the prediction, which is shown in the platform and can be downloaded as a .zip file.

How can RNAmining helps?

Non-coding RNAs are untranslated RNA molecules, but are important players in the cellular regulation of organisms from different kingdom. Thus, the research interest on non-coding RNAs has increased dramatically in recent years. Its investigation is routine in every transcriptome or genome project, since any mutations or misregulation on them result in disorders such as: tumor formation (cancerous or other type), cardiovascular, neurological diseases and others human illness. Therefore, exists an important step in ncRNAs research which is the ability to distinguish coding/non-coding sequences.

Thus, RNAmining was built to enable easy access to nucleotides coding potential prediction for non-programming researchers. Additionally, the results are very easy to interpret.