Working with custom transcriptomes using factR2

Introduction

Many eukaryotic genes undergo alternative splicing to produce multiple RNA isoforms, thus expanding the coding and regulatory potential of the genome. The transcriptome of specific biological samples are constructed from high throughput sequencing experiments by mapping short or long reads to a reference genome and assembling alignments into transcript architectures. The resulting Gene Transfer Format (GTF) files describe newly identified transcripts as sets of exonic coordinates, but they do not contain information about the coding sequences (CDSs; also known as ORFs) or possible biological functions of these transcripts. We previously published {factR}, an R package for the construction of CDSs for novel RNA isoforms. {factR} also predicts the domain organisation of the protein products of these isoforms, and identifies transcripts that are susceptible to nonsense-mediated decay (NMD; a pathway destabilizing mRNAs with premature translation termination codons).

This new version of {factR}, called {factR2}, extends the functionality of its predecessor by adding tools to probe for the regulatory potential of alternative exons and enhances its user-friendliness by adopting an R object class. {factR2} pinpoints alternative splicing events that trigger NMD, and it tests this regulation by correlating exon splicing inclusion with gene expression level. Additional features of {factR2} include direct quantification of exonic sequence conservation scores, an interactive visualisation of transcript and domain architectures and a two-way plot of expression or exon splicing levels.

Getting started

Installing {factR2}

{factR2} is currently available on GitHub and can be installed using {devtools}:

# install.packages("devtools")
devtools::install_github("f-hamidlab/factR2")

Materials needed

{factR2} only requires a custom transcriptome file in GTF format as input. This file is typically generated by short-read assemblers such as StringTie2 (Kovaka et al. 2019)
and Cufflinks, (Trapnell et al. 2012) or by long-read aligners such as minimap2 (Li 2018) and Isoquant (Prjibelski et al. 2023). We provide a sample custom transcriptome file assembled from PacBio HiFi data using Isoquant that can be used to test {factR2} functionalities (see “Creating a factRObject” for more details).

Several {factR2} functions require reference genomic sequences and transcript annotations. To make it easy for users to obtain these files, we provide a convenient system to retrieve reference files of common model organisms from the GENCODE and Ensembl databases. Users may optionally use their own reference files as inputs.

Using factR2

The following packages are required to perform this workflow:

library(factR2)
library(dplyr)

Creating a factRObject

User-input custom transcriptome data are stored and analysed as part of an S4 class structure called factRObject. To create a factRObject, users can provide the local path to their custom GTF file, or a {GenomicRanges} object containing transcript architecture information of custom transcriptome. {factR2} comes with a sample GTF file assembled from long-read sequences of transcripts from postnatal mouse brain cortex (Joglekar et al. 2023). The local path to this GTF file can be retrieved as follows:

gtf.file <-  system.file("extdata/pb_custom.gtf.gz", package = "factR2")

In addition to a custom GTF file, users are required to define the genome used for read assembly. This is done by providing input to the reference argument and accepted inputs include the name of the organism (“Homo sapiens” or “Human”) or the specific ID of {factR2} supported genomes. This ID list can be retrieved by running the listSupportedGenomes() in console:

listSupportedGenomes()
#>     ID      species database release.date
#> 1  v44 Homo sapiens  GENCODE       7.2023
#> 2  v43 Homo sapiens  GENCODE       2.2023
#> 3  v42 Homo sapiens  GENCODE      10.2022
#> 4  v41 Homo sapiens  GENCODE       7.2022
#> 5  v40 Homo sapiens  GENCODE       4.2022
#> 6  v39 Homo sapiens  GENCODE      12.2021
#> 7  v38 Homo sapiens  GENCODE       5.2021
#> 8  v37 Homo sapiens  GENCODE       2.2021
#> 9  v36 Homo sapiens  GENCODE      10.2020
#> 10 v35 Homo sapiens  GENCODE       8.2020
#> 11 v34 Homo sapiens  GENCODE       4.2020
#> 12 v33 Homo sapiens  GENCODE       1.2020
#> 13 v32 Homo sapiens  GENCODE       9.2019
#> 14 v31 Homo sapiens  GENCODE       6.2019
#> 15 v30 Homo sapiens  GENCODE       4.2019
#> 16 v29 Homo sapiens  GENCODE      10.2018
#> 17 v28 Homo sapiens  GENCODE       4.2018
#> 18 v27 Homo sapiens  GENCODE       8.2017
#> 19 v26 Homo sapiens  GENCODE       3.2017
#> 20 v25 Homo sapiens  GENCODE       7.2016
#> 21 v24 Homo sapiens  GENCODE      12.2015
#> 22 v23 Homo sapiens  GENCODE       7.2015
#> 23 v22 Homo sapiens  GENCODE       3.2015
#> 24 v21 Homo sapiens  GENCODE      10.2014
#> 25 v20 Homo sapiens  GENCODE       8.2014
#>  [ reached 'max' / getOption("max.print") -- omitted 11 rows ]

With these inputs, a factRObject can be created as follows:

fobj <- createfactRObject(gtf.file, reference = "vM33")
#> 🡆  Checking inputs
#> 🡆  Checking factRObject
#> 🡆  Adding custom transcriptome
#>     ℹ Importing from local directory
#> 🡆  Adding annotation
#>     ℹ Importing from URL
#> 🡆  Adding genome sequence
#>     ℹ Using BSgenome object
#> 🡆  Matching chromosome names
#> 🡆  Matching gene information
#>     Number of mismatched gene_ids found: 83520
#>     --> Attempting to match ensembl gene_ids...
#>     --> No ensembl gene ids found in query
#>     ---> Attempting to match gene_ids by finding overlapping coordinates...
#>     ---> 65203 gene_id matched
#>     Total gene_ids corrected: 65203
#>     Remaining number of mismatched gene_ids: 18317
#> 🡆  Creating factRset objects
#> 
#>     ℹ Adding gene information
#> 
#>     ℹ Adding transcript information
#> 
#>     ℹ Adding alternative splicing information
#> 
#> 🡆  Annotating novel transcripts
#> 
#> 🡆  Annotating novel AS events
#> 
#> ℹ factRobject created!

The function above extracts and stores gene-, transcript- and exon-level information from a custom GTF file and, by default, corrects for inconsistencies in gene annotation to the reference transcriptome.

Interacting with a factRObject

{factR2} comes with wrapper functions to access the information stored within a factRObject. A preview of the object’s contents can be printed by typing the variable name:

fobj
#> class: factRObject [version 0.99.0]
#> # transcriptome:
#>    36931 genes; 
#>    115055 transcripts [99855 novel]; 
#>    0 coding transcripts 
#> # active set: AS
#> # samples (0):

Transcripts that contain distinct sets of internal exons are defined as “novel”. Note that the sample GTF lack coding sequence information and thus, none of the transcripts are currently annotated as “coding”.

The metadata of genes, transcripts and exons are segregated into individual “set” data in a factRObject. One of these “set” data will be assigned as the active “set” and users may switch the active “set” using the activeSet() function:

activeSet(fobj) <- "transcript"

The exact names of the “set” can be printed out using listSets():

listSets(fobj)
#> [1] "gene"       "transcript" "AS"

The metadata associated with the current active “set” can be displayed as a dataframe using the features() function:

features(fobj)
#> # A tibble: 115,055 × 8
#>    transcript_id              gene_id   gene_name strand width novel cds   nmd  
#>    <chr>                      <chr>     <chr>     <fct>  <int> <chr> <chr> <chr>
#>  1 transcript51190.chr3.nnic  ENSMUSG0… Gnai3     -       1333 yes   no    no   
#>  2 transcript62347.chr16.nnic ENSMUSG0… Cdc45     -       2279 yes   no    no   
#>  3 transcript62383.chr16.nnic ENSMUSG0… Cdc45     -        679 yes   no    no   
#>  4 transcript62392.chr16.nnic ENSMUSG0… Cdc45     -       2282 yes   no    no   
#>  5 transcript45445.chrX.nnic  ENSMUSG0… Scml2     +       1237 yes   no    no   
#>  6 transcript45454.chrX.nnic  ENSMUSG0… Scml2     +        677 yes   no    no   
#>  7 transcript45461.chrX.nnic  ENSMUSG0… Scml2     +        451 yes   no    no   
#>  8 transcript45463.chrX.nnic  ENSMUSG0… Scml2     +        669 yes   no    no   
#>  9 transcript45465.chrX.nnic  ENSMUSG0… Scml2     +        464 yes   no    no   
#> 10 transcript45467.chrX.nnic  ENSMUSG0… Scml2     +        425 yes   no    no   
#> # ℹ 115,045 more rows

To show more columns as an R dataframe, set show_more argument to TRUE:

features(fobj, show_more = TRUE)
#>                                         transcript_id               gene_id
#> transcript51190.chr3.nnic   transcript51190.chr3.nnic  ENSMUSG00000000001.5
#> transcript62347.chr16.nnic transcript62347.chr16.nnic ENSMUSG00000000028.16
#> transcript62383.chr16.nnic transcript62383.chr16.nnic ENSMUSG00000000028.16
#> transcript62392.chr16.nnic transcript62392.chr16.nnic ENSMUSG00000000028.16
#> transcript45445.chrX.nnic   transcript45445.chrX.nnic ENSMUSG00000000037.18
#> transcript45454.chrX.nnic   transcript45454.chrX.nnic ENSMUSG00000000037.18
#> transcript45461.chrX.nnic   transcript45461.chrX.nnic ENSMUSG00000000037.18
#> transcript45463.chrX.nnic   transcript45463.chrX.nnic ENSMUSG00000000037.18
#> transcript45465.chrX.nnic   transcript45465.chrX.nnic ENSMUSG00000000037.18
#> transcript45467.chrX.nnic   transcript45467.chrX.nnic ENSMUSG00000000037.18
#> transcript93435.chr11.nnic transcript93435.chr11.nnic ENSMUSG00000000049.12
#> transcript93455.chr11.nnic transcript93455.chr11.nnic ENSMUSG00000000049.12
#>                            gene_name strand width novel cds nmd
#> transcript51190.chr3.nnic      Gnai3      -  1333   yes  no  no
#> transcript62347.chr16.nnic     Cdc45      -  2279   yes  no  no
#> transcript62383.chr16.nnic     Cdc45      -   679   yes  no  no
#> transcript62392.chr16.nnic     Cdc45      -  2282   yes  no  no
#> transcript45445.chrX.nnic      Scml2      +  1237   yes  no  no
#> transcript45454.chrX.nnic      Scml2      +   677   yes  no  no
#> transcript45461.chrX.nnic      Scml2      +   451   yes  no  no
#> transcript45463.chrX.nnic      Scml2      +   669   yes  no  no
#> transcript45465.chrX.nnic      Scml2      +   464   yes  no  no
#> transcript45467.chrX.nnic      Scml2      +   425   yes  no  no
#> transcript93435.chr11.nnic      Apoh      +  1389   yes  no  no
#> transcript93455.chr11.nnic      Apoh      +  1020   yes  no  no
#>  [ reached 'max' / getOption("max.print") -- omitted 115043 rows ]

To display transcripts expressed from a particular gene, users can provide IDs or names of genes as input to the second argument of features(). For example, to query for the transcripts from U2af1 gene, we can type:

features(fobj, "U2af1", show_more = TRUE)
#>                                         transcript_id               gene_id
#> transcript80929.chr17.nnic transcript80929.chr17.nnic ENSMUSG00000061613.14
#> transcript80960.chr17.nnic transcript80960.chr17.nnic ENSMUSG00000061613.14
#> transcript80965.chr17.nnic transcript80965.chr17.nnic ENSMUSG00000061613.14
#> transcript81001.chr17.nnic transcript81001.chr17.nnic ENSMUSG00000061613.14
#> transcript81044.chr17.nnic transcript81044.chr17.nnic ENSMUSG00000061613.14
#>                            gene_name strand width novel cds nmd
#> transcript80929.chr17.nnic     U2af1      -  1038    no  no  no
#> transcript80960.chr17.nnic     U2af1      -   971    no  no  no
#> transcript80965.chr17.nnic     U2af1      -   971    no  no  no
#> transcript81001.chr17.nnic     U2af1      -   842   yes  no  no
#> transcript81044.chr17.nnic     U2af1      -   785    no  no  no

Multiple genes can be displayed by providing additional gene names:

features(fobj, "U2af1", "U2af2", show_more = TRUE)
#>                                         transcript_id               gene_id
#> transcript2976.chr7.nnic     transcript2976.chr7.nnic ENSMUSG00000030435.17
#> transcript2982.chr7.nnic     transcript2982.chr7.nnic ENSMUSG00000030435.17
#> transcript3019.chr7.nnic     transcript3019.chr7.nnic ENSMUSG00000030435.17
#> transcript3021.chr7.nnic     transcript3021.chr7.nnic ENSMUSG00000030435.17
#> transcript80929.chr17.nnic transcript80929.chr17.nnic ENSMUSG00000061613.14
#> transcript80960.chr17.nnic transcript80960.chr17.nnic ENSMUSG00000061613.14
#> transcript80965.chr17.nnic transcript80965.chr17.nnic ENSMUSG00000061613.14
#> transcript81001.chr17.nnic transcript81001.chr17.nnic ENSMUSG00000061613.14
#> transcript81044.chr17.nnic transcript81044.chr17.nnic ENSMUSG00000061613.14
#>                            gene_name strand width novel cds nmd
#> transcript2976.chr7.nnic       U2af2      +  1191    no  no  no
#> transcript2982.chr7.nnic       U2af2      +  1800   yes  no  no
#> transcript3019.chr7.nnic       U2af2      +   790   yes  no  no
#> transcript3021.chr7.nnic       U2af2      +  1354   yes  no  no
#> transcript80929.chr17.nnic     U2af1      -  1038    no  no  no
#> transcript80960.chr17.nnic     U2af1      -   971    no  no  no
#> transcript80965.chr17.nnic     U2af1      -   971    no  no  no
#> transcript81001.chr17.nnic     U2af1      -   842   yes  no  no
#> transcript81044.chr17.nnic     U2af1      -   785    no  no  no

To quickly display the metadata of a different “set”, users may modify the set argument:

features(fobj, "U2af1", set = "AS", show_more = TRUE)
#>           AS_id               gene_id gene_name                   coord AStype
#> AS06864 AS06864 ENSMUSG00000061613.14     U2af1 chr17:31870620-31870686     CE
#> AS06865 AS06865 ENSMUSG00000061613.14     U2af1 chr17:31871476-31871542     CE
#> AS06866 AS06866 ENSMUSG00000061613.14     U2af1 chr17:31873953-31874040     CE
#>         strand width novel
#> AS06864      -    67    no
#> AS06865      -    67    no
#> AS06866      -    88    no

or use convenient wrapper functions (see ?factR-meta for more):

ase(fobj, "U2af1", show_more = TRUE)
#>           AS_id               gene_id gene_name                   coord AStype
#> AS06864 AS06864 ENSMUSG00000061613.14     U2af1 chr17:31870620-31870686     CE
#> AS06865 AS06865 ENSMUSG00000061613.14     U2af1 chr17:31871476-31871542     CE
#> AS06866 AS06866 ENSMUSG00000061613.14     U2af1 chr17:31873953-31874040     CE
#>         strand width novel
#> AS06864      -    67    no
#> AS06865      -    67    no
#> AS06866      -    88    no

Each alternative splicing event has been assigned a internal unique ID for easy referencing and this can be found in the column “AS_id”.

The above features() function and other similar assessor functions outputs a dataframe that can be saved as a new variable for further wrangling. Alternatively, the output dataframe can be directly piped to downstream functions. In the example below, we show how to quickly count the number of events by the splicing type (AStype):

ase(fobj) %>% 
  group_by(AStype) %>% 
  tally()
#> ℹ Set `show_more to TRUE to show more info`
#> # A tibble: 7 × 2
#>   AStype     n
#>   <fct>  <int>
#> 1 CE      8728
#> 2 AD      2136
#> 3 AA      1611
#> 4 AF      2202
#> 5 AL      1505
#> 6 RI      3620
#> 7 NA       985

{factR2} is pre-built with functions to visualise transcript architectures on an interactive plot:

plotTranscripts(fobj, "U2af1")

Exon structures may sometimes appear thin and inconspicuous especially when flanking introns are extremely long. To focus on the exon architectures, users may set the rescale_introns argument to TRUE:

plotTranscripts(fobj, "U2af1", rescale_introns = T)

Running factR pipeline

{factR2} contains 5 key functions to probe for biological function of custom transcripts:

buildCDS(): to construct CDS information
getAAsequence(): to translate amino acid sequences
predictNMD(): to predict for NMD-sensitive transcripts
testASNMDevents(): to pinpoint splicing events leading to NMD
getAScons(): to quantify exon-level sequence conservation scores

The above functions can be simultaneously performed by running the runfactR() pipeline and this will update the metadata stored within the factRObject:

fobj <- runfactR(fobj)
#> 🡆  Building CDS information
#>     Searching for reference mRNAs in query
#>     16 reference mRNAs found and its CDS were assigned
#>     Building database of annotated ATG codons
#>     Selecting best ATG start codon for remaining transcripts and determining open-reading frame
#>     44009 new CDSs constructed
#> 
#>     Summary: Out of 115056 transcripts in `gtf`,
#>     44025 transcript CDSs were built
#>     ℹ Updating transcript feature data
#> 
#> 🡆  Running transcript-level NMD prediction
#> 
#>     Predicting NMD sensitivities for 44025 mRNAs
#>     ℹ Updating transcript feature data
#> 
#> 🡆  Translating amino acid sequences
#> 
#> 🡆  Running AS-NMD testing
#> 
#>     ℹ Getting best reference for AS-NMD testing
#> 
#>     ℹ Testing AS-NMD exons
#> 
#>     ℹ Updating AS feature data
#> 
#> 🡆  Quantifiying conservation scores
#> 
#>     ℹ Getting phastCons35way.UCSC.mm39 database
#> 
#>     ℹ Quantifying `exon` conservation scores with 0 padding

The number of newly-identified coding transcripts is now reflected in the object summary:

fobj
#> class: factRObject [version 0.99.0]
#> # transcriptome:
#>    36931 genes; 
#>    115055 transcripts [99855 novel]; 
#>    44025 coding transcripts 
#> # active set: transcript
#> # samples (0):

Coding transcripts that are predicted to be NMD-sensitive are now annotated as “yes” under the column “nmd” in the transcript metadata:

features(fobj, "U2af1", show_more = TRUE)
#>                                         transcript_id               gene_id
#> transcript80929.chr17.nnic transcript80929.chr17.nnic ENSMUSG00000061613.14
#> transcript80960.chr17.nnic transcript80960.chr17.nnic ENSMUSG00000061613.14
#> transcript80965.chr17.nnic transcript80965.chr17.nnic ENSMUSG00000061613.14
#> transcript81001.chr17.nnic transcript81001.chr17.nnic ENSMUSG00000061613.14
#> transcript81044.chr17.nnic transcript81044.chr17.nnic ENSMUSG00000061613.14
#>                            gene_name strand width novel cds nmd stop_to_lastEJ
#> transcript80929.chr17.nnic     U2af1      -  1038    no yes yes            417
#> transcript80960.chr17.nnic     U2af1      -   971    no yes  no           -142
#> transcript80965.chr17.nnic     U2af1      -   971    no yes  no           -142
#> transcript81001.chr17.nnic     U2af1      -   842   yes yes yes            430
#> transcript81044.chr17.nnic     U2af1      -   785    no yes yes            342
#>                            num_of_downEJs 3'UTR_length is_NMD      PTC_coord
#> transcript80929.chr17.nnic              5          689   TRUE chr17:31870658
#> transcript80960.chr17.nnic              0          130  FALSE           <NA>
#> transcript80965.chr17.nnic              0          130  FALSE           <NA>
#> transcript81001.chr17.nnic              5          702   TRUE chr17:31870671
#> transcript81044.chr17.nnic              4          614   TRUE chr17:31867832

The total number of NMD-sensitive transcripts can be queried as such:

features(fobj) %>% 
  group_by(nmd) %>% 
  tally()
#> ℹ Set `show_more to TRUE to show more info`
#> # A tibble: 2 × 2
#>   nmd        n
#>   <chr>  <int>
#> 1 no    111665
#> 2 yes     3390

Now, plotTranscripts() will distinguish between CDS and untranslated region (UTR) segments:

plotTranscripts(fobj, "U2af1", rescale_introns = T)

By default, the runfactR pipeline will compute the sequence conservation scores of the entire exon. To determine the conservation of intronic sequences flanking these alternative exons (50 base pairs on each side), users may rerun the getAScons() function and the newly-computed scores will be reflected as a separate column in the “AS” set metadata:

fobj <- getAScons(fobj, type = "flanks", padding = 50)
#> 🡆  Quantifiying conservation scores
#>     ℹ Getting phastCons35way.UCSC.mm39 database
#>     ℹ Quantifying `flanks` conservation scores with 50 padding
ase(fobj, show_more = TRUE)
#>           AS_id                     gene_id          gene_name
#> AS00001 AS00001 novel_gene_JH584304.1_72773               <NA>
#> AS00002 AS00002        ENSMUSG00000095041.8 ENSMUSG00000095041
#> AS00003 AS00003        ENSMUSG00000095041.8 ENSMUSG00000095041
#> AS00004 AS00004        ENSMUSG00000095041.8 ENSMUSG00000095041
#> AS00005 AS00005        ENSMUSG00000095041.8 ENSMUSG00000095041
#> AS00006 AS00006        ENSMUSG00000095041.8 ENSMUSG00000095041
#> AS00007 AS00007        ENSMUSG00000095041.8 ENSMUSG00000095041
#> AS00008 AS00008        ENSMUSG00000095041.8 ENSMUSG00000095041
#>                          coord AStype strand width novel ASNMDtype ASNMD.in.cds
#> AS00001 JH584304.1:51100-51324     AF      +   225   yes      <NA>           NA
#> AS00002 JH584304.1:57057-57151     AL      -    95   yes      <NA>           NA
#> AS00003 JH584304.1:58564-58692   <NA>      -   129   yes      <NA>           NA
#> AS00004 JH584304.1:58693-58835   <NA>      -   143   yes      <NA>           NA
#> AS00005 JH584304.1:58836-59092     AA      -   257   yes      <NA>           NA
#> AS00006 JH584304.1:59334-59480     AF      -   147   yes      <NA>           NA
#> AS00007 JH584304.1:59592-59773     CE      -   182   yes      <NA>           NA
#> AS00008 JH584304.1:60252-60266     CE      -    15   yes      <NA>           NA
#>         Cons.exon.pad0 Cons.flanks.pad50
#> AS00001             NA                NA
#> AS00002             NA                NA
#> AS00003             NA                NA
#> AS00004             NA                NA
#> AS00005             NA                NA
#> AS00006             NA                NA
#> AS00007             NA                NA
#> AS00008             NA                NA
#>  [ reached 'max' / getOption("max.print") -- omitted 20779 rows ]

Testing regulatory function of AS-NMD events

AS-NMD events are classified as “Stimulating” if the longer form (exon inclusion) of the transcript is NMD-sensitive or “Repressing” if the shorter form (exon skipping) of the transcript is NMD-sensitive. Both AS-NMD event types were detected in the U2af1 example:

ase(fobj, "U2af1", show_more = TRUE)
#>           AS_id               gene_id gene_name                   coord AStype
#> AS06864 AS06864 ENSMUSG00000061613.14     U2af1 chr17:31870620-31870686     CE
#> AS06865 AS06865 ENSMUSG00000061613.14     U2af1 chr17:31871476-31871542     CE
#> AS06866 AS06866 ENSMUSG00000061613.14     U2af1 chr17:31873953-31874040     CE
#>         strand width novel   ASNMDtype ASNMD.in.cds Cons.exon.pad0
#> AS06864      -    67    no        <NA>           NA              1
#> AS06865      -    67    no Stimulating         TRUE              1
#> AS06866      -    88    no  Repressing         TRUE              1
#>         Cons.flanks.pad50
#> AS06864             1.000
#> AS06865             1.000
#> AS06866             0.653

This can be visually verified by running plotTranscripts() with coordinates as input:

plotTranscripts(fobj, "chr17:31870294-31873065")

To further test the regulatory function of these AS-NMD events, we can correlate the exon inclusion levels (PSI or Percent Spliced In) of AS-NMD events to its gene expression level. {factR2} can compute gene expression counts and PSI values from transcript-level expression count matrices. To demonstrate this, we provide a sample transcript count matrix of all detected isoforms in postnatal mouse brain at two developmental timepoints. This count matrix can be added to the factRObject using the addTxCounts() function:

counts <- system.file("extdata/pb_expression.tsv.gz", package = "factR2")
fobj <- addTxCounts(fobj, counts)
#> 🡆  Adding expression data
#>     ℹ Importing from local file
#> 🡆  Creating samples metadata
#> 🡆  Processing expression data
#>     ℹ Adding gene counts
#>     ℹ Adding spliced-event counts
#>     ℹ Normalizing counts

Users may choose to use pre-computed PSI values from other splicing quantification tool by providing the path to the PSI matrix to the psi argument.

To test the correlation between exon PSI values and gene expression levels, users may use the testGeneCorr() function. By default, this function will perform a two-sided Pearson’s product moment correlation on variance-stabilised PSI and normalised gene expression expression. Users may customise the parameters of the correlation test by parsing addition arguments to the R cor.test() function. In the example below, we show how one would modify the testGeneCorr() function to carry out one-sided test:

fobj <- testGeneCorr(fobj, alternative="greater")
#> 🡆  Running correlation on 20784 events

The above function creates 2 new columns in AS metadata: gene.cor.estimate which returns the correlation coefficient of the test and gene.cor.pval which returns the unadjusted significant level. For AS-NMD events of type “Stimulating”, the PSI values have been transformed to 1-PSI so that the correlation coefficient of values for gene.cor.estimate are positive. In our example below, the AS-NMD stimulating event AS06865 showed significant correlation with U2af gene expression.

ase(fobj, "U2af1", show_more = TRUE)
#>           AS_id               gene_id gene_name                   coord AStype
#> AS06864 AS06864 ENSMUSG00000061613.14     U2af1 chr17:31870620-31870686     CE
#> AS06865 AS06865 ENSMUSG00000061613.14     U2af1 chr17:31871476-31871542     CE
#> AS06866 AS06866 ENSMUSG00000061613.14     U2af1 chr17:31873953-31874040     CE
#>         strand width novel   ASNMDtype ASNMD.in.cds Cons.exon.pad0
#> AS06864      -    67    no        <NA>           NA              1
#> AS06865      -    67    no Stimulating         TRUE              1
#> AS06866      -    88    no  Repressing         TRUE              1
#>         Cons.flanks.pad50 gene.cor.estimate gene.cor.pval
#> AS06864             1.000       0.813098344  0.0006501011
#> AS06865             1.000       0.834649627  0.0003661430
#> AS06866             0.653      -0.001520686  0.5018711510

We can plot a two-way plot of U2af1 expression against AS10419 PSI values using the plot2way() function:

plot2way(fobj, x="AS06865", y="U2af1", plot_trend=TRUE)

Predicting protein domains

factR2 can also inspect domain structure of protein products encoded by newly identified mRNA isoforms using its predictDomains() function. This function requires connection to the online PFAM database and may require a substantial amount of time and stable internet connection to query multiple transcripts simultaneously. To quickly explore and visualise the output of the predictDomains() tool, user may prefer to use the plotDomains() function. For example, the following will predict domain structure for all transcripts of the U2af1 gene:

plotDomains(fobj, "U2af1")

If you want to predict domains for all new transcripts and do not mind waiting for some time depending on the connection speed and the PFAM server load, run the following:

fobj <- predictDomains(fobj)

Export output objects

The factRObject can be saved as an RDS file and restored later for further analysis:

saveRDS(fobj, "factRObject.rds")

Key data stored in a factRObject (updated GTF file and metadata files) can be exported as text files in the local workding directory using the exportAll function:

exportAll(fobj)

Alternatively, users may use the exportGTF() and exportTable() functions to export individual data:

exportGTF(fobj)
exportTable(fobj, data = "transcript")  # data can be "gene", "transcript" or "AS"

Session Information

This workflow was conducted on:

sessionInfo()  
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/London
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] phastCons35way.UCSC.mm39_3.16.0    AnnotationHub_3.10.1              
#>  [3] BiocFileCache_2.10.2               dbplyr_2.5.0                      
#>  [5] GenomicScores_2.14.3               BSgenome.Mmusculus.UCSC.mm39_1.4.3
#>  [7] BSgenome_1.70.2                    BiocIO_1.12.0                     
#>  [9] Biostrings_2.70.3                  XVector_0.42.0                    
#> [11] rtracklayer_1.62.0                 GenomicRanges_1.54.1              
#> [13] GenomeInfoDb_1.38.8                IRanges_2.36.0                    
#> [15] S4Vectors_0.40.2                   BiocGenerics_0.48.1               
#> [17] dplyr_1.1.4                        factR2_0.99.3                     
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3            rstudioapi_0.16.0            
#>   [3] jsonlite_1.8.8                magrittr_2.0.3               
#>   [5] GenomicFeatures_1.54.4        farver_2.1.1                 
#>   [7] rmarkdown_2.26                fs_1.6.3                     
#>   [9] zlibbioc_1.48.2               ragg_1.3.0                   
#>  [11] vctrs_0.6.5                   memoise_2.0.1                
#>  [13] Rsamtools_2.18.0              RCurl_1.98-1.14              
#>  [15] htmltools_0.5.8.1             S4Arrays_1.2.1               
#>  [17] progress_1.2.3                curl_5.2.1                   
#>  [19] Rhdf5lib_1.24.2               SparseArray_1.2.4            
#>  [21] rhdf5_2.46.1                  sass_0.4.9                   
#>  [23] bslib_0.7.0                   htmlwidgets_1.6.4            
#>  [25] desc_1.4.3                    plotly_4.10.4                
#>  [27] cachem_1.0.8                  GenomicAlignments_1.38.2     
#>  [29] mime_0.12                     lifecycle_1.0.4              
#>  [31] pkgconfig_2.0.3               Matrix_1.6-5                 
#>  [33] R6_2.5.1                      fastmap_1.1.1                
#>  [35] GenomeInfoDbData_1.2.11       MatrixGenerics_1.14.0        
#>  [37] shiny_1.8.1.1                 digest_0.6.35                
#>  [39] colorspace_2.1-0              factR_1.4.0                  
#>  [41] patchwork_1.2.0               AnnotationDbi_1.64.1         
#>  [43] DESeq2_1.42.1                 textshaping_0.3.7            
#>  [45] crosstalk_1.2.1               RSQLite_2.3.6                
#>  [47] filelock_1.0.3                labeling_0.4.3               
#>  [49] fansi_1.0.6                   mgcv_1.9-1                   
#>  [51] httr_1.4.7                    abind_1.4-5                  
#>  [53] compiler_4.4.1                bit64_4.0.5                  
#>  [55] withr_3.0.0                   BiocParallel_1.36.0          
#>  [57] DBI_1.2.2                     highr_0.10                   
#>  [59] HDF5Array_1.30.1              biomaRt_2.58.2               
#>  [61] rappdirs_0.3.3                DelayedArray_0.28.0          
#>  [63] rjson_0.2.21                  tools_4.4.1                  
#>  [65] interactiveDisplayBase_1.40.0 httpuv_1.6.15                
#>  [67] glue_1.7.0                    restfulr_0.0.15              
#>  [69] nlme_3.1-164                  promises_1.3.0               
#>  [71] rhdf5filters_1.14.1           grid_4.4.1                   
#>  [73] generics_0.1.3                gtable_0.3.4                 
#>  [75] tidyr_1.3.1                   data.table_1.15.4            
#>  [77] hms_1.1.3                     xml2_1.3.6                   
#>  [79] utf8_1.2.4                    BiocVersion_3.18.1           
#>  [81] pillar_1.9.0                  stringr_1.5.1                
#>  [83] later_1.3.2                   splines_4.4.1                
#>  [85] lattice_0.22-6                bit_4.0.5                    
#>  [87] tidyselect_1.2.1              locfit_1.5-9.9               
#>  [89] pbapply_1.7-2                 knitr_1.46                   
#>  [91] SummarizedExperiment_1.32.0   xfun_0.43                    
#>  [93] Biobase_2.62.0                matrixStats_1.2.0            
#>  [95] stringi_1.8.3                 lazyeval_0.2.2               
#>  [97] yaml_2.3.8                    evaluate_0.23                
#>  [99] codetools_0.2-20              tibble_3.2.1                 
#>  [ reached getOption("max.print") -- omitted 21 entries ]

References

Joglekar, Anoushka, Wen Hu, Bei Zhang, Oleksandr Narykov, Mark Diekhans, Jennifer Balacco, Lishomwa C Ndhlovu, et al. 2023. “Single-Cell Long-Read mRNA Isoform Regulation Is Pervasive Across Mammalian Brain Regions, Cell Types, and Development.” bioRxivorg, April.

Kovaka, Sam, Aleksey V. Zimin, Geo M. Pertea, Roham Razaghi, Steven L. Salzberg, and Mihaela Pertea. 2019. “Transcriptome Assembly from Long-Read RNA-Seq Alignments with StringTie2.” Genome Biology 20 (1): 278. https://doi.org/10.1186/s13059-019-1910-1.

Li, Heng. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Bioinformatics 34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.

Prjibelski, Andrey D, Alla Mikheenko, Anoushka Joglekar, Alexander Smetanin, Julien Jarroux, Alla L Lapidus, and Hagen U Tilgner. 2023. “Accurate Isoform Discovery with IsoQuant Using Long Reads.” Nat. Biotechnol. 41 (7): 915–18.

Trapnell, Cole, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. 2012. “Differential Gene and Transcript Expression Analysis of RNA-seq Experiments with TopHat and Cufflinks.” Nat. Protoc. 7 (3): 562–78.

Fursham Hamid

2024-07-03