--- title: "Introduction to rgbio" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to rgbio} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview `rgbio` provides performant reading and writing operations for GenBank (.gb/.gbk/.gbff) files in R via an interface to the high-performance [gb-io](https://github.com/moshe/gb-io) Rust crate. It is designed to be fast and memory-efficient while providing R-friendly data structures. ## Why `rgbio`? * the only way to directly *write* GenBank files from R (to my knowledge) * much faster *reading* of GenBank files (~10x-30x faster than other packages in my benchmarks) * reading into and writing from both tidy objects (e.g. tibbles/data.frames) and "Bioconductor Sequence Infrastructure" objects (e.g. DNAStrings). * robust parsing via the robust [gb-io](https://github.com/moshe/gb-io) Rust crate * extensively tested on ~50 diverse GenBank files with many edge cases. ## Installation The `rgbio` package is not available on CRAN (for now), because it depends on a Rust crate. You can install it from the R-universe repository without having installed Rust or any Rust toolchain, as there are binary versions available for Windows, macOS, and Linux. ``` install.packages("rgbio", repos = c("https://richardstoeckl.r-universe.dev", "https://cloud.r-project.org")) ``` If there is no pre-built binary available for your system, or you want the latest development version, you can install `rgbio` from GitHub, provided you have the Rust toolchain installed. You can find information on how to install Rust at https://github.com/r-rust/hellorust. ```r # install.packages("remotes") remotes::install_github("richardstoeckl/rgbio") ``` ## Basic Usage ### Loading the Package ```{r setup} library(rgbio) ``` ### Writing and Reading (Tidy Workflow) To write a GenBank file in tidy mode, you typically provide: 1. **Sequences**: A named character vector (or `DNAStringSet`). 2. **Features**: A `data.frame` with columns `type`, `start`, `end`, `strand`, `qualifiers`. 3. **Metadata**: A list, `data.frame`, or `DataFrame` with record-level attributes. Let's create a minimal example sequence. ```{r example-data} # 1. The sequence seq_dna <- "ATGCGTACGTTAGC" # 2. Metadata metadata <- list( definition = "Synthetic Example Sequence", accession = "EX0001", version = "1", molecule_type = "DNA", topology = "linear", division = "SYN", date = "01-JAN-2023" ) # 3. Features # Note: 'qualifiers' must be a list column where each element is a named character vector. features_df <- data.frame( type = c("source", "gene", "CDS"), start = c(1L, 1L, 1L), end = c(14L, 14L, 14L), strand = c("+", "+", "+"), stringsAsFactors = FALSE ) features_df$qualifiers <- list( c(organism = "Synthetic Organism", mol_type = "genomic DNA"), c(gene = "exampleGene"), c(gene = "exampleGene", product = "hypothetical protein", translation = "MRTS") ) # Preview features print(features_df) ``` Now, write it to a temporary file: ```{r write} tmp_file <- tempfile(fileext = ".gb") write_gbk( file = tmp_file, sequences = c(EX0001 = seq_dna), features = features_df, metadata = metadata ) ``` ### Reading Back in Tidy Format Reading is straightforward. `read_gbk` parses the file and can return tidy tables. ```{r read} records <- read_gbk(tmp_file, format = "tidy") names(records) ``` ### Inspecting the Data The returned object has three components matching what we wrote. **Metadata:** ```{r inspect-meta} str(records$metadata) ``` **Sequence:** ```{r inspect-seq} records$sequences$sequence[[1]] ``` **Features:** The features are returned as a tidy `data.frame`. ```{r inspect-feat} print(records$features) ``` ### Writing and Reading (Bioconductor Workflow) You can also use Bioconductor-native classes for input and output. ```{r bioc-workflow} seqs_bioc <- Biostrings::DNAStringSet(c(EX0002 = "ATGCGGTTAA")) gr <- GenomicRanges::GRanges( seqnames = "EX0002", ranges = IRanges::IRanges(start = c(1L, 1L), end = c(10L, 10L)), strand = c("+", "+") ) S4Vectors::mcols(gr)$type <- c("source", "gene") S4Vectors::mcols(gr)$qualifiers <- list( c(organism = "Synthetic Organism", mol_type = "genomic DNA"), c(gene = "exampleGene2") ) meta_bioc <- S4Vectors::DataFrame( definition = "Bioconductor input example", accession = "EX0002", molecule_type = "DNA" ) tmp_bioc <- tempfile(fileext = ".gb") write_gbk( file = tmp_bioc, sequences = seqs_bioc, features = gr, metadata = meta_bioc ) bioc_out <- read_gbk(tmp_bioc, format = "bioconductor") class(bioc_out$sequences) class(bioc_out$features) class(bioc_out$metadata) ``` ### Minimum Required Information for `write_gbk()` Absolute minimum required inputs: - `file`: output file path. - `sequences`: non-empty named character vector or `DNAStringSet` with non-empty sequence strings. Everything else is optional: - `features`: optional (`NULL` is valid). - `metadata`: optional (`NULL` is valid). If omitted, `rgbio` fills required record-level fields using sequence names: - `name`, `definition`, `accession` default to the record name. - `molecule_type` defaults to `"DNA"`. Practical note: - `append = TRUE` requires that `file` already exists and is a valid GenBank file. ### Supported Metadata Fields The following metadata fields are supported by `write_gbk()` and returned by `read_gbk()`: - `name` (Locus name) - `definition` - `accession` - `version` - `keywords` (character vector) - `source` - `organism` - `molecule_type` (e.g., "DNA") - `division` - `topology` ("linear" or "circular") - `date` (format: `DD-MON-YYYY`) - `references` (list of references; each reference may include `description`, `authors`, `consortium`, `title`, `journal`, `pubmed`, `remark`) ## Advanced: Complex Locations When reading GenBank files, `rgbio` preserves feature locations as GenBank location expressions produced by the parser, including patterns such as: - `join(1..10,20..30)` - `complement(100..200)` - fuzzy bounds such as `<5..>120` In other words, `rgbio` focuses on **faithful I/O** of location syntax rather than fully symbolic location algebra. For advanced manipulations (interval arithmetic, set operations, transcript/CDS composition), use Bioconductor range tooling on the `GRanges` output or parse location strings with specialized utilities. ## Performance `rgbio` leverages Rust's zero-copy parsing where possible and efficient string handling to outperform pure R implementations, especially for large multi-record GenBank files. See the full benchmark details and methodology available in the [benchmarks article](https://richardstoeckl.github.io/rgbio/articles/benchmarks.html). ## Disclaimer **Important note:** This was/is a project for me to play around with agentic coding, and was written primarily by LLMs ("AI") under my direction. Nevertheless, it provides real value as it is one of the only ways to write GenBank files in R, and is one of the most performant ways to read Genbank files to R. It uses the very robust Rust `gb-io` crate and is tested against ~50 diverse GenBank files with many edge cases. This library is provided under the MIT License. The gb-io Rust crate package was written by David Leslie and is licensed under the terms of the MIT License. This project is in no way affiliated, sponsored, or otherwise endorsed by the original gb-io authors.