• school Campus Bookshelves
• perm_media Learning Objects
• how_to_reg Request Instructor Account
• hub Instructor Commons

## Margin Size

• Periodic Table
• Physics Constants
• Scientific Calculator
• Reference & Cite
• Tools expand_more

This action is not available.

## 7.13B: Annotating Genomes

• Last updated
• Save as PDF
• Page ID 9311

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$

( \newcommand{\kernel}{\mathrm{null}\,}\) $$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\id}{\mathrm{id}}$$

$$\newcommand{\kernel}{\mathrm{null}\,}$$

$$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$

$$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$

$$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

$$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$$

$$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$$

$$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vectorC}[1]{\textbf{#1}}$$

$$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$$

$$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$$

$$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$$

Genome annotation is the identification and understanding of the genetic elements of a sequenced genome.

## LEARNING OBJECTIVES

Define genome annotation

## Key Takeaways

• Once a genome is sequenced, all of the sequencings must be analyzed to understand what they mean.
• Critical to annotation is the identification of the genes in a genome, the structure of the genes, and the proteins they encode.
• Once a genome is annotated, further work is done to understand how all the annotated regions interact with each other.
• BLAST : In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.
• in silico : In computer simulation or in virtual reality

Genome projects are scientific endeavors that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist, or a virus). They annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome.

Once a genome is sequenced, it needs to be annotated to make sense of it. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Since the 1980’s, molecular biology and bioinformatics have created the need for DNA annotation. DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do.

Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: identifying elements on the genome, a process called gene prediction, and attaching biological information to these elements. Automatic annotation tools try to perform all of this by computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline (process). The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that. However, nowadays more and more additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases rely on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline.

Structural annotation consists of the identification of genomic elements: ORFs and their localization, gene structure, coding regions, and the location of regulatory motifs. Functional annotation consists of attaching biological information to genomic elements: biochemical function, biological function, involved regulation and interactions, and expression.

These steps may involve both biological experiments and in silico analysis. Proteogenomics based approaches utilize information from expressed proteins, often derived from mass spectrometry, to improve genomics annotations. A variety of software tools have been developed to permit scientists to view and share genome annotations. Genome annotation is the next major challenge for the Human Genome Project, now that the genome sequences of human and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological “parts list” for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts “fit together. ”

Open Access

## Twelve quick steps for genome assembly and annotation in the classroom

* E-mail: [email protected] (HJ); [email protected] (SE)

Affiliations School of Biological Sciences, The University of Queensland, St Lucia, Queensland, Australia, Centre for Agriculture and Bioeconomy, Queensland University of Technology, Brisbane, Queensland, Australia

Affiliation Genecology Research Centre, School of Science and Engineering, University of the Sunshine Coast, Sippy Downs, Queensland, Australia

Affiliation Institute of Marine and Environmental Technology, University of Maryland Center for Environmental Science, Baltimore, Maryland, United States of America

Affiliation Genetics and Breeding Research Center, National Institute of Fisheries Science, Geoje, Korea

Affiliation Biotechnology Research Division, National Institute of Fisheries Science, Busan, Korea

Affiliation Department of Life Science, Chung-Ang University, Seoul, Korea

• Hyungtaek Jung,
• Tomer Ventura,
• J. Sook Chung,
• Woo-Jin Kim,
• Bo-Hye Nam,
• Hee Jeong Kong,
• Young-Ok Kim,
• Min-Seung Jeon,
• Seong-il Eyun

Published: November 12, 2020

• https://doi.org/10.1371/journal.pcbi.1008325

Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.

Citation: Jung H, Ventura T, Chung JS, Kim W-J, Nam B-H, Kong HJ, et al. (2020) Twelve quick steps for genome assembly and annotation in the classroom. PLoS Comput Biol 16(11): e1008325. https://doi.org/10.1371/journal.pcbi.1008325

Editor: Francis Ouellette, University of Toronto, CANADA

Copyright: © 2020 Jung et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by the Korean Ministry of Agriculture, Food, and Rural Affairs (918010042HD030, Strategic Initiative for Microbiomes in Agriculture and Food) to SE. This work was also supported by a grant from the National Institute of Fisheries Science (R2020001) to WK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

## Introduction

Genome projects employ state-of-the-art DNA sequencing, mapping, and computational technologies (including cross-disciplinary experimental designs) to expand our knowledge and understanding of molecular/cellular mechanisms, gene repertoires, genome architecture, and evolution. The revolution in new sequencing technologies and computational developments has allowed researchers to drive advances in genome assembly and annotation to make the process better, faster, and cheaper with key model organisms [ 1 , 2 ].

Such technical advantages and established recommendations and strategies have been widely applied in humans [ 3 – 6 ], terrestrial animals [ 7 – 12 ], and plants and crops [ 13 – 18 ]. Genomic applications in aquatic species that could be potentially important for aquaculture are slower compared with human, livestock, and crops [ 19 – 21 ], compounded by larger diversity, lack of reference genomes, and more novice aquaculture industries. Given that aquaculture is the most rapidly expanding food sector, with the widest diversity of species cultured, it is poised for rapid adoption of genomics applications as these become more accessible. For any specific advice on application of genomics to aquaculture, please refer to previous works [ 19 – 25 ].

Before genome sequencing, a must-have step involves RNA sequencing (RNA-seq) that has provided significant insights into the biological functions [ 26 – 30 ]. RNA-seq plays a key role in genome annotation [ 31 – 36 ] through the identification of protein-coding genes based on transcriptome sequencing data and ab initio or homology-based prediction. However, the use of RNA-seq for genome assembly is limited to genome scaffolding [ 37 ]. While RNA-seq is a powerful technology that will likely remain a key asset in the biologist’s toolkit, recent single-molecule mRNA sequencing approaches (e.g., Pacific Bioscience [PacBio] and Oxford Nanopore Technology [ONT]) have provided significant improvements in gene and genome annotation, making them appealing alternatives or complementary techniques for genome annotation [ 38 – 40 ].

Restriction site–associated DNA sequencing and diversity array technology are cost-effective methods that mainly focus on the detection of loci and the segregation of variants or genome-wide single nucleotide polymorphisms. The generation of genetic linkage maps has been successfully applied to recognize key components in the sustainable production of aquaculture species [ 41 , 42 ]. These attempts have resulted in the emphasis of genomic evaluations/selections or advanced selective breeding programs for desirable traits, such as growth, sex determination, sex markers, and disease resistance [ 42 ]. While these inexpensive techniques have been powerful tools for understanding the genetics of adaptation, recent studies have indicated their limitations for genome scans because they will likely miss many loci under selection, particularly for species with short linkage disequilibrium [ 43 ]. However, the widespread use of whole-genome sequencing (WGS) allows the detection of a full range of common and rare/hidden genetic variants of different types across almost entire genomes.

Many seminal biological discoveries in the 20th century were made using only a genetic analysis of a few selected model organisms because they were readily available for genetic analysis [ 44 ]. However, a high-quality and well-annotated genome assembly is increasingly becoming an essential tool for applied and basic research across many biological disciplines in the 21st century that can turn any organism into a model organism. Thus, securing more complete and accurate reference genomes and annotations before analyzing post-genome studies such as genome-wide association studies, structural variations, and posttranslational studies (methylation or histone modification) has become a cornerstone of modern genomics. Chromosome-level high-quality genomes (including structural and functional annotations) are differentiated from draft genomes by their completeness (low number of gaps and ambiguous N s), low number of assembly errors, and a high percentage of sequences assembled into chromosomes. Advances in next-generation sequencing (NGS) technologies and their analytical tools have made assembling and annotating the genomic sequence of most organisms both more feasible and affordable [ 33 , 45 , 46 ]. Table 1 shows recent chromosome-level genome assemblies and provides a rough estimate of the sequencing depth and costs for beginners to achieve a chromosome-level genome assembly. For diploids, using a minimum 60× depth for PacBio, ONT, 60× for Illumina (San Diego, California, United States of America), and 100× for Hi-C data (Phase Genomics, Seattle, Washington, USA) (an extension of chromosome conformation capture, 3C) is recommended. High-quality end-to-end genome assembly and annotation of small eukaryotic (approximately 1 Gb diploid) and prokaryotic organisms have been achievable with small-to-medium financial resources and limited time, labor, and skill commitments. Nearly all eukaryotic genomes still represent a significant challenge for most aquatic species that have large and complex genomes and no reference genomes.

• PPT PowerPoint slide
• PNG larger image
• TIFF original image

https://doi.org/10.1371/journal.pcbi.1008325.t001

Furthermore, the following fundamental questions should be addressed: Why are genome projects or WGS necessary? What is the aim of a genome project? What kind of information is the research community expected to gather? Even from the beginning of a genome project, describing the expected end product, including project duration/budget, chromosome end-to-end completion, genome browser, and research paper, is required. In particular, if budget is a major obstacle, the best option to raise funds to support the genome project (e.g., industry or government support) must be determined. In addition to the abovementioned limitations, another essential element is bioinformatics, which has become a common denominator to produce and use software that can be applied to biological data in different contexts. As big data and multi-omics analyses are becoming mainstream, computational proficiency and literacy are indispensable skills in a biologist’s toolkit in modern scientific society. All “omics” studies require a certain degree of computational biology: The implementation of analyses requires programming skills and knowledge of computer languages, while experimental design and interpretation require a solid understanding of analytical approaches [ 47 , 48 ]. These could be daunting tasks for biologists who are unfamiliar with computational standards (e.g., codes, pipelines, and system environments) and resources (e.g., SourceForge, Bitbucket, GitLab, and GitHub). While academic cores, commercial services, and collaborations can aid in the implementation of analyses, the computational literacy required to design and interpret omics studies cannot simply be replaced or supplemented [ 47 , 48 ].

In the absence of a standard approach for genome projects, this paper aims to provide practical steps to facilitate project completion before embarking upon a genome assembly and annotation project (mainly for eukaryotic genomes). The target audience is anyone entering this field for the first time, particularly those who do not specialize in genomics research. While we can strive to answer questions in a manner that considers the beginner’s perspective, certain aspects (e.g., assembly algorithms and computer environments) might require further reading for an in-depth understanding.

## Step 1: Build a wide community for the project if possible

All genome projects have a common but monumental goal: sequencing the entire target genome for a wide range of genomics applications. While genomics is a rich field, one of the most prominent scientific objectives is probably securing the future of sustainable food sources by harnessing the power of genomics (i.e., desirable traits) [ 19 – 21 , 23 – 25 ], particularly for agriculture. If the species of interest is distinct from the wild, cultured, or harvested, it necessitates networking and building a scientific or stakeholder community to support the project. This usually requires a multi-institutional effort to both initiate and—more importantly—complete the genome project and then interpret the vast quantities of sequencing information produced for any given organism. As expected, WGS/genome projects’ infrastructure demands are particularly high as varying interpretations may require facilities, personnel (skill intensive), and software (knowledge intensive) that suit the needs of immediate analyses, ongoing reanalyses, and the integration of genomic and other phenotype information (or desirable traits). Data storage, maintenance, transfer, and analysis costs will also likely remain substantial and represent an increasing proportion of overall sequencing costs in the future. Moreover, professional groups (including students), expert panels, and field farmers acknowledge that there is a need for educational programs specific to WGS demands. Addressing these needs will likely require substantial investment by agriculture production care systems. Thus, the real cost of WGS—including ongoing maintenance—could be even higher. Despite these burdens, most genome projects bring together leading researchers to work together and build large datasets of DNA from target genomes, which has significantly benefited the research community. These efforts facilitate the sharing of sequence data and help research advance. In particular, smaller research groups that have less experience and are poorly equipped in areas including raw read sequencing and assembly and annotation should consider the main features and steps outlined here via community collaboration. In the case of funding for genome projects, applying for government grants and receiving corporate sponsorships as a consortium could be considered potential solutions as these avenues have been successful for humans, livestock (cow, pig, and sheep), crops (Arabidopsis, rice, and tomato), and aquaculture (salmon, oyster tilapia, and prawn).

## Step 2: Gather information about the target genome

Every genome sequencing, assembly, and annotation project is different due to each subject genome’s distinctive properties. There are four fundamental aspects that must be considered when embarking on a new genome project: the genome size, levels of ploidy and heterozygosity, GC content, and complexity. These will directly affect the overall quality and cost of genome sequencing, assembly, and annotation [ 14 , 49 ].

How big is the genome? The genome size will greatly influence the amount of data that must be ordered and analyzed. To assemble a genome, securing a certain number/amount of sequences/depth/coverage (called reads) is the first step before proceeding with ordering sequence data. To get an idea of the size and complexity of a genome, publicly available databases for approximate genome sizes are accessible for fungi ( http://www.zbi.ee/fungal-genomesize ), animals ( http://www.genomesize.com ), and plants ( http://data.kew.org/cvalues ). Selecting a closely related species is a practical option if the information on a target species is unavailable from a public database. Alternatively, the two widely used flow cytometry and k -mer frequency distribution methods could provide reliable genome size estimates to predict repeat content and heterozygosity rates. Flow cytometry is a fast, easy, and accurate system of simultaneous multiparametric analysis for nuclear DNA content including a ploidy level that isolates nuclei stained with a fluorescent dye [ 50 , 51 ]. K -mer frequency distribution, a pseudo-normal/Poisson distribution around the mean coverage in the histogram of k -mer counts, is a powerful and straightforward approach to use raw Illumina DNA shotgun reads to infer genome size, data preprocessing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequencing coverage estimation, measuring sequencing error rates, and heterozygosity [ 52 , 53 ]. It is highly recommended to use both flow cytometry and k -mer methods—the gold standard for genome size measures when designing genomic sequencing projects—because no single sequence-based method performs well for all species, and they all tend to underestimate genome sizes [ 54 ]. Is it a diploid, polyploid, or highly heterozygous hybrid species? If possible, it is better to use a single individual and sequence a haploid, highly inbred diploid organism [ 20 , 23 , 55 ], or isogenic line [ 56 ] because this will essentially minimize potential heterozygosity problems for genome assembly. While most genome assemblers are haploid mode (some diploid-aware mode) to collapse allelic differences into one consensus sequence, using complex polyploid or less inbred diploid genomes can greatly increase the number of present alleles, which will likely result in a more fragmented assembly or create uncertainties about the contigs’ homology [ 14 , 49 ]. If so, polyploid and highly repetitive genomes may require 50% to 100% more sequence data than their diploid counterparts [ 14 ].

Is there high/low GC content in a genomic region? Extremely low or high GC content in a genomic region is particularly known to cause problems for second-generation sequencing (SGS) technologies (also called short-read sequencing: mainly refer to Illumina sequencing), resulting in low or no coverage in those regions [ 57 ]. While this can be compensated for by increasing the coverage, we would recommend using third-generation sequencing (TGS) technologies (PacBio and ONT) that do not exhibit this bias [ 14 , 49 ].

How many repetitive sequences (or transposable elements) will likely be present in the genome? The amount and distribution of repetitive sequences, potentially occurring at different locations in the genome, can hugely influence genome assembly results, simply because reads from these different repeats are very similar and the assemblers’ algorithms cannot distinguish them effectively. This may eventually lead to misassembly and misannotation. This is particularly true for SGS reads and assemblies, and a high repeat content will often lead to a fragmented assembly because the assemblers cannot effectively determine the correct assembly of these regions and simply stop extending the contigs at the border of the repeats [ 58 ]. To resolve the assembly of repeats (or if the subject genome has a high repeat content), using TGS reads that are sufficiently long to include the unique sequences flanking the repeats is an effective strategy [ 14 , 49 ]. Thus, understanding the target genome and generating sufficient sequence data/read coverage is a crucial starting point in a genome assembly and annotation project.

## Step 3: Design the best experimental workflow

To meet the experimental goals and answer various biological questions, each application must come with different experimental designs. Above all, the development of high-quality chromosomally assigned reference genomes constitutes a key feature for understanding a species’ genome architecture and is critical for the discovery of the genetic blueprints for biologically significant traits. Once the reference genome has been completed, follow-up post-genome studies can be substantially completed with high accuracy.

While NGS is a useful tool for determining DNA sequences, certain parameters need to be considered prior to running an NGS experiment, such as quality control, SGS versus TGS, read length, read quality/error rate, number of reads, genome read coverage/depth, library preparation, and downstream applications. Recent papers have provided useful recommendations and strategies to ensure the success of NGS experiments by selecting the correct products/technologies and methods for the project [ 14 , 59 – 61 ]. If money is no obstacle, using TGS data (PacBio and ONT) and Hi-C data is recommended [ 14 ], which are also widely accepted approaches for reaching a chromosome-level genome assembly ( Table 1 ) for aquaculture or any other species. While a hybrid approach using Illumina/10x Genomics Chromium (10xGC) and Hi-C data has been proposed as a cost-effective method, this approach’s contiguity could be lower than that of the combination of TGS data and Hi-C data [ 14 ].

Another important point to consider is whether genome assembly should be de novo or reference guided/assisted ( Table 2 ). De novo assembly is the most widely adopted, but when complete genomes of closely related species are available, reference-guided/assisted genome assembly could be an attractive option because of its lower requirements for coverage data and computational memory [ 14 ]. However, early works have warned against its applications in genome assembly because the resultant assemblies may contain biases toward errors and chromosomal rearrangements in the existing reference genome [ 62 – 64 ]. No matter which assembly approaches and technologies are taken, genome assembly’s purpose is to construct a consensus haploid or haploid-phased chromosome-level assembly. Most extensively used genome assemblers typically collapse the 2 sequences into 1 haploid consensus sequence and thus fail to capture the diploid nature of target organisms. While this has been a key challenge in the bioinformatics and biology community, recent works have demonstrated the effectiveness of generating accurate and complete haplotype-resolved assemblies for diploid and polyploid species ( Table 2 ). While we have provided a brief summary of commonly used tools ( Table 2 ), the comprehensive program list focused on TGS reads can be accessed at LRS-DB ( https://long-read-tools.org ). Thus, selecting the appropriate tools and pipelines is important to achieve accurate chromosome-scale assemblies in a timely manner by leveraging speed and sensitivity in the contiguity and quality of genome assemblies.

https://doi.org/10.1371/journal.pcbi.1008325.t002

## Step 4: Choose the best sequencing platforms and library preparations

To sequence an organism’s entire genome (WGS), it must be prepared into a sample library from high-quality genomic DNA. A library is a collection of randomly sized DNA fragments that represent the sample input; its size can vary depending on the choice of sequencing technology. Sample library preparation for WGS is dependent on two considerations: (1) the genome size of the target sample organism; and (2) the amount of sample available to be sequenced. Given the vast range of library preparation products, we can only provide general suggestions for library preparations. For more platform-specific library preparation and sequencing guides, refer to the vendor’s products and/or services page. The recommended procedure is to select the best and most cost-effective library preparation and sequencing technology after considering the given research goal and budget.

The rapid adoption of WGS has been facilitated by the development of SGS and TGS technologies, which have dramatically reduced sequencing costs and simplified genome assembly. It is possible to select short (Illumina, 454, SOLiD, and Ion Torrent), long (ONT and PacBio), or a combination (hybrid) read. Comprehensive guidelines (including pros and cons) for selecting the correct sequencing technology have been extensively described in previous works [ 14 , 59 , 61 , 65 ]. Briefly, while SGS technologies can produce high-throughput, fast, cheap, and highly accurate reads of lengths in the range 75 to 700 bp, they show limited ability to resolve complex regions with repetitive or heterozygous sequences, which results in incomplete or heavily fragmented genome assemblies. According to Illumina, widely used SGS technology—the TruSeq PCR-free Library Preparation Kit—is ideal for any size of genome with a large sample input if there is 2 μg of genomic DNA available. However, the Nextera DNA Library Prep Kit (Illumina) is perfect for large and complex genomes with a small sample input. Meanwhile, the TruSeq Nano DNA Library Prep Kit (Illumina) is ideal for any size genome with a small sample input if there is only 200 ng of genomic DNA available. However, the Nextera DNA XT DNA Library Preparation Kit (Illumina) is perfect for small genomes, plasmids, and amplicons. Additional Illumina library preparation methods and sequencing platforms for high throughput have been extensively reviewed [ 66 , 67 ].

Meanwhile, TGS technologies can produce long single-molecule reads (averaging >30 kb) with complete contiguity, facilitating assembly. However, long-read technologies suffer from both high costs per base and high error rates. To overcome this disadvantage, the PacBio RS II or SEQUEL system (Pacific Biosciences, Menlo Park, California, USA) has been released that could generate 10 to 15 times more data than the original SEQUEL system with even more accurate long reads (HiFi reads could be ABI Sanger quality up to 40 kb). According to PacBio, the SMRTbell Template Prep Kit (Pacific Biosciences) with 20 to 40 kb template preparation using BluePippin Size Selection is recommended for WGS [ 14 , 68 ]. For ONT, a combination of ligation sequencing, PCR sequencing, and rapid sequencing has been optimized for WGS [ 60 , 69 ]. In particular, the Rapid Sequencing Kit (SQK-RAD004) could produce even higher read lengths and some reads could be >2 Mb [ 70 ].

Combining data from both SGS and TGS in a “hybrid approach/assembly” can compensate for the downsides of both approaches and is a cost-effective method because SGS data can correct errors in TGS reads [ 33 , 71 – 75 ]. Alternatively, the development of an advanced “hybrid” approach, such as incorporating 10xGC data or medium-size single-molecule DNA fragment selection and tagging before short-read sequencing, could be a practical strategy to increase the continuity and accuracy of long reads [ 14 ]. While recent studies have highlighted the efficacy and cost-effectiveness of 10xGC linked-reads in diploid aquatic species’ genomes [ 76 – 79 ], the utility of this technology for complex and/or polyploid aquatic species is still being investigated. According to 10xGC, the Chromium Genome Reagent Kit is ideal.

Regardless of the sequencing technology and approach (SGS, TGS, or hybrid), incomplete and/or unfinished assemblies can still occur (e.g., those with gaps and fragments). Thus, additional techniques such as optical mapping (BioNano, San Diego, California, USA) and chromatin association (Hi-C) are highly recommended to facilitate contig joining and genome assembly completion [ 46 , 80 – 83 ]. Use of the Hi-C method over BioNano has been observed in aquaculture species ( Table 1 ). The most widely used kit is the Proximo Hi-C Kit provided by Phase Genomics ( https://www.phasegenomics.com/hi-c-kits ).

## Step 5: Select the best possible DNA source and DNA extraction method

The extraction of high-quality DNA is the most important aspect of a successful genome project. Given the potential breadth of aquaculture species, each with their own peculiarities, extracted high-molecular-weight DNA should be free of contaminants either from the subjected material itself or from the DNA extraction procedure (e.g., polysaccharides, proteoglycans, proteins, secondary metabolites, polyphenols/polyphenolics, humic acids, carbohydrates, and pigments). While recent publications and commercial kits have provided valuable guidance [ 84 – 86 ], DNA extraction methodologies can be explored and adapted along the lines provided by the literature. In general, the minimum DNA input is required for Illumina and 10xGC > 3 ng, PacBio > 20 μg, ONT > 1 μg, BioNano > 200 ng, and Dovetail > 5 μg [ 14 ]. Depending on the project budget and sequencing platform accessibility, SGS and/or TGS technologies can be considered; we recommend using TGS that can deliver DNA of average size >25 kb. Certain species (e.g., mollusks containing high levels of polysaccharide) warrant more careful planning than others. A modified low-salt cetyltrimethylammonium bromide extraction protocol has produced excellent quality DNA of high molecular weight that is free from contaminants and shearing [ 87 ]. Other important considerations are the heterozygosity rate, amplification, and presence of other tissues/organisms [ 14 , 49 ]. The heterozygosity rate can be reduced using a single individual for extraction. However, certain organisms require a pool of individuals to retrieve a sufficient amount of DNA, which will increase the genetic variability and lead to a more fragmented assembly. Attractive strategies include generating an inbred line of individuals for low-heterozygosity pooled sequencing and/or sequencing of haploid tissues as the foundation for filtering out paralogous sequence variants. These have been successful for cost-effective WGS and for optimizing the precision of allele and haplotype frequency estimates in aquaculture breeding [ 19 , 20 , 24 , 42 , 55 ]. When few cells are available, the genomic DNA must be amplified before sequencing, but this can often result in uneven coverage due to artificial effects (chimeric and/or fused unrelated sequences). The introduction of unwanted/unrelated organisms (e.g., contaminants and/or symbionts) and/or tissues (e.g., mitochondria and/or chloroplasts) should be minimized at the extraction and library preparation stages. This requires using tissue with a higher ratio of nuclear over organelle DNA because this can lead to higher coverage of the nuclear genome in the sequences. Whichever approach is adopted, there will be a need to refine the method to achieve several important quality metrics for genome sequencing.

Care should be taken for quality parameters (e.g., the chemical purity and structural integrity of DNA) and two recent works have made the recommendations outlined below for long-read technologies [ 14 , 49 ]. Generally, the measurement/quantification of purified DNA should be performed using both spectrophotometric and fluorescence-based methods (e.g., qubit). Samples with optical density (OD 260 :OD 280 ) ratios of 1.8 to 2.0 are usually free of protein contamination. DNA concentrations at a 1:1 ratio (determined by spectrophotometry and fluorimetry, respectively) are very good indicators of whether they will be sequenced efficiently. To determine the integrity of DNA samples, contour-clamped homogeneous electric field or pulsed-field gel electrophoresis is appropriate when used with TapeStation or Fragment Analyzer (Agilent Technologies, Santa Clara, California, USA). Analyzing isolated DNA in this manner also facilitates decisions regarding shearing DNA to attain an optimal size range for sequencing. Thus, it is always worth investing time in getting high-quality DNA that will result in high-quality data and assembly to save time and money.

## Step 6: Check the computational resources and requirements

Installing open-source tools in one’s computational environment is not always either straightforward or trivial. It generally poses three potential problems: (1) the prerequisites of the tools created by diverse developers employing diverse programming frameworks differ; (2) the installation of various software items in one environment can lead to hard-to-resolve software dependency conflicts; and (3) upon successful installation, maintaining the environment and ensuring that all tools (including changes and updates) are working as expected remain difficult. Therefore, managing the data analysis environment becomes increasingly complex when a project requires many tools for genomic data analysis. While addressing the importance of the appropriate data and computing infrastructure to genome projects is difficult, the two following options (see Step 7: maximizing in-house workers or collaboration and outsourcing from the service provider) can be considered.

Access to high-performance computing or cloud-based computing systems is crucial for genome projects that require a large number of computing resources. As a general guide, the successful assembly of a moderately sized diploid genome (approximately 1 Gb) using software pipelines (Tables 1 and 2 ) requires a minimum computing resource of 96 physical central processing unit (CPU) cores, 1 TB of high-performance random-access memory (RAM), 3 TB of local storage, and 10 TB of shared storage [ 14 ]. However, the guide is scalable based on the amount of data, genome size, heterozygosity rate, and ploidy. Please note that runtimes, memory requirements, number of CPUs, and computational costs will increase geometrically because genome assembly is an all-by-all comparison. However, hard drive space to store raw and/or intermediate data (e.g., storage space) will increase linearly as the total amount/depth of coverage required does not dramatically change as genomes increase in size. In addition, the recommendations stated here will likely apply to larger and more complex genomes (e.g., crustaceans with numerous chromosomes) but at a slower rate and with higher computing resources and costs (obtaining more computing resources will increase costs). If participants’ or collaborators’ institutions are equipped with large in-house high-performance computing resources, they will likely have more direct access and practical assistance in their genome project. Otherwise, cloud-based computing is a potential solution that has been widely emphasized in previous works including easy-to-follow steps [ 88 – 90 ]. While cloud computing provides flexibility, competitive pricing, and continually updated hardware and software, it still requires assistance from information technology (IT) specialists to set up suitable cloud-based software. Thus, users should consider all possible options (including their research budget) to achieve the best outcome.

## Step 7: Choose the best computational design and pipeline

Optimizing a computational design and securing sufficient computer resources are essential steps to succeed in a genome assembly and annotation project. In addition, computational proficiency and literacy have become vital skills for biologists to design and interpret big data analyses and multi-omics studies [ 48 ]. Given the vast range of computational tools and requirements (different resource demands between assembly and annotation for each species), general suggestions are provided on the computational aspect. However, when establishing the best and most cost-effective computational design and requirement, it is important to consider three options: (1) maximizing in-house workers or collaboration; (2) outsourcing from a service provider; and (3) simulating data with different settings. Ultimately, the most suitable and practical approach in methodological computational biology research is recommended because there is no perfect computational design for genome assembly and annotation.

Before embarking on any actual data analyses, the overall goals should first be defined by understanding in-house workers and facilities because computational design requires extensive learning of computer and biology knowledge, which is a great challenge for most wet lab researchers/groups. If in-house workers and computer facilities are not ready to deliver successful outcomes, cross-disciplinary collaborations (computer science, data science, bioinformatics, and biology) could present great solutions. Initiating and successfully maintaining cross-disciplinary collaborations can be challenging but are highly rewarding because the combination of methods, data, and interdisciplinary expertise can achieve more than the sum of the individual parts alone [ 91 ].

Alternatively, work can be outsourced to a service provider. Outsourcing has the following benefits: (1) no need to hire more employees for computational design and analysis, which will reduce labor costs; and (2) there are more talents available at well-equipped companies that are very specialized in specific research fields. However, outsourcing also has the following disadvantages: (1) a lack of control as a contractor; (2) limited methods of communication (e.g., phone, e-mail, or online chat); and (3) the potential danger of poor quality work due to the inability to optimize pipelines (e.g., parameters) and outcomes.

No matter which approach is taken, the essential part is to have firsthand experience to select proper computational design and pipeline and to accurately interpret analyzed genome data. Due to its extensive range of analytical tools and application areas, employing an effective simulator (from the quality of raw reads to assembly evaluation) has become an essential step for benchmarking genomic and bioinformatics analyses [ 92 – 94 ]. In simulations, considering a (very) large number of datasets is generally not a problem, except when the analysis of each dataset is hugely computationally expensive (e.g., in the genome assembly stage). In practice, one should generate and analyze as many datasets as computationally feasible before embracing real empirical studies, particularly before undertaking real assemblies. In large genome assembly, simulating assemblies of down-sampled real data (e.g., 30× coverage/depth of genome) would be very useful for selecting the best pipeline and parameters without requiring too much computational time or cost. Ultimately, a simulation’s practical relevance depends on the similarity between the considered simulation settings and the real datasets in the area of application. The new method may be assessed in different ways depending on the context (e.g., by conducting simulations, applying the method to several real datasets, applying flexible parameter settings, and checking the underlying assumptions in practical examples). Therefore, simulations should not be limited to artificial datasets that correspond exactly to the assumptions underlying the new method as this would favor the new method [ 61 , 95 – 98 ].

## Step 8: Assemble the genome

Regardless of which pathway/strategy is chosen, the TGS approach is recommended over the SGS or hybrid approaches. In general, using multiple programs at each stage to predict the best assembly and annotation ( Table 2 ) is also recommended because each approach and tool has limitations based on the problems inherent in the different algorithms and assumptions used. If the abovementioned steps (Steps 1–7) are met, the recommended flowchart and/or guideline for genome assembly, annotation, maintenance, and community effort would be as shown in Fig 1 , which could be broadly applicable to any species. The rationale of each computational design, workflow, and decision tree is well described in Jung and colleagues [ 14 ], including the background information for each of their steps and the spectrum of available analytical options. Following the workflow and decision tree described by Jung and colleagues, the recommended tools herein are the TGS pipeline: PacBio/ONT read sequencing (remove all contaminated DNA; plastids/bacterial contamination) → read quality assessment, evaluation, and filtering → assembly → error correction and polishing using SGS reads → assessment → chromosome-level assembly using BioNano and Hi-C data. Several recent assemblies adopted from this pipeline (or similar) have shown notable improvements in the assembly of intergenic spaces and centromeres [ 33 , 72 ]. A potential assembly outcome from the new SEQUEL II (HiFi) reads would be even more promising (see Step 4) compared to its early version SEQUEL. In the SGS pipeline, if the target is a diploid organism, starting from 10xGC read sequencing over Illumina reads is ideal. Based on the results of the hybrid-based assemblies, the recommended pipeline starts from PacBio/ONT, and 10xGC read sequencing greatly helps build a highly accurate contiguous genome [ 78 ]. However, all assembly approaches/designs derived only from sequence reads will still contain misassemblies (inversions and translocations), these are mainly caused by the inability of both sequencing and assembly pipelines to cope with long tracts of repeat sequences or high levels of heterozygosity and polyploidization. Thus, using BioNano and Hi-C data is highly recommended for reaching chromosome-level assembly because these two methodologies/technologies can improve the assembly quality by validating the integrity of the initial assembly, correcting misorientations, and ordering the scaffolds.

NGS, next-generation sequencing.

https://doi.org/10.1371/journal.pcbi.1008325.g001

## Step 9: Check the assembly quality before annotation

In the shotgun sequencing era, assembling a new genome mostly relies on computational algorithms and experimental designs (see Steps 6 and 7). The performance of such algorithms and designs, read lengths, insertion size of sequencing libraries, read accuracy, and genome complexity determines the accuracy and continuity of the genome assembly. Therefore, while estimating assembly quality is an unpredictable and challenging task that requires several statistical and biological validations, it remains an important step for a high-quality genome. Typically, the quality assessment for draft assemblies is carried out via statistical measurements and alignment to a reference genome (if available) [ 99 ]. These include overall assembly size (determining the match to the estimated genome size), measures of assembly contiguity (N50, NG50, NA50, or NGA50; the number of contigs; contig length; and contig mean length), assembly likelihood scores (calculated by aligning reads against each candidate assembly), and the completeness of the genome assembly (Benchmarking Universal Single-Copy Orthologs [BUSCO] scores and/or RNA-seq mapping) [ 100 , 101 ]. In computational biology, N50 is a widely used metric for assessing an assembly’s contiguity, which is defined by the length of the shortest contig for which longer and equal-length contigs cover at least 50% of the assembly. NG50 resembles N50 except for the metric, which relates to the genome size rather than the assembly size. NA50 and NGA50 are analogous to N50 and NG50 where the contigs are replaced by blocks aligned to the reference [ 99 ]. Thankfully, recent bioinformatics tools offer an automated pipeline to compute and evaluate the new genome quickly and accurately in a practical setting [ 44 , 102 , 103 ].

## Step 10: Genome annotation

Unlike advanced and revolutionized genome sequencing and assembly, getting genome annotation correct remains a challenge. Annotation is the process of identifying and describing regions of biological interest within a genome (both functionally and structurally). While there are various online annotation servers ( Table 3 ), the intended use of the curated data needs to be clearly defined after considering the two options addressed in Step 7 (maximizing in-house workers/collaboration and outsourcing) because the gene-finding problem in eukaryotes is far more difficult than that in prokaryotes such as bacteria. This procedure requires advanced bioinformatics skills, pipelines, and computing resources and consists of three main steps: (1) identifying noncoding regions; (2) identifying coding regions (called gene prediction); and (3) attaching the biological information of these elements.

https://doi.org/10.1371/journal.pcbi.1008325.t003

Recent works have described genome annotations well [ 13 , 105 – 109 ]. However, it is highly recommended that beginners select automatic or semiautomatic annotation methods (including the workflow and guideline in Fig 1 ) because manual annotation can be very time- and labor-intensive and expensive. Note that while automatic procedures help accelerate the annotation process, they decrease the confidence and reliability of the outcomes because results from different servers and/or databases are often dissimilar [ 106 , 110 , 111 ]. Furthermore, automatic annotation algorithms, frequently based on orthologs from distantly related model organisms, cannot yet correctly identify all genes within a genome and manual annotation is often necessary to obtain accurate gene models and gene sets [ 106 , 110 , 111 ]. Thus, a scheme to obtain consensus annotations by integrating different results, a semiautomatic method, is in demand because this could balance automatic and manual approaches, which would increase the reliability of the annotation while accelerating the process [ 106 , 110 , 111 ]. In general, the identification of noncoding regions includes small and long sequences including repetitive and transposable elements ( Fig 1 and Table 3 ). Despite an explosion of interest in noncoding data and the massive volume of scientific data, selecting the best strategy to annotate and characterize noncoding RNAs is a daunting task because of the strengths and weaknesses of each computational and empirical approach [ 112 ]. After screening noncoding regions (e.g., repeat masking and transposable elements), elements of the gene structure (e.g., introns, exons, coding sequences [CDSs], and start and end coordinates) can be predicted for coding regions.

Both ab initio and evidence-based prediction approaches are widely used as each approach has pros and cons. While Augustus and SNAP are the most popular tools for ab initio prediction, they still necessitate the information of the closely related gene and genome model for screening against the newly sequenced genome. By contrast, evidence-based prediction usually uses results obtained by aligning ESTs, protein sequences, and RNA-seq data (results are even better with full-length Iso-Seq data from PacBio or ONT) to a genome assembly as external evidence. Trained gene predictors (training with Augustus and SNAP to obtain more accurate annotation results is highly recommended) can be used in MAKER, BRAKER, and StringTie ( Fig 1 and Table 3 ). When extrinsic evidence from RNA-seq and protein homology information is available, any program/pipeline could be useful for the de novo annotation of novel genomes. In particular, if any RNA-seq data and a genome sequence are available, starting from MAKER and BRAKER over StringTie would be a better choice for a first-time user because MAKER and BRAKER include ab initio prediction (e.g., Augustus training) unlike StringTie (evidence-based prediction only). However, MAKER could be a better choice for updating existing annotations to reflect new evidence. If various gene prediction methods and tools are used to derive the gene structure from a genome, combining these results to obtain the single consensus gene structure via Evidence Modeler, GLEAN, Evigan, or GAAP is essential ( Table 3 ). In particular, BRAKER, StringTie, PASA, and GAAP can update any gene structure annotation by correcting exon boundaries and adding untranslated regions and alternatively spliced models based on assembled transcriptomic data. The evolutionary rapid emergence of new genes (which quickly respond to changing selection pressures) could give rise to orphan genes that might share no sequence homology to genes in closely related genomes [ 113 ]. Combining the methods and results (especially MAKER, BRAKER and StringTie) could therefore prove effective in increasing the number and accuracy of annotation predictions assigned to orphan and any other young genes.

Subsequently, functional annotation—the process of attaching biological information to gene or protein sequences—must be performed. This can be carried out through homology search and gene ontology (GO) term mapping. To investigate gene function or predict evolutionary associations, newly assembled sequences should be compared with gene sequences with known functions to find sequences with high homology using BLAST, Cufflinks, TopHat, GSNAP, Blast2GO/OmicsBox (referred to here as Blast2GO), and GAAP ( Fig 1 and Table 3 ). To label more diverse biological information, GO term mapping should be performed, which allows information about gene-related terms and relations between genes to be stored in three categories: biological processes, molecular functions, and cellular components. Mapping is the process of retrieving GO terms associated with hits (mapping sequences) obtained via a previous homology search (mainly BLAST) that are accessible from AmiGO, Blast2GO, GO-FEAT, and eggNOG-Mapper. Starting from Blast2GO would be a practical choice for a complete novice because it has more graphic user interface mode with explanations.

While Fig 1 and Table 3 provide a summary of useful tools with key features, it is highly recommended to be familiar with the regular update of public databases and pipelines. In addition, understanding the performance and capability of various analysis from a detailed comparison and instructions of common features of annotation tools could be a very important factor for a successful genome annotation, structurally [ 7 , 111 , 114 – 117 ] and functionally [ 8 , 118 – 123 ].

## Step 11: Build a searchable and sharable output format

Research papers and data products (researchers are usually required to submit raw sequencing data to appropriate repositories such as Sequence Read Archive [SRA]) are key outcomes of the scientific enterprise, including most successful genome projects. In addition, most genomic projects/data potentially have value beyond their initial purpose but only if shared with the scientific community, including refining assembly and annotation (see Step 12). In recent years, genomic studies have involved complex datasets such that biologists have become “big data practitioners” [ 124 ] because of improvements in high-throughput DNA sequencing and cost reductions. As a result, genomic studies have become routine procedures, and there is widespread demand for tools that can assist in the deliberative analytical review of genomic information. What happens to the data after such projects end? In general, data or data management plans have become the central currency of science because open access, open data, and software are critical for advancing science and enabling collaboration across multiple institutions and throughout the world and increasing public awareness [ 125 ]. For example, when archiving sequencing data, repositories such as those run by the National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI) both provide locations for data archiving and encourage a set of practices related to consistent data formatting and the inclusion of appropriate metadata. However, this is a difficult task for an individual research group due to the wide variety of data formats, dataset sizes, data complexity, data use cases, ethical questions, and data collection/storage/sharing practices [ 124 , 126 – 128 ]. Despite its importance, major barriers remain to sharing data, software, and research products throughout the scientific community because of the difficulties that interdisciplinary and/or translational researchers face when engaging in collaborative research [ 124 , 125 , 127 ]. To this end, recent works have provided principles that can be applied in genomic data/database projects, including data sharing and archiving via collaborations [ 124 – 128 ].

The following three fundamental questions on this topic should be considered: (1) Do you want to share your data? (2) Do you have enough in-house expertise and infrastructure to maintain and improve the data, including data storage space? (3) Do you want to form internal and external collaborations to increase research productivity? While each research group has different experiences and criteria in collaborations that included data sharing, engaging with multisite collaborations is highly recommended to overcome more pitfalls, including open-ended questions/concerns on genomic data. In addition, sharing open genomic data can easily facilitate reproducibility and repeatability by reusing the same genomic data.

## Step 12: Reach out to the community to refine the assembly and annotation

Dropping whole-genome shotgun sequencing costs and improvements in bioinformatics pipelines and computer capabilities have resulted in the situation where a small lab can undertake genome projects (assembly and annotation), and any organism can become a model species. Ironically, the ease of sequencing and assembly presents another challenge for annotation: contamination of the assembly itself, because errors in assembly can cause errors in the annotation (structural and functional). In addition, it is important to ensure that methods are computationally repeatable and reproducible because there have been numerous reports of instability arising from a mere change of Linux platform, even when using the exact same versions of genomic analysis tools [ 49 ]. When including new data, it is also necessary to provide software infrastructure to assist in genomic data updating. Hence, assembled genomes and curated annotations should not and cannot be considered perfect, static, or “final products.” Data must be maintained, refreshed, and updated to ensure their reuse and discovery.

Manual and continuous annotation is critical to achieving reliable gene models and elements; however, this process can be daunting and cost prohibitive for small research communities. While some genome consortia choose to manually review and edit sets via time- and resource-intensive meetings that often require substantial expertise, this still provides opportunities for community building, education, and training. In contrast, for small research groups, it has been proposed that involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and genomic resources [ 106 ]. Alternatively, a collaborative approach using web portals such as Apollo, JBrowse, G-OnRamp (Galaxy-based platform), and ORCAE [ 129 – 133 ] could be sufficiently robust and flexible to enable the members of a group to work simultaneously or at different times to improve the biological accuracy of annotation.

Despite any community-based participatory research approaches taken, the recruitment and coordination of researchers are central to any research project due to the requirement of diverse expertise and collective learning. The ideal way would be to form a national/international collaborative research partnership with diverse organizations [ 19 , 134 – 136 ]. Alternatively, active promotion via social networks and/or web portal setup could be the most effective way (e.g., Twitter, the Ensemble website, and blogs). Finally, build collective research solidarity by attending conferences would be plausible. There have been previous successful community efforts and involvement in plant ( https://nbenth.com/annotator/index , https://solgenomics.net , and https://www.helmholtz-muenchen.de/pgsb ) and animal genome projects ( http://www.slimsuite.unsw.edu.au/servers/apollo.php , https://bovinegenome.elsiklab.missouri.edu , http://www.gmgi.org/genomics-fish-shellfish , and https://www.sanger.ac.uk/science/data/vertebrate-genomes-sequencing ) using the Apollo instance with J Browsers exhibits attractive and effective routes because it is always online, curators can log in whenever they have time, and some minor revisions only require a few seconds (to confirm the gene models). Others require up to 20 minutes to change (UTR boundaries and other structural alterations).

After the initial setup, tasks include maintaining momentum and morale, according to the recommendations described by Pedro and colleagues [ 137 ]. Participants bring their own experiences and strengths into this effort. Availability of a training webinar (e.g., https://bit.ly/3gauwn7 and https://bit.ly/36iNQds ) would greatly help kick-start the process, alongside a clear set of starting tasks (e.g., a list of genes/families or regions assigned to each curator) and engagement by the community leader. The leader—an enthusiastic champion—can (1) drum up support from their collaborators; (2) fuse community expertise with resources; (3) oversee the project; and (4) act as a liaison between new members wanting to join, the infrastructure provider, and existing annotators. Considering that the collective expertise within a group may be extensive but diverse, it is necessary to standardize the curation for quality control of annotations. To minimize any conflicts that may arise during the annotation process, it is important (1) to have the initial training webinar by laying out clear rules and guidelines; (2) to select a small subset of genes and ask a group of experienced curators to evaluate whether the decisions taken in each case were uniform and sensible; (3) to record webinar training and comments regarding consensus or disagreements for reporting back to the curation team and to edit the tutorial and guidelines; (4) to address this by automated checks and controls (Apollo does not allow this for now or makes it extremely difficult); and (5) to ask multiple reviewers to check each region by reviewing the annotation history in Apollo (labor-intensive method).

Pooling the expertise, resources, and time of active communities could enable a wide range of geographically distance members to participate in a common process, to share and validate the identification of contradictions and the misrepresentation of data on the genomes [ 137 ]. After corrections, the datasets (manually verified gene sets) that emerge from these projects can be used to improve the gene sets for closely related genomes and downstream analysis. Dialog and collaboration between community members have an enormous impact. The result of an entire community agreeing on and taking ownership of a single gene set is a major stepping-stone to accelerating the field. Handling the mammoth task of manual gene annotation in the absence of dedicated funding or teams is a great challenge. However, our guidelines could provide a manageable solution for the prospect of this approach becoming commonplace and will continue to engage in community-driven curation efforts.

## Advice for new genomic users to select a basic assembly and annotation pipeline

For a complete novice, our recommendation would be as below (not recommended starting from Illumina only short reads assembly).

• Pure long-read assembly: PacBio or ONT read sequencing (if combined, PacBio 40X and ONT 25X, or 60X for a single platform) → CANU assembler (alternatively Flye) → BUSCO assessment → Make a decision to add more sequencing data or proceed next step (See Confirm and Refine in Fig 1 ) → Optional BioNano with RefAligner (still expensive compared to Hi-C data) → Hi-C with 3D-DNA (alternatively HiRise or AllHiC) → Gapclosing with LR_Gapcloser → Arrow with long-read (alternatively Racon) or Pilon polisher with short-read → BUSCO assessment.
• Hybrid assembly: 10xGC read with Supernova → PacBio or ONT read with CANU (alternatively MaSuRCA) → The rest are same with “Pure-read assembly” from BUSCO assessment to BUSCO assessment.
• Annotation: NCBI or EBI (a web-based automatic pipeline) → If not, proceed a semiautomatic pipeline starting from structural annotation → RepeatMasker → Ab initio Augustus training with MAKER (alternatively BRAKER) → Evidence-based prediction (RNA-seq) with MAKER (alternatively BRAKER) → Noncoding RNA prediction with NONCODE → Functional annotation with Blast2GO (alternatively AmiGO) → Genome Browser.

## Conclusions

There are no gold standards for genome assembly and annotation. However, the availability of NGS data (particularly TGS data) and their analytical tools has enabled the sequencing of several high-quality genomes of species of importance in aquaculture in recent years. Beginners and small research groups still face challenges, because genome assembly and annotation are usually complex analytical procedures (or pipelines) requiring interdisciplinary collaborations (from biology to computer science) and hefty costs for refining/maintaining the genome. The recommendations addressed here are broad guidelines that could be considered to avoid common pitfalls throughout the whole-genome assembly and annotation process. However, the comprehensive features (e.g., advantages and disadvantages) of each step and/or technology have not been extensively discussed.

Finally, newly emerging technologies and analytical tools could dramatically improve end-to-end genome assemblies and annotations in the future by replacing the years-long efforts of the past with rapid and low-cost solutions. Meanwhile, emphasis should be placed upon the following: First, define the achievable research aim. Second, avoid the trap of trying to secure a perfect/complete genome assembly and annotation, which could lead to a never-ending project. Third, perform assembly and annotation to gain firsthand experience, including in bioinformatics. Fourth, seek internal and external help and advice from experts. Lastly, be open to sharing genomic data to both increase research productivity and promote public awareness.

## Acknowledgments

The authors are grateful to their colleagues, collaborators, and field/technical specialists from each company for their valuable comments.

• View Article
• PubMed/NCBI
• 116. König S, Romoth L, Stanke M. Comparative Genome Annotation. In: Setubal JC, Stoye J, Stadler PF, editors. Comparative Genomics: Methods and Protocols. New York, NY: Springer New York; 2018. pp. 189–212.

## Genome Annotation

Under development.

This tutorial is not in its final state. The content may change a lot in the next months. Because of this status, it is also not listed in the topic pages.

Genome annotation is the process of attaching biological information to sequences. It consists of three main steps:

• identifying portions of the genome that do not code for proteins
• identifying elements on the genome, a process called gene prediction, and
• attaching biological information to these elements.
Agenda In this tutorial, we will deal with: Introduction into File Formats Structural Annotation Sequence Features Gene Prediction Functional Annotation Similarity Searches (BLAST) More Similarity Search Tools in Galaxy Identification of Gene Clusters

## Introduction into File Formats

DNA and protein sequences are written in FASTA format where you have in the first line a “>” followed by the description. In the second line the sequence starts.

The general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences.

The genbank sequence format is a rich format for storing sequences and associated annotations.

## Structural Annotation

For the genome annotation we use a piece of the Aspergillus fumigatus genome sequence as input file.

## Sequence Features

First we want to get some general information about our sequence.

Hands-on: Sequence composition Count the number of bases in your sequence ( compute sequence length ) Check for sequence composition and GC content ( geecee ). Plot the sequence composition as bar chart.

## Gene Prediction

At first you need to identify those structures of the genome which code for proteins. This step of annotation is called “structural annotation”. It contains the identification and location of open reading frames (ORFs), identification of gene structures and coding regions, and the location of regulatory motifs. Galaxy contains several tools for the structural annotation. Tools for gene prediction are Augustus (for eukaryotes and prokaryotes) and glimmer3 (only for prokaryotes).

Hands-on: Gene prediction We use Augustus for gene prediction. Use the genome sequence (FASTA file) as input. Choose the right model organism , gff format output. Select all possible output options. Augustus will provide three output files: gff3 , coding sequences (CDS) and protein sequences . Question How many genes are predicted? Solution Check the output: augustus_output
Hands-on: tRNA and tmRNA Prediction Use Aragorn for tRNA and tmRNA prediction. As input file use the Aspergillus genome sequence. You can choose the genetic code (e.g. bacteria). Select the topology of your genome (circular or linear). Question Are there tRNAs or tmRNAs in the sequence?

## Functional Annotation

Similarity searches (blast).

Functional gene annotation means the description of the biochemical and biological function of proteins. Possible analyses to annotate genes can be for example:

• similarity searches
• gene cluster prediction for secondary metabolites
• identification of transmembrane domains in protein sequences
• finding gene ontology terms
• pathway information

For similarity searches we use NCBI BLAST+ blastp to find similar proteins in a protein database.

Hands-on: Similarity search tool As input file, select the protein sequences from Augustus. Choose the protein BLAST database SwissProt and the output format xml . Parsing the xml output ( Parse blast XML output ) results in changing the format style into tabular. Question What information do you see in the BLAST output?

From BLAST search results we want to get only the best hit for each protein.

tool Therefore apply the tool BLAST top hit descriptions with number of descriptions =1 on the xml output file.

Question For how many proteins we do not get a BLAST hit?

tool Choose the tool Select lines that match an expression and enter the following information: Select lines from [select the BLAST top hit descriptions result file]; that [not matching]; the pattern [gi].

Comment: Results file The result file will contain all proteins which do not have an entry in the second column and therefore have no similar protein in the SwissProt database.
Comment: Obtaining unannotated proteins for analysis For functional description of those proteins we want to search for motifs or domains which may classify them more. To get a protein sequence FASTA file with only the not annotated proteins, use the tool Filter sequences by ID from a tabular file and select for Sequence file to filter on the identifiers [Augustus protein sequences] and for Tabular file containing sequence identifiers the protein file with not annotated sequences. The output file is a FASTA file with only those sequences without description.

This file will be the input for more detailed analysis:

Interproscan is a functional prediction tool. Select all applications and run it on your protein file.

WolfPSort predicts eukaryote protein subcellular localization. Filter the result file for the best ranked localization hit. Use Filter data on any column using simple expressions with c4==1 . The parameter c4==1 means: filter and keep all results where in column 4 is a “1”.

TMHMM finds transmembrane domains in protein sequences. The number of amino acids in transmembrane helices should be >18. This information can be found in column 3. Filter the result file c3>17.99 .

BLAST2GO maps BLAST results to GO annotation terms.

## BLAST Programs

Details: Organism not available in a BLAST database If you have an organism which is not available in a BLAST database, you can use its genome sequence in FASTA file for BLAST searches “sequence file against sequence file”. If you need to search in these sequences on a regularly basis, you can create a own BLAST database from the sequences of the organism. The advantage of having a own database for your organism is the duration of the BLAST search which speeds up a lot.

NCBI BLAST+ makeblastdb creates a BLAST database from your own FASTA sequence file. Molecule type of input is protein or nucleotide.

Details: Further Reading about BLAST Tools in Galaxy Cock et al. (2015): NCBI BLAST+ integrated into Galaxy Cock et al. (2013): Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology

## More Similarity Search Tools in Galaxy

• VSEARCH : For processing metagenomic sequences, including searching, clustering, chimera detection, dereplication, sorting, masking and shuffling. VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.
Details: vsearch in depth Documentation for vsearch see here .
• Diamond : Diamond is a high-throughput program for aligning a file of short reads against a protein reference database such as NR, at 20,000 times the speed of Blastx, with high sensitivity.
Details: Diamond in depth Buchfink et al. (2015): Fast and sensitive protein alignment using Diamond.
• Kraken : Kraken BLAST is a highly scalable, extremely fast, commercial, parallelized implementation of the NCBI BLAST application.

## Identification of Gene Clusters

For identification of gene clusters, antiSMASH is used. The tool uses genbank file as input files and predicts gene clusters. Output files are a html visualization and the gene cluster proteins.

Hands-on: antiSMASH analysis tool Import this dataset into your Galaxy history and run antiSMASH to detect gene clusters. The genbank file contains a part of the Streptomyces coelicolor genome sequence.
Question Which gene clusters are identified?

When you have a whole genome antiSMASH analysis, your result may look like this:

At the end, you can extract a reproducible workflow out of your history. The workflow should look like this:

## You've Finished the Tutorial

Please also consider filling out the Feedback Form as well!

Did you use this material as an instructor? Feel free to give us feedback on how it went . Did you use this material as a learner or student? Click the form below to leave feedback.

## Citing this Tutorial

• Anika Erxleben, Björn Grüning, Genome Annotation (Galaxy Training Materials) . https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/genome-annotation/tutorial.html Online; accessed TODAY
• Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
• Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
BibTeX @misc{genome-annotation-genome-annotation, author = "Anika Erxleben and Björn Grüning", title = "Genome Annotation (Galaxy Training Materials)", year = "", month = "", day = "" url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/genome-annotation/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Hiltemann_2023, doi = {10.1371/journal.pcbi.1010752}, url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752}, year = 2023, month = {jan}, publisher = {Public Library of Science ({PLoS})}, volume = {19}, number = {1}, pages = {e1010752}, author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and}, editor = {Francis Ouellette}, title = {Galaxy Training: A powerful framework for teaching!}, journal = {PLoS Comput Biol} Computational Biology} }
Galaxy Administrators: Install the missing tools You can use Ephemeris's shed-tools install command to install the tools used in this tutorial. shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/genome-annotation/tutorials/genome-annotation/tutorial.json | jq .admin_install_yaml -r) Alternatively you can copy and paste the following YAML --- install_tool_dependencies: true install_repository_dependencies: true install_resolver_dependencies: true tools: []
Feedback 5 stars 4 4 stars 3 3 stars 3 2 stars 1 1 stars 2 April 2023 2 stars : Liked : chart explanation Disliked : language can be made more simple August 2022 3 stars : Liked : List elements identified during the process of genome annotation. May 2022 5 stars : Liked : proper description and arrangement. Disliked : Results February 2022 3 stars : Liked : The clear steps and explanation attached Disliked : Some tools have not been found in Galaxy such as antiSMASH. April 2019 1 stars : Liked : good description of what annotation is Disliked : did not tell me how to do annotation using tools present here, would like a step by step instruction on how to do an annotation if I have a genome sequence

• Write to the Help Desk
• Knowledge Articles
• NLM Support Center
• Knowledge Base

## What is genome annotation?

Genome annotation is the process of finding and designating locations of individual genes and other features on raw DNA sequences, called assemblies.  Annotation gives meaning to a given sequence and makes it much easier for researchers to view and analyze its contents. To visualize what annotation adds to our understanding of the sequence, you can compare the raw sequence (in FASTA format) with the GenBank or Graphics formats, both of which contain annotations. In both instances note the placement of individual genes and other features on the sequence. When a group of researchers assemble a genome, they may also — with processes they establish themselves — annotate it at the same time. In the past, an assembly with annotation was known as a build . These days, the term build is rarely used, as the genome assembly process and its annotation process are often completely uncoupled. They can be conducted at different times by different parties. For example, the Genome Reference Consortium (GRC) is maintaining and updating the human reference assembly . GRC releases assembly (sequence) updates and deposits these to the International Nucleotide Sequence Database Collaboration (INSDC) without annotation. GRC prepared the latest major assembly update (major release designated as GRCh38) in December 2013 and it has since followed with several minor updates (patches). In further processing of an assembly update, the NCBI staff creates a RefSeq version of the submitted INSDC assembly. Following that, NCBI annotates the RefSeq version of the assembly. Each annotation release has its own designation and time stamp. For example, the latest (as of August 2023) NCBI annotation release is designated as  GCF_000001405.40-RS_2023_03 . In addition to the human reference genome, NCBI staff annotate numerous eukaryotic genomes via the powerful Eukaryotic Genome Annotation Pipeline . Visit the  Eukaryotic Genome Annotation at NCBI page to start exploring extensive documentation on the annotation process, and to follow the progress of individual genome annotation.  NCBI staff have also developed the Prokaryotic Genome Annotation Pipeline that is available as a service to GenBank submitters  and also as a stand-alone software package .

• Our Culture
• Open and FAIR Data
• Research projects
• Publications
• Cellular Genomics
• Decoding Biodiversity
• Delivering Sustainable Wheat
• Earlham Biofoundry
• Transformative Genomics
• Scientific Groups Our groups work at the forefront of life science, technology development, and innovation.
• High-Performance Sequencing Dedicated and efficient high-throughput genomics led by experts in sequencing and bioinformatics.
• Single-cell and Spatial Analysis Platforms to support single- or multi-cell analysis, from cell isolation, to library preparation, sequencing and analysis.
• Earlham Biofoundry Providing expertise in synthetic biology approaches and access to laboratory automation
• Tools and resources Explore our software and datasets which enable the bioscience community to do better science.
• Cloud Computing Infrastructure for Data-intensive Bioscience
• Web Hosting for Sites, Tools and Web Services
• Earlham Enterprises Ltd
• Events Calendar Browse through our upcoming and past events.
• About our training High-quality, specialist training and development for the research community.
• Year in industry Supporting undergraduate students to develop skills and experience for future career development.
• Internships and opportunities Opportunities for the next generation of scientists to develop their skills and knowledge in the life sciences.
• Immersive visitors A bespoke, structured training programme, engaging with the faculty, expertise and facilities at the Earlham Institute.
• News Catch up on our latest news and browse the press archive.
• Articles Explore our science and impact around the world through engaging stories.
• Impact Stories Find out how we are contributing to the major challenges of our time.
• Impact Through Policy Advocacy Engaging across the political spectrum to exchange knowledge and inform public policy.
• Public engagement and outreach Communicating our research to inspire and engage learning.
• Communications at EI We work across digital, multimedia, creative design and public relations to communicate our research.
• Our Vision and Mission
• Inclusivity, diversity, equality and accessibility
• Our Management Team
• Operations Division
• Careers overview
• Fellowships
• Life at Earlham Institute
• Living in Norfolk

## Why is genome annotation important?

Genome annotation is no simple feat, but it’s incredibly important in identifying the functional elements of DNA. Building the appropriate tools and pipelines is key.

With expertise gleaned from working with a diverse range of genomes - from aphids to wheat, and protists to fish - Earlham Institute scientists explain how genome annotation has advanced over the years, and why it is so important.

Gemy Kaithakottil is celebrating his tenth anniversary at the Earlham Institute this February, having helped to oversee a decade of dramatic transformation in how we annotate genomes.

“We’ve come a long way in the last ten years. Back then, before we began, the process was more like running software and scripts one by one. Now, we’ve merged everything into a streamlined pipeline and shaved a lot of time off the process.”

Throughout that time,  Kaithakottil  has worked as a Senior Bioinformatician and Software Developer in the  Swarbreck Group , developing a suite of tools and pipelines that help us to more accurately map where genes lie along a genome.

We’ve come a long way in the last ten years. Back then, before we began, the process was more like running software and scripts one by one. Now, we’ve merged everything into a streamlined pipeline and shaved a lot of time off the process.

## What is a genome annotation?

Simply put, genome annotation involves taking genomic data - DNA or RNA sequences - and mapping the correct genes (or more accurately, functional elements) to the correct locations. It gives the genome meaning.

According to Kaithakottil, this is an essential step that is frustratingly undervalued.

“People often spend a lot of effort on genome assembly, but eventually the research is going to work with the protein or the functional parts of it. If you're not going to give any effort to that part, then what's the point?

“You have to put equal - or even more effort - into the annotation.”

This can be done manually, by looking directly at the data and identifying the precise starting points of genes, but that takes a lot of time.

“When I was working at the Sanger Institute on the Human Genome Project, we annotated the human genome by going gene by gene across every chromosome,” says Dr David Swarbreck, Core Bioinformatics Group Leader at the Earlham Institute.

“We used manual annotation tools to visually examine alignments of cDNAs and proteins and, based on these, we could construct gene-models to define a gene's structure. This was a huge team effort and manual curation to this extent is not possible for most newly-sequenced and assembled genomes.

“I wanted something compatational that would work in a similar way to a manual annotator, enable us to assess alternative gene models, generate metrics to aid that comparison and make choices over the models we include or exclude: allowing us to shape the annotation for specific projects but without us having to do it manually.”

## Genome Annotation Workshop 2022

You can learn from the experts how to annotate a genome at this year’s training workshop, delivered by the Core Bioinformatics Group here at the Earlham Institute.

Date:  17 - 19 May 2022

Register By:  17 April 2022

## Moving towards genome annotation pipelines

One of the first genomes annotated by Swarbreck and his group was that of the  green peach aphid ,  Myzus persicae , in a collaboration with the  Hogenhout Group  at the John Innes Centre that continues to this day.

“We played around with all the available tools at the time, to see what was available,” says Swarbreck. “The problem was, many of the more comprehensive pipelines weren’t easy to translate to running in your own environment.”

Kaithakottil adds that, “with any new pipeline, you need to understand the software, then work out what parameters you need to use, or tweak, for a particular species. It's not one size fits all. You need to understand the species that you're working with.”

As those early, non-human genomes were being assembled and annotated, RNA-seq data from transcriptome sequencing - the potentially expressed functional elements - was becoming more prevalent. So, too, were longer reads from the sequencers.

“We found that there was quite a bit of variation between different methods and different ways of dealing with that data,” says Swarbreck. “We concluded that there was no single tool out there that we found worked for all situations.

“We were looking for something that would allow us to try to integrate results of all these different transcriptome assemblers.”

With any new pipeline, you need to understand the software, then work out what parameters you need to use, or tweak, for a particular species. It's not one size fits all. You need to understand the species that you're working with.

## Genome annotation pipelines at the Earlham Institute

That led to the development of  Mikado and Portcullis , which were essential tools in the huge global effort to sequence, assemble and annotate the genome of bread wheat - a major milestone for such a crucial source of food.

Mikado is a tool, according to a former developer Dr Luca Venturini, based on the traditional stick game that is its namesake, aiming to “imagine genes as sticks and to capture the ones with the highest value without getting the others.”

What Mikado does, essentially, is find more real genes - filtering out false positives and identifying where there might have been false negatives. An example would come from gene duplications, whereby some software may have accidentally identified two very similar genes as only one.

“At the time we were working as part of the International Wheat Genome Sequencing Consortium and wanted a transparent approach that would enable us to integrate two alternative gene sets created by our collaborators,” says Swarbreck. “We made some tweaks to Mikado, and used it as a method to bring these two gene sets together, essentially cherry picking the ‘better’ models from the two annotations.”

Since then, the group has been integral to many genome sequencing projects, from various plant and tree species through to insects, fish, rodents, fungi and numerous others.

Now, the aim is to integrate these tools to tackle the biggest prize of all - the Darwin Tree of Life Project that aims to sequence the DNA of all eukaryotic life in the UK.

People often spend a lot of effort on genome assembly, but eventually the research is going to work with the protein or the functional parts of it. If you're not going to give any effort to that part, then what's the point?  You have to put equal - or even more effort - into the annotation.

## Reat: an all-encompassing, easy-to-use genome annotation pipeline

“We've got a growing number of projects and collaborations that use these tools, but what we have aimed for all along is an easy to run, all-encompassing annotation pipeline,” says Swarbreck. “The solution to that was to develop what we’ve called the  reat  toolkit.”

That toolkit was most recently used in an effort  to produce the best ever reference genome for tilapia  - a fish of exceptional importance in global aquaculture. It helped bioinformatician Dr Will Nash produce a comprehensive genome annotation.

Reat contains a module for dealing with a whole variety of transcriptome data, including cDNA, PacBio, nanopore, and short reads, and there are different workflows for those different types of data as well.

There's also a module for dealing with homology data from protein alignments, together with a gene prediction module - and at the end of that a consolidated gene annotation across all these different methods.

“Rather than try to generate a single set of models for each project, we generate lots of different gene models through different routes,” explains Swarbreck. “We then use our  Minos  pipeline to bring these all together and select the final ‘best’ models.

“Rather than putting all our eggs in one basket, we accept that it’s best to vary parameters, the choice of tools, and the inputs into these tools. We can achieve a higher quality final annotation and have an approach that is more robust across projects by generating alternative gene models.

“Ultimately, you need to have some way of making a final selection. Minos provides us with a means of making that selection that we can control, allowing us to review and tweak as required.”

## How can I access genome annotation tools and expertise at the Earlham Institute?

The  reat  pipeline is available to anyone who would like to use it, open-source, on GitHub. The same is true of mikado, portcullis, minos and a range of other tools and pipelines for genome annotation.

“We are more than happy to keep in touch with anyone interested in using these pipelines,” says Kaithakottil. “People tend to ask questions in the issues section of GitHub, and we gladly help them there. We’ll reply to emails, too, of course!”

There’s also a training course on  Genome Annotation , which Kaithakottil would encourage those looking to make use of these pipelines to sign up for.

“The workshop starts from the very basics,” he explains. “We introduce participants to genome annotation and then go through a number of pipelines, including our own, such as Mikado and Minos, as well as some external pipelines.

“We go through using metrics, best practice, explore the parameters you should be using - and then how to run and install these tools, and how to use them most effectively. Thanks to  CyVerse UK , trainees can also access the resources via virtual machines to see their outputs, tweak parameters, and modify them to improve their results.”

If you’d like to sign up for the workshop in 2022, registrations end on April 17. Keep your eyes on the  events calendar,  which is regularly updated, for future events, or sign up to the Earlham Institute  monthly newsletter.

This 3-day course will help to provide scientists with an overview of eukaryotic genome annotation approaches, covering advances in Next Generation Sequencing (NGS) technologies, transcriptome assembly, best practice guidance for building gene models utilising short and long read sequencing data or cross species proteins, how to integrate and assess different gene models and create a publication/release ready gene set.

• Course Dates:  17 - 19 May 2022
• Time : 09.30 - 15.00
• Venue : Online (via Zoom)
• Registration Deadline:  17 April 2022
• Registration Cost : £150.00

## Generating a high-quality tilapia genome assembly: from sample to sequence

• Scientific Groups
• High-Performance Sequencing
• Single-cell and Spatial Analysis
• Tools and resources
• Events Calendar
• Year in industry
• Internships and opportunities
• Immersive visitors
• Impact Stories
• Public engagement and outreach
• Communications at EI

## The first high-quality genome assembly and annotation of Lantana camara , an important ornamental plant and a major invasive species

• Research Article
• Open access
• Published: 10 May 2024
• Volume 2 , article number  14 , ( 2024 )

• S. Brooks Parrish 1 &
• Zhanao Deng   ORCID: orcid.org/0000-0002-7338-3298 1

204 Accesses

Explore all metrics

This study presents the first annotated, haplotype-resolved, chromosome-scale genome of Lantana camara , a flowering shrub native to Central America and known for its dual role as an ornamental plant and an invasive species. Despite its widespread cultivation and ecological impact, the lack of a high-quality genome has hindered the investigation of traits of both ornamental and invasive. This research bridges the gap in genomic resources for L. camara , which is crucial for both ornamental breeding programs and invasive species management. Whole-genome and transcriptome sequencing were utilized to elucidate the genetic complexity of a diploid L. camara breeding line UF-T48. The genome was assembled de novo using HiFi and Hi-C reads, resulting in two phased genome assemblies with high Benchmarking Universal Single-Copy Orthologs (BUSCO) scores of 97.7%, indicating their quality. All 22 chromosomes were assembled with pseudochromosomes averaging 117 Mb. The assemblies revealed 29 telomeres and an extensive presence of repetitive sequences, primarily long terminal repeat transposable elements. The genome annotation identified 83,775 protein-coding genes, with 83% functionally annotated. In particular, the study mapped 42 anthocyanin and carotenoid candidate gene clusters and 12 herbicide target genes to the assembly, identifying 38 genes spread across the genome that are integral to flower color development and 53 genes for herbicide targeting in L. camara . This comprehensive genomic study not only enhances the understanding of L. camara’s genetic makeup but also sets a precedent for genomic research in the Verbenaceae family, offering a foundation for future studies in plant genetics, conservation, and breeding.

## Bioinformatics insight in shallow genome sequence: a case study of Corymbia hybrid (C. citriodora × C. torelliana)

Avoid common mistakes on your manuscript.

## Introduction

Lantana camara , commonly known as lantana, is a flowering shrub native to Central America. It has been introduced to various parts of the world, including India, Australia, Africa, and the United States, where it has become invasive in certain regions (Sharma et al. 2005 ; Bhagwat et al. 2012 ; Taylor et al. 2012 ; Shackleton et al. 2017 ). Despite its status as an invasive species, lantana remains a popular ornamental plant, contributing significantly to the flowering plant market in the United States. Its dual role as both an attractive ornamental and a problematic invasive species makes it a subject of interest for both ecological and economic reasons.

L. camara is a polyploid species with a base chromosome number of 11 (1 x  = 11). Ploidy levels in this species can range from diploid (2 x ) to hexaploid (6 x ), particularly in commercial varieties and breeding lines (Czarnecki et al. 2014 ; Parrish et al. 2021 ). It is believed that L. camara is an autopolyploid species, capable of increasing its ploidy levels due to the presence of unreduced female gametes (Czarnecki and Deng 2009 ). This polyploid nature potentially contributes to its adaptability and invasiveness, as well as its appeal as an ornamental plant.

Despite the rich genetic diversity inherent to L. camara , there is a conspicuous lack of comprehensive genomic resources to guide both breeding programs aimed at enhancing its ornamental traits and conservation efforts to manage its invasive characteristics. While a handful of transcriptome studies have been conducted focusing on aspects such as unreduced female gamete production genes and genes involved in phenylpropanoid biosynthesis (Peng et al. 2019 ; Shah et al. 2020 ), these offer only a partial view of the species’ genetic landscape. Moreover, a 2013 study that used chloroplast spacers and microsatellites to explore the population structure of lantana in India found high levels of genetic diversity at the examined loci (Ray and Quader 2014 ). This study suggested multiple introductions of the species into India, but it also underscored the need for more extensive genomic data. Given the high genetic diversity observed at just a few loci, there is a compelling case for a more comprehensive genomic exploration to unlock the full scope of lantana’s genetic makeup.

Whole-genome sequencing and de novo assembly have become indispensable tools for bioinformaticians and geneticists seeking to elucidate the traits inherent to plant species. The availability of such comprehensive genomic data empowers researchers to identify genes associated with key traits, develop molecular markers for breeding programs, and explore the phylogenetic relationships among plant species. For L. camara , the first step in this genomic exploration was the assembly and annotation of its chloroplast genome (Yaradua and Shah 2020 ). This study reported a chloroplast genome length of 154,388 bp and identified 90 protein-coding genes. Furthermore, a comparative analysis with other chloroplast genomes in the Verbenaceae family positioned L. camara as a sister taxon to Lippia origanoides . While this initial study laid important groundwork, it also highlighted the need for a more comprehensive genomic analysis to fully understand the genetic diversity and potential of this complex species.

Subsequent to the initial assembly of Lantana camara 's chloroplast genome, two de novo genome assemblies have been published, both utilizing short-read sequencing data. The first, by Shah et al. ( 2022 ), was part of a broader study aimed at identifying gene targets for herbicide development across seven weed species. For L. camara , the study focused on a wild population in Queensland, Australia, and generated over 870 million 2 × 150 bp paired-end Illumina reads. These were assembled into 1,053,782 scaffolds with an N50 of 3 kb, resulting in a fragmented 1.57 Gb genome. This assembly had a Benchmarking Universal Single-Copy Orthologs (BUSCO) score of 79.8% and contained 18,369 protein-coding genes. Based on a k-mer estimated genome size of 6.36 Gb and other genome estimates, it can be inferred that the sequenced accession was tetraploid (Parrish et al. 2021 ). In the same year, Joshi et al. ( 2022 ) took a similar approach but used an accession with a 2.59 pg/2C DNA content. They generated over 500 million paired-end reads and assembled a 1.89 Gb genome with a notably higher BUSCO score of 99.3%. Although the total number of scaffolds was not reported, 26,057 were greater than 10 kb in size. While these two genomes provide valuable genomic data for the species, a chromosome-scale assembly is needed for more accurate and reliable genomics studies.

In the present study, a significant step forward is taken in the genomic exploration of L. camara . The first annotated, haplotype-resolved, chromosome-scale genome is presented, not only for this species but also for the Verbenaceae family as a whole. This comprehensive genomic resource aims to fill existing gaps in the understanding of lantana’s genetic diversity and complexity. By providing such a detailed genomic map, the study offers valuable insights that could be leveraged for both conservation efforts to control its invasive spread with new herbicides and breeding programs to enhance its ornamental traits. The work sets a new standard for genomic research in the Verbenaceae family and offers a robust foundation for future studies.

## Materials and methods

Plant material and dna extraction.

Lantana breeding line UF-T48 plants were subjected to etiolation by enclosing them in dark cardboard boxes within a temperature-controlled greenhouse environment for a duration of three weeks. Subsequently, etiolated leaves were harvested, snap-frozen in liquid nitrogen, and preserved at -80°C. The frozen tissue samples were then shipped to CD Genomics (Shirley, New York, USA) for genomic DNA extraction and sequencing. The cetyl trimethylammonium bromide (CTAB) method was used to isolate high molecular weight DNA suitable for subsequent sequencing processes.

## Library preparation and sequencing

The high molecular weight DNA was utilized to prepare SMRT-bell libraries following the protocol provided by Pacific Biosciences (Menlo Park, California, USA). Additionally, Arima-HiC libraries were prepared (Arima, Carlsbad, California, USA) for chromatin conformation capture sequencing. The PacBio libraries were sequenced using three 8 M SMRT cells on a PacBio Sequel II system. The Hi-C libraries underwent sequencing on an Illumina NovaSeq 6000 platform (Illumina, San Diego, California, USA). Validation of the Hi-C libraries was conducted using 12 Gb of Illumina paired-end reads, analyzed with qc3c v0.5 software (DeMaere and Darling 2021 ) in the absence of a reference genome.

## RNA extraction and sequencing

For transcriptomic analysis, approximately 100 mg of tissue was collected from leaves, roots, green stems, and green fruits. The samples were immediately frozen in liquid nitrogen and stored at -80°C. Collection occurred at the University of Florida Institute of Food and Agricultural Sciences (UF/IFAS) Gulf Coast Research and Education Center in Wimauma, Florida, USA, between 8:00 and 9:00 AM in October 2022. RNA extraction was performed using the RNeasy Plant Mini Kit by Qiagen (Hilden, Germany). The extracted RNA was then sent to Novogene (Beijing, China) for library preparation and Illumina sequencing, targeting a yield of 6 Gb per sample.

## Genome size estimation

The nuclear DNA content of the UF-T48 lantana breeding line was assessed following the protocol established by Doležel et al. ( 2007 ). Fresh leaf tissue was thoroughly rinsed with tap water. Approximately 30 mg of leaf tissue from both lantana and the internal standard, tomato ( Solanum lycopersicum L. ‘Stupické polni rané’ (1.96 pg•2C −1 )), were co-chopped in 1 mL of LB01 buffer. To this mixture, 50 µL of RNase (Sigma-Aldrich, St. Louis, Missouri, USA; 1 mg•mL −1 ) was added. The chopping was performed with a sharp razor blade to release the nuclei into the solution. The nuclei suspension was then filtered through a 50 µm pore nylon mesh filter to remove debris. Subsequently, 50 µL of the DNA fluorochrome propidium iodide (Sigma-Aldrich, St. Louis, Missouri, USA; 1 mg•mL −1 ) was added to stain the DNA. The stained nuclei were analyzed using a Cyflow® Ploidy Analyser (Sysmex Europe GmbH, Norderstedt, Germany) flow cytometer. Each leaf sample was subjected to three flow cytometric analyses, and three separate clonal plants were evaluated to ensure accuracy. The DNA content for each sample was calculated using the formula provided by Doležel et al. ( 2007 ), which is: nuclear DNA content of lantana = nuclear DNA content of internal standard × (mean fluorescence value of lantana sample ÷ mean fluorescence value of the internal standard). K-mer counting was performed on the raw DNA sequencing reads using KMC v3.2.1 (Kokot et al. 2017 ). The resulting K-mers were plotted in R v4.3.1 (R Core Team 2023 ) to estimate the genome size.

## De novo assembly

For quality assessment, PacBio and Hi-C sequencing reads were analyzed using FastQC v0.11.7 (Andrews 2010 ). Hi-C reads underwent trimming at the GATC restriction enzyme site with HOMER v4.11 (Heinz et al. 2010 ). The genome assembly was performed de novo using hifiasm, integrating the PacBio data sets and trimmed Hi-C reads with default parameters on a 50-thread computational setup (Cheng et al. 2021 ). The processed Hi-C reads were mapped to the draft genome following the Arima-HiC mapping pipeline protocol (Arima Genomics 2019 ). BWA v0.7.17 (Li and Durbin 2009 ) was used for the mapping, and the mapped reads were filtered using SAMtools v1.15 (Li et al. 2009 ) and BEDtools v2.30.0 (Quinlan and Hall 2010 ). The yahs v1.1 tool (Zhou et al. 2023 ) utilized the mapped reads and draft assembly for scaffolding. To fill gaps in the chromosome assemblies, raw PacBio sequencing reads were applied using TGS GapCloser v1.2.1 (Xu et al. 2020 ), which is tailored for closing gaps in third-generation sequencing assemblies.

## Assembly quality evaluation

The integrity and quality of both draft and final genome assemblies were evaluated using Quast v5.0.2 (Gurevich et al. 2013 ), which provided essential statistics such as contig number, N50, and total assembly length. To estimate the assembly quality value (QV), Merqury v1.3 (Rhie et al. 2020 ) was employed, offering a k-mer based quantification of accuracy. The completeness of the assemblies was gauged using the BUSCO database v5.3.0 (Simão et al. 2015 ). For the spatial organization of the genome, trimmed Hi-C reads were aligned to the phased assemblies with HiC-Pro v3.0.0 (Servant et al. 2015 ) and the resulting contact maps were visualized using Juicebox v1.11.08 (Durand et al. 2016 ), providing a chromosomal interaction overview. The two phased assemblies were aligned to each other and plotted to assess synteny using D-GENIES (Cabanettes and Klopp 2018 ). To further assess the assembly quality, the Long Terminal Repeat Assembly Index (LAI) (Ou et al. 2018 ) was calculated for each chromosome using LTR-retriever v2.5 (Ou and Jiang 2018 ). This index offers a measure of the completeness of long terminal repeat retrotransposons, which is indicative of the overall assembly quality, particularly in repeat-rich regions.

## Repetitive sequence annotation

Transposable elements (TEs), which are crucial components of the genomic landscape, were annotated using EDTA v1.9.6 (Ou et al. 2019 ). This tool was employed with its default parameters to systematically identify and catalog the various classes of TEs within the assembly. Following the annotation, the identified TE regions were masked to mitigate their impact on subsequent analyses, utilizing RepeatMasker v4.1.1 (Tarailo-Graovac and Chen 2009 ). In parallel, the assembly was scanned for simple sequence repeats (SSRs) using PERF v0.4.6 (Avvaru et al. 2018 ), which extracted microsatellite sequences, a resource valuable for genetic mapping and marker development. Additionally, the search for telomeric sequences was conducted using tidk v0.2.31 (Brown et al. 2023 ), a specialized tool for identifying the repetitive DNA sequences that cap the ends of chromosomes, providing insights into chromosome structure and stability.

## Gene annotation

For the prediction of protein-coding genes in the UF-T48 lantana genome, a comprehensive approach was employed utilizing RNA-seq data. This data encompassed a diverse range of tissues, including leaves, green stems, roots, and green fruits, ensuring a broad representation of the gene expression profile. Additionally, publicly available RNA-seq reads specific to UF-T48 flowers were incorporated, sourced from the NCBI project PRJNA956917 (Parrish et al. 2024 ). RNA-seq reads were trimmed using Trimmomatic v0.39 (Bolger et al. 2014 ) prior to input for gene prediction. The gene prediction was conducted using Braker v3.0.3 (Gabriel et al. 2023 ) a tool known for its accuracy in predicting gene structures in eukaryotic genomes, especially when guided by RNA-seq data. Following the prediction of protein-coding genes, functional annotation was carried out using eggNOG mapper v2.1.6 (Cantalapiedra et al. 2021 ). This tool is adept at categorizing genes into functional groups based on orthology and provides insights into potential gene functions by mapping them to known gene families and biological pathways.

## Anthocyanin/Carotenoid pathway and herbicide target genes

Candidate genes with differential expression in anthocyanin and carotenoid pathways between white, yellow, and red flower colors were retrieved from NCBI project PRJNA956917 (Parrish et al. 2024 ). Herbicide target gene queries were obtained from the study published by Shah et al. ( 2022 ). To locate these candidate genes within the assembled UF-T48 genome, a DIAMOND search v2.1.8 (Buchfink et al. 2021 ) was employed.

## Tissue specific RNA analysis

Trimmed RNA-seq reads were aligned to the assembled genome using HISAT2 v2.2.1 (Kim et al. 2019 ). Raw gene counts were obtained from the alignment files by employing HTSeq v2.0.3 (Anders et al. 2015 ).

## Genome and transcriptome sequencing

Ploidy analysis revealed that the somatic nuclei of the UF-T48 lantana breeding line contained approximately 3.02 ± 0.02 pg/2C of nuclear DNA which equates to approximately 2.95 Gb (3.02 pg/2C × 0.978) (Jaroslav Doležel et al. 2007 ). To achieve 60 × coverage, three 8 M single-molecule, real-time (SMRT) cells were utilized on a PacBio Sequel II sequencer (Table  1 ). This approach yielded 94.86 Gb of HiFi reads, generated from 5.6 million reads with an average read length of 16,816 bp. For Hi-C sequencing, an Illumina NovaSeq 6000 was employed, resulting in 30.92 Gb of data. This dataset comprised 103 million paired-end reads, each with an average length of 150 bp. Prior to scaling up the Hi-C sequencing to achieve 10 × coverage, the quality of Hi-C cross-linking was assessed using 12 Gb of Illumina paired-end reads. Analysis of the reads indicated that 80% of the reads were true products of proximity ligation, confirming the quality of the Hi-C data. To further assist with genome annotation, RNA-seq data were also generated for the UF-T48 breeding line. This resulted in 258 Gb of data, produced from 347 million paired-end reads, each 150 bp in length (Table  1 ).

## Phased genome assembly

The genome of UF-T48 was assembled de novo, without the use of parental genomic data, by utilizing HiFi and Hi-C reads. K-mer analysis for genome size estimation aligned well with flow cytometry estimates, reporting an estimated genome size of 2.95 Gb. The k-mer frequency distribution (k = 23) exhibited a bimodal pattern, characteristic of a diploid organism with both homozygous and heterozygous genomic regions (Fig.  1 ). Numerical integration of the areas under the respective peaks of the distribution yielded an estimated heterozygosity of 72.31%. The assembly was phased into two separate datasets: the phased 1 assembly contained 1,295 contigs with an N50 of 104.99 Mb, while the phased 2 assembly had 426 contigs with an N50 of 85.09 Mb (Table  2 ). The largest contigs in the phased 1 and phased 2 assemblies measured 146.15 Mb and 170.47 Mb, respectively. Notably, 75% of the phased 2 assembly was composed of just 11 contigs, suggesting that the majority of each of the chromosomes are composed of just one contig. Both phased assemblies achieved a complete BUSCO score of 97.7%, indicating high-quality genome assemblies.

The k-mer (k = 23) distribution of T48 Lantana camara genome. The leftmost peak (~ 33 ×) represents the heterozygous region of the genome and the rightmost peak (~ 66 ×) represents the homozygous region of the genome

Hi-C reads were utilized to scaffold the phased genome assemblies. Of these, 99.65% of the first set of Hi-C reads (read 1) and 99.38% of the second set (read 2) were successfully mapped to the assembled genome. After filtering out unmapped reads, low-quality reads, and singletons, 37.20% of the uniquely mapped reads were retained for scaffolding. These filtered Hi-C read pairs were visualized using a Hi-C contact map, which revealed 11 chromosomes in both phased assemblies (Fig.  2 ). The density of the Hi-C pairs on the contact map suggests a low likelihood of mis-assemblies in the genome. Furthermore, a high degree of collinearity was observed between the two phased assemblies, with only a few small inversions and rearrangements evident (Fig.  3 ). All chromosomes were assembled gap-free with the exception of chromosome 5 that has one gap of unknown length at 26.17 Mb of chromosome 5A and 23.88 Mb of chromosome 5B (Fig.  4 ).

Hi-C contact map of phased 1 ( a ) and phased 2 ( b ) genome assemblies of UF-T48 Lantana camara. Each square corresponds to the chromosome listed along the horizontal axis. The color scale bar represents interaction frequencies. Higher values indicate more frequent interactions

Dotplot of aligned Lantana camara UF-T48 phased 1 and phased 2 genome assemblies

Circos plot displaying the characteristics of the UF-T48 Lantana camara genome assembly. Concentric circles from outside to inside show the following: 1) 22 assembled pseudomolecules (Mb); 2) heatmap of locations of predicted gene models with gene density increasing with darker shading; 3) heatmap of locations of predicted long terminal repeat (LTR) transposable elements (TEs) with LTR density increasing with darker shading; 4) locations of telomeric repeats; and 5) locations of gaps in the assembly

A total of 29 telomeres were identified using the telomeric motif (5'-AAACCCT-3') at the terminal ends of pseudo-chromosomes (Fig.  4 ). Telomere-to-telomere assembly was achieved for pseudo-chromosomes 1, 6, and 7 in both phased assemblies, as well as for pseudo-chromosomes 3B and 9B. Telomeres were identified at either the 5' or 3' end for all other pseudo-chromosomes, with the exception of chromosome 4A, which had no telomeric repeats detected. The Long Terminal Repeat (LTR) Assembly Index (LAI) for individual pseudo-chromosomes ranged from 18.51 to 23.1 (Fig.  5 ). The overall LAI scores were 19.61 for the phased 1 assembly and 19.12 for the phased 2 assembly, indicating high-quality genome assemblies.

The Long Terminal Repeat (LTR) Assembly Index (LAI) distribution in the UF-T48 Lantana camara genome assembly

## Genome annotation

Repetitive sequences constitute 85.82% of the combined phased 1 and phased 2 lantana assemblies. Among these, long terminal repeat (LTR) transposable elements represent the majority, accounting for 70.23% of the repetitive sequences (Fig.  5 ; Supplementary Table  1 ). Simple sequence repeats (SSRs) comprise 2.12% of the genome, totaling 3,710,838 repeats (Supplementary Table  2 ). The genome contains 83,775 protein-coding genes, which give rise to 95,239 transcripts (Fig.  4 ; Supplementary Table  3 ). The average gene length is 2,415 bp, with a mean coding sequence length of 1,212 bp and an average of 4.5 exons per gene. Protein-coding genes span approximately 8.2% of the UF-T48 genome, equivalent to 202,281,879 bp. Out of the identified protein-coding genes, 83% were functionally annotated. A BUSCO analysis of these annotated genes revealed the presence of 2,176 complete core eudicot genes, accounting for 93.6%. Only 1.6% of these genes were fragmented, and 4.8% were missing.

Parrish et al. ( 2024 ) identified 40 anthocyanin and 2 carotenoid pathway genes that were differentially expressed in red and white flowers, respectively. Alignment of these clusters to the assembled UF-T48 genome revealed 38 genes located throughout the genome (Fig.  6 ). All of the gene clusters were representative of two alleles per locus. Chromosomes one, five, and seven contained the highest number of candidate genes with three candidates per chromosome.

Locations of differentially expressed anthocyanin and carotenoid genes in the UF-T48 Lantana camara genome assembly

## Common herbicide gene targets

While eradicating invasive lantana genotypes from landscapes can be accomplished through herbicide applications, the hardy plant can take many applications for death to occur. This necessitates that more specialized herbicides be developed to control this invasive plant. To support this research, 12 common gene targets for herbicide development identified in the study by Shah et al. ( 2022 ) were extracted from the genome (Supplementary Table  4 ). All 12 gene targets were identified in full length within the genome including the two previously missing targets beta-isopropylmalate dehydrogenase ( I M D H ) and acetyl-CoA carboxylase 1 ( accA ) genes. These genes, integral to the branched-chain amino acid (BCAA) pathway and the acetyl-CoA carboxylase (ACCase) inhibitors, respectively, are crucial for the development of targeted herbicides.

In the process of annotating the Lantana camara UF-T48 genome, RNA reads from various tissue types were aligned to the genome to quantify gene expression across different tissues. Out of the total 83,775 predicted genes in the genome, 41,729 genes (49.81%) were detected in the RNA-seq data derived from the six tissue types analyzed (Fig.  7 ). Notably, a significant number of genes, 22,344, were found to be expressed across all tissue types, indicating a broad spectrum of shared genetic activity. Among the different tissues, unopened flowers exhibited the highest number of expressed genes, with 40.45% of all predicted genes in the genome showing some level of expression in this tissue. In contrast, green fruit tissue had the fewest number of unique genes expressed, with only 465 genes uniquely expressed in this tissue type.

RNA gene expression counts from 6 tissue types that were used in the annotation of the UF-T48 Lantana camara genome assembly. This image was generated by ChatGPT-4 DALL•E 3, https://chat.openai.com

The genome of the UF-T48 lantana breeding line, as revealed by this study, offers significant insights into the genetic composition of this ornamental plant. The findings align with prior research regarding genome size estimation techniques, with the ploidy analysis closely mirroring the K-mer analysis, a consistency observed in other plant genomes (Jaroslav Doležel et al. 2007 ).

A high level of genome heterozygosity as estimated by k-mers underscores the importance of having a haplotype phased assembly to capture the full genetic diversity present. The phased genome assembly, achieved without parental genomic data, underscores the advancements in sequencing technologies. The high N50 values of both phased assemblies, especially when compared to other plant genomes, indicate a high level of contiguity and completeness (Kersey 2019 ). The utilization of Hi-C reads for scaffolding further enhanced the quality of the assembly, as evidenced by the high mapping rates and the clear visualization of chromosome pairs on the Hi-C contact map. This approach, combined with the high BUSCO scores, suggests that the UF-T48 genome assembly is of superior quality and can serve as a reference for future lantana genomic studies.

The identification of telomeres in the UF-T48 genome is crucial for understanding chromosome stability and integrity. The presence of telomeres in most pseudo-chromosomes, and the achievement of telomere-to-telomere assembly in several, is indicative of a comprehensive and high-quality assembly. The LAI scores further corroborate the quality of the assembly, aligning with scores observed in other high-quality plant genome assemblies.

Repetitive sequences, particularly LTR transposable elements, dominate the UF-T48 genome. This high proportion of repetitive sequences is consistent with other complex plant genomes and underscores the challenges of assembling such genomes (Mehrotra and Goyal 2014 ; Macas et al. 2015 ).

Despite these challenges, the successful annotation of a significant number of protein-coding genes, with a high percentage being functionally annotated, is a testament to the robustness of the sequencing and annotation methodologies employed. While only half of the predicted protein-coding genes were supported by RNA-seq data, this likely reflects the limited depth of RNA-seq data coverage and the restricted range of tissue types analyzed. Nevertheless, the RNA-seq data proved adequate for training the ab initio model, enabling the prediction of the remaining genes in the genome. The BUSCO analysis results further emphasize the completeness of the UF-T48 genome assembly. The high percentage of complete core eudicot genes, coupled with a minimal number of fragmented or missing genes, places the UF-T48 genome among the top-tier of plant genome assemblies in terms of quality and completeness.

The alignment of anthocyanin and carotenoid biosynthetic pathway genes, previously identified in a de novo transcriptome study (Parrish et al. 2024 ), to the UF-T48 genome represents a significant step forward in connecting functional genomics with structural genomics in this species. The successful localization of genes such as a nth ocyanid in synt has e ( ANS ), b asi c helix-loop-helix 42 ( BHLH42 ), c i n namat e -4-h ydroxylas e ( C4H ), and others not only emphasizes the UF-T48 assembly’s role as a robust scaffold for integrating transcriptomic and genomic data but also showcases its utility in diverse genomic explorations. This is further exemplified by identifying the 12 common gene targets for herbicide development, originally identified by Shah et al. ( 2022 ). The identification of these gene targets, including the previously missing IMDH and accA genes, within the UF-T48 genome signifies a parallel yet equally significant stride in understanding and combating herbicide development in lantana.

This dual achievement underscores the UF-T48 genome assembly’s versatility, serving both ornamental breeding programs and herbicide research. While the precise localization of biosynthetic pathway genes facilitates the manipulation of genes for vibrant coloration in lantana flowers, the mapping of herbicide-target genes offers a genetic blueprint for developing more effective herbicides. Thus, the UF-T48 genome emerges as a comprehensive tool, aiding in the creation of new floral varieties with desired characteristics and in controlling invasive genotypes, addressing both aesthetic and ecological concerns associated with lantana.

This study showcases the successful assembly of the UF-T48 lantana breeding line, a complex genome, using a combination of PacBio HiFi long-read sequencing and Hi-C data. This approach facilitated the creation of the first chromosome-scale, haplotype-phased assembly for Lantana camara . Remarkably, this high-quality assembly was achieved without the need for parental sequence data. The resulting genome provides a comprehensive genetic blueprint of this ornamental plant species. The availability of this UF-T48 genome assembly will undoubtedly pave the way for the identification of genes associated with key ornamental and invasive traits, furthering the development of advanced breeding tools and strategies for Lantana camara and the Verbenaceae family.

## Availability of data and materials

The genome assembly files of the UF-T48 lantana genome are available from the NCBI Sequence Read Archive BioProject database with the accession numbers PRJNA1065478, PRJNA1069082, and PRJNA1069083.

Anders S, Pyl PT, Huber W. HTSeq – a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–9. https://doi.org/10.1093/BIOINFORMATICS/BTU638 .

Andrews S. Babraham Bioinformatics - FastQC A quality control tool for high throughput sequence data. 2010. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ . Accessed 7 Mar 2023.

Arima Genomics. Arima-HiC mapping pipeline. 2019. https://github.com/ArimaGenomics/mapping_pipeline/tree/master . Accessed 7 Nov 2023.

Avvaru AK, Sowpati DT, Mishra RK. PERF: an exhaustive algorithm for ultra-fast and efficient identification of microsatellites from large DNA sequences. Bioinformatics. 2018;34:943–8. https://doi.org/10.1093/BIOINFORMATICS/BTX721 .

Bhagwat SA, Breman E, Thekaekara T, Thornton TF, Willis KJ. A battle lost? Report on two centuries of invasion and management of Lantana camara L. in Australia, India and South Africa. PLoS One. 2012. https://doi.org/10.1371/journal.pone.0032407 .

Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20. https://doi.org/10.1093/BIOINFORMATICS/BTU170 .

Brown M, De la GonzálezRosa PM, Mark B. A Telomer Identification toolkit. 2023. Zenodo. https://doi.org/10.5281/zenodo.10091385 .

Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18:366–8. https://doi.org/10.1038/s41592-021-01101-x .

Article   CAS   PubMed   PubMed Central   Google Scholar

Cabanettes F, Klopp C. D-GENIES: dot plot large genomes in an interactive, efficient and simple way. PeerJ. 2018. https://doi.org/10.7717/PEERJ.4958 .

Article   PubMed   PubMed Central   Google Scholar

Cantalapiedra CP, Hern̗andez-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. MOL BIOL EVOL. 2021;38:5825–9. https://doi.org/10.1093/MOLBEV/MSAB293 .

Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18:170–5. https://doi.org/10.1038/s41592-020-01056-5 .

Czarnecki DM, Deng Z. Occurrence of unreduced female gametes leads to sexual polyploidization in lantana. J Am Soc Hortic Sci. 2009;134:560–6. https://doi.org/10.21273/JASHS.134.5.560 .

Czarnecki DM, Hershberger AJ, Robacker CD, Clark DG, Deng Z. Ploidy levels and pollen stainability of Lantana camara cultivars and breeding lines. HortScience. 2014;49:1271–6. https://doi.org/10.21273/HORTSCI.49.10.1271 .

DeMaere MZ, Darling AE. qc3C: Reference-free quality control for Hi-C sequencing data. PLoS Comput Biol. 2021. https://doi.org/10.1371/JOURNAL.PCBI.1008839 .

Doležel J, Greilhuber J, Suda J. Estimation of nuclear DNA content in plants using flow cytometry. Nat Protoc. 2007;2:2233–44. https://doi.org/10.1038/nprot.2007.310 .

Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 2016;3:99–101. https://doi.org/10.1016/J.CELS.2015.07.012 .

Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, et al. BRAKER3: fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. bioRxiv.:2023.06.10.544449 [Preprint]. 2023 [cited 2024 Mar 5]: [21 p.]. Available from: https://doi.org/10.1101/2023.06.10.544449 .

Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5. https://doi.org/10.1093/BIOINFORMATICS/BTT086 .

Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38:576–89. https://doi.org/10.1016/J.MOLCEL.2010.05.004 .

Joshi AG, Praveen P, Ramakrishnan U, Sowdhamini R. Draft genome sequence of an invasive plant Lantana camara L. Bioinformation. 2022;18:739–41. https://doi.org/10.6026/97320630018739 .

Kersey PJ. Plant genome sequences: past, present, future. Curr Opin Plant Biol. 2019;48:1–8. https://doi.org/10.1016/J.PBI.2018.11.001 .

Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37:907–15. https://doi.org/10.1038/s41587-019-0201-4 .

Kokot M, Dlugosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017;33:2759–61. https://doi.org/10.1093/BIOINFORMATICS/BTX304 .

Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60. https://doi.org/10.1093/BIOINFORMATICS/BTP324 .

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9. https://doi.org/10.1093/BIOINFORMATICS/BTP352 .

Macas J, Novak P, Pellicer J, Cizkova J, Koblizkova A, Neumann P, et al. In depth characterization of repetitive DNA in 23 plant genomes reveals sources of genome size variation in the legume tribe Fabeae . PLoS ONE. 2015. https://doi.org/10.1371/JOURNAL.PONE.0143424 .

Mehrotra S, Goyal V. Repetitive sequences in plant nuclear DNA: types, distribution, evolution and function. Genom Proteom Bioinform. 2014;12:164–71. https://doi.org/10.1016/J.GPB.2014.07.003 .

Ou S, Jiang N. LTR_retriever: A highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 2018;176:1410–22. https://doi.org/10.1104/PP.17.01310 .

Ou S, Chen J, Jiang N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 2018. https://doi.org/10.1093/NAR/GKY730 .

Ou S, Su W, Liao Y, Chougule K, Agda JRA, Hellinga AJ, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 2019;20:1–18. https://doi.org/10.1186/S13059-019-1905-Y .

Parrish SB, Qian R, Deng Z. Genome size and karyotype studies in five species of Lantana (Verbenaceae). HortScience. 2021;56:352–6. https://doi.org/10.21273/HORTSCI15603-20 .

Parrish SB, Paudel D, Deng Z. Transcriptome analysis of Lantana camara flower petals reveals candidate anthocyanin biosynthesis genes mediating red flower color development. G3-Genes Genom Genet. 2024. https://doi.org/10.1093/G3JOURNAL/JKAD259 .

Peng Z, Bhattarai K, Parajuli S, Cao Z, Deng Z. Transcriptome analysis of young ovaries reveals candidate genes involved in gamete formation in Lantana camara . Plants. 2019. https://doi.org/10.3390/PLANTS8080263 .

Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. https://doi.org/10.1093/BIOINFORMATICS/BTQ033 .

R Core Team. R: A Language and Environment for Statistical Computing. 2023. https://www.R-project.org/ . Accessed 7 Nov 2023.

Ray A, Quader S. Genetic diversity and population structure of Lantana camara in India indicates multiple introductions and gene flow. Plant Biol. 2014;16:651–8. https://doi.org/10.1111/plb.12087 .

Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21:1–27. https://doi.org/10.1186/S13059-020-02134-9 .

Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 2015;16:1–11. https://doi.org/10.1186/S13059-015-0831-X .

Shackleton RT, Witt ABR, Aool W, Pratt CF. Distribution of the invasive alien weed, Lantana camara , and its ecological and livelihood impacts in eastern Africa. Afr J Range Forage Sci. 2017;34:1–11. https://doi.org/10.2989/10220119.2017.1301551 .

Shah S, Lonhienne T, Murray CE, Chen Y, Dougan KE, Low YS, et al. Genome-guided analysis of seven weed species reveals conserved sequence and structural features of key gene targets for herbicide development. Front Plant Sci. 2022. https://doi.org/10.3389/FPLS.2022.909073 .

Shah M, Alharby HF, Hakeem KR, Ali N, Rahman IU, Munawar M, et al. De novo transcriptome analysis of Lantana camara L. revealed candidate genes involved in phenylpropanoid biosynthesis pathway. Sci Rep. 2020. https://doi.org/10.1038/S41598-020-70635-5 .

Sharma GP, Raghubanshi AS, Singh JS. Lantana invasion: an overview. Weed Biol Manag. 2005;5:157–65. https://doi.org/10.1111/J.1445-6664.2005.00178.X .

Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2. https://doi.org/10.1093/BIOINFORMATICS/BTV351 .

Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009. https://doi.org/10.1002/0471250953.BI0410S25 .

Taylor S, Kumar L, Reid N. Impacts of climate change and land-use on the potential distribution of an invasive weed: a case study of Lantana camara in Australia. Weed Res. 2012;52:391–401. https://doi.org/10.1111/J.1365-3180.2012.00930.X .

Xu M, Guo L, Gu S, Wang O, Zhang R, Peters BA, et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. Gigascience. 2020;9:1–11. https://doi.org/10.1093/GIGASCIENCE/GIAA094 .

Yaradua SS, Shah M. The complete chloroplast genome of Lantana camara L. (Verbenaceae). Mitochondrial DNA Part B. 2020;5:918–9. https://doi.org/10.1080/23802359.2020.1719920 .

Zhou C, McCarthy SA, Durbin R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 2023. https://doi.org/10.1093/BIOINFORMATICS/BTAC808 .

## Acknowledgements

The authors would like to express their gratitude to their lab members for their valuable assistance and to the anonymous reviewers for their thorough and constructive feedback on the manuscript. Thank you to CD Genomics (Shirley, NY, USA) for PacBio and Hi-C sequencing services and Novogene (Beijing, China) for RNA sequencing services.

This work was supported in part by the U.S. Department of Agriculture Hatch projects (Projects No. FLA-GCC-005065 and No. FLA-GCC-005507).

## Author information

Authors and affiliations.

Gulf Coast Research and Education Center, Department of Environmental Horticulture, University of Florida, IFAS, Wimauma, FL, 33598, USA

S. Brooks Parrish & Zhanao Deng

You can also search for this author in PubMed   Google Scholar

## Contributions

ZD designed and planned the project. SBP performed all RNA extractions, genome size estimations, and computational analysis. All authors read and approved the final manuscript.

## Corresponding author

Correspondence to Zhanao Deng .

## Ethics declarations

Ethics approval and consent to participate.

Not applicable.

## Consent for publication

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

## Competing interests

The authors declare that they have no competing interests. The corresponding author, ZD, is a member on this journal’s editorial team, and was not involved in the journal's review or decisions related to this mauscript.

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

Supplementary material 1., rights and permissions.

Reprints and permissions

Parrish, S.B., Deng, Z. The first high-quality genome assembly and annotation of Lantana camara , an important ornamental plant and a major invasive species. HORTIC. ADV. 2 , 14 (2024). https://doi.org/10.1007/s44281-024-00043-6

Revised : 13 March 2024

Accepted : 14 March 2024

Published : 10 May 2024

DOI : https://doi.org/10.1007/s44281-024-00043-6

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

• Chromosome-length genome assembly
• Find a journal
• Publish with us
• Open access
• Published: 06 May 2024

## Teaching transposon classification as a means to crowd source the curation of repeat annotation – a tardigrade perspective

• Valentina Peona 1 , 2 , 3   na1 ,
• Jacopo Martelossi 4   na1 ,
• Dareen Almojil 5 ,
• Julia Bocharkina 6 ,
• Ioana Brännström 7 , 8 ,
• Max Brown 9 ,
• Alice Cang 10 ,
• Tomàs Carrasco-Valenzuela 11 , 12 ,
• Jon DeVries 13 ,
• Meredith Doellman 14 , 15 ,
• Daniel Elsner 16 ,
• Pamela Espíndola-Hernández 17 ,
• Guillermo Friis Montoya 18 ,
• Bence Gaspar 19 ,
• Danijela Zagorski 20 ,
• Paweł Hałakuc 21 ,
• Beti Ivanovska 22 ,
• Christopher Laumer 23 ,
• Robert Lehmann 24 ,
• Ljudevit Luka Boštjančić 25 ,
• Rahia Mashoodh 26 ,
• Sofia Mazzoleni 27 ,
• Alice Mouton 28 ,
• Maria Anna Nilsson 25 ,
• Yifan Pei 1 , 29 ,
• Giacomo Potente 30 ,
• Panagiotis Provataris 31 ,
• José Ramón Pardos-Blas 32 ,
• Ravindra Raut 33 ,
• Tomasa Sbaffi 34 ,
• Florian Schwarz 35 ,
• Jessica Stapley 36 ,
• Lewis Stevens 37 ,
• Nusrat Sultana 38 ,
• Mohadeseh S. Tahami 40 ,
• Alice Urzì 41 ,
• Heidi Yang 42 ,
• Abdullah Yusuf 43 ,
• Carlo Pecoraro 44 &
• Alexander Suh 1 , 45 , 46

Mobile DNA volume  15 , Article number:  10 ( 2024 ) Cite this article

477 Accesses

7 Altmetric

Metrics details

The advancement of sequencing technologies results in the rapid release of hundreds of new genome assemblies a year providing unprecedented resources for the study of genome evolution. Within this context, the significance of in-depth analyses of repetitive elements, transposable elements (TEs) in particular, is increasingly recognized in understanding genome evolution. Despite the plethora of available bioinformatic tools for identifying and annotating TEs, the phylogenetic distance of the target species from a curated and classified database of repetitive element sequences constrains any automated annotation effort. Moreover, manual curation of raw repeat libraries is deemed essential due to the frequent incompleteness of automatically generated consensus sequences.

Here, we present an example of a crowd-sourcing effort aimed at curating and annotating TE libraries of two non-model species built around a collaborative, peer-reviewed teaching process. Manual curation and classification are time-consuming processes that offer limited short-term academic rewards and are typically confined to a few research groups where methods are taught through hands-on experience. Crowd-sourcing efforts could therefore offer a significant opportunity to bridge the gap between learning the methods of curation effectively and empowering the scientific community with high-quality, reusable repeat libraries.

## Conclusions

The collaborative manual curation of TEs from two tardigrade species, for which there were no TE libraries available, resulted in the successful characterization of hundreds of new and diverse TEs in a reasonable time frame. Our crowd-sourcing setting can be used as a teaching reference guide for similar projects: A hidden treasure awaits discovery within non-model organisms.

The importance of in-depth analyses of repetitive elements, particularly transposable elements (TEs), is becoming more and more fundamental to understand genome evolution and the genetic basis of adaptation [ 1 ]. While there is a wealth of bioinformatic tools available for the identification and annotation of TEs ( https://tehub.org/en/resources/repeat_tools ), any automated annotation effort is limited by the phylogenetic distance of the target species to a database of curated and classified repetitive element sequences [ 2 ]. For example, in birds where zebra finch and chicken have well-characterized repetitive elements because their genomes were first sequenced in large consortia during the pre-genomics era [ 3 , 4 ], automated annotation of other bird genomes will render most repeats as correctly classified [ 5 , 6 ]. On the other hand, in taxa as diverse and divergent as insects, up to 85% of repetitive sequences can remain of “unknown” classification in non- Drosophila species [ 7 ]. This is problematic. Inferences about the mobility and accumulation of TEs, as well as their potential effects on the host, are not feasible for unclassified repeats, as well as for incorrectly classified repeats if the automated classification is based on short, spurious nucleotide sequence similarity [ 8 , 9 ].

The reference bias in TE classification reflects the history of the TE field in the genomics era: In the 1990s and 2000s, there were usually multiple people tasked with TE identification, classification, and annotation for each genome project, yielding manually curated TE consensus sequences (namely representative sequences whose quality was manually controlled and improved) and fully classified TE libraries deposited in databases such as Repbase [ 2 ]. Over the last ten years, however, the number of genome projects both of individual labs as well as large consortia has increased exponentially and so have speed and number of automated TE annotation efforts [ 10 , 11 , 12 ], while time and personnel have remained limited for curated TE annotation efforts. Similar to taxonomic expertise required for identifying and classifying organisms, TE identification and classification need hands-on experience with manual curation for months or even years per genome [ 1 ] which is usually taught through knowledge passed within genome projects and research groups. Recent efforts [ 13 , 14 , 15 ] have started to make manual curation accessible to a broader scientific audience, with the aim to increase reproducibility and comparability. However, what cannot be changed is that there are hundreds if not thousands of genomes per TE-interested researcher with more or less pressing priority for time-consuming manual curation.

Low scalability and people power are major obstacles that need to be overcome by the many facets of computational biology where curation is essential. Annotation efforts of other genomic features have shown that crowd sourcing through teaching [ 16 , 17 , 18 , 19 , 20 , 21 , 22 ], or “course sourcing” as we call it, has the benefit of providing participants with hands-on skills for curation and experience on how to reconcile biology with technical limitations, while simultaneously sharing the workload of time-consuming curation across multiple people working on different parts at the same time. Thus, we argue that a TE curation effort that would take months or years for a single person may fit into a few days or weeks of teaching, of course as long as reproducibility and comparability are ensured throughout course duration.

Here, we present our “course sourcing” experience from two iterations of a Physalia Course on TE identification, classification, and annotation. We focused on two species of tardigrades as a case study to motivate student-centered learning through direct contribution to scientific knowledge: Tardigrades are, to our knowledge, the most high-ranking animal phylum without curated TE annotation, very clearly illustrated by the fact that in previous genome analyses, almost all repeats remained of “unknown” classification [ 23 ]. Tardigrades are a diverse group of aquatic and terrestrial animals which show extraordinary ability to survive extreme environments by entering the state of cryptobiosis [ 24 ]. This animal clade comprises almost 1,200 described species belonging to Panarthropoda [ 25 ] and the two species used in the courses are closely related and belong to the Hypsibiidae family [ 23 ].

The first course took place in person in June 2018 in Berlin across five full-time work days: The first three days familiarized the 13 participants with the biology of TEs, concepts for classification, and methods for annotation using the tardigrade Hypsibius exemplaris genome (formerly identified as Hypsibius dujardini ), while the last two days had a student-centered learning format where each participant was able to curate as many TEs as possible from the target species. The second course took place virtually in June 2021 due to the Covid-19 pandemic and comprised five afternoons in the Berlin time zone to minimize Zoom fatigue. The overall format was similar to the prior in-person course but with 24 participants and focusing on another tardigrade, Ramazottius varieornatus , which the participants identified to have not a single shared TE family with the tardigrade H. exemplaris curated in the 2018 course. Between the two courses, the participants were able to uncover a vast diversity of TEs and successfully curate over 400 consensus sequences. We demonstrate therefore that a collaborative approach is a valuable means to achieve significant results for the scientific community and we hope to share with the community a teaching reference for future similar efforts, because: A hidden treasure always awaits discovery in non-model organisms.

## Results and discussion

Incorporating crowd sourcing efforts within a classroom setting (“course sourcing”) can represent an invaluable opportunity for teaching, while simultaneously contributing to the scientific community. However, course sourcing also does present its own unique challenges, particularly in terms of minimizing errors, maximizing reproducibility and student engagement. Drawing from our experience in both in-person and virtual settings, we identified several crucial factors in teaching TE manual curation that must be considered during the organization and supervision of such courses, like: (a) establishing a standardized approach for curation and classification of TE consensus sequences; (b) implementing a peer-review process between participants to check on the quality of the curation of each TE consensus sequence; (c) maintaining meticulous version control of the libraries. Here, we describe how we addressed these points. First, to establish a standard approach to manual curation, we implemented methods widely used in the TE community that have been recently reviewed in detail [ 13 , 14 ]. The approach, briefly, consists in producing and inspecting multi-sequence alignments for each of the consensus sequences automatically generated by RepeatModeler [ 10 ]. Each nucleotide position of the “alignable part” of the alignment is carefully inspected to identify the correct termini of the TE while correcting for any ambiguous base or gap. To correct for ambiguous bases in the curated consensus sequence, we applied the majority rule and assigned the most representative IUPAC nucleotide character for each position in the alignment (see Methods). To correct the consensus sequences where gaps of different lengths are present, we considered each insertion/deletion length as independent events so that a majority rule was applicable to these regions as well. When very complex regions could not be unambiguously solved, stretches of 10 N nucleotides were inserted as placeholder (gap) in the consensus sequence. The TE classification followed the nomenclature used by RepeatMasker to ensure direct compatibility with the tool and its suite of scripts for downstream analysis. Second, when participants completed the curation of their consensus sequences, then their results would go through a peer-review process where both the quality of each consensus sequence and its classification were revised by other participants (or course faculty). During the in-person edition, a random set of consensus sequences curated by one participant was assigned to another participant, while in the second online edition, all sequences were reviewed by the two instructors and one participant (Fig.  1 ). The review of the TE sequences continued after the official conclusion of the course. To ensure reproducibility and the documentation of the entire decision-making process for classification, all steps and details of classification were recorded in a shared Google Sheet. The tables would include the changes in consensus sequence names, names of the curators and reviewers as well as additional comments (Fig.  1 , Table S1 ). Whenever a change was introduced in a consensus sequence (either in the nucleotide sequence itself or in the classification), the new version was directly added to the multi-sequence alignment file used for curation together with the original one. Keeping all the versions of a consensus in the same alignment file and respective notes in the tables allows the implementation of a basic version control useful to check on the steps leading to a particular decision. From the re-iteration of the course, we noticed three particularly challenging points for beginners that need an extra supervision effort. The most challenging points are the identification of the correct termini, target site duplications (a hallmark of transposition for the vast majority of TEs) if any, and the correct spelling of the TE categories for classification in accordance with the RepeatMasker nomenclature rules. The last point is of particular importance especially if the repeat annotation is visualized as a landscape using the RepeatMasker scripts (e.g., calcDivergenceFromAlign.pl and createRepeatLandscape.pl) to avoid causing computing errors and downstream misinterpretations.

Finally, all the tutorials to obtain and curate a TE library are available on the GitHub repository linked to this paper: https://github.com/ValentinaPeona/TardigraTE.

Schematic representation of the peer-reviewed process of TE curation

## Improvement of the transposable element libraries

To generate the TE libraries, we first ran RepeatModeler and RepeatModeler2 on H. exemplaris and R. varieornatus , respectively, and obtained 519 and 898 consensus sequences (Table  1 ). Then the course participants manually curated as many consensus sequences as possible. In about three course days plus voluntary efforts by some participants after each course, the participants were able to curate 274 consensus sequences (53%) of the H. exemplaris library and 139 consensus sequences (15%) of the R. varieornatus library (Table S1 -3). Given the lack of previously curated libraries from closely related species, most of the consensus sequences were automatically classified as “Unknown” by RepeatModeler, but the thorough process of manual curation successfully reclassified 296 unknown consensus sequences (out of a total of 413 curated sequences, 71%) into known categories of elements. After manual curation, we found that most of the two species’ libraries are comprised of DNA transposons and a minority of retrotransposons (Table  1 ). Since many consensus sequences remained uncurated and unclassified, it is possible that the relative percentages of the categories change in the future, but we expect, especially from the composition of the H. exemplaris library, to mostly find additional (non-autonomous) DNA transposons among the unclassified.

The process of manual curation improved the overall level of TE classification of the libraries but also the quality of the individual consensus sequences by correctly identifying their termini and in general by extending their sequence. Indeed, by comparing the lengths of the consensus sequences for the same element, we can notice a marked increase in length after curation (Fig.  2 ).

Comparison of the length of the consensus sequences before and after manual curation

## Diversity of transposable elements

When looking at the diversity of repeats in the curated libraries (combined libraries comprising curated and uncurated consensus sequences), we identified a total of 437 Class II DNA consensus sequences belonging to the superfamilies/clades CMC, MULE, TcMar, Sola, PiggyBac, PIF-Harbinger, Zator, hAT, Maverick, P and Zisupton. Many of these elements are non-autonomous and show a remarkable diversity and complexity of internal structures (Fig.  3 ) which emphasizes the need for complete, curated consensus sequences to be able to properly classify internal repeat structures and infer their mode of accumulation in the genome. For Class I retrotransposons, we found 47 LINEs belonging to the superfamilies/clades L1, I, CR1, CRE, R2, R2-NesL, L2, RTE-X and RTE-BovB and another 40 LTRs belonging to the superfamilies/clades DIRS, Gypsy, Ngaro and Pao. The REPET library generated for R. varieornatus (Table S4 ) consists of a total of 130 consensus sequences, with the majority classified as DNA transposons (129), similar to the curated consensus generated from RepeatModeler output (Table S3 ). However, several superfamilies manually identified in the RepeatModeler library, including Zator, MULE, and P, were not detected in the REPET one. These differences may be attributed to variations in underlying software and also to differences in curation and decision-making processes.

Dotplots of six DNA transposons from the library of Hypsibius exemplaris produced with the MAFFT online server. These elements were selected by course participants for aesthetic reasons

To highlight the importance of generating and using custom repeat libraries for the organisms of interest as well as their curation, we masked the two tardigrade genomes and compared how the annotation and accumulation patterns change when using general repeat libraries (in this case the Repbase library for Arthropoda) and species-specific ones before and after curation (Fig.  4 ; Table  2 and S5). The use of the known repeats for Arthropoda available on Repbase provided a poor and insufficient annotation for both species (all the following percentages are given for H. exemplaris and then for R. varieornatus ) where only 1.95% and 0.26% of the assemblies were annotated as interspersed repeats and the accumulation patterns were characterized only by likely old insertions. Then the use of species-specific, albeit uncurated, libraries completely changed the percentage of TEs annotated (16.38% and 15.66%) and their accumulation patterns that showed many recently accumulated insertions. While the shape and percentages of the repeat landscapes did not drastically change after the manual curation of the libraries, the curated libraries clearly highlighted a large accumulation of DNA transposons in recent and ancient times alike that were either not present in the other landscapes or were hidden among the “unknown” repeats. Especially for R. varieornatus , the curation highlighted a higher accumulation of repeats in the very recent times (1–5% of divergence). This higher accumulation of DNA transposons in recent times is also in line with the finding of multiple putatively active transposable element subfamilies (Table  3 ). Finally, the use of the repeat library of one species to annotate the other species (reciprocal masking) resulted to be almost as insufficient as the use of the Repbase library for Arthropoda, stressing once again how important it is to have a capillary knowledge of the repeatome for correct biological interpretations.

Repeat landscapes of the genomes of H. exemplaris and R. varieornatus annotated with the Repbase (Arthropoda clade), uncurated and curated of both tardigrades combined libraries, and with libraries of the reciprocal species (only species-specific repeats). The divergence from consensus calculated with the Kimura 2-parameter distance model is shown on the x-axis. The percentage of genome annotated is shown on the y-axis

As a demonstrative example of the contribution of the collaborative curation process in providing novel insights into TE diversity, taxonomic distribution and biology, we decided to deeply characterize consensus sequences that we classified as Tc4. These elements have a rather limited taxonomic distribution, few references in the literature exist, and they incompletely duplicate the target site upon transposition [ 26 ] which can impose challenges for their classification. The Tc4 transposons are DDD elements firstly discovered in Caenorhabditis elegans [ 26 ] where they recognize the interrupted palindrome CTNAG as target site for insertion, and cause duplication of only the central TNA trinucleotide. Regarding their taxonomic distribution, consensus sequences for Tc4 elements are known and deposited only for nematodes and arthropods in RepeatPeps, Repbase and DFAM. Phylogenetic analyses based on DDD segments confidently placed the four tardigrade Tc4 consensus sequences identified in R. varieornatus within the Tc4 clade in a sister relationship with arthropod elements and with a branching pattern that reassembles the Panarthropoda group (tardigrades + onychophorans + arthropods) within Ecdysozoa [ 27 ] (Fig.  5 A). The DDD catalytic domain is highly conserved between different phyla (Fig.  5 B) and the target site of tardigrades mirrors what was previously observed in nematodes (i.e., C|TNA|G where “|” marks the transposase cut site; Fig.  5 C-D). We could therefore hypothesize that these elements first originated during the diversification of Ecdysozoa. However, broader comparative analyses involving more early-diverging Metazoa clades are necessary to confirm this lineage-specific origin.

Characterization and phylogenetic analyses of Tc4 elements. ( A ) Phylogenetic tree of Tc4 consensus sequences based on DDD catalytic domains identified in the R. varieornatus consensus sequences, highlighted in bold and orange, together with representative sequences extracted from the RepeatPeps library from nematodes (pink) and insects (green). All nodes received maximal support value. ( B ) Alignment of DDD catalytic domains of sequences included in phylogenetic analyses. Residues conserved in more than 80% of the sequences are colored. Arrows highlight catalytic DDD residues. Sequence logos of 5’ ( C ) and 3’ ( D ) ends of Tc4 elements used to curate the R. varieornatus consensus sequences. Black and purple arrows denote terminal inverted repeats (TIRs) and target site duplications (TSDs), respectively. The purple dotted line marks the transposase cut on the CTNAG target site

## Contributions from the course participants

During both editions of the course, participants were free to explore their favorite topics within the scope of the syllabus and we share two contributions developed by the participants that can be useful for the entire community. First, an additional repeat library of 130 consensus sequences (119 of which are DNA transposons) was produced with the use of REPET for R. varieornatus (Table S4 ). Second, a guide for the classification of TEs from multisequence alignments (File S1) that can be a useful starting point for beginners and complementary to more extensive guides [ 13 , 14 ].

As shown here and in many other studies, repeat annotation is key to correctly identify and interpret patterns of genome evolution and proper annotation is based on a thorough curation of the repeat libraries [ 8 , 9 , 28 ]. However, it is hard for curation efforts to keep up with the sheer number of genome assemblies released every year as curation done by single laboratories may require months or even years for a single genome. Tools like TE-Aid [ 13 ] and EarlGrey [ 29 ] are rapidly spreading and gaining popularity to facilitate TE curation processes [ 30 , 31 , 32 , 33 ]. Despite these advancements, until fully automatized, reliable tools are developed and there are manual curation training sets for understudied taxa, we emphasize the need to implement manual curation for repeat libraries as well as to find alternative ways to deal with the curation of hundreds of new libraries. Here we presented one such alternative approach, namely a peer-reviewed course sourcing effort designed to be as reproducible and comparable as possible and where the hands-on tutorials were designed to be meaningful for the participants because they dealt with real unexplored data and directly contributed to the scientific community. The two iterations of this course sourcing effort resulted in the successful curation of hundreds of new and diverse TEs. While the repeat libraries presented here were not completely curated and classified, we would like to highlight that TE curation can be considered as a “cumulative” effort of a community. The more people learn how to curate, the more teachers are educated and the faster the process becomes. Therefore, we hope that this experience and teaching framework can be of use for the genome research community and that it can be applicable to other types of data/analyses that need manual curation (e.g., genome assemblies [ 21 , 22 ] and gene annotations).

## Materials and methods

Genome assemblies.

For this study, we used the genome assemblies of the two tardigrade species: Hypsibius exemplaris  (formerly identified as Hypsibius dujardini ; GCA_002082055.1) and Ramazzottius varieornatus (GCA_001949185.1) produced by sequencing a pool of male and female individuals by Yoshida et al. [ 23 ]. The Hypsibius exemplaris genome was assembled using long PacBio and short Illumina reads whereas the Ramazzottius varieornatus genome was assembled using a combination of Sanger and Illumina reads [ 23 ].

## Raw repetitive element library

To start the de novo characterization of TEs, we ran RepeatModeler on H. exemplaris and RepeatModeler2 on R. varieornatus [ 34 ] using the option -LTR_struct and obtained a library of raw consensus sequences for each of the genomes. RepeatModeler and not RepeatModeler2 was used on H. exemplaris since at the time of the first edition of the course in 2018, only RepeatModeler was available. RepeatModeler and RepeatModeler2 automatically named the consensus sequences with the prefix “rnd” that we replaced with the abbreviations of the species names: “hypDuj” for H. exemplaris and “ramVar” for R. varieornatus . Note that the abbreviation “hypDuj” was assigned prior to the scientific name change from H. dujardini to H. exemplaris . Despite this, we have chosen to retain “hypDuj” in the final repeat library for the sake of simplicity.

The two libraries were then compared to find similar sequences belonging either to the same family or subfamily by using, respectively, the 80-80-80 rule [ 35 ] and the 95-80-98 rule [ 36 ]. The rules were applied by masking the library of R. varieornatus with the library of H. exemplaris using RepeatMasker [ 37 ] and by parsing the resulting. out table with awk.

## Manual curation of the consensus sequences

After the generation of the libraries of raw consensus sequences, we proceeded with the collaborative peer-reviewed manual curation step. For example in the second iteration of the course, the participants were split into ten groups and each group received about 80 consensus sequences to curate.

The curation of the raw consensus sequences followed a “Blast-Extend-Extract” process. The first step of the curation consisted in the alignment of the raw consensus sequences to the genome of origin using BLAST [ 38 ]. The best 20 BLASTN hits were selected, extended by 2 kb at both ends and aligned to their raw consensus sequence with MAFFT [ 39 ] which produced a multisequence alignment for each consensus sequence ready to be manually curated (script RMDL_curation_pipeline.pl, first published in [ 40 ]).

Each of the multisequence alignment was then inspected to: (1) find the actual boundaries of the repetitive element; (2) build a new consensus sequence with Advanced Consensus Maker ( https://hcv.lanl.gov/content/sequence/CONSENSUS/AdvConExplain.html ); (3) fix ambiguous base and gap calls in the new consensus sequence following the majority rule; (4) find sequence hallmarks to define the repetitive elements as transposable elements (e.g., target site duplication, long terminal repeats, terminal inverted repeats or other motifs). Every new consensus sequence was reported in a common Excel table (Table S1 ). To quantitatively measure the improvement of the repeat libraries after manual curation, we compared the length of consensus sequences before and after curation.

In all the figures and tables, the term “curated” indicates that the library mentioned contains manually curated consensus sequences as well as all the consensus sequences that remained uncurated. Finally, we consider each consensus sequence as a proxy for a transposable element subfamily. However, the consensus sequences were not checked for redundancy and not clustered into families and subfamilies using the 80-80-80 or 95-80-98 rules for nomenclature because the focus of the study was on classifying the consensus sequences into superfamilies and orders of transposable elements.

The code used to produce the consensus sequences and their alignments is provided as tutorial on the GitHub repository https://github.com/ValentinaPeona/TardigraTE .

## Classification

The new consensus sequences were classified using sequence characteristics retrieved by the alignments (e.g., target site duplications, terminal repeats) and homology information retrieved through masking the sequences with Censor [ 41 , 42 ] following the recommendations from [ 35 ] and [ 43 ]. When the information retrieved by the alignments and Censor was not enough to provide a reliable classification of the elements, the sequences were further analyzed for the presence of informative protein domains using the Conserved Domain Database [ 44 , 45 , 46 ].

Since the course participants in general had never curated transposable element alignments before, we decided to implement a peer-review process. For the first course ( H. exemplaris ), the results of each participant were sent to another participant to check the curated alignments and independently retrieve key information for the classification. The independent sequences and classifications would be compared and fixed if necessary. In the second course ( R. varieornatus ), all sequences were inspected by the same 3 reviewers and only these applied the same process as previously described.

## Comparative analysis of the repetitive content

The genome assemblies of both tardigrade species were masked with RepeatMasker 4.1.10 using four different types of TE libraries: (1) known Arthropoda consensus sequences from Repbase; (2) uncurated raw consensus sequences from the respective species; (3) curated consensus sequences together with the consensus sequences that were not curated from the respective species; (4) curated consensus sequences together with the consensus sequences that were curated from the other species. The RepeatMasker output files were then used to get the percentages of the genomes annotated as TEs and to visualize the landscapes of the accumulation of repeats.

Finally, we estimated the number of putative active transposable elements in the two genomes by filtering the RepeatMasker annotation for elements that show at least 10 copies with 0% divergence from their consensus sequences.

## Characterization of Tc4 elements

During the manual curation process, participants found types of DNA transposons that are currently considered to have a rather restricted phylogenetic distribution like Tc4 elements, therefore more in-depth analyses were run on these elements. The protein domains of known Tc elements were compared to the Tc4 consensus sequences from the tardigrade species and phylogenetic relationships were established.

Protein homologies of the partially curated repeat libraries were collected using BLASTX (e-value 1e-05) [ 47 ] against a database of TE-related protein (RepeatPeps library) provided with the RepeatMasker installation. We extracted the amino acid translation of each hit on Tc4 elements based on the coordinates reported in the BLASTX output. Resulting protein sequences were aligned together with all members of the TcMar group present in RepeatPeps library using MAFFT ( L-INS-i mode) [ 48 ] and the alignment was manually inspected to identify and isolate the catalytic DDD domain. The resulting trimmed alignment was used for phylogenetic inference with IQ-TREE-2 [ 49 ], identifying the best-fit evolutionary model with ModelFinder2 and assessing nodal support with 1000 UltraFastBootstrap replicates [ 50 ]. The resulting maximum likelihood tree was mid-point rooted and the Tc4 subtree extracted for visualization purposes. The alignment with all members of the TcMar superfamily and the resulting phylogenetic tree can be found in File S2 and S3, respectively.

The DDD segments of Tc4 elements were re-aligned using T-Coffee in expresso mode [ 51 ] to produce conservation scores. A sequence logo of 5’ and 3’ boundaries of identified Tc4 elements was produced extracting all sequences used to curate the four R. varieornatus Tc4 elements and keeping the first 15 bp and 11 bp before and after the terminal inverted repeats (TIRs), respectively.

Participants ran REPET V3.0 [ 52 ] to produce a de novo transposable element library for R. varieornatus in parallel to the one generated by RepeatModeler2. A custom TE library composed by repeats from Repbase and from H. exemplaris was used to aid REPET in the classification process. Only consensus sequences that showed two or more full-length copies in the R. varieornatus genome were retained in the new library. Furthermore, the consensus sequences were scanned for protein domains and presence of TIRs or long terminal repeats (LTRs).

## Data availability

Data is provided within the manuscript or supplementary information files.

## Abbreviations

Long Terminal Repeats

Transposable Element

Terminal Inverted Repeats

Osmanski AB, Paulat NS, Korstian J, Grimshaw JR, Halsey M, Sullivan KAM et al. Insights into mammalian TE diversity through the curation of 248 genome assemblies. Science (1979) [Internet]. 2023;380:eabn1430. https://doi.org/10.1126/science.abn1430 .

Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA [Internet]. 2015;6:11. https://doi.org/10.1186/s13100-015-0041-9 .

Wicker T. The repetitive landscape of the chicken genome. Genome Res [Internet]. 2004;15:126–36. http://genome.cshlp.org/content/15/1/126.abstract .

Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature [Internet]. 2004;432:695–716. https://doi.org/10.1038/nature03154 .

Boman J, Frankl-Vilches C, da Silva dos Santos M, de Oliveira EHC, Gahr M, Suh A. The Genome of Blue-Capped Cordon-Bleu Uncovers Hidden Diversity of LTR Retrotransposons in Zebra Finch. Genes (Basel) [Internet]. 2019;10:301. https://www.mdpi.com/2073-4425/10/4/301 .

Kapusta A, Suh A, Feschotte C. Dynamics of genome size evolution in birds and mammals. Proc Natl Acad Sci U S A [Internet]. 2017;114:E1460–9. http://www.pnas.org/content/114/8/E1460.abstract .

Sproul J, Hotaling S, Heckenhauer J, Powell A, Marshall D, Larracuente AM et al. 600 + insect genomes reveal repetitive element dynamics and highlight biodiversity-scale repeat annotation challenges. Genome Res [Internet]. 2023; http://genome.cshlp.org/content/early/2023/09/22/gr.277387.122.abstract .

Platt RN, Blanco-Berdugo L, Ray DA. Accurate transposable element annotation is vital when analyzing new genome assemblies. Genome Biol Evol [Internet]. 2016;8:403–10. https://doi.org/10.1093/gbe/evw009 .

Peona V, Blom MPK, Xu L, Burri R, Sullivan S, Bunikis I et al. Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise. Mol Ecol Resour [Internet]. 2021;21:263–86. https://doi.org/10.1111/1755-0998.13252 .

Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences [Internet]. 2020;117:9451–7. https://doi.org/10.1073/pnas.1921046117 .

Zeng L, Kortschak RD, Raison JM, Bertozzi T, Adelson DL. Superior ab initio identification, annotation and characterisation of TEs and segmental duplications from genome assemblies. PLoS One [Internet]. 2018;13:e0193588-. https://doi.org/10.1371/journal.pone.0193588 .

Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M et al. Combined Evidence Annotation of Transposable Elements in Genome Sequences. PLoS Comput Biol [Internet]. 2005;1:e22-. https://doi.org/10.1371/journal.pcbi.0010022 .

Goubert C, Craig RJ, Bilat AF, Peona V, Vogan AA, Protasio AV. A beginner’s guide to manual curation of transposable elements. Mob DNA [Internet]. 2022;13:7. https://doi.org/10.1186/s13100-021-00259-7 .

Storer JM, Hubley R, Rosen J, Smit AFA. Curation Guidelines for de novo Generated Transposable Element Families. Curr Protoc [Internet]. 2021;1:e154. https://doi.org/10.1002/cpz1.154 .

Elliott TA, Heitkam T, Hubley R, Quesneville H, Suh A, Wheeler TJ et al. TE Hub: A community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation. Mob DNA [Internet]. 2021;12:16. https://doi.org/10.1186/s13100-021-00244-0 .

Leung W, Shaffer CD, Chen EJ, Quisenberry TJ, Ko K, Braverman JM et al. Retrotransposons Are the Major Contributors to the Expansion of the Drosophila ananassae Muller F Element. G3 Genes|Genomes|Genetics [Internet]. 2017;7:2439–60. https://doi.org/10.1534/g3.117.040907 .

Moya ND, Stevens L, Miller IR, Sokol CE, Galindo JL, Bardas AD et al. Novel and improved Caenorhabditis briggsae gene models generated by community curation. BMC Genomics. 2023;24. https://link.springer.com/article/10.1186/s12864-023-09582-0 .

Chang WH, Mashouri P, Lozano AX, Johnstone B, Husić M, Olry A et al. Phenotate: crowdsourcing phenotype annotations as exercises inundergraduate classes. Genetics in Medicine [Internet]. 2020;22:1391–400. https://doi.org/10.1038/s41436-020-0812-7 .

Zhou N, Siegel ZD, Zarecor S, Lee N, Campbell DA, Andorf CM et al. Crowdsourcing image analysis for plant phenomics to generate ground truth data for machine learning. PLoS Comput Biol [Internet]. 2018;14:e1006337-. https://doi.org/10.1371/journal.pcbi.1006337 .

Singh M, Bhartiya D, Maini J, Sharma M, Singh AR, Kadarkaraisamy S et al. The Zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation. Database [Internet]. 2014;2014:bau011. https://doi.org/10.1093/database/bau011 .

Prost S, Winter S, De Raad J, Coimbra RTF, Wolf M, Nilsson MA et al. Education in the genomics era: Generating high-quality genome assemblies in university courses. Gigascience [Internet]. 2020;9:giaa058. https://doi.org/10.1093/gigascience/giaa058 .

Prost S, Petersen M, Grethlein M, Hahn SJ, Kuschik-Maczollek N, Olesiuk ME et al. Improving the Chromosome-Level Genome Assembly of the Siamese Fighting Fish (Betta splendens) in a University Master’s Course. G3 Genes|Genomes|Genetics [Internet]. 2020;10:2179–83. https://doi.org/10.1534/g3.120.401205 .

Yoshida Y, Koutsovoulos G, Laetsch DR, Stevens L, Kumar S, Horikawa DD et al. Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus. Tyler-Smith C, editor. PLoS Biol [Internet]. 2017;15:e2002266. https://doi.org/10.1371/journal.pbio.2002266 .

Møbjerg N, Halberg KA, Jørgensen A, Persson D, Bjørn M, Ramløv H et al. Survival in extreme environments – on the current knowledge of adaptations in tardigrades. Acta Physiologica [Internet]. 2011;202:409–20. https://doi.org/10.1111/j.1748-1716.2011.02252.x .

Peter D, Bertolani R, Guidetti R. Actual checklist of Tardigrada species. 2019.

Yuan JY, Finney M, Tsung N, Horvitz HR. Tc4, a Caenorhabditis elegans transposable element with an unusual fold-back structure. Proceedings of the National Academy of Sciences. 1991;88:3334–8.

Giribet G, Edgecombe GD. Current Understanding of Ecdysozoa and its Internal Phylogenetic Relationships. Integr Comp Biol [Internet]. 2017;57:455–66. https://doi.org/10.1093/icb/icx072 .

Peona V, Kutschera VE, Blom MPK, Irestedt M, Suh A. Satellite DNA evolution in Corvoidea inferred from short and long reads. Mol Ecol [Internet]. 2022;0–64. https://onlinelibrary.wiley.com/doi/ https://doi.org/10.1111/mec.16484 .

Baril T, Galbraith J, Hayward A. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Mol Biol Evol [Internet]. 2024;41:msae068. https://academic.oup.com/mbe/article/41/4/msae068/7635926 .

Panta M, Mishra A, Hoque MT, Atallah J. ClassifyTE: a stacking-based prediction of hierarchical classification of transposable elements. Bioinformatics [Internet]. 2021;37:2529–36. https://doi.org/10.1093/bioinformatics/btab146 .

Orozco-Arias S, Lopez-Murillo LH, Piña JS, Valencia-Castrillon E, Tabares-Soto R, Castillo-Ossa L et al. Genomic object detection: An improved approach for transposable elements detection and classification using convolutional neural networks. PLoS One [Internet]. 2023;18:e0291925-. https://doi.org/10.1371/journal.pone.0291925 .

Bickmann L, Rodriguez M, Jiang X, Makalowski W. TEclass2: Classification of transposable elements using Transformers. bioRxiv [Internet]. 2023;2023.10.13.562246. http://biorxiv.org/content/early/2023/10/16/2023.10.13.562246.abstract .

Orozco-Arias S, Isaza G, Guyot R, Tabares-Soto R. A systematic review of the application of machine learning in the detection and classification of transposable elements. Nakai K, editor. PeerJ [Internet]. 2019;7:e8311. https://doi.org/10.7717/peerj.8311 .

Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117:9451–7.

Article   CAS   PubMed   PubMed Central   Google Scholar

Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007;8:973–82.

Flutre T, Duprat E, Feuillet C, Quesneville H. Considering transposable element diversification in De Novo Annotation Approaches. PLoS ONE. 2011;6:e16526.

Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0 [Internet]. 2015. http://www.repeatmasker.org .

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: Architecture and applications. BMC Bioinformatics. 2009;10:421.

Article   PubMed   PubMed Central   Google Scholar

Katoh K, Rozewicki J, Yamada KD. MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform. 2018;20:1160–6.

Suh A, Smeds L, Ellegren H. Abundant recent activity of retrovirus-like retrotransposons within and among flycatcher species implies a rich source of structural variation in songbird genomes. Mol Ecol [Internet]. 2018;27:99–111. https://doi.org/10.1111/mec.14439 .

Kapitonov VV, Jurka J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat Rev Genet. 2008;9:411–2.

Kohany O, Gentles AJ, Hankus L, Jurka J. Annotation, submission and screening of repetitive elements in repbase: RepbaseSubmitter and Censor. BMC Bioinformatics. 2006;7:474.

Feschotte C, Pritham EJ. DNA transposons and the evolution of eukaryotic genomes. Annu Rev Genet. 2007;41:331–68.

Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, et al. CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res. 2011;39:D225–9.

Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004;32:W327–31.

Lu S, Wang J, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 2020;48:D265–8.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K et al. BLAST+: Architecture and applications. BMC Bioinformatics [Internet]. 2009;10:421. https://doi.org/10.1186/1471-2105-10-421 .

Katoh K, Rozewicki J, Yamada KD. MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform [Internet]. 2018;20:1160–6. https://doi.org/10.1093/bib/bbx108 .

Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, Von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37:1530–4.

Hoang DT, Chernomor O, Von Haeseler A, Minh BQ, Vinh LS. UFBoot2: improving the ultrafast bootstrap approximation. Mol Biol Evol. 2018;35:518–22.

Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302:205–17.

Flutre T, Duprat E, Feuillet C, Quesneville H. Considering Transposable Element Diversification in De Novo Annotation Approaches. PLoS One [Internet]. 2011;6:e16526. https://doi.org/10.1371/journal.pone.0016526 .

## Acknowledgements

Part of the analysis were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX), National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725 and CSC-IT Finland. We thank the three anonymous reviewers and Irina Arkhipova for their useful and detailed comments on the manuscript.

declaration.

Open access funding provided by Uppsala University. This study was supported by grants from Swedish Research Council Vetenskapsrådet (2020–04436 to AS; 2022–06195 to VP), the Swedish Research Council Formas (2017 − 01597 to AS), the Canziani bequest and the ‘Ricerca Fondamentale Orientata’ (RFO) funding from the University of Bologna to JM.

Open access funding provided by Uppsala University.

## Author information

Valentina Peona and Jacopo Martelossi contributed equally to this work.

## Authors and Affiliations

Department of Organismal Biology – Systematic Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, SE-752 36, Sweden

Valentina Peona, Yifan Pei & Alexander Suh

Swiss Ornithological Institute Vogelwarte, Sempach, CH-6204, Switzerland

Valentina Peona

Department of Bioinformatics and Genetics, Swedish Natural History Museum, Stockholm, Sweden

Department of Biological Geological and Environmental Science, University of Bologna, Via Selmi 3, Bologna, 40126, Italy

Jacopo Martelossi

New York University Abu Dhabi, Saadiyat Island, United Arab Emirates

Dareen Almojil

Skolkovo Institute of Science and Technology, Moscow, Russia

Julia Bocharkina

Natural History Museum, Oslo University, Oslo, Norway

Ioana Brännström

Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden

Anglia Ruskin University, East Rd, Cambridge, CB1 1PT, UK

University of Arizona, Tucson, AZ, USA

Evolutionary Genetics Department, Leibniz Institute for Zoo and Wildlife Research, 10315, Berlin, Germany

Tomàs Carrasco-Valenzuela

Berlin Center for Genomics in Biodiversity Research, 14195, Berlin, Germany

Reed College, Portland, OR, United States of America

Jon DeVries

Department of Ecology and Evolution, The University of Chicago, Chicago, IL, 60637, USA

Meredith Doellman

Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, 46556, USA

Evolutionary Biology & Ecology, University of Freiburg, Freiburg, Germany

Daniel Elsner

Research Unit Comparative Microbiome Analysis (COMI), Helmholtz Zentrum München, Ingolstädter Landstraße 1, D-85764, Neuherberg, Germany

Pamela Espíndola-Hernández

Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AE, UK

Guillermo Friis Montoya

Institute of Evolution and Ecology, University of Tuebingen, Tuebingen, Germany

Bence Gaspar

Institute of Botany, Czech Academy of Sciences, Průhonice, Czech Republic

Danijela Zagorski

Institute of Evolutionary Biology, Faculty of Biology, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland

Paweł Hałakuc

Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Budapest, Hungary

Beti Ivanovska

The Natural History Museum, Cromwell Road, London, SW6 7SJ, UK

Christopher Laumer

Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

Robert Lehmann

LOEWE Centre for Translational Biodiversity Genomics (LOEWE-TBG), Senckenberganlage 25, 60325, Frankfurt, Germany

Ljudevit Luka Boštjančić & Maria Anna Nilsson

Department of Genetics, Environment & Evolution, Centre for Biodiversity & Environment Research, University College London, London, UK

Rahia Mashoodh

Department of Ecology, Faculty of Science, Charles University, Prague, Czech Republic

Sofia Mazzoleni

INBIOS-Conservation Genetic Lab, University of Liege, Liege, Belgium

Alice Mouton

Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change, Adenauerallee 127, 53113, Bonn, Germany

Department of Systematic and Evolutionary Botany, University of Zurich, Zurich, Switzerland

Giacomo Potente

German Cancer Research Center, NGS Core Facility, DKFZ-ZMBH Alliance, 69120, Heidelberg, Germany

Panagiotis Provataris

Departamento de Biodiversidad y Biología Evolutiva, Museo Nacional de Ciencias Naturales (MNCN-CSIC), José Gutiérrez Abascal 2, Madrid, 28006, Spain

José Ramón Pardos-Blas

Department of Biotechnology, National Institute of Technology Durgapur, Durgapur, India

Ravindra Raut

Molecular Ecology Group (MEG), National Research Council of Italy – Water Research Institute (CNR-IRSA), Verbania, Italy

Tomasa Sbaffi

Eurofins Genomics Europe Pharma and Diagnostics Products & Services Sales GmbH, Ebersberg, Germany

Florian Schwarz

Plant Pathology Group, Institute of Integrative Biology, ETH Zurich, Zurich, Switzerland

Jessica Stapley

Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK

Lewis Stevens

Department of Botany, Jagannath Univerity, Dhaka, 1100, Bangladesh

Nusrat Sultana

Institute of Hydrobiology, Biology Centre of the Czech Academy of Sciences, České Budějovice, Czech Republic

Department of Biological and Environmental Science, University of Jyväskylä, P.O. Box 35, Jyväskylä, 40014, Finland

Centogene GmbH, Am Strande 7, 18055, Rostock, Germany

Department of Ecology & Evolutionary Biology, University of California, Los Angeles, Los Angeles, CA, United States of America

Zell- und Molekularbiologie der Pflanzen, Technische Universität Dresden, Dresden, Germany

Abdullah Yusuf

Physalia-courses, 10249, Berlin, Germany

Carlo Pecoraro

School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TU, UK

Alexander Suh

Present address: Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change, Adenauerallee 160, 53113, Bonn, Germany

You can also search for this author in PubMed   Google Scholar

## Contributions

AS conceived the project and VP contributed to its development. VP and JM analyzed the final data. AS, VP, JM wrote the manuscript, and all authors revised the manuscript. MST, AM, DA, JS, GP provided additional contributions to the teaching material. VP, JM, DA, JB, IB, MB, AC, TCV, JDV, MD, DE, PEH, GFM, BG, DZ, PH, BI, CL, RL, LLB, RM, SM, AM, MAN, YP, GP, PP, JRPB, RR, TS, FS, JS, LS, NS, RS, MST, AU, HY, AY, AS contributed to the curation of the repeat library. CP provided and maintained the computational infrastructure during the courses. Valentina Peona and Jacopo Martelossi contributed equally to this work. All course participants are listed in alphabetical order.

## Corresponding authors

Correspondence to Valentina Peona , Jacopo Martelossi or Alexander Suh .

## Ethics declarations

Ethics approval and consent to participate.

Not applicable.

## Consent for publication

Competing interests.

Carlo Pecoraro is founder of Physalia-courses (http://www.physalia-courses.org/) but had no role in the design of the study.

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Electronic supplementary material

Below is the link to the electronic supplementary material.

## Supplementary Material 1

Supplementary material 2, supplementary material 3, supplementary material 4, rights and permissions.

Reprints and permissions

Peona, V., Martelossi, J., Almojil, D. et al. Teaching transposon classification as a means to crowd source the curation of repeat annotation – a tardigrade perspective. Mobile DNA 15 , 10 (2024). https://doi.org/10.1186/s13100-024-00319-8

Accepted : 09 April 2024

Published : 06 May 2024

DOI : https://doi.org/10.1186/s13100-024-00319-8

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

• Transposable elements
• Manual curation
• Non-model organism
• Genome assembly

ISSN: 1759-8753

• Open access
• Published: 17 May 2024

## Comparative 3D genome analysis between neural retina and retinal pigment epithelium reveals differential cis -regulatory interactions at retinal disease loci

• Eva D’haene   ORCID: orcid.org/0000-0002-5936-8294 1 , 2   na1 ,
• Víctor López-Soriano   ORCID: orcid.org/0009-0003-4621-7547 1 , 2   na1 ,
• Pedro Manuel Martínez-García   ORCID: orcid.org/0000-0002-2119-2465 3   na1 ,
• Soraya Kalayanamontri   ORCID: orcid.org/0009-0005-7028-8067 3 ,
• Alfredo Dueñas Rey   ORCID: orcid.org/0000-0002-4456-6159 1 , 2 ,
• Ana Sousa-Ortega   ORCID: orcid.org/0000-0001-6457-5397 3 ,
• Silvia Naranjo   ORCID: orcid.org/0000-0002-4529-3332 3 ,
• Stijn Van de Sompele   ORCID: orcid.org/0000-0002-3294-0668 1 , 2 ,
• Lies Vantomme 1 , 2 ,
• Quinten Mahieu 1 , 2 ,
• Sarah Vergult   ORCID: orcid.org/0000-0002-0816-6262 1 , 2 ,
• Ana Neto   ORCID: orcid.org/0000-0002-4670-034X 3 ,
• José Luis Gómez-Skarmeta   ORCID: orcid.org/0000-0001-5125-4332 3   na3 ,
• Juan Ramón Martínez-Morales   ORCID: orcid.org/0000-0002-4650-4293 3   na2 ,
• Miriam Bauwens   ORCID: orcid.org/0000-0003-0402-9006 1 , 2   na2 ,
• Juan Jesús Tena   ORCID: orcid.org/0000-0001-8165-7984 3   na2 &
• Elfride De Baere   ORCID: orcid.org/0000-0002-5609-6895 1 , 2   na2

Genome Biology volume  25 , Article number:  123 ( 2024 ) Cite this article

80 Accesses

1 Altmetric

Metrics details

Vision depends on the interplay between photoreceptor cells of the neural retina and the underlying retinal pigment epithelium (RPE). Most genes involved in inherited retinal diseases display specific spatiotemporal expression within these interconnected retinal components through the local recruitment of cis -regulatory elements (CREs) in 3D nuclear space.

To understand the role of differential chromatin architecture in establishing tissue-specific expression at inherited retinal disease loci, we mapped genome-wide chromatin interactions using in situ Hi-C and H3K4me3 HiChIP on neural retina and RPE/choroid from human adult donor eyes. We observed chromatin looping between active promoters and 32,425 and 8060 candidate CREs in the neural retina and RPE/choroid, respectively. A comparative 3D genome analysis between these two retinal tissues revealed that 56% of 290 known inherited retinal disease genes were marked by differential chromatin interactions. One of these was ABCA4 , which is implicated in the most common autosomal recessive inherited retinal disease. We zoomed in on retina- and RPE-specific cis -regulatory interactions at the ABCA4 locus using high-resolution UMI-4C. Integration with bulk and single-cell epigenomic datasets and in vivo enhancer assays in zebrafish revealed tissue-specific CREs interacting with ABCA4 .

## Conclusions

Through comparative 3D genome mapping, based on genome-wide, promoter-centric, and locus-specific assays of human neural retina and RPE, we have shown that gene regulation at key inherited retinal disease loci is likely mediated by tissue-specific chromatin interactions. These findings do not only provide insight into tissue-specific regulatory landscapes at retinal disease loci, but also delineate the search space for non-coding genomic variation underlying unsolved inherited retinal diseases.

## Graphical Abstract

The human retina, the light-sensitive layer of the eye that transmits visual information to the brain, is a highly organized tissue, consisting of a multi-layered neural retina intimately associated with a single layer of retinal pigment epithelium (RPE) and bordered by the choroid, the vascular layer containing blood vessels and connective tissue. Although it is the neural retina that contains the light-sensitive photoreceptor cells, the neural retina as well as the RPE are commonly affected in retinal disease, as the latter plays a crucial role in photoreceptor maintenance and survival [ 1 , 2 ]. Despite the interconnectedness between these retinal components, they are phenotypically, functionally, and molecularly highly distinct. To illustrate the latter, most known retinal disease genes display a cell-type-specific expression pattern, with large groups being specifically expressed in either photoreceptors or the RPE [ 3 ].

This type of tissue- or cell-type-specific gene expression is achieved through a tight transcriptional control via thousands of cis -regulatory elements (CREs) [ 4 , 5 ]. Integrated epigenomic analyses have revealed over 50,000 candidate CREs (cCREs) active in the human adult neural retina or RPE, with the majority displaying tissue-specific accessibility [ 4 ]. Yet, until recently, linking these cCREs to their true retinal target genes was hampered by the lack of relevant tissue-specific chromatin interaction data. Indeed, spatiotemporal communication between CREs and target promoters relies on a chromatin looping mechanism, ensuring close physical proximity in the three-dimensional (3D) nuclear space [ 6 , 7 ]. These 3D chromatin interactions are mostly constrained within self-interacting domains, called topologically associating domains (TADs), which are flanked by insulating boundaries enriched for CTCF binding [ 8 ]. Although TADs are thought to be largely conserved across cell lines and tissues [ 8 , 9 ], there have been examples of cell-type specific 3D structures within complex tissues such as the brain [ 10 , 11 ]. Although a 3D genome map of the human neural retina recently increased our insight into the genetic control of tissue-specific functions [ 12 ], 3D genome structure in the RPE/choroid has not been mapped before, nor has it been explored whether differential chromatin interactions exist within the different components of the retina.

Genetic variation disrupting active CREs and/or 3D genome architecture has been reported in inherited retinal disease (IRD), a group of disorders leading to vision impairment and affecting 2 million people worldwide [ 13 , 14 ]. For instance, duplications within the PRDM13 and IRX1 loci, altering enhancer regions, have been associated with North Carolina Macular Dystrophy (NCMD) (MIM #136550 and MIM #608850), a retinal enhanceropathy affecting macular development [ 15 ]. Structural variants spanning YPEL2 , associated with retinitis pigmentosa 17 (RP17) (MIM #600852), have been shown to induce the formation of new TADs (neo-TADs), resulting in ectopic expression of GDPD1 in photoreceptor cells [ 16 ]. So far only a handful of non-coding sequence variants with a regulatory effect have been reported in IRD, as exemplified by single nucleotide variants (SNVs) in two hotspot regions near PRDM13 [ 15 ]. Yet, the highest number of non-coding sequence variants reported in IRD were identified within the ABCA4 locus, implicated in ABCA4 -associated IRD ( ABCA4 -IRD, MIM #248200) [ 17 , 18 ]. Although most of these non-coding variants influence cis -acting splicing [ 17 , 19 ], functional CREs within the ABCA4 locus may represent targets for hidden genetic variation in ABCA4 -IRD.

The annotation of functional CREs remains challenging, however, considering the tissue and cell-type specificity of gene regulatory mechanisms. Combining chromatin interaction profiling using C-technologies (e.g., Hi-C, 4C) with epigenomic chromatin signatures generated on relevant human tissues represents a powerful approach to identify cCREs that can be associated with a target gene [ 20 ]. Given the increased implementation of whole genome sequencing in genetic testing protocols of rare diseases including IRD [ 21 , 22 ], prioritizing and identifying key functional regions without coding potential could aid in pinpointing and interpreting overlooked variation associated with disease [ 23 ].

Considering the tissue-specificity of gene expression [ 3 ] and chromatin accessibility [ 4 ] in the two major components of the human retina, we aimed to understand the role of differential 3D chromatin interactions in establishing tissue-specific expression patterns at IRD loci in the human neural retina and the RPE. We therefore generated genome-wide chromatin interaction maps by applying in situ Hi-C [ 9 ] and H3K4me3 HiChIP [ 24 ] to the neural retina and RPE/choroid from human adult post-mortem donor eyes and performed a comparative 3D genome analysis between these two retinal tissues. We focused in particular on the impact of tissue-specific chromatin interactions at IRD loci and investigated this in depth for the ABCA4 gene, implicated in the most common autosomal recessive IRD and expressed in both retinal components [ 3 , 4 , 25 ]. Using high-resolution targeted assays (UMI-4C [ 26 ]), (single-cell) epigenomic data integration, and in vivo enhancer assays, we characterized tissue-specific ABCA4 CREs.

## Comparative 3D genome analysis between the neural retina and RPE/choroid reveals differential interactions

As many known retinal disease genes are expressed within specific components and cell types within the human retina [ 3 ], we wanted to explore the role of tissue-specific 3D genomic structures or interactions in establishing these expression patterns in the neural retina and RPE/choroid. We used in situ Hi-C on post-mortem human donor retina to map 3D genomic interactions in the adult neural retina ( n  = 4, four eyes from three donors), as well as the RPE/choroid layer ( n  = 4, four eyes from three donors) (Fig.  1 a). A total of 1.13 billion and 1.34 billion pairwise genomic contacts could be identified in the neural retina and RPE/choroid, respectively.

Comparative Hi-C analysis between human neural retina and RPE/choroid. a Generation of tissue-specific 3D contact matrices using in situ Hi-C on adult human donor neural retina and RPE/choroid samples ( n  = 4) and strategy for comparative 3D genome analysis. b Results of CHESS comparative analysis between the neural retina and RPE/choroid Hi-C contact matrices (z-ssim similarity scores obtained for chromosome 1 using 1-Mb window sizes, z-ssim <  − 1.2, signal/noise (SN) > 2). c Enrichment of retina-enriched genes from the EyeGEx database and RetNet IRD genes compared to Ensembl genes within CHESS differential regions (Fisher’s exact test, p  = 0.000273 and p  = 0.000658 respectively). d Clustered heatmap of genes within CHESS differential windows using GTEx tissue expression data. e Overlap between genes at (differential) Hi-C loop anchors identified in neural retina and RPE/choroid and EyeGEx retina-enriched genes and RetNet IRD genes. f Enrichment of RetNet IRD genes, retina-specific IRD genes, RPE/choroid-specific IRD genes, and retina-enriched genes from the EyeGEx database compared to Ensembl genes at Hi-C loops in neural retina (Fisher’s exact test, p  = 4.312e − 13, p  = 4.875e − 12, p  = 0.2024 and p  = 0.0001 respectively), differential Hi-C loops in the neural retina (Fisher’s exact test, p  = 1.149e − 05, p  = 1.254e − 06, p  = 0.7515 and p  = 3.826e − 13 respectively) and Hi-C loops in RPE/choroid (Fisher’s exact test, p  = 0.4705, p  = 0.2559, p  = 0.0991 and p  = 0.8237 respectively). g Single-cell RNA expression within adult human retina of clusters of genes identified at differential loops in neural retina and RPE/choroid. The figure in panel a was partly created using BioRender

These retinal Hi-C maps were subsequently used to calculate genome-wide diamond insulation scores and determine tissue-specific insulating TAD boundaries (Additional file 1 : Fig. S1a–c, f–h). We identified 3905 and 3785 boundaries in the neural retina and RPE/choroid respectively, with 60–62% of them overlapping or adjacent in both tissues (Additional file 1 : Fig. S1k, Additional files  2  and 3 ). As expected, these boundaries were enriched for CTCF binding and displayed a convergent orientation bias for CTCF motifs (Additional file 1 : Fig. S1d–e, i–j).

Next, we performed a comparative analysis of neural retina vs. RPE/choroid 3D genomes (Fig.  1 a). First, we applied the feature-independent CHESS algorithm [ 27 ] with both 1 Mb and 500 kb sliding windows to scan the whole genome for quantitative contact differences within the neural retina and RPE/choroid Hi-C maps (Fig.  1 b, Additional file 1 : Fig. S2–3). Upon merging and reducing overlapping differential windows, we delineated 476 genomic regions displaying differential chromatin interactions (Additional file 4 ). We identified 2034 protein-coding genes within these differential loci and found, despite the large window sizes used for CHESS analysis, that these were significantly enriched for genes with a highly specific expression in the retina (44/242 retina-enriched genes from the EyeGEx database compared to other GTEx tissues, Fisher’s exact test, p  = 0.000273) and known IRD disease genes (49/290 RetNet genes, Fisher’s exact test, p  = 0.000658) (Fig.  1 c). Also, by analyzing GTEx RNA expression data for genes within differential regions, we identified a subcluster of 296 genes with highly specific expression in the retina and associated with functions such as “visual perception” (Fig.  1 d and Additional file 1 : Fig. S4).

As a second approach to determine tissue-specific interactions, we used the retinal Hi-C maps to determine (differential) chromatin looping in neural retina vs. RPE/choroid. Using HICCUPS [ 9 ] loop calling, 6884 and 2902 chromatin loops were identified in, respectively, neural retina and RPE/choroid (Additional files 5 and 6 ). 60% of neural retina loops (4081/6884) correspond to loops previously identified in the same tissue by Marchal et al. [ 12 ] Differential loop calling between the neural retina and RPE/choroid resulted in 1292 differential loops, of which 1149 were gained in neural retina and 143 in RPE/choroid (Additional files 7 and 8 ). We identified all genes with transcription start sites (TSSs) within 2 kb of (differential) loop anchors and found an enrichment of retina-enriched genes and known IRD genes at loops in the neural retina (69/242 retina-enriched genes, Fisher’s exact test, p  = 2.097e − 06 and 97/290 RetNet genes, Fisher’s exact test, p  = 4.312e − 13), and at differential loops gained in the neural retina (27/69 retina-enriched genes at retinal loops, Fisher’s exact test, p  = 0.0001444 and 37/97 RetNet genes at retinal loops, Fisher’s exact test, p  = 1.149e–05) (Fig.  1 e–f). Next, we evaluated whether IRD genes with specific expression in cell types of the neural retina or RPE/choroid would be more strongly associated with tissue-specific loops. Using scRNA-seq data from adult human retina [ 3 ] and by scaling gene expression across all identified cell types (Methods), 74/290 IRD genes (RetNet) were identified as having enriched expression in at least one cell type within the RPE/choroid ( Z -score > 2), while 239/290 IRD genes showed enriched expression in at least one cell type of the neural retina ( Z -score > 2) (Additional file 1 : Fig. S5, Additional file 9 : Table S1). We found that retina-specific IRD genes were strongly enriched at (differential) Hi-C loops in the neural retina, while RPE/choroid-specific IRD genes were not enriched at these retinal loops (Fig.  1 f). Similarly, at Hi-C loops identified in the RPE/choroid, we observed a 1.7-fold enrichment of only RPE/choroid-specific IRD genes, although this was not significant due to the small number of RPE/choroid loops and therefore small gene sets (Fig.  1 f). Gene Ontology enrichment analysis also indicated an involvement of genes associated with the visual system in chromatin looping in the neural retina (Additional file 1 : Fig. S6), while enriched terms for genes contacted by RPE/choroid loops included epithelium-associated processes (Additional file 1 : Fig. S6). Genes contacted by differential loops in the neural retina showed increased expression in the retina compared to other tissues in the GTEx dataset, while genes at RPE/choroid-specific loops were markedly downregulated in the retina (Additional file 1 : Fig. S7–9). Clustering based on tissue-specific expression and subsequent analysis of retinal scRNA-seq data revealed that subsets of genes at differential chromatin loops, displayed specific expression in the most abundant cell types of either the neural retina (photoreceptors) or the RPE/choroid (RPE, fibroblasts, endothelial and immune cells) (Fig.  1 g and Additional file 1 : Fig. S8, 9).

Taken together, the results from our comparative Hi-C analysis suggest that tissue-specific 3D interactions exist within the adult human retina and could contribute to tissue-specific regulation of genes, including known IRD genes and genes specifically expressed in the retina.

## Mapping cis -regulatory retinal landscapes at high resolution using HiChIP

While Hi-C interaction maps provided a genome-wide view of 3D genome architecture in the human retina and RPE/choroid, the sensitivity to identify chromatin loops at high resolution was limited. To identify cis -regulatory interactions involving active promoters at a higher resolution and with greater sensitivity, we performed HiChIP [ 24 ] for H3K4me3 in both human adult neural retina ( n  = 2, two eyes from one donor) and RPE/choroid ( n  = 2, two eyes from two donors). Visual inspection of HiChIP contact matrices at 5 kb resolution revealed promoter-centered interactions in the form of discrete lines that delineate regulatory landscapes of active genes and were not detectable in the Hi-C heatmaps (Additional file 1 : Fig. S10a). Moreover, our HiChIP-derived ChIP-seq signals recapitulated publicly available H3K4me3 datasets (Marchal et al. [ 12 ], ENCODE) (Additional file 1 : Fig. S10b) and showed the expected enrichment at peaks (Additional file 1 : Fig. S10c). Furthermore, we found a high degree of overlap between HiChIP loops and Hi-C loops involving TSSs with invariant H3K4me3. Respectively 72% and 60% of retinal and RPE/choroid Hi-C loops were also present in corresponding HiChIP loops sets, while 75% of retinal Hi-C loops previously identified by Marchal et al. [ 12 ] correspond to neural retina HiChIP loops (Additional file 1 : Fig. S10d). Yet, distances between anchors of HiChIP loops were significantly smaller ( p -value < 2.2e − 16, Wilcoxon rank-sum test; Additional file 1 : Fig. S10e), the median distance being ~ 115 kb compared with ~ 250 kb of Hi-C loops. We further observed that only a small proportion of HiChIP loops cross TAD boundaries (10.7% and 3% in neural retina and RPE/choroid, respectively, compared to ~ 16% and ~ 13% in shuffled boundary controls; Additional file 1 : Fig. S10f), in agreement with preferential intra-domain promoter-enhancer contacts provided by TAD insulation [ 8 , 9 , 28 ].

To identify specific HiChIP contacts of both retinal compartments, we performed differential loop calling using FitHiChIP [ 29 ]. To unambiguously assign differential contacts due to changes in 3D structure, only interactions with similar ChIP-seq coverage of H3K4me3 in both tissues were considered. We identified 269,684 loops contacting 16,648 promoters that fulfilled this condition, from which 34,692 (from 6463 genes) and 2204 loops (from 1339 genes) were specific of neural retina and RPE/choroid, respectively (Fig.  2 a, Additional file 10 ), in line with the unbalanced difference observed in our Hi-C datasets. Differential intensities were confirmed by aggregate peak analysis plots (Fig.  2 b). At retina-specific loops, we found an enrichment in known IRD disease genes and retina-enriched genes from the EyeGEx database (133/249 RetNet genes at retinal HiChIP loops, Fisher’s exact test, p  = 7.713e − 06 and 71/101 retina-enriched genes at retinal HiChIP loops, Fisher’s exact test, p  = 4.382e − 10) (Fig.  2 c). Moreover, we again found a stronger enrichment when only considering IRD genes with specific expression in cell types of the neural retina (119/208 retina-specific IRD genes at retinal HiChIP loops, Fisher’s exact test, p  = 2.042e − 07), while RPE/choroid-specific IRD genes were not enriched at retina-specific HiChIP loops (Fig.  2 c). Conversely, only RPE/choroid-specific IRD genes were slightly enriched (1.3-fold) at RPE/choroid-specific HiChIP loops (not significant), while we also observed a significant depletion of retina-enriched genes from the EyeGEx database (Fisher’s exact test, p  = 0.0046) (Fig.  2 c).

Differential promoter looping between human neural retina and RPE/choroid. a Proportion of differential promoter-associated loops (at 5-kb resolution) in human neural retina (red) and RPE/choroid (blue) according to FitHiChIP (FDR < 0.05). b Aggregate peak analysis centered at HiChIP loops specific of neural retina, RPE/choroid, and stable loops. c Enrichment of RetNet IRD genes, retina-specific RetNet genes, RPE/choroid-specific RetNet genes, and retina-enriched genes from the EyeGEx database within genes specifically contacted in the neural retina (right; Fisher’s exact test, p  = 7.713e − 06, p  = 2.042e − 07, p  = 0.8979, and p  = 4.382e − 10, respectively) and RPE/choroid (left; Fisher’s exact test, p  = 0.4856, p  = 0.1584, p  = 0.3647, and p  = 0.0046, respectively). d Top-10 enriched GO Biological Process terms associated with differentially HiChIP-contacted promoters in neural retina and RPE/choroid. e Genomic tracks showing the 3D chromatin configuration of the RHO gene locus. For both tissues, HiChIP contact matrices, differential loops, and HiChIP-derived H3K4me3 ChIP-seq signals are represented from top to bottom. f Virtual 4C contact frequencies (viewpoints indicated by a green line) for all genes within the RHO locus derived from the neural retina and RPE/choroid binned HiChIP counts

Examples of RetNet genes associated with tissue-specific contact gains included ACO2 , CRX , RHO , NRL , and PROM1 (gain in the neural retina), as well as CDH3 and TIMP3 (gain in RPE/choroid) (Additional file 1 : Fig. S11). Gene Ontology analysis further revealed enriched biological processes associated with light perception for genes specifically contacted in retina, while RPE/choroid-contacted genes were involved in extracellular matrix organization (Fig.  2 d). Additionally, the analysis of GTEx tissue expression data and scRNA-seq data for adult human retina indicated a large cluster of 700 + retina-specific genes involved in retina-specific looping, which were primarily expressed in photoreceptors (Additional file 1 : Fig. S12). Expression of genes at RPE/choroid-specific loops was detected across many human tissues, with single-cell data confirming expression of these genes in cell types of the RPE/choroid (Additional file 1 : Fig. S13). This was in line with expectations, as the cell types found within the RPE/choroid are also present in epithelial, connective, and vascular tissues throughout the human body, while the retinal tissue from the EyeGEx database primarily contains neural retina [ 30 ].

Next, we used these stable and retina-/RPE-specific loops to identify interactions between promoters and candidate cis -regulatory elements (cCREs) with activity in the retina or RPE previously identified by Cherry et al. [ 4 ] (Additional file 10 ). Specifically, using HiChIP stable, retina-specific, and RPE-specific loops, we identified 134,374 neural retina loops (stable and retina-specific) connecting 15,819 TSSs to 32,425 retinal cCREs; and 118,461 loops in RPE/choroid (stable and RPE/choroid-specific) connecting 13,190 TSSs to 8060 RPE cCREs.

Illustrative of the power of HiChIP to delineate tissue-specific cis -regulatory landscapes was the differential 3D wiring we observed at the RHO locus, where neighboring genes formed mutually exclusive contacts in either retinal compartment (Fig.  2 e). To further inspect changes in chromatin 3D interactions within this locus, we generated virtual 4C contacts from the HiChIP data for every gene promoter in this region. As inferred from the HiChIP heatmaps, RHO / H1-8 and PLXND1 genes showed little contact overlap, with most of their interactions mapping to opposing sides of the locus (Fig.  2 f).

Altogether, these HiChIP data support the outcome of our comparative Hi-C analysis and extend these results by including high-resolution promoter interactions. This enabled us to refine tissue-specific maps of cis -regulatory landscapes in the adult retina and should aid in unraveling the regulatory mechanisms governing retinal disease genes.

## Differential 3D topology and cis -regulatory interactions shape IRD loci

As single-cell RNA sequencing experiments have indicated that many known IRD genes are expressed in a cell-type-specific manner [ 3 ], we used our differential Hi-C and HiChIP interaction data to explore whether tissue-specific interactions at IRD loci could be associated with their specific expression patterns. Considering results from both the Hi-C and HiChIP comparative analyses, 56% of IRD genes (164/290) could be associated with differential 3D interactions (Fig.  3 a, Additional file 9 : S1). Based on their cell-type-specific expression pattern (single-cell expression data was available for 161/164 genes [ 3 ]), we observed two clusters within this subset of IRD genes marked by tissue-specific 3D topology, with the largest cluster predominantly composed of IRD genes specifically expressed in rod and cone photoreceptors, the most abundant cell types in the neural retina, and a small cluster of genes expressed in the RPE or choroidal cell types, including vascular cells, immune cells and fibroblasts (Fig.  3 b, Additional file 1 : Fig. S14).

The impact of differential 3D genomic interactions at retinal disease loci. a Number of inherited retinal disease (IRD) genes associated with differential interactions in neural retina vs. RPE/choroid through Hi-C differential regions (CHESS) or loops and HiChIP differential loops. b Single-cell RNA expression per cell type within the adult human retina of two clusters of IRD genes associated with differential interactions. Cell types: rod, L/M cone, S cone, retinal pigment epithelium (RPE), pericyte (PER), fibroblast (FB), endothelial (END), melanocyte (CM), T-cell, microglia (uG), monocyte (MO), mast cell (MAST), ON bipolar (DBC), rod bipolar (RBC), OFF bipolar (HBC), Müller cell (MC), GABA amacrine (ACB), horizontal cell (HC), GLY amacrine (ACY), astrocyte (AST), ganglion cell (GC). c Differential 3D interactions at the CFH and CRB1 locus. d Differential 3D interactions at the MAK locus. e Single-cell RNA expression of genes within highlighted loci in adult human retina (periphery) averaged per cell type group

The differential Hi-C and HiChIP analyses primarily enabled the identification of IRD genes associated with interaction gains in the neural retina (Fig.  3 a). For many of these loci, including all those identified through the three individual analyses ( CC2D2A , CEP164 , DMD , ELOVL4 , EYS , GNB3 , IMPG1 , LCA5 , PCDH15 , PROM1 , RPGR , SAMD7 , UNC119) , we found increased local interactions in the neural retina to be correlated with their specific expression in the same tissue (Additional file 1 : Fig. S15). In particular, we often observed tissue-specific chromatin looping between genes with similar expression patterns, indicating these might share a regulatory mechanism. For example, UNC119 (~ cone-rod dystrophy and maculopathy, MIM #620342) forms a retina-specific loop with the VTN gene (specifically expressed in cones in the fovea), ELOVL4 (~ Stargardt-like disease, MIM #600110) contacts LCA5 (~ Leber congenital amaurosis, MIM #604537), while SAMD7 (candidate modifier of IRD [ 31 ] and macular dystrophy, MIM #620762) forms retina-specific loops, mediated by retina-specific CTCF binding at the SAMD7 promoter, with both downstream gene GPR160 and upstream gene MYNN (both expressed in photoreceptors) (Additional file 1 : Fig. S15b, d, f). Some IRD loci, such as CC2D2A / PROM1 and IMPG2 , even showed an increase of long-range, inter-TAD contacts with genes displaying a similar expression profile in the neural retina (Additional file 1 : Fig. S15g, k). For other genes, we identified tissue-specific contacts with cCREs. PCDH15 (~ Usher syndrome, MIM #601067) contacts intronic and upstream cCREs through retina-specific loops, both IMPG1 (~ macular dystrophy, MIM #616151; RP, MIM #153870) and EYS (~ RP, MIM #602772) form retina-specific loops with intronic cCREs mediated by retina-specific CTCF binding, while RPGR and DMD (from its retinal promoter) engage in retina-specific looping with upstream cCREs (Additional file 1 : Fig. S15a, c, h, i, fj).

A smaller subset of IRD genes could be associated with interaction gains in the RPE/choroid. Many of these genes displayed specific expression in the RPE or choroidal cell types and could be identified through differential HiChIP chromatin looping (e.g. CDH3 , EFEMP1 , FBLN5 , LRAT , TIMP3 ) or local interaction frequency gains detected through CHESS analysis of the Hi-C data (e.g., AHR , CFH , CWC27 , NR2F1 , PEX7 , VCAN , WFS1 ) (Additional file 1 : Fig. S16).

Interestingly, a few loci displayed specific contact gains in both the neural retina and RPE/choroid. For example, we observed increased interaction between the CFH promoter (~ age-related macular degeneration, MIM #610698) and its upstream region in the RPE/choroid, coinciding with specific expression and increased CTCF binding in the same tissue, while the opposite is true for the nearby CRB1 gene, which displayed increased local interactions and expression in the neural retina (Fig.  3 c, e). This was also the case for the RP-associated MAK locus, which in addition to a retina-specific interaction between the MAK and ELOVL2 genes (both specifically expressed in photoreceptors) also showed an increase of local RPE/choroid-specific interactions at the GCNT2 and TFAP2A genes (both expressed in RPE/choroid) (Fig.  3 d, e).

## 3D interactions define the ABCA4 cis -regulatory landscape in neural retinal and RPE/choroid

Next, we investigated the 3D topology and cis -regulatory landscape of an IRD locus in greater detail. We focused on the ABCA4 locus, implicated in the most common autosomal recessive IRD. The ABCA4 gene is mainly expressed in photoreceptor cells within the neural retina [ 32 ], but has also been shown to be expressed in the RPE [ 25 ]. Interestingly, ABCA4 -IRD has been hypothesized to originate from a fovea-specific dysfunction of RPE cells [ 3 , 25 ]. Moreover, its genetic architecture is characterized by a high proportion of non-coding pathogenic variants [ 17 , 18 ]. The retinal Hi-C and HiChIP maps generated here indicated differential chromatin looping and a TAD boundary shift at the ABCA4 locus, suggesting that specific interactions with distinct CREs in neural retina vs. RPE/choroid could be involved in the differential transcriptional regulation of ABCA4 (Additional file 1 : Fig. S17).

Characterization of the ABCA4 cis -regulatory landscape in human retina. a ABCA4 promoter interaction frequencies using UMI-4C in human neural retina and RPE/choroid from retinal donors ( n  = 3, interacting regions (IRs) indicated 1–12). Candidate cis -regulatory elements (cCREs) within IRs were identified using publicly available epigenomic data from human retina: ATAC-seq from bulk retina and scATAC-seq from photoreceptor cells; ChIP-seq for histone marks H3K27ac and H3K4me2, retinal transcription factors (TFs) (CRX, OTX2, and NRL) and the architectural protein CTCF. Epigenomic data for RPE/choroid included bulk ATAC-seq and ChIP-seq targeting H3K27ac and CTCF. All these data were integrated to finely map cCREs. b Close-up of the cCREs including the above-described datasets; retinal TF binding (CRX, OTX2, NRL, RORB, and MEF2D); and sequence motifs (Jaspar Core Pred. TFBS 2022) for TFs expressed in photoreceptors (i.e., MEIS1, NRL, NR2E3, OTX2, CRX, MEIS2, MEF2D, RORB, RXRG, SMAD2 and NEUROD1); and the TFs expressed in RPE (CRX, KLF4, KLF9, LHX2, MEIS1, MEIS2, OTX2, RORB, SMAD2, STAT5B, TEAD1, and TEAD3). c Overview of in vivo enhancer assays using zebrafish stable transgenic lines; dot plot (left) indicating in which tissues GFP + reporter expression was observed (retina, RPE, and lens, white arrows). d Overview of in vivo enhancer assays for the cCRE1–5 synthetic construct through transient transgenesis in zebrafish; bar plots (top) indicating the frequency of GFP + tissues (retina, pineal gland, lens, forebrain, heart, and nosepit) among total GFP + embryos at 1, 2, 3, and 4 days post-fertilization (dpf); example of reporter expression in retina and pineal gland at 3 and 4 dpf

Subsequently, we identified tissue-specific cCREs within these IRs using publicly available epigenomic datasets (Fig.  4 a). Almost all IRs were associated with open chromatin in the neural retina (11/12) [ 4 , 33 ]; and all of them in RPE (12/12) [ 33 ]. In addition, we found histone modifications associated with active enhancers (H3K27ac and H3K4me2) and photoreceptor-specific transcription factors (TFs) (e.g., OTX2, CRX, NRL, RORB, and MEF2D), including their sequence motifs, to be present at most IRs within the neural retina (10/12, Additional file 9 : Table S2). Within the RPE, we identified the presence of H3K27ac within 6 of 12 IRs, in addition to the presence of TF sequence motifs found to be expressed in the RPE (e.g., KLF4, LHX2, OTX2, and TEAD1) (Additional file 9 : Table S2). Of note, IR12 appears to contain a cCRE with RPE-specific activity given the presence of H3K27ac and high frequency of chromatin accessibility (cCRE-RPE, Fig.  4 b), as also reported by Cherry et al. [ 4 ].

## Single-cell dissection of the ABCA4 cis -regulatory network reveals cCREs in photoreceptors and RPE

Given the cellular complexity of the retina, we mined the ABCA4 locus in publicly available scATAC-seq and scRNA-seq datasets derived from human neural retina [ 34 ]. Using these datasets, we could identify the precise cell type in which cCREs within 9/11 IRs are likely active (Additional file 1 : Fig. S20, Additional file 9 : Table S2). As expected, we observed the highest frequency of chromatin accessibility at the ABCA4 TSS among adult rod and cone photoreceptor cells, which correlated with transcriptional activity in these cell types (Additional file 1 : Fig. S20a). Also, most IRs (9/11) were found to be accessible in at least one retinal cell cluster and could be linked to the ABCA4 promoter through co-accessibility analysis, corroborating the UMI-4C interaction profiles (Additional file 1 : Fig. S20b, Additional file 9 : Table S2). Of all IRs, seven were found to be accessible in photoreceptor cells while only one, the ARHGAP29 promoter (IR1), was found to be constitutively accessible. Interestingly, IR8 and IR10 were found to be exclusively accessible in the adult Müller glial cells, in which low ABCA4 expression can be observed (Additional file 1 : Fig. S20).

Overall, upon cell-type-specific epigenetic characterization of the IRs and narrowing down to elements active in photoreceptor cells, we prioritized six cCREs (cCRE1-6), within IR3, IR4, IR5, IR7, IR9, and IR11 respectively, as candidate regulatory elements for ABCA4 expression (Fig.  4 b and Additional file 9 : Table S2). Moreover, the available TF ChIP-seq data and motifs found in the center of these cCREs suggest that CRX, OTX2, NRL , and RORB likely constitute the core TFs necessary for ABCA4 transcriptional regulation in photoreceptors cells (Fig.  4 b and Additional file 9 : Table S2). Note that since some of these TFs are expressed in the RPE as well (CRX and OTX2), the proposed cCREs may also act as cis -regulators in this cell type.

## In vivo zebrafish enhancer assays characterize ABCA4 cCRE activity

To further evaluate the activity pattern of cCREs with a putative role in ABCA4 regulation, in vivo enhancer assays in zebrafish were performed. We prioritized eight elements for functional assessment, including the ABCA4 promoter, five out of the six cCREs (cCRE2-6) identified above, as well as two previously identified cCREs by Cherry et al. [ 4 ] that had not been tested in vivo before (Cherry1/2) (Fig.  4 c, Additional file 9 : Table S3) [ 9 ]. In total, we generated eight stable transgenic zebrafish lines and assessed GFP fluorescence at 1, 2, and 3 days post fertilization (dpf) to evaluate enhancer activity. Reporter expression in the eye was observed for the majority of the tested elements (5/8) (Fig.  4 c, Additional file 1 : Fig. S21). From these, three exhibited reporter expression in the retina (promoter, cCRE6, Cherry2), three in the lens (cCRE4, cCRE5, and Cherry2), and one in the RPE (cCRE4) (Fig.  4 c, Additional file 1 : Fig. S21).

To assess whether cooperativity between several cCREs could improve tissue-specificity, we designed a synthetic construct including core elements of 5 out of the 6 prioritized cCREs (cCRE1–5), since ChIP-seq data [ 4 ] indicated these were bound by a common set of photoreceptor TFs (CRX, NRL, OTX2, RORB, and MEF2D) (Additional file 9 : Table S2). This construct was cloned into the E1b-tol2 vector [ 35 ] and transient eGFP expression was annotated at one, two, three, and four dpf. Remarkably, we observed robust and strong reporter expression in the retina (75/82) and pineal gland (82/82) (Fig.  4 d, Additional file 1 : Fig. S22, Additional file 9 : Table S4). Of note, the pineal gland contains both rod and cone light-sensitive photoreceptor cells and plays important roles in the regulation of circadian rhythms in animal behavior and physiology [ 36 ]. Overall, these results indicate a functional role of the proposed cCREs and suggest a mechanism of enhancer cooperativity to ensure tissue-specific ABCA4 expression.

Through extensive 3D genome mapping, including genome-wide (Hi-C), promoter-centric (HiChIP), and locus-specific (UMI-4C) profiling, we have characterized the 3D chromatin architecture and cis -regulatory interactions in the two major components of the human retina, the neural retina, and the RPE/choroid. A comparative analysis between these two tightly interconnected layers revealed differential 3D chromatin topology and cis -regulatory interactions at loci associated with tissue- and cell-type specific expression and/or retinal disease. Importantly, we found that almost 60% of known IRD genes were marked by a differential 3D genome topology.

Recently Marchal et al. [ 12 ] mapped high-resolution 3D topology of the human retina by Hi-C, and by integrating this with chromatin accessibility, histone marks, and transcriptome data of the human retina provided insight into targets of CREs and into the chromatin architecture of super-enhancers. Here, combining two complementary genome-wide chromatin interaction profiling technologies, in situ Hi-C and H3K4me3 HiChIP, allowed us to investigate multiple aspects of differential 3D topology in the neural retina vs. RPE/choroid. The comparative Hi-C analyses provided a genome-wide view on interaction frequency changes, primarily revealing increased cis -regulatory interactions near genes displaying specific expression in the most abundant cell types of either the neural retina ( i.e. rod and cone photoreceptors) or the RPE/choroid. These interactions appeared to facilitate contact with tissue-specific cCREs or other genes with similar expression profiles. The inclusion of HiChIP analyses greatly increased the sensitivity with which we could detect differential chromatin looping at active promoters. We therefore focused the differential HiChIP analysis on genes that were active in both retinal compartments, revealing differential usage of cCREs for gene regulation in both tissues.

The 3D interaction differences between the two closely related tissues highlighted in this study stress the importance of acquiring tissue-specific interaction data for genes with highly specific expression patterns, as is the case for most retinal disease genes. This type of tissue-specific data is crucial to correctly interpret cis -regulatory landscapes and disease-associated variation, in particular within the non-coding genome. Yet, it is important to note that even chromatin interaction mapping at the tissue level foregoes the underlying cellular complexity, as the resulting interaction maps reflect contact frequencies derived from a mixture of different cell types. In this case, we observed that interaction data from the neural retina primarily reflects contacts derived from the most abundant cell types by far, namely the photoreceptors. This was clearly exemplified by the photoreceptor-specific expression of most genes near differential contacts gained in the neural retina maps. The RPE/choroid layer, on the other hand, is comprised of a mixture of epithelial, endothelial, fibroblast, and immune cells, and the resulting interaction maps are therefore expected to reflect an average contact frequency across these different cell types. This might also explain the imbalance we observed in the number of chromatin loops that could be identified in neural retina vs. RPE/choroid Hi-C matrices. Despite similar sequencing coverage and contact numbers, more than twice as many Hi-C loops were identified in the neural retina. We speculate that the punctate signal from cell-type specific loops might be diluted in the RPE/choroid interaction maps due to its heterogeneous composition. An alternative explanation though may come from a lower degree of cis -regulatory complexity in the RPE/choroid compartment versus neural retina, given that neurons in general are highly complex cell types from a regulatory point of view [ 37 ]. Future interaction mapping at the cell-type level will be required to disentangle this complexity.

To investigate the potential impact of differential 3D chromatin architecture on IRD genes in greater detail, we focused on the ABCA4 locus, which was marked by a shift in TAD boundaries, as well as differential chromatin looping in our comparative analysis. Cherry et al. [ 4 ] previously annotated cCREs of the ABCA4 region in the human retina, based on tissue-specific epigenomic markers, TF binding, and gene expression datasets. Here, integration of chromatin conformation, scATAC-seq, and scRNA-seq datasets revealed six cCREs interacting with ABCA4 and presumably active in photoreceptors. These were located “proximally” (~ 75 kb from the TSS), upstream of the promoter, and within intronic regions, as is expected for tissue-specific enhancers [ 38 , 39 ]. Overall, contact frequencies between the ABCA4 promoter and these proximal cCREs were highly similar in neural retina and RPE/choroid, except one interaction in the RPE/choroid that contained RPE-specific enhancer marks (cCRE-RPE). To functionally validate these cCREs, zebrafish transgenic enhancer assays were performed using stable lines, revealing expression in relevant tissues such as the retina, lens, and RPE. Since this expression pattern was not specific for photoreceptor cells, we tested the cooperativity of five cCREs and demonstrated specific retinal expression, presumably in photoreceptors. The latter emphasizes the importance of the 3D chromatin architecture for the regulation of tissue-specific ABCA4 expression and of the tissue-specific CREs involved [ 40 ].

The number of genetic defects affecting CREs and/or 3D genome architecture reported in Mendelian retinal diseases is slowly emerging [ 13 , 14 ]. A striking example where 3D genome topology of patient-derived retinal organoids was used to interpret a non-coding structural variant in IRD, was reported only recently [ 16 ]. Relating CREs to their target genes is useful to interpret more subtle variants with a regulatory effect, as reported in NCMD, a retinal enhanceropathy [ 15 ]. We anticipate that multi-omics analyses of functional non-coding regions within retinal disease loci, as illustrated here for the ABCA4 locus, will accelerate our understanding of Mendelian retinal diseases.

In summary, we have shed light on the extent of differential 3D chromatin landscapes in neural retinal and RPE/choroid, the two major components of the human retina. Given the growing interest of non-coding variation both in multifactorial eye diseases implicating the retina such as age-related macular disease and glaucoma, and Mendelian retinal diseases, a differential annotation of the 3D topology of the retinal compartments, and adequate interpretation of different categories of variants is highly needed. For example, TAD boundaries and chromatin loops within the different retinal compartments, as identified in this study, will allow to define biologically relevant search spaces for missing heritability in complex as well as Mendelian retinal diseases such as ABCA4 retinopathy, one of the most frequent IRDs.

## Tissue preparation and nuclei isolation

Post-mortem human neural retina and RPE/choroid mixtures were obtained through the Tissue Bank of Ghent University Hospital and Antwerp University Hospital under ethical approval of the Ethics Committee of Ghent University (2018/1072, B670201837286). Eye globes were provided with a description of time and cause of death, post-mortem circulation time (ranging from 3-18 h), age, and sex (Additional file 9 : Table S5). None of the eight donors had a prior known ophthalmological condition.

The eye globes were dissected on ice, followed by extraction of the neural retina and the RPE/choroid. The resulting tissues were resuspended in 1XPBS supplemented with 10% Fetal Bovine Serum. The samples were processed according to Matelot and Noordermeer [ 41 ] and cross-linking of nuclei was performed using 2% formaldehyde. Finally, the obtained nuclei were aliquoted per 10 million and snap frozen after supernatant removal. Samples were stored at – 80 °C.

## Hi-C data analysis

FASTQ files containing raw sequencing data were processed into Hi-C contact matrices containing both raw and normalized counts using the Juicer pipeline (v1.6) [ 42 ] with BWA-MEM mapping (v0.7.17) [ 43 ] to the hg38 reference genome. Paired contacts from individual replicates were merged to create mega contact matrices for each tissue. Insulating boundaries between self-interacting domains were identified based on diamond insulation score minima. We used cooltools (v0.5.2, https://doi.org/10.5281/zenodo.5214125 .) to calculate a genome-wide contact insulation score with 250 kb window size for SCALE normalized mega Hi-C contact matrices (MAPQ > 30) at 25-kb resolution. Insulating boundaries were determined by applying automated “Li” thresholding (from the scikit-image Python package) on boundary strength. Chromatin loops were identified using HiCCUPS [ 9 ] (as implemented in Juicer v1.6), using SCALE normalized mega Hi-C contact matrices (MAPQ > 30) at 5, 10 and 25 kb resolution (parameters as used by Rao et al. [ 9 ]: -m 512 -r 5000,10000,25000 -k KR -f 0.1,0.1,0.1 -p 4,2,1 -i 7,5,3 -t 0.02,1.5,1.75,2 -d 20000,20000,50000). Differential loops in neural retina vs. RPE/choroid were determined using HiCCUPSDiff (as implemented in Juicer v1.6) with the same parameters and input matrices. Differential 3D features in neural retina vs. RPE/choroid were identified using the CHESS algorithm [ 27 ]. CHESS was run on a per-chromosome basis with SCALE normalized mega Hi-C contact matrices (MAPQ > 30, 25-kb resolution), using sliding windows of 1 Mb and 500 kb with a 100 kb step size. Top differential windows were filtered using z-ssim <  − 1.2 and signal-to-noise > 2 or 2.5 for the 1 Mb and 500 kb window analysis respectively. Filtered differential windows from both analyses were merged and overlapping windows were collapsed to generate a list of differential regions. We used FAN-C [ 44 ] to plot Hi-C matrices and fold-change matrices for regions of interest. All downstream analyses are described in a separate section below.

## Generation of HiChIP libraries

HiChIP was performed as previously described [ 24 ] using cross-linked nuclei from two neural retina and RPE/choroid samples (derived from two eyes, obtained from one and two donors respectively) (Additional file 9 : Table S5). After lysis, digestion was performed using 400-U DpnII (R0543T-NEB) restriction enzyme. Next, digestion efficiency was assessed and incorporation Master Mix (biotin-dATP 0.4 mM/19524016- Thermo Fisher; dNTP-A mix; and DNA Polymerase I, Large (Klenow) Fragment M0210-NEB) was added to fill in the restriction fragments overhangs and mark DNA ends with biotin in rotation during 1 h at 37 °C. Subsequently, ligation master mix was added (10 × NEB T4 DNA ligase buffer with 10-mM ATP B0202-NEB); 10% Triton X-100, BSA (B9000-NEB), T4 DNA ligase (M0202-NEB), and H2OmQ) and incubated at 16 °C in rotation. Sonication was performed keeping the samples on ice using the M220 Focused-ultrasonicator (Covaris) with the following cycling conditions: duty cycle 10%, PIP 75W, 100 cycles/burst, time 5′. This allowed to obtain DNA fragments of around 300 bp in size which were incubated with Dynabeads Protein G (10003D-TermoFisher) and 6.7 µg with anti-H3K4me3 antibody overnight at 4 °C with rotation. Samples were purified using the DNA Clean and Concentrator columns (D4004-Zymo Research). Up to 150 ng was taken into the biotin capture step, performed using Streptavidin C-1 beads (65,002-ThermoFisher). TAGmentation was conducted using the Nextera DNA Library Preparation Kit (FC-121-1030-Illumina) and library amplification was performed using NEBNext® High-Fidelity 2X PCR Master Mix (M0541L-NEB) with Nextera Ad1_noMX and Ad2.X primers. The resulting product was purified with the DNA Clean and Concentrator columns (D4004-Zymo Research).

## HiChIP data analysis

Paired-end reads were aligned to the hg38 reference human genome using the TADbit pipeline [ 45 ] with default settings. Briefly, duplicate reads were removed, Dpn II restriction fragments were assigned to resulting read pairs, valid interactions were retained by removing unligated and self-ligated events and multiresolution interaction matrices were generated. To create 1D signal bedfiles, equivalent to those of ChIP-seq, dangling end read pairs were used and coverage profiles were generated in bedgraph format using the bedtools genomecov tool. Next, we performed bedgraph to bigwig conversions for visualization purposes using the bedGraphToBigWig tool from UCSC Kent Utils. 1D signal bedgraph files were then used to call peaks either with nucleR [ 46 ] or with MACS2 [ 47 ] using the no model and extsize 147 parameters and an FDR ≤ 0.05.

FitHiChIP [ 29 ] was used to identify “peak-to-all” interactions at 5-kb resolution using HiChIP filtered pairs and peaks derived from dangling ends. Loops were called using a genomic distance between 20 kb and 2 Mb, and coverage bias correction was performed to achieve normalization. FitHiChIP loops with q-values smaller than 0.05 that were common to both replicates and involving promoters were kept for further analyses. For differential loop calling between the neural retina and RPE/chroroid, we used the script "DiffAnalysisHiChIP" from FitHiChIP with FDR and fold-change thresholds of 0.05 and 1.5, respectively. To avoid the identification of differential loops due to changes in ChIP-seq coverage, only differential loops connecting anchors with similar H3K4me3 intensities were kept (i.e., category ND–ND from the FitHiChIP differential loop calling output). Gene annotation of loop anchors was performed as described in the “ Downstream analyses of Hi-C and HiChIP data ” section below, and only promoter-associated loops were finally retained.

To determine the overlap between Hi-C loops and HiChIP loops identified in both retinal tissues, FitHiChIP [ 29 ] was used to annotate Hi-C loops with H3K4me3 at 5-kb resolution. Hi-C loops were then filtered to only retain those loops with characteristics that resemble those of HiChIP loops included in the differential FitHiChIP analysis, i.e., category ND–ND and 2 kb up- or downstream from a TSS. Subsequently, we performed an overlap between (1) filtered retinal Hi-C loops and stable or retina-specific HiChIP loops and (2) filtered RPE Hi-C loops and stable or RPE-specific HiChIP loops. The same approach was used to perform the overlap between retinal Hi-C loops identified by Marchal et al. [ 12 ] and the set of stable or retina-specific HiChIP loops.

Virtual 4C tracks of the RHO gene locus were generated from HiChIP interaction matrices. First, virtual 4C baits were determined by overlapping of HiChIP 5 kb bins with gene promoters located within a 265-kb locus around RHO (chr3:129395000–129660000). Then, we extracted all interaction counts from each single bait belonging to such locus.

For the computation of loops crossing the TAD boundaries of Fig_HiChIP_S6, five sets of shuffled TAD boundaries were generated by partitioning the genome into virtual TADs with the same size as experimental ones but randomly positioned within chromosomes.

## Downstream analyses of Hi-C and HiChIP data

Gene sets used for downstream analyses/annotation of Hi-C and HiChIP differential regions, loops, and boundaries, included Ensembl Human genes (GRCh38.p13), filtered for protein-coding, long non-coding RNA and microRNA transcripts, known IRD genes (Additional file 9 : Table S1) and retina-enriched genes from the EyeGEx database (defined as genes having a tenfold or higher expression in the retina than in at least 42 of the 53 GTEx (v7) tissues) [ 30 ]. For annotation purposes, a 2-kb region up- and downstream of the TSS was considered. Gene Ontology enrichment of genes at (differential) 3D features was performed using the “clusterProfiler” package in R (ontology = Biological Process, Benjamini–Hochberg adjustment, q -value < 0.05) [ 48 ]. Fisher’s exact test ( p -value < 0.05) was used to determine enrichment of gene sets of interest at (differential) 3D features.

Tissue-specific expression of genes in differential windows or at differential loops was evaluated using the GTEx dataset (v8) with integrated EyeGEx expression data for retina [ 30 ], as is available through The Human Protein Atlas (v23.0, https://www.proteinatlas.org ) [ 49 ]. Specifically, normalized expression values (normalized transcripts per million (nTPM)) were log2-transformed and converted to gene Z -scores. Clustered heatmaps were generated using the ComplexHeatmap package in R [ 50 ].

Single-cell RNA-seq data from the human adult peripheral retina was obtained from Cowan et al. [ 3 ] Specifically, we converted cell-type level, normalized gene expression values (expression normalized to 10,000 transcript counts per cell type) to cell-type level gene Z -scores. Genes with cell-type specific expression in the RPE/choroid where then identified by filtering for genes with a Z -score > 2 in at least one cell-type found in the RPE/choroid layer (“RPE,” “PER,” “FB_01,” “FB_02,” “FB_03,” “END_01,” “END_02,” “END_03,” “CM,” “NK,” “TCell,” “MO_01,” “MO_02,” “MO_03,” “MAST”). Similarly, to identify genes with cell-type-specific expression in the neural retina, we filtered for genes with a Z -score > 2 in at least one cell-type found in the neural retina (all other cell-types excluding the ones mentioned above). Clustered heatmaps were generated as described above.

## Generation of UMI-4C libraries and data analysis

The generation of the 3C template was performed as previously described [ 26 ]. Briefly, around 5 million cross-linked nuclei were digested overnight using 400 U Dpn II (NEB). After digestion, ligation was performed overnight using 4000 U of T4 DNA ligase (NEB), followed by the addition of proteinase K (BIOzymTC). The fficiency of digestion and ligation were evaluated via agarose gel electrophoresis. Next, samples were de-crosslinked, followed by purification of samples using AMPure XP beads (Agencourt). Subsequently, 4 µg of the 3C template was sheared on a Covaris M220-focused ultrasonicator to get 300 bp DNA fragments. The UMI-4C sequencing library preparation was obtained using the NEBNext Ultra II Library Prep Kit (NEB). Library amplification was performed by nested PCR. In the first PCR, 100 ng of the library was amplified using an upstream (US) forward primer and a universal reverse primer using the KAPA2G Robust HotStart ReadyMix (Roche). The resulting product was amplified using a downstream (DS) forward primer and the same universal reverse primer. Primer sequences can be found in Additional file 9 : Table S6. Libraries were multiplexed in equimolar ratios and sequenced on the Illumina NovaSeq 6000 platform, resulting in 150 bp paired-end reads. These were demultiplexed based on their barcodes and their DS primer using runcutadapt ( https://github.com/marcelm/cutadapt ). UMI-4C data was processed using the R package umi4cpackage 0.0.0.9000 ( https://github.com/tanaylab/umi4cpackage ; https://github.com/tanaylab/umi4cpackage/index.html ) [ 26 ]. Profiles were generated using default parameters, pooling all samples per viewpoint and condition (retina and RPE/choroid), and using a minimum win_cov of 50. All individual samples were interrogated for the ABCA4 promoter region. Reverse UMI-4C were performed, using at least 2 different biological replicates (2 different human donors).

## Integration of bulk and single-cell transcriptomic and epigenomic datasets from human donor retina

To predict putative CREs for the ABCA4 locus, an integration of publicly available datasets based on human neural retinal post-mortem material was performed. Data from the following experiments was included: ATAC-seq derived from healthy adult donor retinas [ 4 , 33 ], scATAC-seq from human embryo and adult post-mortem retinas [ 34 ], DNase-seq from ENCODE based on fetal retinas [ 51 ] and ChIP-seq of histone modifications (H3K27ac and H3K4me2), specific retinal transcription factors (CRX, OTX2, NRL, CREB, RORB and MEF2D) and CTCF derived from post-mortem donors with no eye condition [ 4 ]. Equally, bulk ATAC-seq [ 4 , 33 ] and ChIP-seq data for the active enhancer marker H3K27ac [ 4 , 33 ] derived from healthy post-mortem donors were also integrated. A ChIP-seq dataset targeting the CTCF protein derived from primary RPE from ENCODE (ENCSR000DVI) was also included. Additionally, single-nucleus ATAC-seq data [ 34 ] of embryonic (53, 59, 74, 78, 113, and 132 days) and adult (25, 50, and 54 years old) human retinal cells were obtained from GSE183684 and imported into R (v4.0.5). The matrices were processed using the ArchR single-cell analysis package (v1.0.1) [ 52 ] and processed according to Thomas et al., [ 34 ]. After filtering out doublets, the dataset was characterized by 61,313 number of cells. Single-nucleus RNA-seq data [ 34 ] for the same tissue types and timepoints were integrated using the unconstrained integration method. Peak calling was performed using the native peak caller “TileMatrix” from ArchR and bigwig files from each annotated cell cluster were extracted and converted to bedgraph files. Peak identification was performed using bdgpeakcall (MACS2.2.7.1) [ 47 ] using default parameters and a value of 0.1 as cutoff.

## Generation of in vivo reporter constructs

Eight elements were selected for functional assessment, including the ABCA4 promoter, five out of the six cCREs (cCRE2–6) prioritized in the study, as well as two previously identified cCREs by Cherry et al. [ 4 ] (Cherry1/2) that had not been tested in vivo before. Human genomic DNA (Roche) was amplified, using the Phusion High Fidelity PCR kit (NEB) using primers designed to span the ATAC-seq signals (Additional file 9 : Table S6) following the manufacturer’s instructions. PCR products were purified with Isolate II PCR and Gel Kit (BIOLINE) and cloned into the entry vector pCR®8/GW/TOPO (#250020 Invitrogen, ThermoFisher Scientific) according to manufacturer’s instructions. The fragments were then recombined into the destination vector for zebrafish transgenesis using Gateway® LR Clonase® II Enzyme mix (#11791020, Invitrogen, ThermoFisher Scientific), following the manufacturer’s instructions. This vector contains the strong midbrain enhancer z48 and the green fluorescent protein (GFP) reporter gene under the control of the gata2 minimal promoter [ 53 ]. Transformation was performed with MultiShotTM FlexPLate Mach1TM T1R (#C8681201, Invitrogen, ThermoFisher Scientific), grown O.N. at 37 °C. Vector selection was performed with 100 μg/ml Ampicillin (#624619.1, Normon). Plasmids were purified with NZYMiniprep kit (#MB010, NZYTech) and validated using Sanger sequencing. Final plasmids were purified with phenol/chloroform (#A931I500 and #C/4920/15, Fisher Chemical) and concentration was determined using Qubit (Invitrogen).

## Functional characterization of cCREs using in vivo enhancer assays in zebrafish

All zebrafish lines were generated through Tol2-mediated transgenesis [ 54 ]. Tol2 cDNA was transcribed by Sp6 RNA polymerase (#EP0131, ThermoFisher Scientific) after Tol2-pCS2FA vector linearization with Not I restriction enzyme (#IVGN0016, Anza, Invitrogen, ThermoFisher Scientific). All constructs were microinjected into the yolk of > 200 wild-type zebrafish embryos at the single-cell stage using the Tol2 transposase system for germline integration of the transgene according to Bessa et al. [ 55 ] with minor modifications. As a readout, GFP fluorescence was observed and its localization was annotated at 1, 2, and 3 days post fertilization (dpf) to evaluate enhancer activity, using GFP expression in the midbrain as transgenesis control.

As GFP reporter expression becomes masked by the pigmentation of the eye as the RPE develops, embryos were also treated with PTU to decrease eye pigmentation [ 56 ].

## Availability of data and materials

All datasets generated in this study have been deposited in NCBI’s Gene Expression Omnibus and are accessible through GEO Series accession number GSE236022 ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE236022 ).

Wright AF, Chakarova CF, Abd El-Aziz MM, Bhattacharya SS. Photoreceptor degeneration: genetic and mechanistic dissection of a complex trait. Nat Rev Genet. 2010;11:237–84. https://doi.org/10.1038/nrg2717 .

Letelier J, Bovolenta P, Martínez-Morales JR. The pigmented epithelium, a bright partner against photoreceptor degeneration. J Neurogenet. 2017;31:203–15. https://doi.org/10.1080/01677063.2017.1395876 .

Cowan CS, et al. Cell types of the human retina and its organoids at single-cell resolution. Cell. 2020;182:1623-1640.e34.

Article   CAS   PubMed   PubMed Central   Google Scholar

Cherry TJ, et al. Mapping the cis-regulatory architecture of the human retina reveals noncoding genetic variation in disease. Proc Natl Acad Sci U S A. 2020;117:9001–12.

Moore JE, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710.

Article   PubMed   PubMed Central   Google Scholar

Robson MI, Ringel AR, Mundlos S. Regulatory landscaping: how enhancer-promoter communication is sculpted in 3D. Mol Cell. 2019;74:1110–22. https://doi.org/10.1016/j.molcel.2019.05.032 .

Oudelaar AM, Higgs DR. The relationship between genome structure and function. Nat Rev Genet. 2021;22:154–68. https://doi.org/10.1038/s41576-020-00303-x .

Dixon JR, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–80.

Rao SSP, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–80.

Winick-Ng W, et al. Cell-type specialization is encoded by specific chromatin topologies. Nature. 2021;599:684–91.

Bonev B, et al. Multiscale 3D genome rewiring during mouse neural development. Cell. 2017;171:557-572.e24.

Marchal C, et al. High-resolution genome topology of human retina uncovers super enhancer-promoter interactions at tissue-specific and multifactorial disease loci. Nat Commun. 2022;13:1–16.

Turro E, et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature. 2020;583:96–102.

Duncan JL, et al. Inherited retinal degenerations: current landscape and knowledge gaps. Transl Vis Sci Technol. 2018;7:6.

Van de Sompele S, et al. Multi-omics approach dissects cis-regulatory mechanisms underlying North Carolina macular dystrophy, a retinal enhanceropathy. Am J Hum Genet. 2022;109:2029–48.

de Bruijn SE, et al. Structural variants create new topological-associated domains and ectopic retinal enhancer-gene contact in dominant retinitis pigmentosa. Am J Hum Genet. 2020;107:802–14.

Cremers FPM, Lee W, Collin RWJ, Allikmets R. Clinical spectrum, genetic complexity and therapeutic approaches for retinal disease caused by ABCA4 mutations. Prog Retin Eye Res. 2020;79:100861.

Khan M, et al. Resolving the dark matter of ABCA4 for 1054 Stargardt disease probands through integrated genomics and transcriptomics. Genet Med. 2020;22:1235–46.

Bauwens M, et al. ABCA4-associated disease as a model for missing heritability in autosomal recessive disorders: novel noncoding splice, cis-regulatory, structural, and recurrent hypomorphic variants. Genet Med. 2019;21:1761–71.

Ellingford JM, et al. Recommendations for clinical interpretation of variants found in non-coding regions of the genome. Genome Med. 2022;14:73.

Ellingford JM, et al. Molecular findings from 537 individuals with inherited retinal disease. J Med Genet. 2016;53:761–7.

Lee H, et al. Clinical exome sequencing for genetic identification of rare mendelian disorders. JAMA. 2014;312:1880–7.

Spielmann M, Mundlos S. Looking beyond the genes: The role of non-coding variants in human disease. Hum Mol Genet. 2016;25:R157–65.

Mumbach MR, et al. HiChIP: Efficient and sensitive analysis of protein-directed genome architecture. Nat Methods. 2016;13:919–22.

Lenis TL, et al. Expression of ABCA4 in the retinal pigment epithelium and its implications for Stargardt macular degeneration. Proc Natl Acad Sci U S A. 2018;115:E11120–7.

Schwartzman O, et al. UMI-4C for quantitative and targeted chromosomal contact profiling. Nat Methods. 2016;13:685–91.

Galan S, et al. CHESS enables quantitative comparison of chromatin contact data and automatic feature extraction. Nat Genet. 2020;52:1247–55.

Nora EP, et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature. 2012;485:381–5.

Bhattacharyya S, Chandra V, Vijayanand P, Ay F. Identification of significant chromatin contacts from HiChIP data by FitHiChIP. Nat Commun. 2019;10:4221.

Ratnapriya R, et al. Retinal transcriptome and eQTL analyses identify genes associated with age-related macular degeneration. Nat Genet. 2019;51:606–10. https://doi.org/10.1038/s41588-019-0351-9 .

Van Schil K, et al. Autosomal recessive retinitis pigmentosa with homozygous rhodopsin mutation E150K and non-coding cis-regulatory variants in CRX-binding regions of SAMD7. Sci Rep. 2016;6:21307.

Allikmets R, et al. A photoreceptor cell-specific ATP-binding transporter gene (ABCR) is mutated in recessive Starqardt macular dystrophy. Nat Genet. 1997;15:236–46.

Wang J, et al. ATAC-Seq analysis reveals a widespread decrease of chromatin accessibility in age-related macular degeneration. Nat Commun. 2018;9:1–13.

Thomas ED, et al. Cell-specific cis-regulatory elements and mechanisms of non-coding genetic disease in human retina and retinal organoids. Dev Cell. 2022;57:820-836.e6.

Birnbaum RY, et al. Coding exons function as tissue-specific enhancers of nearby genes. Genome Res. 2012;22:1059–68.

Li X, et al. Pineal photoreceptor cells are required for maintaining the circadian rhythms of behavioral visual sensitivity in zebrafish. PLoS ONE. 2012;7:1–12.

Closser M, et al. An expansion of the non-coding genome and its regulatory potential underlies vertebrate neuronal diversity. Neuron. 2022;110:70-85.e6.

Borsari B, et al. Enhancers with tissue-specific activity are enriched in intronic regions. Genome Res. 2021;31:1325–36.

Pachano T, Haro E, Rada-Iglesias A. Enhancer-gene specificity in development and disease. Development. 2022;149:dev186536.

Perry MW, Boettiger AN, Levine M. Multiple enhancers ensure precision of gap gene-expression patterns in the Drosophila embryo. Proc Natl Acad Sci U S A. 2011;108:13570–5.

Matelot M, Noordermeer D. Determination of high-resolution 3D chromatin organization using circular chromosome conformation capture (4C-seq). Methods Mol Biol. 2016;1480:223–41.

Durand NC, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3:95–8.

Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.

Kruse K, Hug CB, Vaquerizas JM. FAN-C: a feature-rich framework for the analysis and visualisation of chromosome conformation capture data. Genome Biol. 2020;21:303.

Serra F, et al. Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors. PLoS Comput Biol. 2017;13:1–17.

Flores O, Orozco M. nucleR: a package for non-parametric nucleosome positioning. Bioinformatics. 2011;27:2149–50.

Zhang Y, et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 2008;9:R137.

Wu T, et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation. 2021;2:100141.

CAS   PubMed   PubMed Central   Google Scholar

Uhlén M, et al. Tissue-based map of the human proteome. Science. 2015;347:1260419.

Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847–9.

Abascal F, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710.

Granja JM, et al. Author Correction: ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet. 2021;53(3):403–11. https://doi.org/10.1038/s41588-021-00790-6 . Nat Genet 53, 935 (2021).

Gehrke AR, et al. Deep conservation of wrist and digit enhancers in fish. Proc Natl Acad Sci U S A. 2015;112:803–8.

Kawakami K, et al. A transposon-mediated gene trap approach identifies developmentally regulated genes in zebrafish. Dev Cell. 2004;7:133–44.

Bessa J, et al. Zebrafish Enhancer Detection (ZED) vector: A new tool to facilitate transgenesis and the functional analysis of cis-regulatory regions in zebrafish. Dev Dyn. 2009;238:2409–17.

Karlsson J, Von Hofsten J, Olsson PE. Generating transparent zebrafish: A refined method to improve detection of gene expression during embryonic development. Mar Biotechnol. 2001;3:522–7.

## Acknowledgements

We thank the Core Zebrafish Facility Ghent (ZFG) and Dr. Andy Willaert for their expert technical assistance.

## Review history

The review history is available as Additional file 11 .

## Peer review information

Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

This work was supported by the Ghent University Special Research Fund (BOF20/GOA/023) (E.D.B.); H2020 MSCA ITN grant (No. 813490 StarT) (E.D.B., M.B., J.M.M., J.J.T., J.L. G.-S.), EJPRD19-234 Solve-RET (E.D.B., J.M.M., J.J.T., J.L. G.-S.), FWO research project G0A9718N (to E.D.B., M.B.), Foundation John W. Mouton Pro Retina & Marie-Claire Liénaert (to E.D.B., E.D., S.V.), UGent Fund Alzheimer and Neurodegenerative Diseases (to E.D.). E.D.B. is a Senior Clinical Investigator (1802220N) of the Research Foundation-Flanders (FWO); V.L.S., A.D.R., and S.K. are an Early Starting Researcher of StarT (grant No. 813490). E.D. is supported by a postdoctoral grant from the Research Foundation Flanders (FWO 12D8523N). P.M.M.G. was funded by a postdoctoral fellowship from Junta de Andalucía (DOC_00397). E.D.B. is a member of ERN-EYE (Framework Partnership Agreement No 739534-ERN-EYE).

## Author information

Eva D’haene, Víctor López-Soriano and Pedro Manuel Martínez-García contributed equally to this work.

Juan Ramón Martínez-Morales, Miriam Bauwens, Juan Jesús Tena and Elfride De Baere contributed equally to this work.

José Luis Gómez-Skarmeta is deceased.

## Authors and Affiliations

Department of Biomolecular Medicine, Ghent University, Ghent, Belgium

Eva D’haene, Víctor López-Soriano, Alfredo Dueñas Rey, Stijn Van de Sompele, Lies Vantomme, Quinten Mahieu, Sarah Vergult, Miriam Bauwens & Elfride De Baere

Center for Medical Genetics, Ghent University Hospital, Ghent, Belgium

Centro Andaluz de Biología del Desarrollo, Consejo Superior de Investigaciones Científicas and Universidad Pablo de Olavide, Sevilla, Spain

Pedro Manuel Martínez-García, Soraya Kalayanamontri, Ana Sousa-Ortega, Silvia Naranjo, Ana Neto, José Luis Gómez-Skarmeta, Juan Ramón Martínez-Morales & Juan Jesús Tena

You can also search for this author in PubMed   Google Scholar

## Contributions

E.D. performed Hi-C experiments, Hi-C data processing, and downstream analyses. P.M.M.G. performed HiChIP data processing and downstream analyses. V.L.S. performed eye dissections, UMI-4C experiments, and integrated public epigenomic and scRNA-seq datasets. E.D., V.L.S., and P.M.M.G integrated and interpreted the data and wrote the manuscript. A.D.R. performed eye dissections and analyzed scRNA-seq data. S.V.S. performed UMI-4C optimization. L.V. performed Hi-C experiments. Q.M. performed cloning for transgenesis assays. S.K. and A.N. performed HiChIP experiments. S.K. and S.N. were in charge of transgenesis assays. A.S. was responsible for confocal imaging. S.V. aided in data interpretation and writing of the manuscript. J.L.G.S., J.M.M., M.B., J.J.T., and E.D.B. conceived the project, secured funding, and contributed to data interpretation and the writing of the manuscript. All authors, except the late J.L.G.S., reviewed and approved the final version of the manuscript.

Twitter handles: @elfridedebaere (Elfride De Baere).

## Corresponding authors

Correspondence to Eva D’haene , Juan Ramón Martínez-Morales , Miriam Bauwens , Juan Jesús Tena or Elfride De Baere .

## Ethics declarations

Ethics approval and consent to participate.

Human donor eyes were obtained through the Tissue Bank of Ghent University Hospital and Antwerp University Hospital under the ethical approval of the Ethics Committee of Ghent University (2018/1072, B670201837286). Animal experiments were approved by the Animal Experimentation Ethics Committees at the Pablo de Olavide University and CSIC (license number 02/04/2018/041).

Not applicable.

## Competing interests

The authors declare that they have no competing interests.

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

Supplementary Figures S1-22 (.pdf).

BED file with Hi-C boundaries in neural retina (25 kb resolution).

BED file with Hi-C boundaries in RPE/choroid (25 kb resolution).

BED file with Hi-C CHESS regions with differential 3D topology between neural retina vs. RPE/choroid.

BEDPE file with Hi-C loops in neural retina.

BEDPE file with Hi-C loops in RPE/choroid.

BEDPE file with differential Hi-C loops gained in neural retina.

BEDPE file with differential Hi-C loops gained in RPE/choroid.

Supplementary Tables S1-6 (.xlsx).

BED file with stable and differential HiChIP loops in neural retina and RPE/choroid, with annotation of interacting cis -regulatory elements (CREs, from [ 4 ]).

Peer review history.

## Rights and permissions

Reprints and permissions

D’haene, E., López-Soriano, V., Martínez-García, P.M. et al. Comparative 3D genome analysis between neural retina and retinal pigment epithelium reveals differential cis -regulatory interactions at retinal disease loci. Genome Biol 25 , 123 (2024). https://doi.org/10.1186/s13059-024-03250-6

Accepted : 17 April 2024

Published : 17 May 2024

DOI : https://doi.org/10.1186/s13059-024-03250-6

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

• 3D genome structure
• Neural retina
• Retinal pigment epithelium (RPE)
• Inherited retinal disease (IRD)
• Cis -regulatory element (CRE)
• Enhancer assay

## Genome Biology

ISSN: 1474-760X

• Editor's Choice
• Information for authors
• Submission Site
• Open Access Options
• Why publish with the journal
• About the Kazusa DNA Research Institute
• Editorial Board
• Journals Career Network
• Self-Archiving Policy
• Dispatch Dates

## Article Contents

1. introduction, 2. materials and methods, 4. discussion, conflict of interest, author contributions, data availability.

• < Previous

## Chromosome-level genome assembly and characterization of the Calophaca sinica genome

Jianting Cao, Hui Zhu and Yingqi Gao contributed equally to this work.

• Article contents
• Figures & tables
• Supplementary Data

Jianting Cao, Hui Zhu, Yingqi Gao, Yue Hu, Xuejiao Li, Jianwei Shi, Luqin Chen, Hao Kang, Dafu Ru, Baoqing Ren, Bingbing Liu, Chromosome-level genome assembly and characterization of the Calophaca sinica genome, DNA Research , Volume 31, Issue 3, June 2024, dsae011, https://doi.org/10.1093/dnares/dsae011

• Permissions Icon Permissions

Calophaca sinica is a rare plant endemic to northern China which belongs to the Fabaceae family and possesses rich nutritional value. To support the preservation of the genetic resources of this plant, we have successfully generated a high-quality genome of C. sinica (1.06 Gb). Notably, transposable elements (TEs) constituted ~73% of the genome, with long terminal repeat retrotransposons (LTR-RTs) dominating this group of elements (~54% of the genome). The average intron length of the C. sinica genome was noticeably longer than what has been observed for closely related species. The expansion of LTR-RTs and elongated introns emerged had the largest influence on the enlarged genome size of C. sinica in comparison to other Fabaceae species. The proliferation of TEs could be explained by certain modes of gene duplication, namely, whole genome duplication (WGD) and dispersed duplication (DSD). Gene family expansion, which was found to enhance genes associated with metabolism, genetic maintenance, and environmental stress resistance, was a result of transposed duplicated genes (TRD) and WGD. The presented genomic analysis sheds light on the genetic architecture of C. sinica, as well as provides a starting point for future evolutionary biology, ecology, and functional genomics studies centred around C. sinica and closely related species.

Calophaca sinica Rehd. 1933 (2n = 16) is a perennial upright shrub belonging to the Fabaceae family that depends on insect-mediated cross-pollination for successful reproduction. Of the approximately ten species representing the Calophaca genus, three are indigenous to China, namely, Calophaca sinica , Calophaca chinensis , and Calophaca soongorica . 1 Of these, Calophaca sinica has a narrow range, which is restricted to southern regions of the Yin Mountains in Inner Mongolia and central-southern parts of Shanxi in northern China. Currently, both the habitat range and population size of C. sinica are diminishing, which means that this plant is a rare and endangered species endemic to northern China. 2

Research findings have underscored the exceptional nutritional value of Calophaca sinica seeds, which include an array of vital nutrients such as proteins, carbohydrates, and sugars. 1 Additionally, Calophaca sinica exhibits remarkable traits, such as cold resistance, drought tolerance, and adaptability to nutrient-poor soils, which enable it to survive in challenging environments and significantly contribute to soil and water conservation efforts. 3 Despite both the economic and ecological significance of this species, along with the current vulnerable status, our understanding of Calophaca sinica, especially detailed genomic information, remains deficient.

This knowledge gap underscores the urgent need for comprehensive genomic studies, which will provide insight that is vital for both the development of conservation strategies for this invaluable species and the identification of agronomic traits and molecular breeding strategies. Therefore, we have assembled and annotated the C. sinica genome using long reads obtained from Oxford Nanopore (ONT) sequencing and short reads obtained from Illumina sequencing. This effort resulted in a genome assembly of 1.06 Gb with a 11.91Mb contig N50 size. Using the Hi-C data, we associated 91.62% of the assembled bases with eight pseudo-chromosomes, and an improved scaffold N50 of 136.60Mb. The genome size, which is noticeably larger than the genomes of related species, may be attributed to the expansion of transposable elements (TEs), which constitute 73% of the genome. Through our analyses, we identified two significant whole-genome duplication (WGD) events that were predominantly responsible for the presence of duplicated genes; we also found cases of dispersed duplication (DSD). Transposed duplicated genes (TRD) and WGD were the primary factors responsible for gene family expansion. The genes that are linked to adaptation, biosynthesis, and genetic maintenance were delineated. The comprehensive insights gained from this study are invaluable for the future conservation and sustainable utilization of this unique plant species.

## 2.1. Sample collection and sequencing

Prior to genome sequencing, fresh leaf tissue was collected from C. sinica (ID: TY200618) plants grown in Tianlong mountain (37.72152°N, 112.42146°E), Shanxi Province, China; the samples were immediately preserved in liquid nitrogen ( Fig. 1a ). For Oxford Nanopore Technology (ONT) library preparation, high-molecular-weight genomic DNA was prepared using the modified cetyltrimethylammonium bromide (CTAB) method 4 and subsequently purified with the QIAGEN® Genomic kit (QIAGEN, Shanghai, China) according to the manufacturer’s standard protocol. Sequencing was performed on a PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK), while base-calling was carried out using the official tool, Guppy v3.2.2 + 9fe0a78, 5 which converted the electrical signals generated by DNA strands that passed through nanopores into the corresponding nucleotide sequences. Raw reads with an average qscore_template < 7 were filtered, resulting in a dataset of 90,682,339,206 bp long reads (Supplementary Table S1 ). In the following step, DNA was extracted from the leaves of the same individual using the CTAB method. Paired-end sequencing libraries were then constructed for PCR amplification. Next, sequencing was carried out in 150 bp paired-end mode on the Illumina HiSeq 4000 platform, adhering to Illumina’s suggested workflow. This resulted in a short reads dataset of 66,667,737,000bp (Supplementary Table S1 ). The raw reads were filtered using FASTP v0.20.0, 6 producing 59,913,561,246bp of clean reads. To improve assembly to the chromosome scale, young leaves from the same plant were sequenced via the Illumina Novaseq 6000 platform (Illumina, San Diego, CA); the resulting 130,655,238,000 bp of Hi-C data were filtered using FASTP v0.20.0 6 (Supplementary Table S1 ), yielding 130,187,994,618bp clean reads. Finally, fresh leaf samples from the same individual plant were collected for RNA sequencing. The Qiagen RNeasy Plant Mini kit was used to extract the total RNA from each sample. Subsequently, RNA-seq libraries were constructed using the TruSeq RNA Library Preparation kit and sequenced on the Illumina NovaSeq 6000 platform, resulting in 12,042,424,500 bp of RNA-seq reads (Supplementary Table S1 ).

Genomic features and genome assembly of C. sinica. (a) The genomic landscape of C. sinica. Concentric circles, outer to inner, show (I) chromosome characteristics, (II) GC density, (III) gene density, (IV) distribution of Gypsy retrotransposons, (V) distribution of Copia retrotransposons, (VI) LTR density, (b) Heat map of chromatin contact matrices generated by aligning the Hi-C dataset to the C. sinica genome, (c) Genomic CEGMA assessment results, (d) Genomic BUSCO assessment results.

## 2.2. Estimation of the genome size

To understand the size and heterozygosity of the C. sinica genome, we performed a 17-mer (K-mer refers to sequences of kbp length) analysis based on 59.91 Gb of Illumina DNA data. Jellyfish v2.3.0 7 was first used to calculate the frequency of K-mer, with the resulting frequency distribution shown in Supplementary Fig. S1 . We then calculated the genome size according to the formula: genome size = K-num/K-depth. GenomeScope v1.1.1 8 was also applied to the results of the frequency distribution to estimate the heterozygosity of the genome. The final estimated genome size was 1.09 Gb with 0.9% heterozygosity (Supplementary Fig. S2 ; Supplementary Table S2 ).

## 2.3. Assembly of the C. sinica genome and chromosome construction

The 90.68 Gb of clean ONT single-molecule long reads, generated by the aforementioned sequencing steps, were used for genome assembly. Since ONT reads may have a high error rate, we first used the NextCorrect module of NextDenovo v2.3.1 ( https://github.com/Nextomics/NextDenovo ) with default parameters to correct the ONT reads and obtained 31.69 Gb (~29.20 × coverage) of consistent sequences (CNS). The NextGraph module was then used to assemble the corrected reads during preliminary genome assembly. Finally, Racon v1.3.1 (using ONT long reads; https://github.com/isovic/racon.git ) and Nextpolish v1.3.0 9 (using Illumina short reads) were used to correct any errors in the preliminary genome. After obtaining the corrected draft genome, we used the Hi-C data to further enhance assembly of the genome to the chromosome scale. Before mapping the reads, the Hi-C data were subjected to Hi-C-Pro v2.8.1, 10 run under default parameters. Next, the Hi-C data were analysed using bowtie2 v2.3.2 11 to map clean Hi-C reads to the corrected draft genome using the parameter ‘-end-to-end, -very-sensitive, -L 30’ (Supplementary Table S6 ). In the following step, the contigs were further clustered, sorted, and oriented to chromosomes based on the mapping results using LACHESIS 12 with the parameters ‘cluster min re sites = 100, cluster noninformative ratio = 1.4, cluster max link density = 2.5, order min n res in trunk = 60, order min n res in shreds = 60’. Finally, any incorrectly positioned chromosome segments were manually corrected using Juicebox v2.13.07 ( https://github.com/aidenlab/Juicebox ).

After obtaining the chromosome-level genome, we used the following methods for genome assembly quality assessment: (i) BUSCO v4.0.5 13 was used to assess the completeness of genome assembly, with the embryophyta_odb10 database serving as the evaluation background; (ii) Burrows-Wheeler Aligner (BWA) v0.7.12-r1039 14 was used to map the short-read data to the genome to assess the coverage and mapping rate of Illumina short-read data; and (iii) CEGMA v2 15 was used to query the database of 248 core genes to obtain information about the core genes in the obtained genome and thus assess genome completeness.

## 2.4. Repetitive DNA annotation

During repeat sequence annotation, we first used GMATA v2.2, 16 run under default parameters, to search the genome for simple repeat sequences (SSRs) with short repeat units. After the identified SSRs were soft-masked (repeat sequences are marked in lower case), tandem repeat sequences (TRs) within the genome were identified using TRF v4.07b. 17 Simple and tandem repeat sequences were then soft-masked to avoid conflicts between TRs and TEs during the annotation of transposable elements. Next, MITE-hunter ( https://github.com/jburnette/MITE-Hunter ), run under the parameter ‘-n 20, -P 0.2, -c 3’, was used to construct a MITE library for the identification of small MITE transposons within the genome. In addition, long terminal repeat retrotransposons (LTR-RTs) in the C. sinica genome were identified based on LTR sequence features using LTR_Finder v1.0.7 18 and LTRharvest, 19 after which LTR_retriever v2.9.0 20 was applied to integrate the results of the first two LTR software programs, resulting in an LTR repeat sequence library. In addition, we further integrated the MITE library and the LTR library into a TE library file (TE.lib) and performed a hard-mask (repeat sequences marked as N) of the genome. After hard-masking, the repeat sequences in the C. sinica genome were identified using RepeatModeler v2.0 21 with default parameters, and a de novo library (Denovo.lib) was constructed. As this library has a large number of unknown repeat sequences, we proceeded to classify the repeat sequences using TEclass. 22 Finally, TE.lib, Denovo.lib, and the Repbase ( http://www.girinst.org/repbase ) library that was obtained using RepeatMasker v1.331 23 were integrated into a total library, which was input into RepeatMasker, run under the parameter ‘nolow, -no_is, -norna’, to complete the search for repeat sequences within the C. sinica genome. After obtaining the repetitive DNA annotation results, we estimated the insertion time ( T ) based on T  =  K /2μ (where K is the divergence rate and μ is the nucleotide substitution rate), with only full-length terminal repeat retrotransposons included in the calculations (the nucleotide substitution rate used was the LTR_Finder default value of 1.3e-8). The divergence rate (K) between 5ʹ LTR and 3ʹ LTR sequences was calculated using DnaDiSt, a program within PhyliP v3.698 ( https://phylipweb.github.io/phylip/ ).

## 2.5. Gene annotation

To improve the reliability of the genome annotation process, repeat sequences in the C. sinica genome were masked before genome annotation. We employed three complementary approaches in the prediction of protein-coding genes: transcriptome-based prediction; homology-based prediction; and ab initio prediction. In transcriptome-based prediction, clean RNA-seq reads, which had been filtered by FASTP v0.20.0, 6 were mapped to the C. sinica genome using STAR v2.7.3a. 24 Transcript assembly was performed using StringTie v1.3.4d, 25 and the assembled transcripts were further processed using the Program to Assemble Spliced Alignments (PASA) v2.3.3 26 to obtain a transcriptome-based gene set. In homology-based prediction, we used GeMoMa v1.6.1 27 to align homologous protein sequences with the genome. This tool can utilize homologous protein sequence information from known species to deduce gene boundaries and exon-intron structures of target species. By comparing these known homologous protein sequences with the C. sinica genome and integrating the prediction results from multiple species, we produced the homologous protein-predicted gene set. Next, we randomly selected 3000 genes from the gene structure prediction model derived from transcriptome-based prediction. This selection was used for model training with AUGUSTUS v3.3.1 28 to generate a gene prediction model for C. sinica . More specifically, the chosen 3000 genes were randomly divided into test and training sets. The initial training was conducted using AUGUSTUS with default parameters. Subsequently, the optimize_augustus.pl script provided by AUGUSTUS was used for iterative training to acquire the optimal training model parameters. A final retraining was conducted to obtain the definitive gene prediction model. Following this, we used AUGUSTUS to predict the ab initio gene set based on the resulting model. Finally, the gene prediction results obtained from these three methods were integrated using the EVidenceModeler (EVM) v1.1.1 26 pipeline to obtain an initial integrated gene set for C. sinica gene annotation. TransposonPSI ( http://transposonpsi.sourceforge.net/ ) was then used to align and remove genes with coding errors, resulting in the final consensus gene set. Once the gene prediction results had been obtained, BUSCO was used to evaluate the completeness of gene prediction, and Circos v0.69-8 29 was used to display basic genomic indicators.

The identification of non-coding RNA (ncRNA) sequences employed two distinct strategies: database searching and model-based prediction. In database searching, Infernal v1.1.2 30 was used to detect genomic ncRNAs based on searches in the Rfam database. 31 In model-based prediction, tRNAscan-SE v2.0 32 was used to predict transfer RNA (tRNA) sequences, with the eukaryotic parameter selected for model-based predictions. Additionally, RNAmmer v1.2 33 was used to predict ribosomal RNA (rRNA), along with the corresponding subunits.

## 2.6. Functional annotation of protein-coding genes

To identify gene functions, we first used BLASTP v2.7.1 ( E -value < 1 × 10 -5 ) 34 to annotate the functions of protein-coding genes based on items in the Swiss-Prot, 35 Eukaryotic Orthologous Groups of protein (KOG), 36 and NCBI non-redundant protein (NR; https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ ) databases. In addition, structural domains and Gene Ontology (GO) annotations of predicted protein-coding genes were performed using InterProScan v5.32-71.0. 37 Next, for Kyoto Encyclopedia of Gene and Genomes (KEGG) annotation, pathway items for each gene were assigned according to the KEGG database 38 via the BBH method, which is available on the KAAS website ( https://www.genome.jp/tools/kaas/ ). As a final step, we merged the annotation results from each database into a concatenation set of annotation 2results and performed statistics to obtain results concerning the final functional annotation rates for all databases.

## 2.7. Phylogeny and evolution of gene families

To determine the phylogenetic position of C. sinica , we collected genome data from 15 sequenced eudicot species (Supplementary Table S17 ), including 11 species of Fabaceae ( Arachis duranensis , Cajanus cajan , Cicer arietinum , Entada phaseoloides , Glycine max , Lotus japonicus , Lupinus albus , Medicago truncatula , Robinia pseudoacacia , Sophora japonica , Vigna unguiculata ), one species of Brassicaceae ( Arabidopsis thaliana ), one species of Salicaceae ( Populus trichocarpa ), and two species of Rosaceae ( Malus domestica , Prunus persica ). Before clustering, gene selection was performed for all of the analysed species, with only the longest transcript of each gene retained for subsequent analysis. OrthoFinder v 2.5.4 39 was then run under the parameter ‘-S blast, -M msa, -T fasttree, -A mafft’ to infer orthogroups (gene families) for these species; this resulted in 157 strict single-copy genes. MAFFT v7.45 40 was further used for the multiple sequence alignment of protein sequences within each single-copy gene set. Then, PAL2NAL v14 41 was utilized to convert the protein multiple sequence alignments into nucleotide multiple sequence alignments based on coding sequence (CDS) information, and TRIMAL v1.2rev59 42 was used to remove gap positions (positions with gaps in 50% or more of the sequences were considered as alignment gaps). In the next step, IQ-TREE v2.1.4-beta, 43 conducted with the parameter ‘-b 100’, was employed to construct phylogenetic trees for all genes using the maximum likelihood (ML) method, which was based on the alignment results for each set of orthologous single-copy genes. Finally, ASTRAL-III v5.6.3 44 was used to illustrate the phylogenetic relationships among species by combining multiple gene trees into species trees based on the aggregated phylogenetic model.

The MCMCTREE pipeline of the PAML v4.9 45 program package was used to estimate the divergence times for the 16 analysed species. We also collected five fossil constraints from the TimeTree database ( http://www.timetree.org/ ) to calibrate the divergence times: Arabidopsis thaliana and Populus trichocarpa [102.0-112.5 million years ago (Ma)]; Populus trichocarpa and Prunus persica [99.0-111.3 Ma]; Prunus persica and Malus domestica [34.4-67.2 Ma]; Lupinus albus and Cajanus cajan [48.6-79.1 Ma]; Glycine max and Vigna unguiculata [19.5-28.8 Ma]. After constructing the phylogenetic relationships between C. sinica and the other plant species, the results from OrthoFinder were input into CAFÉ v4.2.1 46 to determine the expansion and contraction of gene families. In addition, the functional enrichment of expanded gene families was accomplished through the KOBAS website ( http://bioinfo.org/kobas ).

## 2.8. Identification of genome synteny and whole-genome duplication (WGD)

We utilized the WGDI 47 to analyse whole-genome duplication events in the C. sinica genome. The genomes of three other legume species were used in the WGD analysis: C. sinica; G. max; and M. truncatula . ColinearScan 48 was first employed to search for collinear genes between and within genomes, and a collinear dot plot was generated to determine the proportions of collinear blocks between the genomes of different species. Then, the YN00 program within the PAML software package was used to calculate the synonymous substitution ( K s) values between collinear gene pairs based on the Nei-Gojobori method. Next, WGDI—with the parameter ‘-kp’—was run to extract collinear blocks into different groups based on the distribution range of K s values for subsequent peak fitting. Finally, WGDI—with the parameter ‘-pf’—was used to fit the K s peaks, all of which represented the median Ks value of each homologous block. In addition, the proportional relationships between collinear blocks were calculated and displayed via JCVI ( https://github.com/tanghaibao/jcvi ).

In order to further study the evolution of the C. sinica genome, genome-wide genes were identified and categorized into five types of duplicates (tandem duplicated (TD) genes, transposed duplicated (TRD) genes, whole-genome duplicated (WGD) genes, proximal duplicated (PD) genes, and dispersed duplicated (DSD) genes); the last category (DSD) was a catchall class for all duplication modes other than TD, TRD, WGD, and PD. The analysis was performed using DupGene_finder 49 with default parameters. The final number of genes used was unique and M. truncatula was used as an outgroup species.

## 3.1. Chromosome-level genome assembly

To generate the C. sinica (2n = 2x = 16) chromosome-level genome (estimated genome size of ~1.09 Gb; Supplementary Table S2 ; Supplementary Figs. S1 and S2 ), we employed multiple sequencing technologies and assembly strategies (see Methods). A total of 90.68 Gb Oxford Nanopore long-read data (~83.56 × coverage), 66.67 Gb Illumina short-read data (~61.43 × coverage), and 130.66 Gb Hi-C paired-end reads (~120.39 × coverage) were generated for genome assembly (Supplementary Table S1 ). Only filtered ONT long reads were used during preliminary assembly, which consisted of 279 contigs with a contig N50 of 11.5 Mb (Supplementary Table S3 ). Subsequently, the preliminary assembly draft was polished using both ONT and Illumina clean data, resulting in a contig-level genome assembly of 1.06 Gb, which was slightly smaller than the estimated genome size, with 275 contigs and a contig N50 of 11.9 Mb (Supplementary Table S3 ); these results indicated that the assembly showed high contiguity. Furthermore, to improve the C. sinica genome assembly, we employed Hi-C data (Supplementary Table S5 and S6 ) to upgrade the assembly to the chromosome level. As a result, 1.03 Gb (97.75%) of contigs were successfully oriented and anchored onto eight chromosomes (Supplementary Table S7 ; Fig. 1b ), which ranged in size from 85.53 Mb to 142.65 Mb with a scaffold N50 of 136.60 Mb (Supplementary Table S8 ).

Various methods were employed to validate the quality of the genome assembly. First, Illumina short reads were mapped to the assembled genome, achieving a mapping rate of 99.62% (Supplementary Table S4 ). Additionally, the GC content and sequencing depth were found to follow a Poisson distribution (Supplementary Fig. S3 ), which is indicative of high accuracy and the absence of external contamination. Moreover, a BUSCO evaluation revealed that 98.14% of the conserved genes were complete ( Fig. 1d ; Supplementary Table S9 ). In addition, a CEGMA evaluation demonstrated that 95.56% of eukaryotic ultra-conserved gene families were detected in the assembly ( Fig. 1c ; Supplementary Table S10 ). These results provide clear evidence for the high completeness and accuracy of the genome assembly.

## 3.2. Repetitive DNA and ncRNA annotation

We identified 1,919,930 repetitive elements (total length 768.78 Mb) in the C. sinica genome, which accounted for 72.63% of the genome (Supplementary Table S11 ); this proportion is significantly higher than what was calculated in the S. japonica (53.13%), 50 R. pseudoacacia (59.47%), 51 and C. arietinum (60%) 52 genomes. Transposable elements dominated the identified repeat sequences, accounting for 69.58% of the genome. More specifically, long terminal repeat retrotransposons represented the most abundant type of TEs, accounting for 77.5% of the repeat sequences. Gypsy (210.48 Mb; 19.89% of the genome) and Copia (173.75 Mb; 16.42% of the genome) were the largest superfamilies among the LTRs. In addition to repeat elements, we also predicted 352 microRNAs (miRNAs), 897 transfer RNAs (tRNAs), 2,248 small nuclear RNAs (snRNAs), and 223 ribosomal RNAs (rRNAs; Supplementary Table S14 ).

## 3.3. C. sinica genome annotation

By integrating ab initio prediction, homology-based prediction, and transcriptome prediction, we annotated a total of 28,977 protein-coding genes in the C. sinica genome (Supplementary Table S12 ), with 99.09% (28,713) of these genes anchored to chromosomes. The average length of these protein-coding genes was 5,676.14 bp, and the average CDS length was 1,228.35 bp (Supplementary Table S13 ). The average number of exons per gene was 5.68, with an average exon length of 216.41 bp; the average intron length was 951.17 bp. Further analysis of the gene structure distribution revealed a similar pattern to what has been observed in other plant species, e.g. Medicago sativa and A. thaliana (Supplementary Fig. S4 ); this provided evidence that supported the reliability of the annotation results. However, the C. sinica genome showed a significantly longer average gene length (5,676.14 bp) when compared to closely related species, a discrepancy which could be attributed to the long average intron length (951.17 bp; Supplementary Table S13 ).

In addition, among the 28,977 predicted protein-coding genes, functional annotations retrieved from the Swiss-Prot, KEGG, KOG, GO, and NR databases covered 75.14% (21,772 genes), 38.36% (11,115 genes), 51.90% (15,040 genes), 55.71% (16,144 genes), and 92.87% (26,910 genes) of the predicted genes, respectively (Supplementary Fig. S5 ; Supplementary Table S16 ). A total of 27,145 genes (93.68%) were assigned putative functional annotations (Supplementary Table S16 ). A BUSCO evaluation showed that 95.66% of the highly conserved plant genes (1,544 in total) were completely present in the genome (the percentages of single-copy, duplicated, fragmented, and missing genes were 92.07%, 3.59%, 1.30%, and 3.04%, respectively; Supplementary Table S15 ). These results clearly indicate that the annotated C. sinica gene set is relatively robust and accurate.

## 3.4. Whole-genome duplication

To gain a deeper understanding of the WGD history within the C. sinica genome, we first investigated the synonymous substitution rates ( K s) within the C. sinica genome, and then compared the results to the K s distributions observed in the G. max and M. truncatula genomes ( Fig. 2a ). We found that the K s distribution of the C. sinica genome has two peaks ( Fig. 2a ), which indicate that the C. sinica genome has undergone two rounds of WGD during the course of genomic evolution. In addition to the ancient whole-genome triplication (gamma) event shared by eudicot species ( K s peak = 1.74), C. sinica has also experienced a more recent WGD event ( K s peak = 0.56).

Results of the WGD analyses. (a) The distributions frequencies of synonymous substitutions ( Ks ) for syntenic blocks among the G. max , M. truncatula, and C. sinica genomes, (b) Dot plot of syntenic blocks identified by JCVI in the C. sinica genome, (c) synteny blocks (involving ≥ 5 collinear genes), identified via JCVI, between the G. max , C. sinica, and M. truncatula genomes.

Furthermore, the strong collinearity between different chromosomes within the C. sinica genome (1:1 collinearity ratio; Fig. 2b ) supports the occurrence of a recent WGD. Subsequently, we investigated the collinearity between the genomes of G. max , C. sinica , and M. truncatula . The results revealed that the G. max genome has undergone two recent WGD events, 53 while the M. truncatula genome has only experienced one recent WGD. 54 The results demonstrated that the C. sinica genome exhibits 2:4 collinearity with the G. max genome and 2:2 collinearity with the M. truncatula genome ( Fig. 2c ; Supplementary Fig. S6 ); these findings agree with the previously described synonymous substitution rate calculations. This 4:2:2 collinearity relationship between the three analysed genomes further supports the occurrence of a recent WGD event in the C. sinica genome.

## 3.5. Phylogenetic and comparative genomic analysis

We selected the genomes of 15 other eudicot species for the phylogenetic analysis, which was performed to determine the phylogeny of C. sinica and better understand its evolutionary history (Supplementary Table S17 ). By clustering gene families, we identified 157 strictly single-copy orthologous genes and constructed phylogenetic trees for C. sinica and other species using both Maximum Likelihood (ML) and aggregation-based methods. We also estimated the divergence times between species. The phylogenetic results showed that C. sinica diverged from a common ancestor it shared with M. truncatula and C. arietinum approximately 30.79 million years ago (Mya; Palaeogene Period; Fig. 3a ). The evolutionary relationships among species were consistent with what was reported in previous studies. 55–57

Evolution of the C. sinica genome and gene families. (a) Phylogenetic tree and divergence times for C. sinica and 15 other species, with A. thaliana as the outgroup; genome statistics for each species are shown on the right, (b) Venn diagram showing the cluster distribution of shared gene families among C. sinica , C. arietinum , M. truncatula , R. pseudoacacia , and L. japonicus, (c) Estimated insertion time of long terminal repeat retrotransposons (LTR-RTs) in C. sinica and two related species. The x-axis represents the insertion time of the LTR-RT, which was calculated using a substitution rate of 1.05e-8 substitutions site -1 year -1 . The dots represent full-length LTR-RTs within the genomes, and the fold line shows the average for the analysed data.

We further analysed the phylogenetic outcomes by investigating the expansion and contraction of gene families in the C. sinica genome via the ‘birth-death’ model. The results showed that C. sinica experienced expansions and contractions in 1,697 and 4,323 gene families, respectively. This is in comparison to the most recent common ancestor shared with M. truncatula and C. arietinum . A KEGG pathway enrichment analysis revealed that the expanded gene families were linked to metabolic pathways such as flavonoid biosynthesis (map00941), fatty acid biosynthesis (map00061), and amino acid biosynthesis (map01230), as well as gene information processing pathways including mismatch repair (map03430), homologous recombination (map03430), and nucleotide excision repair (map03420; Fig. 4a ; Supplementary Table S18 ). Legume plants contain large amounts of flavonoids, which play critical roles in nitrogen fixation during nodulation and plant physiological functions, for instance, protection from UV radiation and drought stress. 58 , 59 The expansion of gene families related to gene information processing could correspond to the large genome of C. sinica (1.06 Gb), which could necessitate mismatch and DNA repair mechanisms to fix DNA damage and maintain the stability of the genome. 60–62 This hypothesis was consistent with the Gene Onotology (GO) enrichment analysis, which returned terms related to environmental adaptation and genome information processing, e.g. response to heat (GO:0009408), response to water deprivation (GO:0009414), defense response to bacteria (GO:0042742), response to salt stress (GO:0009651), and DNA repair (GO:0006281) (Supplementary Table S19 ).

Gene duplication and evolution. (a) KEGG enrichment analysis of the expanded gene family in C. sinica . The colour of the box represents the classification in KEGG terms, and the size of the box illustrates the number of genes representing each KEGG term, (b) Categories and proportions of gene duplications under various replication modes and gene expansions caused by gene duplications (WGD: whole genome duplication; DSD: dispersed duplication; TRD: transposed duplication; TD: tandem duplication; PD: proximal duplication). The outer circle is the number of duplicated genes and the inner circle depicts the overlap between duplicated and expanded family genes, (c) Violin and box plots showing the selective pressures ( K a/ K s ratios) on genes originating from different gene duplication modes, (d) KEGG functional enrichment of genes that showed a degree of overlap between expanded gene families and various modes of gene duplication.

Furthermore, to explore the specific dynamics of C. sinica gene families, we integrated the gene family profiles of C. sinica with four closely-related legume species ( M. truncatula , R. pseudoacacia , L. japonicus , and C. arietinum ). We found that C. sinica shared 12,167 gene families with these legume species and identified 1,781 gene families that are unique to the C. sinica genome and not present in other legume plants ( Fig. 3b ). A functional enrichment analysis of these unique gene families revealed involvement in plant defense responses (GO:009414, GO:0009408, GO:0042742, GO:0009651, GO:0009817), mismatch repair (GO:0006298), chromosome maintenance (GO:0007129, GO:0003684, GO:0032875, GO:0006289, GO:0007094), and other important biological processes (Supplementary Table S20 ). Interestingly, the gene families which are unique to C. sinica were associated with similar plant roles as the expanded gene families, i.e. genome maintenance and environmental adaptation; this explains how C. sinica has adapted to the challenging conditions present in the mountainous terrain of northern China.

The genome of C. sinica is significantly larger (1.06 Gb) than the genomes of other legumes. When compared to the genomes of close relatives ( C. arietinum , M. truncatula , R. pseudoacacia and L. japonicus ), the C. sinica genome is larger by an average of 512 Mb ( Fig. 3a ). To explain why the C. sinica genome is so large, we investigated the proportion of repeat sequences in the C. sinica genome relative to what has been found for closely-related species. The results showed that repeat sequences in the C. sinica genome accounted for 72.63% of the genome, or a total of 769 Mb, which exceeds the genome sizes of certain closely related species ( Fig. 3a ). This indicates that the large genome size of C. sinica is primarily a result of the insertion of a substantial number of repeat sequences. To further understand why the C. sinica genome includes such an immense amount of repetitive DNA, we examined the insertion times of LTR-RTs in C. sinica , M. truncatula , and R. pseudoacacia . We found that the C. sinica genome has experienced a large number of recent LTR insertions, and that LTR insertion in C. sinica genome has generally been a frequent and long-term process ( Fig. 3c ). We already determined that the C. sinica and M. truncatula genomes underwent the same number of WGD events. Therefore, we cannot attribute the fact that the C. sinica genome is ~646 Mb larger than the M. truncatula genome solely to a WGD event. These results suggest that the burst of repeat sequences significantly contributes to the massive scale of the C. sinica genome.

## 3.6. Contribution of gene duplication to the specific adaptation of C. sinica

The duplication of genes results in the expansion of gene families, which can assign potential functions to genes, and culminate in the sub- and neo-functionalization of genes; as such, gene duplication has long been considered one of the driving forces of plant evolution. 63 , 64 To this end, we identified 22,736 duplicated genes in the C. sinica genome and classified them into five categories based on the collinearity of duplicated gene pairs and chromosomal location: 8,379 whole-genome duplicated genes (WGD, 36.85%); 6,382 dispersed duplicated genes (DSD, 28.07%); 4,545 transposon-mediated duplicated genes (TRD, 19.99%); 2,136 tandem duplicated genes (TD, 9.39%); and 1,294 proximal duplicated genes (PD, 5.69%; Fig. 4b ). We evaluated selection strength using the K a/ K s (non-synonymous substitution rate/synonymous substitution rate) ratio for duplicated gene pairs. Interestingly, TD and PD gene pairs showed higher K a/ K s ratios than what was calculated for other types of genes, which suggests that genes generated by tandem and proximal duplications experienced faster sequence divergence and stronger positive selection in the C. sinica genome than duplicate genes generated by other modes ( Fig. 4c ). In contrast, WGD genes experienced lower selection pressure.

Next, we overlapped the identified expanded gene families (EGFs) with various types of duplicated genes to analyse which gene duplication modes were responsible for gene family expansion ( Fig. 4b ). The overlaps between EGFs and duplicated genes were mainly attributed to TRD (29.62%), followed by WGD (22.91%) and DSD (21.32%). It is noteworthy that although TRD only accounted for 19.99% of all gene duplication, it was an important mechanism underlying gene family expansion in C. sinica , as it accounted for the highest percentage of expanded genes (29.62%). A KEGG enrichment analysis was performed on the overlapping sections between duplicated genes and EGFs, with the results revealing that TRD and DSD played major roles in the expansion of families related to chromosome information processing and repair, more specifically, mismatch repair (map03430), homologous recombination (map03440), and nucleotide excision repair (map03420; Fig. 4d ; Supplementary Table S21 ). The genes that had arisen through PD and TD were mainly enriched in terms related to environmental responses and defense ( Fig. 4d ), such as flavonoid biosynthesis (map00941), circadian rhythm (map04710), terpenoid biosynthesis (map00902), and endocytosis (map04144). WGD genes were found to be associated with various aspects of plant functioning, including environmental responses, plant defense, protein processing, and biosynthesis. Clearly, all five gene duplication mechanisms have positively contributed to the adaptation of C. sinica to a harsh environment. For example, PD, TRD, and WGD genes were enriched in the MAPK signalling pathway (map04016), which—in plants—plays a crucial role in responses to temperature stress, salt stress, drought stress, and nutritional deficiency. 65–68 The performed analyses found that PD and TD genes contribute to flavonoid biosynthesis (map00941), which plays a key role in plant resistance to oxidative stress, ultraviolet damage, tumour-bearing nitrogen fixation, and growth and development. In summary, various modes of gene duplication have caused different degrees of expansion in C. sinica gene families, with a strong correlation to gene families that are connected to the vital pathways that regulate plant metabolism.

The completion of the chromosome-level genome assembly for C. sinica represents a crucial step in deciphering the genomic intricacies of this species. The integration of multiple sequencing technologies and various assembly strategies resulted in a comprehensive genomic resource that lays the groundwork for in-depth studies of the C. sinica genetic landscape. Meticulously planned validation steps, including mapping Illumina short reads, Poisson distribution analysis, and BUSCO and CEGMA evaluations, collectively affirmed the reliability, accuracy, and completeness of the C. sinica genome assembly (Supplementary Fig. S3 ; Supplementary Tables S4 , S9 , and S10 ). These metrics not only underscore the technical robustness of the assembly process but also provide a level of confidence for subsequent analyses and interpretations. The complete C. sinica genome assembly enhances our understanding of the coding and non-coding components of the genome, which is important information for future studies on gene regulation and functional genomics.

The size of the C. sinica genome, reaching 1.06 Gb (Supplementary Table S3 and S8 ), is remarkably larger than what has been observed for closely related species ( Fig. 3a ), such as Robinia pseudoacacia (682 Mb) 51 and Medicago truncatula (412 Mb). 69 This enlargement was found to be predominantly explained by the proliferation of LTRs, which constitute a substantial 72.63% of the genome (Supplementary Table S11 ). The prevalence of these repetitive elements, especially LTR-RTs, highlights a profound impact on shaping the genomic landscape of C. sinica . 70–72 This burst in recent LTR insertions not only underscores the dynamic nature of TE activity but also provides insight into the ongoing genomic evolution of C. sinica . Another noteworthy aspect related to the enlarged genome is the sizeable average gene length (5,676.14 bp), which surpasses what has been measured for closely related species (Supplementary Table S13 ). An elongated intron length, averaging 951.17 bp, was also observed and could significantly contribute to the expanded gene size (Supplementary Table S13 ). While the longer average intron length measured in the C. sinica genome could be hypothesized to play a crucial role in extending overall gene length and, consequently, influence the enlarged genome size, this remains an issue of debate. 73–76 Nevertheless, this unique feature raises intriguing questions about the potential functional implications of such enlarged gene structures.

By integrating the annotated C. sinica genome with previously published genomes representing Fabaceae members, we were able to identify two rounds of WGD events in C. sinica ( Fig. 2a–c ). The observed collinearity within the genome, along with comparisons to other species, strongly support a recent WGD ( Fig. 2b–c ). However, there is a noticeable paradox that challenges the straightforward attribution of the genome enlargement in C. sinica to WGD. Despite showing a large genome size, our findings indicated that C. sinica has undergone the same number of WGD events as closely related species, such as M. truncatula and S. japonica 50 ; this phenomenon has been previously discussed in a comprehensive study. Contrary to expectations, polyploidization did not exhibit a significant positive linear correlation with genome size, as demonstrated by. 77 This intriguing discrepancy prompted a more profound exploration into the mechanisms underlying the unique genomic expansion observed in C. sinica . The complex relationship between polyploidization events and genome size requires nuanced consideration and forces us to re-evaluate conventional assumptions linked with the evolution of the C. sinica genome.

While there is evidence that polyploidization is unrelated to genome size, prior studies highlight the significant role of WGD in plant genome evolution and species adaptation. 78–80 WGD events provide plants with additional genetic material to enhance adaptability, promote plant diversification, and foster functional innovation. 81 , 82 In the context of adaptation, the expansion of gene families can lead to the functional diversification of genes, which can be crucial to adapting to a new environment. 83–85 Diversification of gene families through different duplication modes—WGD, DSD, TRD—has profound implications for the adaptive evolution of C. sinica ( Fig. 4b , d ; Supplementary Tables S18 , S19 , and S21 ). These expansion and contraction dynamics reflect the inherent evolutionary flexibility of the genome to allow for functional innovations. 86 The identification of gene families associated with essential pathways, such as chromosome information processing and repair, highlights the functional consequences of these duplication events ( Fig. 4d ; Supplementary Table S21 ). The enrichment of expanded gene families in processes like mismatch repair, homologous recombination, and nucleotide excision repair suggests the presence of a robust genomic machinery for maintaining genome stability, which is particularly crucial in the face of environmental stresses (Supplementary Tables S18 and S19 ). Moreover, the over-representation of gene families associated with environmental responses and defense in tandem and proximal duplication indicates a potential role in the adaptive strategies of C. sinica (Supplementary Table S21 ).

Furthermore, the intricate connections between gene length, repair mechanisms, and the enlarged genome underscore the complex relationship between genomic architecture and functional adaptations. The evolutionary trajectory of C. sinica , which is marked by genome expansion and elongated gene structures, is intricately linked with the ability of this species to cope with environmental challenges. A deeper investigation of the specific roles of these elongated genes and their regulatory elements promises to illuminate the significance of such genomic features to the adaptation of C. sinica to new environments.

To summarize, the comprehensive genomic analysis presented here significantly enhances our understanding of the C. sinica genome. The high-quality assembly, which is characterized by LTR bursts and enlarged gene structures, prompts questions about the potential functional implications of these genomic features. Enrichment results provide insights into the functional aspects of expanded genes, which were found to be particularly strongly linked to repair mechanisms. The two identified rounds of WGD, and the associated gene family expansion, may be linked to adaptation. This holistic genomic analysis provides a solid foundation for future investigations into the unique adaptations of C. sinica , notably, strong environmental resilience, including the molecular mechanisms governing these traits. The identified genomic elements, pathways, and duplication events offer a rich resource for functional genomics studies and lay the groundwork for understanding the ecological and evolutionary dynamics of C. sinica .

This study was supported by the National Natural Science Foundation of China (grant nos. 41601055, 32371900, and 32001085) and the Research on germplasm resource and propagational technique of Betulaceae (grant no. 201903D221071).

The authors have no conflict of interest to declare.

B.L., B.R., and J.C. designed and supervised the project; L.C. and H.K. prepared the samples; H.Z., D.R., and Y.G. analysed the data; Y.H., X.L., and J.S. helped with the data analysis and examined the results; H.Z., Y.G., and D.R. wrote the first draft; B.L., J.C., and B.R wrote the final manuscript. All of the authors read and approved the final manuscript.

The data acquired in this Whole Genome Shotgun project have been deposited in the NCBI under project number PRJNA1046645 and the genome accession number is JBAFXK000000000. The genome assembly file and annotation file are also available at Figshare ( https://doi.org/10.6084/m9.figshare.24669363.v2 ). All other data are available from the corresponding authors on reasonable request.

Li , J.P. and Chen , S. 2019 , Analysis of nutrient components in Calophaca sinica seeds in Tianlong Mountain, For. Sci. Technol. , 7 , 60– 63 , doi: 10.13456/j.cnki.lykt.2018.04.19.0001

Chinese Botanical Committee of the Chinese Academy of Sciencces . 1993 . Flora Reipublicae Popularis Sinicae: Calophaca Fisch . vol. 42 ( 1 ). Beijing : Science Press : 67 – 71 .

Wu , H.Z. , Han , L.J. , Wu , Y.B. , and Jia , J. 2019 , Evaluation on drought resistance of Calophaca Sinica under drought stress, Shanxi For , Sci. Technol. , 48 , 1 – 5 , doi: 10.3969/j.issn.1007-726X.2019.04.001

Allen , G.C. , Flores-Vergara , M.A. , Krasynanski , S. , Kumar , S. , and Thompson , W.F. 2006 , A modified protocol for rapid DNA isolation from plant tissues using cetyltrimethylammonium bromide , Nat. Protoc , 1 , 2320 – 5 , doi: 10.1038/nprot.2006.384

Wick , R.R. , Judd , L.M. , and Holt , K.E. 2019 , Performance of neural network basecalling tools for Oxford Nanopore sequencing , Genome Biol. , 20 , 129 , doi: 10.1186/s13059-019-1727-y

Chen , S. , Zhou , Y. , Chen , Y. , and Gu , J. 2018 , fastp: an ultra-fast all-in-one FASTQ preprocessor , Bioinformatics , 34 , i884 – 90 , doi: 10.1093/bioinformatics/bty560

Marçais , G. and Kingsford , C. 2011 , A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , Bioinformatics , 27 , 764 – 70 , doi: 10.1093/bioinformatics/btr011

Vurture , G.W. , Sedlazeck , F.J. , Nattestad , M. , et al.  2017 , GenomeScope: fast reference-free genome profiling from short reads , Bioinformatics , 33 , 2202 – 4 , doi: 10.1093/bioinformatics/btx153

Hu , J. , Fan , J. , Sun , Z. , and Liu , S. 2020 , NextPolish: a fast and efficient genome polishing tool for long-read assembly , Bioinformatics , 36 , 2253 – 5 , doi: 10.1093/bioinformatics/btz891

Servant , N. , Varoquaux , N. , Lajoie , B.R. , et al.  2015 , HiC-Pro: an optimized and flexible pipeline for Hi-C data processing , Genome Biol. , 16 , 259 , doi: 10.1186/s13059-015-0831-x

Langmead , B. and Salzberg , S.L. 2012 , Fast gapped-read alignment with Bowtie 2 , Nat. Methods , 9 , 357 – 9 , doi: 10.1038/nmeth.1923

Ji , Q.M. , Xin , J.W. , Chai , Z.X. , et al.  2021 , A chromosome-scale reference genome and genome-wide genetic variations elucidate adaptation in yak , Mol. Ecol. Resour. , 21 , 201 – 11 , doi: 10.1111/1755-0998.13236

Simão , F.A. , Waterhouse , R.M. , Ioannidis , P. , Kriventseva , E.V. , and Zdobnov , E.M. 2015 , BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , Bioinformatics , 31 , 3210 – 2 , doi: 10.1093/bioinformatics/btv351

Li , H. and Durbin , R. 2009 , Fast and accurate short read alignment with Burrows-Wheeler transform , Bioinformatics , 25 , 1754 – 60 , doi: 10.1093/bioinformatics/btp324

Parra , G. , Bradnam , K. , and Korf , I. 2007 , CEGMA: A pipeline to accurately annotate core genes in eukaryotic genomes , Bioinformatics , 23 , 1061 – 7 , doi: 10.1093/bioinformatics/btm071

Wang , X. and Wang , L. 2016 , GMATA: an integrated software package for genome-scale SSR mining, marker development and viewing , Front. Plant Sci. , 7 , 1350 , doi: 10.3389/fpls.2016.01350

Benson , G. 1999 , Tandem repeats finder: A program to analyze DNA sequences , Nucleic Acids Res. , 27 , 573 – 80 , doi: 10.1093/nar/27.2.573

Xu , Z. and Wang , H. 2007 , LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , Nucleic Acids Res. , 35 , W265 – 8 , doi: 10.1093/nar/gkm286

Ellinghaus , D. , Kurtz , S. , and Willhoeft , U. 2008 , LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons , BMC Bioinf. , 9 , 18 , doi: 10.1186/1471-2105-9-18

Ou , S. and Jiang , N. 2018 , LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons , Plant Physiol. , 176 , 1410 – 22 , doi: 10.1104/pp.17.01310

Flynn , J.M. , Hubley , R. , Goubert , C. , et al.  2020 , RepeatModeler2 for automated genomic discovery of transposable element families , Proc. Natl. Acad. Sci. USA , 117 , 9451 – 7 , doi: 10.1073/pnas.1921046117

Abrusán , G. , Grundmann , N. , DeMester , L. , and Makalowski , W. 2009 , TEclass - a tool for automated classification of unknown eukaryotic transposable elements , Bioinformatics , 25 , 1329 – 30 , doi: 10.1093/bioinformatics/btp084

Tarailo-Graovac , M. and Chen , N. 2009 , Using RepeatMasker to identify repetitive elements in genomic sequences , Curr. Protoc. Bioinformatics , 25 , 4.10.1 – 4.10.14 , doi: 10.1002/0471250953.bi0410s25

Dobin , A. , Davis , C.A. , Schlesinger , F. , et al.  2013 , STAR: ultrafast universal RNA-seq aligner , Bioinformatics , 29 , 15 – 21 , doi: 10.1093/bioinformatics/bts635

Kovaka , S. , Zimin , A.V. , Pertea , G.M. , Razaghi , R. , Salzberg , S.L. , and Pertea , M. 2019 , Transcriptome assembly from long-read RNA-seq alignments with StringTie2 , Genome Biol. , 20 , 278 , doi: 10.1186/s13059-019-1910-1

Haas , B.J. , Salzberg , S.L. , Zhu , W. , et al.  2008 , Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments , Genome Biol. , 9 , R7 , doi: 10.1186/gb-2008-9-1-r7

Keilwagen , J. , Wenk , M. , Erickson , J.L. , Schattat , M.H. , Grau , J. , and Hartung , F. 2016 , Using intron position conservation for homology-based gene prediction , Nucleic Acids Res. , 44 , e89 , doi: 10.1093/nar/gkw092

Stanke , M. , Diekhans , M. , Baertsch , R. , and Haussler , D. 2008 , Using native and syntenically mapped cDNA alignments to improve de novo gene finding , Bioinformatics , 24 , 637 – 44 , doi: 10.1093/bioinformatics/btn013

Krzywinski , M. , Schein , J. , Birol , I. , et al.  2009 , Circos: An information aesthetic for comparative genomics , Genome Res. , 19 , 1639 – 45 , doi: 10.1101/gr.092759.109

Nawrocki , E.P. and Eddy , S.R. 2013 , Infernal 1.1: 100-fold faster RNA homology searches , Bioinformatics , 29 , 2933 – 5 , doi: 10.1093/bioinformatics/btt509

Griffiths-Jones , S. , Moxon , S. , Marshall , M. , Khanna , A. , Eddy , S.R. , and Bateman , A. 2005 , Rfam: annotating non-coding RNAs in complete genomes , Nucleic Acids Res. , 33 , D121 – 4 , doi: 10.1093/nar/gki081

Lowe , T.M. and Eddy , S.R. 1997 , tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence , Nucleic Acids Res. , 25 , 955 – 64 , doi: 10.1093/nar/25.5.955

Lagesen , K. , Hallin , P. , Rødland , E.A. , Staerfeldt , H.H. , Rognes , T. , and Ussery , D.W. 2007 , RNAmmer: consistent and rapid annotation of ribosomal RNA genes , Nucleic Acids Res. , 35 , 3100 – 8 , doi: 10.1093/nar/gkm160

Camacho , C. , Coulouris , G. , Avagyan , V. , et al.  2009 , BLAST+: architecture and applications , BMC Bioinf. , 10 , 421 , doi: 10.1186/1471-2105-10-421

Bairoch , A. , Apweiler , R. , Wu , C.H. , et al.  2005 , The Universal Protein Resource (UniProt) , Nucleic Acids Res. , 33 , D154 – 9 , doi: 10.1093/nar/gki070

Galperin , M.Y. , Makarova , K.S. , Wolf , Y.I. , and Koonin , E.V. 2015 , Expanded microbial genome coverage and improved protein family annotation in the COG database , Nucleic Acids Res. , 43 , D261 – 9 , doi: 10.1093/nar/gku1223

Zdobnov , E.M. and Apweiler , R. 2001 , InterProScan - an integration platform for the signature-recognition methods in InterPro , Bioinformatics , 17 , 847 – 8 , doi: 10.1093/bioinformatics/17.9.847

Kanehisa , M. and Goto , S. 2000 , KEGG: Kyoto encyclopedia of genes and genomes , Nucleic Acids Res. , 28 , 27 – 30 , doi: 10.1093/nar/28.1.27

Emms , D.M. and Kelly , S. 2019 , OrthoFinder: Phylogenetic orthology inference for comparative genomics , Genome Biol. , 20 , 238 , doi: 10.1186/s13059-019-1832-y

Katoh , K. and Standley , D.M. 2013 , MAFFT multiple sequence alignment software version 7: improvements in performance and usability , Mol. Biol. Evol. , 30 , 772 – 80 , doi: 10.1093/molbev/mst010

Suyama , M. , Torrents , D. , and Bork , P. 2006 , PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments , Nucleic Acids Res. , 34 , W609 – 12 , doi: 10.1093/nar/gkl315

Capella-Gutiérrez , S. , Silla-Martínez , J.M. , and Gabaldón , T. 2009 , trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , Bioinformatics , 25 , 1972 – 3 , doi: 10.1093/bioinformatics/btp348

Nguyen , L.T. , Schmidt , H.A. , von Haeseler , A. , and Minh , B.Q. 2015 , IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies , Mol. Biol. Evol. , 32 , 268 – 74 , doi: 10.1093/molbev/msu300

Zhang , C. , Rabiee , M. , Sayyari , E. , and Mirarab , S. 2018 , ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees , BMC Bioinf. , 19 , 153 , doi: 10.1186/s12859-018-2129-y

Yang , Z. 2007 , PAML 4: Phylogenetic analysis by maximum likelihood , Mol. Biol. Evol. , 24 , 1586 – 91 , doi: 10.1093/molbev/msm088

De Bie , T. , Cristianini , N. , Demuth , J.P. , and Hahn , M.W. 2006 , CAFE: a computational tool for the study of gene family evolution , Bioinformatics , 22 , 1269 – 71 , doi: 10.1093/bioinformatics/btl097

Sun , P. , Jiao , B. , Yang , Y. , et al.  2022 , WGDI: a user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes , Mol. Plant , 15 , 1841 – 51 , doi: 10.1016/j.molp.2022.10.018

Wang , X. , Shi , X. , Li , Z. , et al.  2006 , Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice , BMC Bioinf. , 7 , 447 , doi: 10.1186/1471-2105-7-447

Qiao , X. , Li , Q. , Yin , H. , et al.  2019 , Gene duplication and evolution in recurring polyploidization-diploidization cycles in plants , Genome Biol. , 20 , 38 , doi: 10.1186/s13059-019-1650-2

Lei , W. , Wang , Z. , Cao , M. , et al.  2022 , Chromosome-level genome assembly and characterization of Sophora japonica , DNA Res. , 29 , 1 – 10 , doi: 10.1093/dnares/dsac009

Wang , Z. , Zhang , X. , Lei , W. , et al.  2023 , Chromosome-level genome assembly and population genomics of Robinia pseudoacacia reveal the genetic basis for its wide cultivation , Commun. Biol. , 6 , 797 , doi: 10.1038/s42003-023-05158-6

Varshney , R.K. , Song , C. , Saxena , R.K. , et al.  2013 , Draft genome sequence of chickpea ( Cicer arietinum ) provides a resource for trait improvement , Nat. Biotechnol. , 31 , 240 – 6 , doi: 10.1038/nbt.2491

Schmutz , J. , Cannon , S.B. , Schlueter , J. , et al.  2010 , Genome sequence of the palaeopolyploid soybean , Nature , 463 , 178 – 83 , doi: 10.1038/nature08670

Young , N.D. , Debellé , F. , Oldroyd , G.E. , et al.  2011 , The Medicago genome provides insight into the evolution of rhizobial symbioses , Nature , 480 , 520 – 4 , doi: 10.1038/nature10625

Wojciechowski , M.F. , Lavin , M. , and Sanderson , M.J. 2004 , A phylogeny of legumes (Leguminosae) based on analysis of the plastid mat K gene resolves many well-supported subclades within the family , Am. J. Bot. , 91 , 1846 – 62 , doi: 10.3732/ajb.91.11.1846

Zhang , R. , Wang , Y.H. , Jin , J.J. , et al.  2020 , Exploration of plastid phylogenomic conflict yields new insights into the deep relationships of Leguminosae , Syst. Biol. , 69 , 613 – 22 , doi: 10.1093/sysbio/syaa013

Zhao , Y. , Zhang , R. , Jiang , K.W. , et al.  2021 , Nuclear phylotranscriptomics and phylogenomics support numerous polyploidization events and hypotheses for the evolution of rhizobial nitrogen-fixing symbiosis in Fabaceae , Mol. Plant , 14 , 748 – 73 , doi: 10.1016/j.molp.2021.02.006

Subramanian , S. , Stacey , G. , and Yu , O. 2007 , Distinct, crucial roles of flavonoids during legume nodulation , Trends Plant Sci. , 12 , 282 – 5 , doi: 10.1016/j.tplants.2007.06.006

Roy , S. , Liu , W. , Nandety , R.S. , et al.  2020 , Celebrating 20 years of genetic discoveries in legume nodulation and symbiotic nitrogen fixation , Plant Cell , 32 , 15 – 41 , doi: 10.1105/tpc.19.00279

Jackson , S.P. and Bartek , J. 2009 , The DNA-damage response in human biology and disease , Nature , 461 , 1071 – 8 , doi: 10.1038/nature08467

Huang , Y. and Li , G.M. 2018 , DNA mismatch repair preferentially safeguards actively transcribed genes , DNA Repair (Amst.) , 71 , 82 – 6 , doi: 10.1016/j.dnarep.2018.08.010

Szurman-Zubrzycka , M. , Jędrzejek , P. , and Szarejko , I. 2023 , How do plants cope with DNA damage? A concise review on the DDR pathway in plants , Int. J. Mol. Sci. , 24 , 2404 , doi: 10.3390/ijms24032404

Long , M. and Thornton , K. 2001 , Gene duplication and evolution , Science , 293 , 1551 , doi: 10.1126/science.293.5535.1551a

Conant , G.C. and Wolfe , K.H. 2008 , Turning a hobby into a job: how duplicated genes find new functions , Nat. Rev. Genet. , 9 , 938 – 50 , doi: 10.1038/nrg2482

Li , Z. , Yue , H. , and Xing , D. 2012 , MAP Kinase 6-mediated activation of vacuolar processing enzyme modulates heat shock-induced programmed cell death in Arabidopsis , New Phytol. , 195 , 85 – 96 , doi: 10.1111/j.1469-8137.2012.04131.x

Zhu , J.K. 2016 , Abiotic stress signaling and responses in plants , Cell , 167 , 313 – 24 , doi: 10.1016/j.cell.2016.08.029

Chardin , C. , Schenk , S.T. , Hirt , H. , Colcombet , J. , and Krapp , A. 2017 , Review: mitogen-activated protein kinases in nutritional signaling in Arabidopsis , Plant Sci. , 260 , 101 – 8 , doi: 10.1016/j.plantsci.2017.04.006

Yu , J. , Kang , L. , Li , Y. , et al.  2021 , RING finger protein RGLG1 and RGLG2 negatively modulate MAPKKK18 mediated drought stress tolerance in Arabidopsis , J. Integr. Plant Biol. , 63 , 484 – 93 , doi: 10.1111/jipb.13019

Pecrix , Y. , Staton , S.E. , Sallet , E. , et al.  2018 , Whole-genome landscape of Medicago truncatula symbiotic genes , Nat. Plants , 4 , 1017 – 25 , doi: 10.1038/s41477-018-0286-7

Bennetzen , J.L. 2007 , Patterns in grass genome evolution , Curr. Opin Plant Biol. , 10 , 176 – 81 , doi: 10.1016/j.pbi.2007.01.010

Kim , S. , Park , M. , Yeom , S.I. , et al.  2014 , Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species , Nat. Genet. , 46 , 270 – 8 , doi: 10.1038/ng.2877

Novák , P. , Guignard , M.S. , Neumann , P. , et al.  2020 , Repeat sequence turnover shifts fundamentally in species with large genomes , Nat. Plants , 6 , 1325 – 9 , doi: 10.1038/s41477-020-00785-x

Vinogradov , A.E. 1999 , Intron-genome size relationship on a large evolutionary scale , J. Mol. Evol. , 49 , 376 – 84 , doi: 10.1007/PL00006561

McLysaght , A. , Enright , A.J. , Skrabanek , L. , and Wolfe , K.H. 2000 , Estimation of synteny conservation and genome compaction between pufferfish (Fugu) and human , Yeast , 17 , 22 – 36 , doi: 10.1002/(SICI)1097-0061(200004)17:1<22::AID-YEA5>3.0.CO;2-S

Wendel , J.F. , Cronn , R.C. , Alvarez , I. , Liu , B. , Small , R.L. , and Senchina , D.S. 2002 , Intron size and genome size in plants , Mol. Biol. Evol. , 19 , 2346 – 52 , doi: 10.1093/oxfordjournals.molbev.a004062

Elliott , T.A. and Gregory , T.R. 2015 , What’s in a genome? The C-value enigma and the evolution of eukaryotic genome content , Phil. Trans. R. Soc. B , 370 , 20140331 , doi: 10.1098/rstb.2014.0331

Wang , D. , Zheng , Z. , Li , Y. , et al.  2021 , Which factors contribute most to genome size variation within angiosperms ? Ecol. Evol. , 11 , 2660 – 8 , doi: 10.1002/ece3.7222

Van de Peer , Y. , Maere , S. , and Meyer , A. 2009 , The evolutionary significance of ancient genome duplications , Nat. Rev. Genet. , 10 , 725 – 32 , doi: 10.1038/nrg2600

Lynch , M. and Conery , J.S. 2000 , The evolutionary fate and consequences of duplicate genes , Science , 290 , 1151 – 5 , doi: 10.1126/science.290.5494.1151

Jiao , Y. and Paterson , A.H. 2014 , Polyploidy-associated genome modifications during land plant evolution , Philos. Trans. R. Soc. Lond. B Biol. Sci , 369 , 20130355 , doi: 10.1098/rstb.2013.0355

Paterson , A.H. , Bowers , J.E. , and Chapman , B.A. 2004 , Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics , Proc. Natl. Acad. Sci. U.S.A. , 101 , 9903 – 8 , doi: 10.1073/pnas.0307901101

Soltis , D.E. , Visger , C.J. , and Soltis , P.S. 2014 , The polyploidy revolution then…and now: Stebbins revisited , Am. J. Bot. , 101 , 1057 – 78 , doi: 10.3732/ajb.1400178

Spaethe , J. and Briscoe , A.D. 2004 , Early duplication and functional diversification of the opsin gene family in insects , Mol. Biol. Evol. , 21 , 1583 – 94 , doi: 10.1093/molbev/msh162

Hanada , K. , Zou , C. , Lehti-Shiu , M.D. , Shinozaki , K. , and Shiu , S.H. 2008 , Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli , Plant Physiol. , 148 , 993 – 1003 , doi: 10.1104/pp.108.122457

Han , M.V. , Demuth , J.P. , McGrath , C.L. , Casola , C. , and Hahn , M.W. 2009 , Adaptive evolution of young gene duplicates in mammals , Genome Res. , 19 , 859 – 67 , doi: 10.1101/gr.085951.108

Lu , H. , Li , F. , Yuan , Y. , et al.  2021 , Yeast metabolic innovations emerged via expanded metabolic network and gene positive selection , Mol. Syst. Biol. , 17 , e10427 , doi: 10.15252/msb.202110427

## Author notes

Supplementary data, email alerts, citing articles via.

• Author Guidelines

## Affiliations

• Online ISSN 1756-1663
• Publish journals with us
• University press partners
• What we publish
• New features
• Open access
• Institutional account management
• Rights and permissions
• Get help with access
• Accessibility
• Media enquiries
• Oxford University Press
• Oxford Languages
• University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

• Legal notice

## This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

## Viruses are doing mysterious things everywhere – AI can help researchers understand what they’re up to in the oceans and in your gut

V iruses are a mysterious and poorly understood force in microbial ecosystems. Researchers know they can infect, kill and manipulate human and bacterial cells in nearly every environment , from the oceans to your gut. But scientists don’t yet have a full picture of how viruses affect their surrounding environments in large part because of their extraordinary diversity and ability to rapidly evolve .

Communities of microbes are difficult to study in a laboratory setting. Many microbes are challenging to cultivate, and their natural environment has many more features influencing their success or failure than scientists can replicate in a lab.

So systems biologists like me often sequence all the DNA present in a sample – for example, a fecal sample from a patient – separate out the viral DNA sequences , then annotate the sections of the viral genome that code for proteins. These notes on the location, structure and other features of genes help researchers understand the functions viruses might carry out in the environment and help identify different kinds of viruses. Researchers annotate viruses by matching viral sequences in a sample to previously annotated sequences available in public databases of viral genetic sequences .

However, scientists are identifying viral sequences in DNA collected from the environment at a rate that far outpaces our ability to annotate those genes. This means researchers are publishing findings about viruses in microbial ecosystems using unacceptably small fractions of available data.

To improve researchers’ ability to study viruses around the globe, my team and I have developed a novel approach to annotate viral sequences using artificial intelligence. Through protein language models akin to large language models like ChatGPT but specific to proteins, we were able to classify previously unseen viral sequences. This opens the door for researchers to not only learn more about viruses, but also to address biological questions that are difficult to answer with current techniques.

## Annotating viruses with AI

Large language models use relationships between words in large datasets of text to provide potential answers to questions they are not explicitly “taught” the answer to. When you ask a chatbot “What is the capital of France?” for example, the model is not looking up the answer in a table of capital cities. Rather, it is using its training on huge datasets of documents and information to infer the answer: “The capital of France is Paris.”

Similarly, protein language models are AI algorithms that are trained to recognize relationships between billions of protein sequences from environments around the world. Through this training, they may be able to infer something about the essence of viral proteins and their functions.

We wondered whether protein language models could answer this question: “Given all annotated viral genetic sequences, what is this new sequence’s function?”

In our proof of concept , we trained neural networks on previously annotated viral protein sequences in pre-trained protein language models and then used them to predict the annotation of new viral protein sequences. Our approach allows us to probe what the model is “seeing” in a particular viral sequence that leads to a particular annotation. This helps identify candidate proteins of interest either based on their specific functions or how their genome is arranged, winnowing down the search space of vast datasets.

By identifying more distantly related viral gene functions, protein language models can complement current methods to provide new insights into microbiology. For example, my team and I were able to use our model to discover a previously unrecognized integrase – a type of protein that can move genetic information in and out of cells – in the globally abundant marine picocyanobacteria Prochlorococcus and Synechococcus . Notably, this integrase may be able to move genes in and out of these populations of bacteria in the oceans and enable these microbes to better adapt to changing environments.

Our language model also identified a novel viral capsid protein that is widespread in the global oceans. We produced the first picture of how its genes are arranged, showing it can contain different sets of genes that we believe indicates this virus serves different functions in its environment.

These preliminary findings represent only two of thousands of annotations our approach has provided.

## Analyzing the unknown

Most of the hundreds of thousands of newly discovered viruses remain unclassified . Many viral genetic sequences match protein families with no known function or have never been seen before. Our work shows that similar protein language models could help study the threat and promise of our planet’s many uncharacterized viruses.

While our study focused on viruses in the global oceans, improved annotation of viral proteins is critical for better understanding the role viruses play in health and disease in the human body. We and other researchers have hypothesized that viral activity in the human gut microbiome might be altered when you’re sick. This means that viruses may help identify stress in microbial communities.

However, our approach is also limited because it requires high-quality annotations. Researchers are developing newer protein language models that incorporate other “tasks” as part of their training, particularly predicting protein structures to detect similar proteins, to make them more powerful.

Making all AI tools available via FAIR Data Principles – data that is findable, accessible, interoperable and reusable – can help researchers at large realize the potential of these new ways of annotating protein sequences leading to discoveries that benefit human health.

• Researchers identified over 5,500 new viruses in the ocean, including a missing link in viral evolution
• New AI technique identifies dead cells under the microscope 100 times faster than people can – potentially accelerating research on neurodegenerative diseases like Alzheimer’s

Libusha Kelly receives funding from the National Institutes of Health.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

• View all journals
• Explore content
• Publish with us
• Open access
• Published: 15 May 2024

## Structure and genome editing of type I-B CRISPR-Cas

• Meiling Lu 1 , 2   na1 ,
• Chenlin Yu 1   na1 ,
• Yuwen Zhang 1 ,
• Wenjun Ju 1 ,
• Chenyang Hua 1 ,
• Jinze Mao 3 ,
• Chunyi Hu   ORCID: orcid.org/0000-0003-4509-4840 4 , 5 ,
• Zhenhuang Yang 6 &
• Yibei Xiao   ORCID: orcid.org/0000-0003-4716-5526 2 , 7 , 8

Nature Communications volume  15 , Article number:  4126 ( 2024 ) Cite this article

Metrics details

• CRISPR-Cas9 genome editing
• CRISPR-Cas systems
• Cryoelectron microscopy

Type I CRISPR-Cas systems employ multi-subunit effector Cascade and helicase-nuclease Cas3 to target and degrade foreign nucleic acids, representing the most abundant RNA-guided adaptive immune systems in prokaryotes. Their ability to cause long fragment deletions have led to increasing interests in eukaryotic genome editing. While the Cascade structures of all other six type I systems have been determined, the structure of the most evolutionarily conserved type I-B Cascade is still missing. Here, we present two cryo-EM structures of the Synechocystis sp . PCC 6714 ( Syn ) type I-B Cascade, revealing the molecular mechanisms that underlie RNA-directed Cascade assembly, target DNA recognition, and local conformational changes of the effector complex upon R-loop formation. Remarkably, a loop of Cas5 directly intercalated into the major groove of the PAM and facilitated PAM recognition. We further characterized the genome editing profiles of this I-B Cascade-Cas3 in human CD3 + T cells using mRNA-mediated delivery, which led to unidirectional 4.5 kb deletion in TRAC locus and achieved an editing efficiency up to 41.2%. Our study provides the structural basis for understanding target DNA recognition by type I-B Cascade and lays foundation for harnessing this system for long range genome editing in human T cells.

## Programmed genome editing by a miniature CRISPR-Cas12f nuclease

Introduction.

CRISPR-Cas system is an adaptive immune system that defends prokaryotes against the invasion of foreign genetic elements 1 , 2 , 3 . In these systems, CRISPR-associated (Cas) protein(s) assemble with transcribed and processed CRISPR RNAs (crRNAs) to form the effector complex that degrades the complementary invading nucleic acid 4 , 5 , 6 . Based on the constitute of the effector, the CRISPR-Cas systems are broadly divided into two classes: Class 1 utilizes multi-subunit effector proteins and Class 2 employs only a single effector protein 7 . Class 2 CRISPR-Cas systems are well studied and their effectors, including Cas9 and Cas12, have been widely employed for genome editing 8 , 9 . However, Class 2 systems only account for about 10% of all discovered CRISPR-Cas systems 10 , the more prevalent Class 1 CRISPR-Cas systems are a huge reservoir of potential genome manipulation tools that await further exploration.

Type I CRISPR-Cas systems are the most abundant Class 1 systems 7 , and are further divided into seven types I-A through I-G. These systems are characterized by the coordinated action of Cascade (CRISPR-associated complex for antiviral defense) which binds complementary target dsDNA and a nuclease-helicase subunit Cas3 for processive DNA degradation 7 , 11 . All type I-A 12 , I-C 13 , 14 , 15 , I-D 16 , 17 , I-E 18 , 19 , 20 , I-F 21 , 22 , 23 , I-G 24 Cascade structures have been determined to date, except for type I-B Cascade. A typical Cascade contains a large subunit that detects the protospacer adjacent motif (PAM) on the target DNA, a belly composed of two to five copies of small subunits, a backbone for crRNA binding, a crRNA-processing nuclease, and a single-copy subunit flanked to the 5’end of the crRNA 25 . In type, I-C 13 , 14 , 15 , I-D 16 , 17 , 26 , I-E 18 , 19 , 20 , and I-F 21 , 22 , 23 systems, Cas3 recruitment is dependent on R-loop formation upon target DNA recognition by Cascade. By contrast, Cas3 of I-A 12 and I-G 24 subtypes is a stable component of the complex effector even in the absence of target DNA. The ability to cause long fragment deletions has been validated in microbes 27 , 28 , plants 29 , 30 , and mammalian cells 31 , 32 , 33 . Cascade or Cas3 fused with other regulatory proteins or modifiers have been proven to regulate gene expression or induce random mutagenesis 34 , 35 .

Several studies have suggested potential applications using type I-B system. For example, endogenous type I-B systems have been redirected for gene deletion/ insertion in several native hosts 36 , 37 , 38 , and the reconstructed type I-B interference machinery can scavenge target genes on plasmids 36 , 39 , 40 . Recently, a type I-B CRISPR-associated transposase (CAST) system utilizes I-B Cascade to target DNA and recruits Tn7-like transposase for achieving site-specific gene insertion has been reported 41 . However, structural-based mechanistic understanding and validation of gene editing capabilities of type I-B CRISPR-Cas3 system are still not fully understood.

In this study, we reconstitute a Syn type I-B Cascade and delineate its broader PAM requirement with the optimal preference as 5’-A-Y-G-3’. We determine the cryo-EM structures of Syn type I-B Cascade bound to the dsDNA target and show interactions for PAM recognition, NTS stabilization, and conformational changes of the large subunit in two functional states. Uniquely, both Cas5b and Cas8b subunits are involved in PAM recognition. In addition, we introduce the Syn type I-B system into human CD3 + T cells using mRNA delivery, which achieved a satisfactory editing efficiency of up to 41.2% and unidirectional 4.5 kb deletion. Our results form the structural basis for understanding target DNA recognition mechanisms by type I-B Cascade, filling the last missing piece of type I Cascade, and show the potential of type I-B CRISPR-Cas3 in large genome fragment deletion in T-cell engineering.

## Reconstitution and PAM determination of type I-B system

Synechocystis sp . PCC 6714 encodes a type I-B CRISPR locus which was subdivided as Myxan based on the properties of its large subunit 42 (Fig.  1A ). The initial nomenclature of the large subunit as Cmx8 was updated to Cas8b in 2020 by Makarova et al. 7 . Like its counterparts in I-C 13 , 14 , 15 and I-D 16 , 17 systems, the cas8b large subunit of this system also includes an internal ribosome-binding site at its 3’ terminus, which encodes a separate small subunit Cas11 17 . To recapitulate this type I-B system, plasmids encoding Cas8b, Cas7b, Cas5b, Cas6b, and Cas11b were co-expressed with associated CRISPR array, in E. coli , and co-purified as an assembled complex through a Strep-tag fused to the N-terminus of Cas8b. Size-exclusion chromatography of the affinity-purified sample indicated successful assembly of a type I-B Cascade, which was eluted at a volume corresponding to slightly smaller than 440 kDa (Fig.  1B ). SDS-PAGE revealed the presence of Cas8b (70 kDa), Cas7b (35 kDa), Cas5b (26.5 kDa), Cas6b (24.4 kDa), as well as an expected Cas11b at <15 kDa (Fig.  1C ). A ~ 71 nucleotides long crRNA was co-purified, consistent with the length of a full spacer-repeat crRNA unit (Fig.  1D ).

A Schematic of Synechocystis sp . PCC 6714 type I-B operon (left) and mature crRNA (right). B The SEC chromatogram of the recombinant Cascade was purified by an N-terminal Strep-tag on Cas8b, and the Ferritin (Mr 440,000) was used as the standard. C Coomassie blue stained SDS-PAGE gel of Cascade sample collected at the main elution peak of Cascade in B . D Ethidium bromide stained Urea-PAGE gel of the crRNA isolated from the same sample as C , and three transcribed RNA of different lengths were used as the markers. E Experimental strategy for PAM identification using plasmid libraries and fluorescent-labeled dsDNA to trap the candidates. To avoid the non-specific binding, only the DNA hits bound with Cascade of low concentrations were recovered (lane 3–6) as a template for further PCR amplification, individually. The weblog for the trinucleotide PAM consensus observed by Sanger sequencing results, with the PAM motif regions highlighted in the yellow box.

PAM-dependent recognition forms the basis to distinguish host DNA from foreign nucleic acids in type I CRISPR immunity 6 , 43 . To identify the optimal PAM sequence of the Syn type I-B system, pET28a-NNN-protospacer plasmid libraries were constructed to generate potential PAM sequence variety. All NNN-protospacers were then amplified into 161 bp 6-FAM-labeled dsDNA and incubated with the Syn Cascade for Electrophoretic Mobility Shift Assay (EMSA). Specific bands, indicative of DNA binding with low Syn Cascade complex concentrations, were singled out and followed with Sanger sequencing to map the PAM preference. The sequencing results of the recovered DNA suggested that the −3 position of PAM showed a strong bias for adenine (A), while the −2 and −1 positions preferred Y and G, respectively (Fig.  1E ). To further validate the propensity of nucleotides for the −2 and −1 positions, we analyzed the binding affinity of the 16 ANN-protospacer substrates with Syn Cascade. With the −3 site of PAM fixed as A, Syn Cascade prefers to bind with DNA substrate containing 5’-A-Y-N-3’ PAM sequence than 5’-A-R-N-3’ (Supplementary Fig.  1A ). Comparing the binding affinities of DNA sequences containing 5’-A-Y-N-3’ with the Cascade complex revealed that AYG has the greatest binding capacity. ATA was a close second, with marginally lower affinity. Both ATY and ACM displayed moderate binding affinities, whereas ACT shows the lowest preference within the AYN group (Supplementary Fig.  1B ). The 5’-A-Y-G-3’ preference was re-validated by binding assays between Cascade and eight 5’-N-Y-G-3’ protospacers (Supplementary Fig.  1C, D ). The results showed that the nucleotide preference for the −3 position in the PAM sequence was indeed adenine. Taken together, our results validate that the reconstituted Syn type I-B Cascade appears to have a broader PAM requirement with the best preference as 5’-A-Y-G −3’.

## Integral type I-B Cascade-DNA assembly and Cryo-EM structure analysis

To elucidate the detailed mechanism of how the Syn type I-B Cascade recognizes target DNA, we incubated a 59-bp dsDNA target containing an ATG-PAM (Fig.  2A ) with Syn Cascade in a 1:3 molar ratio at 25 °C for 1 h, the unbound dsDNA was then removed by SEC purification. Thereafter, cryo-EM was employed to reveal the structural features of the integral complex. Raw micrographs and reference-free 2D class averages clearly showed particles with a “sea horse”-like shape (Supplementary Fig.  2 ). The final 3D reconstruction reached an overall resolution higher than 3.6 Å, which was sufficient to identify the direction of the main chain and the clear side chains (Fig.  2B ). Some periphery regions including the Cas6b/crRNA 3’-hairpin were either not well resolved or were too degenerate for modeling. The target strand (TS) is embedded within the complex and hybridized with the crRNA, representing the full R-loop formed state. However, the non-target strand (NTS) was not fully visible, particularly the bulge for Cas3 recruitment (Fig.  2D ). We also observed a subset of particles within our cryo-EM dataset that formed a partial R-loop state, with only 5 nt of the TS hybridized to the crRNA, alongside duplex DNA bound with Cas8b (Fig.  2 C, E ). This led to the discovery of an additional structure, resolved at 3.8 Å (Supplementary Fig.  2 ).

A crRNA and dsDNA sequences were used to program type I-B Cascade for structure studies. Residue numbers and color schemes are followed throughout the text. The PAM region is highlighted in the orange box. B Schematics of the cryo-EM density (left) and modeled structure (right) of full R-loop formation state. C Schematics of the cryo-EM density (left) and modeled structure (right) of partial R-loop formation state. D Cryo-EM density of nucleic acids in full R-loop state. E Cryo-EM density of nucleic acids in partial R-loop state. F The representation of Cas7b 7 backbone binding with crRNA is shown in the cartoon. G The finger domain of the Cas7b subunit disrupts the complementary base pairing between crRNA and target DNA strand at intervals of 5 bases.

The stoichiometry of Syn Cascade was Cas8b 1 -Cas7b 7 -Cas5b 1 -Cas6b 1 -Cas11b 3 . The helical backbone of the complex, composed of seven successive Cas7b subunits, was clearly identified in the density. It shared structural similarities with that of I-A 12 , I-C 13 , 14 , 15 , and I-D 16 , 17 Cascades, featuring a longer helical backbone. The crRNA bound to the backbone and threaded through the “finger” domains (Fig.  2F ). Each Cas7b subunit occupied 6 nucleotides of the crRNA with a recurring periodic pattern of 5 + 1 nt, where the sixth base flipped out in the opposite direction to the other five 44 (Fig.  2G ). The Cas5b subunit is located at the top of the complex and recognizes the 5’ handle of crRNA.

Adjacent to Cas5b is the large subunit Cas8b, which displays very low sequence similarity with the large subunits of any other known type I CRISPR systems. Meanwhile, it has low sequence identity with the large subunit of other subtypes of I-B system, e.g., only 29.85% similarity with the Cas8 of the I-B CAST system (Supplementary Fig.  3 ). The Cas8b large subunit comprises an N-terminal domain and a helical C-terminal domain. In contrast to other type I systems, where the NTD of the large subunit displays poor density and cannot be accurately modeled in the partial R-loop state, the Cas8b NTD in the Syn type I-B Cascade structure is well resolved in both full and partial R-loop states. The only exception is the β-sheet consisting of residues 94–117, which is positioned laterally to the main structure (Supplementary Fig.  4 ), and is missing in the map. This β-sheet is located in a position similar to the recruit loop of Cas8b in the Nla I-C subtype, which may function in Cas3 recruitment 15 . The C-terminal portion, identical to the Cas11b small subunit, is similar in size and secondary structure of the α-helical bundle observed in the small Cas11 subunit in most Class I effectors, though it exhibits low identity with other Cas11 proteins 16 . The C-terminal domain of Cas8b and three Cas11b small subunits together formed the inner “belly” of the integral complex. They were tasked with providing support for the non-target strand (NTS) within the full R-loop state structure (explained in detail later).

## Cas8b NTD and Cas5b are responsible for PAM recognition

In type I systems, the best-studied PAM recognition typically involves large subunit-mediated DNA minor groove contacts. This recognition involves three components: the specific residues on a Gly-rich loop in the large subunit’s NTD that interacts with the DNA’s minor groove, a Gln-wedge that inserts itself into the dsDNA path beneath the PAM, and a Lys-finger that favorably forms electrostatic interactions with a pyrimidine in the PAM 45 . However, the Gln-wedge may change to an Asn-wedge in type I-C 14 or to a Lys-wedge in type I-F 23 , and the Lys-finger is replaced by Asn in type I-C 14 and type I-F 23 .

In our I-B system, a loop comprised of residues 154–156 (GVP), bridging two helices in the NTD of Syn Cas8b, is proximal to and interacts with the minor groove of the PAM duplex. Notably, a “GNS” loop (residues 101–103) of Syn Cas5b intercalated into the major groove of the PAM, opposite to the “GVP” loop from Cas8b, aiding in PAM recognition (Fig.  3 A, C ). Within the “GNS” loop, the N102 residue closely interacts with the amino group of A NT-3 , forming a hydrogen bond (Fig.  3B ). This might explain why Syn type I-B Cascade strongly prefers the PAM-3 as A. Similarly, the G101 residue is also adjacent to the major groove, allowing ample room for A T-2 . In the type I-D system, the Cas5d subunit also closely contacts the major groove of the target DNA, but it merely serves an accessory role in stabilizing the DNA 15 . Compared to the wild type (WT), the G101A and N102A mutants of Syn Cas5b significantly reduced the DNA binding affinity and preference for the PAM sequence (Supplementary Fig.  5 ). This suggests that the “GNS” loop of Syn Cas5b plays a pivotal role in PAM recognition.

A Binding pattern of target DNA with Cas8b and Cas5b. B N102 of Cas5b formed a H-bond with A NT-3 (blue dash lines). C A “GVP” loop (154–156) of Cas8b is proximal to and interacts with the minor groove of the PAM duplex. D The wedge N332 of Cas8b forms two H-bonds with G NT-1 and C T-1 , respectively. The adjacent S333 interacted with the ribose of C T-1 . H-bonds are depicted using yellow dashed lines. E Schematic of the residues involved in PAM recognition for Syn Cascade. F , G Specific residues in Cas8b NTD involved in NTS stabilization. The positively charged residues and aromatic residues form non-specific interactions with the NTS backbone and bases, respectively. H-bond and aromatic forces are illustrated using blue and black dashed lines, respectively. H Specific residues in Cas11b small subunits involved in NTS stabilization. H-bonds are depicted using blue dashed lines.

Residue P156 of “GVP” motif in Cas8b is situated close to the minor groove of the PAM sequence (Fig.  3C ). Its rigidity may elucidate why PAM-2 favors Y: a purine nucleotide would cause a steric clash with the “GVP” loop (Supplementary Fig.  6 ). The insertion of a wedge structure to initiate DNA duplex unwinding was also observed in Syn Cas8b. Within this wedge, N332 establishes two hydrogen bonds (Fig.  3D ): one with the N1 of G NT-1 and the other with the N3 of C T-1 , making PAM-1 more favorable to G. Adjacently, S333 forms a hydrogen bond with the oxygen of C T-1 ’s ribose (Fig.  3D ). These interactions, along with other backbone-stabilizing interactions, aid in binding to dsDNA targets. Taken together, these results suggest that both the “GNS” loop of Cas5b and the “GVP” loop of Cas8b are responsible for AYG-PAM recognition in Syn type I-B Cascade (Fig.  3E ).

## NTS stabilization and conformational dynamics during full R-loop formation

In our full R-loop state structure, we modeled 6 bp of PAM-proximal dsDNA and 19 nt of NTS ssDNA. Out of these, 9 nt of NTS ssDNA were directly located downstream of PAM, while the other 10 nt constituted the PAM-distal region stabilized by the small subunits were modeled using “A” (Fig.  2D ). Positively charged residues (R88, R159, K244, R245 and R282) within the Cas8b NTD made sequence-independent contacts with the negatively charged NTS backbone. In addition to electrostatic contacts, we identified aromatic residues (Y119 and F281) of Cas8b NTD that participated in the stacking interactions with NTS bases, which stabilizing the single-stranded region of the NTS seed (Fig.  3 F, G ). After modeling 9 nucleotides downstream of the PAM position, the density of the NTS located on the surface of the tail of Cas8b NTD deteriorated in an unclear direction. Similar stimulation was observed in most other type I systems except for I-C systems 14 , 15 .

In our partial R-loop state structure, only 5 nt of the TS hybridizing with the crRNA and 3 nt of the NTS downstream of the PAM were modeled (Fig.  2E ). Comparative analysis of the overall structure of Cas8b in the two different states showed that its CTD extended in the full R-loop structure (Fig.  4A ), with an RMSD of 6.7 Å. This extension was accompanied by a significant downward and outward displacement of 11.8 Å, 12.0 Å, 13.7 Å, 6.7 Å, and 9.8 Å in Helix1, Helix2, Helix4, Helix5, and Helix6, respectively (Fig.  4B−E ). While the position of Helix3 is unchanged. This extension might promote the creation of the ssDNA-binding groove, which is vital for the NTS supporting and Cas3 recruitment.

A Alignment of Cas8b in two states. B – E Conformational changes in Cas8b-CTD of Syn Cascade between full (green) and partial (gray) R-loop state. F Alignment of two Cascade structures with the backbones overlapped. G , H Conformational transition in the belly (Cas8b CTD and Cas11.1-3) between full (cyan) and partial (yellow) R-loop state. Arrows indicate the direction of movements upon full R-loop formation.

Comparison of the “belly” in the two structures reveals a pivoting motion of the extended Cas8b CTD, accompanied by a correlated motion and rotation of the three Cas11b subunits reflected by an RMSD of 3.15 Å (Fig.  4F–H ). This motion leads to a reduction in the spatial distance between the belly and the backbone. Concurrently, the small subunits position the NTS ~ 22 Å above the DNA/crRNA heteroduplex, facilitated by electrostatic interactions between the positively charged residues (R85, K118, K39) and the negatively charged backbone of the NTS (Fig.  3H ). These features suggest that the R-loop formation follows a kinetically favorable mechanism, analogous to that observed in type I-C 14 .

## Type I-B CRISPR-Cas3 mediated genome editing in human cells

CRISPR/Cas9 engineered T cells showed high efficiency and safety in cancer immunotherapy 46 , 47 , 48 , 49 , we therefore explored the potential usage of Syn CRISPR-Cas3 for genome editing in human cells by disrupting T-cell receptor α constant ( TRAC ) locus in CD3 + T cells. Previous studies showed that the Nla I-C system achieved a 95% editing efficiency using RNP delivery, but resulted in only an 8% lesion of EGFP in HAP1 cells when using mRNA delivery 32 . This suggests that mRNA-mediated delivery might not be optimally effective, but we posited that the efficacy of mRNA delivery could vary depending on the specific CRISPR-Cas system and cell types.

Hence, we examined the editing efficiency of this type I-B Cascade-Cas3 system in an mRNA-mediated manner. Two 71-nt crRNAs containing protospacer G1 or G2 that target a 35-bp region in TRAC locus flanked by a 5’-ATG-3’ PAM were designed (Fig.  5A ). 5’capped and 3’polyA-tailed mRNAs for cas3 , cas8b , cas7b , cas5b , cas6b and cas11b of Syn type I-B system were transcribed in vitro and then electroporated into CD3 + T cells, along with mature crRNA (Fig.  5B ). The endogenous TCR-α chain was disrupted by knockout of TRAC gene, and a specific monoclonal antibody was utilized to track TCR-αβ expression, which only occurs on the T-cell surface when both TCR-α and -β chains are co-expressed. The editing efficiency was monitored using flow cytometry based on the expression level of TCR. cas mRNAs and a crRNA targeting a non- TRAC gene were delivered as a negative control, resulting in a negligible signal above the untreated background. It is shown that an average of 35.56% and 36.62% editing was generated by crRNA containing G1 and G2, respectively (Fig.  5 C, D ). The cells with significantly reduced TCR expression were collected by flow cytometry, and their genomic DNAs were extracted to further characterize the Syn Cascade-Cas3 genome editing profile. Long-range PCR was performed using the extracted genomic DNAs as templates, and a total of 73 deletion fragments which were all generated by G1 were obtained after PacBio sequencing. Among them, 32 fragments with deletion lengths larger than 1 kb were shown in Fig.  5E , the remaining were listed in Table  S5 . The deletions were uniformly initiated at a frame from +320 nt to +348 nt of the PAM, except for one that started at −85 nt of the PAM. The deletion endpoints varied within the ~4.5 kb PAM-proximal region (Fig.  5E ). Taken together, we concluded that Syn CRISPR-Cas3 could be efficiently delivered using mRNA and create a long-spectrum deletion that is unidirectional relative to the target in human T cells.

A Schematic of the TRAC locus, with protospacers for the two TRAC-targeting Cascades shown in blue and corresponding PAM in magenta. B Schematics of cas mRNAs and mature crRNA used in C . The TRAC -targeting CRISPR spacer is shown in green. C mRNAs encoding Syn proteins were electroporated into T cells, along with a TRAC-targeting CRISPR in the form of mature crRNA. Editing efficiencies were evaluated and plotted. Data are shown as mean ± SEM, n  = 5 independent healthy donors. D Representative flow cytometry plots of experiment in C , with percentages of TCR- T cells in the population shown on the top. E 32 long-range deletions (>1 kb) location at the TRAC locus for G1, revealed by PacBio sequencing of the long-range PCR products using the extracted genomic DNA as a template. Deleted genomic regions, G1-protospacer, and the onset of deletion are shown as black lines, blue lines, and red dots, respectively.

The type I CRISPR-Cas system targets invasive genetic elements during the interference stage in a stepwise manner 7 , 11 , which may minimize off-target effects and enable long-range deletions that other CRISPR-Cas systems cannot achieve 50 , 51 , 52 . With the elucidation of the type I-B Cascade structure in this study, now we have a complete picture of all seven type I CRISPR Cascades. As shown in Fig.  6 , the structural and functional features of the Cascade “backbone” in the type I CRISPR-Cas system are similar, albeit with slight differences in the number of subunits and the degree of bending. Coincidentally, the number of copies of the small subunit, including the CTD of the large subunit that comprises the Cascade “belly”, decreases progressively from subtype I-A to I-G. The large subunit of Cascade shares common functions, including facilitating PAM recognition and subsequent dsDNA binding. Additionally, it provides the surface position for the binding of nuclease Cas3 11 , despite the absence or low sequence similarity between them. While the involvement of large subunits in PAM recognition is common and well-recognized in type I systems, the participation of Cas5, originally responsible for binding and stabilizing the 5’ handle of crRNA, in PAM recognition is rare. This involvement has only been observed in the type I-D 17 and type I-B systems. In type I-C 13 and I-G 24 systems, Cas5 or Csb2 can functionally replace Cas6 in crRNA processing and maturation. This characteristic contributes to the streamlined nature of the I-C and I-G systems.

The Cas7 backbone is shown in light gray, Cas5 in orange, Cas6 in pink, Cas11 (SSU in type I-C and Cse2 in type I-E) in green and yellow, Cas8 (Cas10d in type I-D, Cse1 in type I-E) in purple, Cas3 in red, crRNA in dark green, TS in blue, NTS in cyan. A The structure of type I-A Cascade 12 , PDB: 7TR8. B The structure of type I-B Cascade, PDB: 8H67. C The structure of type I-C Cascade 13 , PDB: 7KHA. D The structure of type I-D Cascade 17 , PDB: 7SBA. E The structure of type I-E Cascade 19 , PDB: 5U07. F The structure of type I-F Cascade 21 , PDB: 6B45. G The structure of type I-G Cascade 24 , PDB: 8ANE.

Apart from the structural and functional differences of Cascade in type I system, there are two different mechanisms for Cas3 recruitment. Type I-B (Supplementary Fig.  S7 ), I-C, I-D, I-E, and I-F exhibits the canonical trans-recruitment mechanism, in which Cas3 recruitment depends on target DNA binding and full R-loop formation 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 . Type I-A and possibly I-G systems, however, use another allosteric-activation mechanism to degrade the substrate 12 , 24 . In type I-A systems, Cascade and Cas3 function as an integral effector complex, and the HD nuclease domain of Cas3 remains autoinhibited and becomes activated upon full R-loop formation.

Though most subtypes of type I systems, especially type I-C, I-E, and I-F systems, have gained immense attention in recent years, the type I-B CRISPR-Cas system remains poorly understood. We implemented I-B Cascade-Cas3-mediated target genome degradation to characterize the gene editing features in eukaryotes. By inducing DNA lesions in the CD3 + T-cell line at two sites of the TRAC locus, we found that Syn I-B CRISPR created target-specific 4.5 kb deletions that were all unidirectional to the upstream of the PAM. Though the potential of Syn I-B generating long-range deletion hasn’t been validated due to the restricted length of the TRAC locus, the observed editing features of type I-B are consistent with that of type I-E 31 and Nla I-C 32 . Up to 41.2% editing efficiency targeting TRAC locus in CD3 + T-cell by Syn I-B in our work was obtained. The genome editing efficiency of the Nla I-C system using mRNA delivery was poor when the crRNA was designed as a multimeric pre-CRISPR transcript 32 . Therefore, we compared the impact of different crRNA designs on editing efficiency (Supplementary Fig.  S8 ) and posited that the pre-CRISPR RNA maturation may be the rate-limiting step of Cascade assembly. Indeed, using matured crRNA instead of pre-CRISPR significantly boosted the efficiency of this study. In summary, our work where Syn type I-B CRISPR-Cas system was demonstrated as a powerful gene editing tool with exceptionally robust editing efficiency when delivered in mRNA form expands the genomic editing toolbox.

## DNA oligonucleotides

All the sequences (HPLC purified) used in this study were shown in Supplementary Table  1 and purchased from Sangon Biotech (Shanghai) Co., Ltd (Shanghai, China).

## Plasmids construction

The genes cas6b and cas11b were inserted in order into a polycistronic pRSFDuet-1 vector using the BamH I-Hind III and BamH I-Xho I restriction endonuclease cut sites respectively, while the genes cas8b , cas7b and cas5b were cloned into the pCDFDuet-1 vector at BamH I-Xho I restriction sites. To construct the Cas3 expression plasmid, the synthetic cas3 gene was cloned into the pET28a vector using Nde I and Xho I restriction sites. All the constructions were verified by Sanger sequencing (Sangon Biotech). The pUC19-CRISPR array plasmid, commercially synthesized, was co-transformed along with pCDF- cas8b - cas7b-cas5b and pRSF- cas6b - cas11b into the competent E. coli BL21 (DE3) cell for the expression of Cascade complex.

## Protein purification

E. coli BL21 (DE3) harboring WT and mutants of Cascade complex (Cas8b, Cas5b, Cas7b, Cas6b, Cas11b, and crRNA) was grown in LB broth supplemented with 50 µg/ml kanamycin, ampicillin, and streptogramin at 37 °C till the OD 600 reached 0.6. At this point, expression of Cascade complex was induced with 0.5 mM isopropylthio-β-D-galactoside (IPTG), and cells were allowed to grow for 12 h at 25°C. Cells were harvested and resuspended in binding buffer (20 mM Tris-Cl pH 7.5, 500 mM NaCl). Cells were lysed using an ultrasonic cell disruptor, and cell debris was removed by centrifugation at 4 °C and 19,000 ×  g for 30 min. After centrifugation, the clear supernatant was loaded onto pre-equilibrated 3 ml Strep-Tactin affinity column (IBA Lifesciences). After loading, the column was washed with 10 column volume (CV) of binding buffer to remove unbound proteins and then Cascade complex was eluted in binding buffer containing 5 mM d-Desthiobiotin. The eluate obtained was further purified by size-exclusion chromatography (SEC, Superdex™ 200 Increase 10/300 GL column, Cytiva). Subsequently, Cas proteins and crRNA in the Cascade complex were assayed by SDS-PAGE and Urea-PAGE respectively. After concentration, samples were flash-frozen in liquid nitrogen and stored at −80 °C until use.

E. coli BL21 (DE3) harboring Cas3 was grown in LB Broth supplemented with 50 µg/ml kanamycin at 37 °C till the OD 600 was equal to 0.6. Cas3 expression was induced with 0.5 mM IPTG and cells were allowed to grow overnight at 25°C. Cells were harvested and resuspended in binding buffer (20 mM HEPES pH 7.5, 500 mM NaCl, 20 mM imidazole, 5% (v/v) glycerol). Cell lysate was generated by ultrasonication and then further processed by centrifugation at 4 °C and 19,000 ×  g for 30 min. After initial fractionation steps, the clarified supernatant was passed through a pre-equilibrated Ni 2+ affinity column (Cytiva). After washing with 10 CV of binding buffer, Cas3 was eluted at the end of a gradient imidazole elution, where the buffer was composed of 20 mM HEPES (pH 7.5), 500 mM NaCl, 50-500 mM imidazole, 5% (v/v) glycerol. Samples were pooled up and further purified by SEC (Superdex™ 200 Increase 10/300 GL column, Cytiva). Concentrated samples were flash-frozen in liquid nitrogen and stored at −80 °C until further use.

## PAM library generation

To generate a PAM library, a set of complementary oligonucleotides PF-Mix-PAM and PR-Mix-PAM were commercially synthesized (Sangon Biotech). In each single-strand oligonucleotide, a 3 nt PAM sequence (NNN) was linked to the 5’ end of the protospacer sequence, with BamH I and Xho I cut sites flanking the whole sequence. After 5’-OH phosphorylation and annealing, these oligonucleotides were ligated into pET-28a vector with T4 Fast ligase and then the ligation products were transformed into DH5α. The generated pET28a-NNN-protospacer plasmid library encompassed 64 potential PAM sequences, the coverage of which was ascertained through Sanger sequencing (Sangon Biotech).

## PAM determination

To ascertain the PAM preferences of the I-B system, the binding affinity between the Cascade complex and all potential PAM sequences was evaluated. Using primers PF-161 and PR-161, segments of 161 bp dsDNA were amplified via PCR. A subsequent amplification was carried out using primers PF-6-FAM and PR-161 to produce 5’-FAM labeled target DNA. The coverage of all 64 PAM sequences was verified by Sanger sequencing to ensure that the DNAs in the library were relatively homogeneous prior to subsequent assays (Supplementary Fig.  9A ). The 161 bp 6-FAM-PAM library DNA molecules (320 nM) were then incubated with Cascade complex (0, 10, 20, 40, 100, 200, 400, 800, 2000 nM) in buffer containing 20 mM HEPES (pH 7.5) and 100 mM NaCl at 25 °C for one hour. Incubated samples were electrophoresed on 2% agarose gel and then visualized in Gel Imager. Bands that exhibited specific binding to low concentrations of the Cascade complex were selected, and the extracted DNA from these bands underwent PCR amplification using the primers PF-161 and PR-161, respectively. The PCR products were then analyzed via Sanger sequencing to determine the PAM preference (Supplementary Fig.  9B ).

To determine the PAM preferences at −2 and −1 positions, 16 single-stranded DNAs (59 nt) with A at PAM-3 were synthesized (Sangon Biotech). The oligonucleotides were annealed and subsequently amplified through two rounds of PCR using the primer pairs PF-97/PR-97 and PF-97/PR-CY5, yielding 5’-CY5 labeled 97 bp ANN-PAM DNAs. Each PCR product containing a certain PAM sequence underwent incubation with increasing concentrations of the Cascade complex (0 to 200 nM) at 25°C for one hour, after which the binding affinity was determined through an electrophoretic mobility shift assay. The oligonucleotides and primer sequences for this study are listed in Supplementary Tables  1 and 2 .

## Electrophoretic mobility shift assay

A final concentration of 10 nM fluorescently labeled target DNA was incubated with titrations of Cascade complex in a 20 μL total reaction volume containing 20 mM HEPES pH 7.5, and 100 mM NaCl. After one hour incubation at 25 °C, 10 μL of each sample was loaded onto 6% acrylamide gel. Electrophoresis was performed in 0.5× TBE buffer at 150 V for 35 min in cold room. DNA was visualized by fluorescence imaging in Tanon MINI Space 3000 system and images were quantified using ImageJ software. The fraction of DNA bound (amount of bound DNA divided by the sum of free and bound DNA) was plotted versus the concentration of Cascade and fit to one site-specific binding with Hill slope using GraphPad Prism 8.0.1. Each PAM sequence was tested in at least three independent experiments.

## In vitro assembly of Cascade-DNA complex

The oligonucleotides synPAM-14F and synPAM-14R that contain the PAM sequence as “ATG” and the protospacer were annealed to generate a double-stranded DNA. The dsDNA was incubated with Cascade at 25 °C for one hour at a molar ratio of 3:1. The sample was centrifugated and loaded onto Superdex™ 200 Increase 10/300 GL column (Cytiva) to remove excess DNA molecules. SDS-PAGE analysis was also conducted to further confirm the assembly of the Cascade-DNA complex in vitro.

## Cryo-EM data acquisition

Three microliters of 1 mg/ml SEC-purified Cascade-DNA complexes were applied to a gold grid which had been glow discharged for 45 seconds. After being stained with phosphotungstic acid hydrate, samples were loaded into field emission transmission electron microscope (Thermo Fisher) for morphological observation. Samples were then concentrated to 4.6 mg/ml and applied to a gold grid (1.2/1.3 300 mesh), which had been glow discharged for 45 s. The grids were blotted at 16 °C, 100% humidity, and plunge-frozen in liquid ethane using the Vitrobot Mark IV (blot time 5 seconds, wait time 30 s). Cryo-EM images were manually collected on a FEI Titan Krios G3i electron microscope equipped with a K2 Summit electron detector (Gatan) which was operated at 300 kV. Images were collected in counting mode, with a nominal defocus range of −1.0 to −2.1 μm at a nominal magnification of 130,000×, corresponding to a calibrated pixel size of 1.1 Å/pixel. The total exposure time of each movie stack was 7.5 s, leading to a total accumulated dose of 50 e-/Å 2 , which fractionated into 40 frames. The data collection parameters are listed in Supplementary Table  4 .

## Cryo-EM data processing and model building

Motion correction, CTF (contrast transfer function) estimation, particle picking, 2D classification, 3D classification, and non-uniform 3D refinement were performed in CryoSPARC (version 4.2). A series of standard refinement procedures including 2D and 3D classification were performed to obtain the final maps as shown in Fig.  S2 . The initial models of each Cas protein were generated using Alpha fold 53 . The initial models were first docked into the cryo-EM density map in UCSF Chimera 54 and manually rebuilt using Coot 55 . Models were subsequently adjusted in Coot 55 and refined using phenix.real_space_refine 56 . The quality of the structural model was checked using the MolProbity program in Phenix 57 . The detailed refinement statistics are listed in Supplementary Table  4 .

## Primary T-cell culture

Healthy human peripheral blood mononuclear cells (PBMCs) were purchased from StemCell Technologies (Cat# 70025) and used according to the manufacturer’s instructions. CD3 + T cells were then further isolated by magnetic negative selection using an EasySep Human T-Cell Isolation Kit (STEMCELL, Cat# 17951). Immediately after isolation, T cells were cultured in Gibco CTS AIM V Medium (Thermo Fisher) and stimulated for 2 days with anti-human CD3/CD28 magnetic dynabeads (Thermo Fisher, Cat# A56992) at the beads-to-cells concentration ratio of 1:1 supplemented with human IL-2 at 200 U/ml (Peprotech). After electroporation, T cells were cultured in media with IL-2 at 100 U/ml. Throughout the culture period, T cells were maintained at an approximate density of 1 million cells per ml of media. Every 2-3 days after electroporation, additional media was added, along with additional fresh IL-2 to bring the final concentration to 100 U/ml.

## RNA synthesis

Synthetic CRISPR RNA (crRNA) was chemically synthesized (GenScript Biotech), resuspended to 160 µM, aliquoted and stored at −80 °C. The first and last 3 bases of the crRNA were chemically modified with 2’ O-Methyl.

## In vitro transcription of mRNAs

The full DNA sequences encoding the six Cas proteins were cloned into an IVT template plasmid carrying a T7 promoter, 5’ and 3’ UTR elements, and a poly(A) tail. Endotoxin-free and linearized plasmid preparation service was provided by GenScript Biotech and used as DNA templates. 5’ capped and 3’ polyadenylated mRNAs were synthesized with mMessage mMachine T7 Ultra kit (Thermo Fisher) using m1Ψ−5’-triphosphate (TriLink N-1081) instead of UTP and contained 120 nucleotide-long poly(A) tails. All mRNAs were purified by cellulose purification and analyzed by agarose gel electrophoresis, then stored at −20 °C.

## mRNA electroporation

mRNA and crRNA were electroporated 48 h after initial T-cell stimulation, de-beaded cells were centrifuged for 10 min at 90 g, aspirated, and resuspended in the Lonza electroporation buffer P3 using 20 µL buffer per 1 million cells. For optimal editing, T cells were electroporated per well using a Lonza 4D electroporation system with pulse code EH115. Unless otherwise indicated, 2 µL mRNAs (50, 120, 120, 120, 140, 120 ng of cas3 , cas5b , cas6b , cas7b , cas8b , cas11b ) were electroporated, along with 2 µg crRNA. Immediately after electroporation, 80 µL of pre-warmed media was added to each well, and cells were allowed to rest for 10 min at 37 °C in a cell culture incubator while remaining in the electroporation cuvettes. After 10 min, cells were moved to 24-well tissue culture plates.

## Flow cytometry

TCR surface disruption was quantified using flow cytometry analysis 5 days post-electroporation. Transfected primary T cells were individualized and analyzed on an Accuri C6 Plus or LSR Fortessa (BD). Surface staining for flow cytometry was performed by pelleting cells and resuspending in 25 µl of PBS with 2% FBS and APC anti-human TCR α/β (Biolegend, Cat# 306717) for 20 min at 4 °C in the dark. As isotype controls were used APC human IgG1 Isotype control recombinant antibody (Biolegend, Cat # 403505). Cells were washed twice in FACS buffer before resuspension. Flow cytometry data were analyzed with FlowJo v10.7.1.

## DNA lesion analysis by long-range PCR and PacBio sequencing

The genomic DNA of the edited cells was isolated using a Puregene Cell Kit (8 × 10 8 ) (Qiagen) per the manufacturer’s instructions. Long-range PCR reactions were carried out with a KOD FX Neo kit (TOYOBO), each PCR used gDNA as a template and the complement primers were detailed in Supplementary Tables. PCR products (9.4 kb) were resolved on 0.8% agarose gel, stained with SYBR Safe (Invitrogen), and visualized using a ChemiDoc MP imager (BioRad). To precisely define Cas3-induced deletions or insertions, the PCR products were analyzed by PacBio sequencing.

## Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

## Data availability

The cryo-EM reconstructed density map of Syn Cascade-dsDNA complex in partial and full R-loop formations have been deposited into the Electron Microscopy data bank under accession numbers EMD-34495 and EMD-35629 , respectively. The associated atomic coordinate has been deposited into the Protein Data Bank with PDB 8H67 and PDB 8IP0 . All materials and data are available upon request from the corresponding authors Meiling Lu ([email protected]), Zhenhuang Yang ([email protected]), and Yibei Xiao ([email protected]).  Source data are provided in this paper.

Makarova, K. S., Grishin, N. V., Shabalina, S. A., Wolf, Y. I. & Koonin, E. V. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol. Direct 1 , 7 (2006).

Article   PubMed   PubMed Central   Google Scholar

Barrangou, R. et al. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315 , 1709–1712 (2007).

Marraffini, L. A. & Sontheimer, E. J. CRISPR interference limits horizontal gene transfer in Staphylococci by targeting DNA. Science 322 , 1843–1845 (2008).

Brouns, S. J. J. et al. Small CRISPR RNAs guide antiviral defense in prokaryotes. Science 321 , 960–964 (2008).

Wiedenheft, B., Jore, M. M., Brouns, S. J. J. & van der Oost, J. Structures of the RNA-guided surveillance complex from a bacterial immune system. Nature 477 , 486–489 (2011).

Westra, E. R. et al. CRISPR immunity relies on the consecutive binding and degradation of negatively supercoiled invader DNA by cascade and Cas3. Mol. Cell 46 , 595–605 (2012).

Article   CAS   PubMed   PubMed Central   Google Scholar

Makarova, K. S. et al. Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 18 , 67–83 (2020).

Doudna, J. A. The promise and challenge of therapeutic genome editing. Nature 578 , 229–236 (2020).

Anzalone, A. V., Koblan, L. W. & Liu, D. R. Genome editing with CRISPR–Cas nucleases, base editors, transposases and prime editors. Nat. Biotechnol. 38 , 824–844 (2020).

Makarova, K. S., Zhang, F. & Koonin, E. V. SnapShot: class 1 CRISPR-Cas systems. Cell 168 , 946 (2017).

Nussenzweig, P. M. & Marraffini, L. A. Molecular mechanisms of CRISPR-Cas immunity in bacteria. Annu. Rev. Genet. 54 , 93–120 (2020).

Hu, C. et al. Allosteric control of type I-A CRISPR-Cas3 complexes and establishment as effective nucleic acid detection and human genome editing tools. Mol. Cell 82 , 2754–2768 (2022).

O'Brien, R. E. et al. Structural basis for assembly of non-canonical small subunits into type I-C Cascade. Nat. Commun. 11 , 5931 (2020).

O'Brien, R. E. et al. Structural snapshots of R-loop formation by a type I-C CRISPR Cascade. Mol. Cell 83 , 746–758 (2023).

Hu, C. et al. Exploiting activation and inactivation mechanisms in type I-C CRISPR-Cas3 for genome-editing applications. Mol. Cell 84 , 463–475.e5 (2024).

McBride, T. M. et al. Diverse CRISPR-Cas complexes require independent translation of small and large subunits from a single gene. Mol. Cell 80 , 971–979 (2020).

Schwartz, E. A. et al. Structural rearrangements allow nucleic acid discrimination by type I-D Cascade. Nat. Commun. 13 , 2829 (2022).

Zhao, H. et al. Crystal structure of the RNA-guided immune surveillance Cascade complex in Escherichia coli . Nature 515 , 147–150 (2014).

Xiao, Y. et al. Structure basis for directional R-loop formation and substrate handover mechanisms in type I CRISPR-Cas system. Cell 170 , 48–60 (2017).

Xiao, Y., Luo, M., Dolan, A. E., Liao, M. & Ke, A. Structure basis for RNA-guided DNA degradation by Cascade and Cas3. Science 361 , eaat0839 (2018).

Guo, T. W. et al. Cryo-EM structures reveal mechanism and inhibition of DNA targeting by a CRISPR-Cas surveillance complex. Cell 171 , 414–426 (2017).

Pausch, P. et al. Structural variation of type I-F CRISPR RNA guided DNA surveillance. Mol. Cell 67 , 622–632 (2017).

Rollins, M. F. et al. Structure reveals a mechanism of CRISPR-RNA-guided nuclease recruitment and anti-CRISPR viral mimicry. Mol. Cell 74 , 132–142 (2019).

Shangguan, Q., Graham, S., Sundaramoorthy, R. & White, M. F. Structure and mechanism of the type I-G CRISPR effector. Nucleic Acids Res. 50 , 11214–11228 (2022).

Jiang, F. & Doudna, J. A. The structural biology of CRISPR-Cas systems. Curr. Opin. Struct. Biol. 30 , 100–111 (2015).

Lin, J. et al. DNA targeting by subtype I-D CRISPR-Cas shows type I and type III features. Nucleic Acids Res. 48 , 10470–10478 (2020).

Csörgő, B. et al. A compact Cascade-Cas3 system for targeted genome engineering. Nat. Methods 17 , 1183–1190 (2020).

Whitford, C. M. et al. CASCADE-Cas3 enables highly efficient genome engineering in Streptomyces species. bioRxiv https://doi.org/10.1101/2023.05.09.539971 (2023).

Li, Y. et al. Targeted large fragment deletion in plants using paired crRNAs with type I CRISPR system. Plant Biotechnol. J. 21 , 2196–2208 (2023).

Wada, N., Osakabe, K. & Osakabe, Y. Type I-D CRISPR system-mediated genome editing in plants. Methods Mol. Biol. 2653 , 21–38 (2023).

Dolan, A. E. et al. Introducing a spectrum of long-range genomic deletions in human embryonic stem cells using type I CRISPR-Cas. Mol. Cell 74 , 936–950 (2019).

Tan, R. et al. Cas11 enables genome engineering in human cells with compact CRISPR-Cas3 systems. Mol. Cell 82 , 852–867 (2022).

Osakabe, K., Wada, N., Murakami, E., Miyashita, N. & Osakabe, Y. Genome editing in mammalian cells using the CRISPR type I-D nuclease. Nucleic Acids Res. 49 , 6347–6363 (2021).

Chen, Y. et al. Repurposing type I-F CRISPR-Cas system as a transcriptional activation tool in human cells. Nat. Commun. 11 , 3136 (2020).

Zimmermann, A. et al. A Cas3-base editing tool for targetable in vivo mutagenesis. Nat. Commun. 9 , 3389 (2023).

Pyne, M. E., Bruder, M. R., Moo-Young, M., Chung, D. A. & Chou, C. P. Harnessing heterologous and endogenous CRISPR-Cas machineries for efficient markerless genome editing in Clostridium. Sci. Rep. 6 , 25666 (2016).

Cheng, F. et al. Harnessing the native type I-B CRISPR-Cas for genome editing in a polyploid archaeon. J. Genet. Genomics 44 , 541–548 (2017).

Zhang, J., Zong, W., Hong, W., Zhang, Z. & Wang, Y. Exploiting endogenous CRISPR-Cas system for multiplex genome editing in Clostridium tyrobutyricum and engineer the strain for high-level butanol production. Metab. Eng. 47 , 49–59 (2018).

Elmore, J. R. et al. Programmable plasmid interference by the CRISPR-Cas system in Thermococcus kodakarensis . RNA Biol. 10 , 828–840 (2013).

Maikova, A. et al. Protospacer-adjacent motif specificity during Clostridioides difficile type I-B CRISPR-Cas interference and adaptation. Mbio 12 , e213621 (2021).

Wang, S., Gabel, C., Siddique, R., Klose, T. & Chang, L. Molecular mechanism for Tn7-like transposon recruitment by a type I-B CRISPR effector. Cell 186 , 4204–4215.e19 (2023).

Makarova, K. S. et al. An updated evolutionary classification of CRISPR‐Cas systems. Nat. Rev. Microbiol. 13 , 722–736 (2015).

Musharova, O. et al. Systematic analysis of Type I‐E Escherichia coli CRISPR‐Cas PAM sequences ability to promote interference and primed adaptation. Mol. Microbiol. 111 , 1558–1570 (2019).

Jore, M. M. et al. Structural basis for CRISPR RNA-guided DNA recognition by Cascade. Nat. Struct. Mol. Biol. 18 , 529–536 (2011).

Hayes, R. P. et al. Structural basis for promiscuous PAM recognition in type I-E Cascade from E. coli . Nature 530 , 499–503 (2016).

Eyquem, J. et al. Targeting a CAR to the TRAC locus with CRISPR/Cas9 enhances tumour rejection. Nature 543 , 113–117 (2017).

Gao, Q. et al. Therapeutic potential of CRISPR/Cas9 gene editing in engineered T‐cell therapy. Cancer Med. 8 , 4254–4264 (2019).

Azangou-Khyavy, M. et al. CRISPR/Cas: from tumor gene editing to T cell-based immunotherapy of cancer. Front. Immunol. 11 , 2062 (2020).

Stadtmauer, E. A. et al. CRISPR-engineered T cells in patients with refractory cancer. Science 367 , eaba7365 (2020).

Rutkauskas, M. et al. Directional R-loop formation by the CRISPR-Cas surveillance complex Cascade provides efficient off-target site rejection. Cell Rep. 10 , 1534–1543 (2015).

Morisaka, H. et al. CRISPR-Cas3 induces broad and unidirectional genome editing in human cells. Nat. Commun. 10 , 5302 (2019).

Cameron, P. et al. Harnessing type I CRISPR–Cas systems for genome engineering in human cells. Nat. Biotechnol. 37 , 1471–1477 (2019).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Pettersen, E. F. et al. UCSF ChimeraX: structure visualization for researchers, educators, and developers. Protein Sci. 30 , 70–82 (2021).

Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D. Biol. Crystallogr. 66 , 486–501 (2010).

Liebschner, D. et al. Macromolecular structure determination using X-rays, neutrons and electrons: recent developments in Phenix. Acta Crystallogr. D. 75 , 861–877 (2019).

Williams, C. J. et al. MolProbity: more and better reference data for improved all-atom structure validation. Protein Sci. 27 , 293–315 (2018).

## Acknowledgements

This work is supported by the National Key Research and Development Program of China (2018YFA0902000) (to Y.B.X. and M.L.L.), and the National Natural Science Foundation of China (31970547) (to Y.B.X.). We would like to thank you for cryo-EM data collection at the Instrument Analysis Center (IAC) at Shanghai Jiao Tong University.

## Author information

These authors contributed equally: Meiling Lu, Chenlin Yu.

## Authors and Affiliations

Department of Biochemistry, School of Life Science and Technology, China Pharmaceutical University, Nanjing, 211198, China

Meiling Lu, Chenlin Yu, Yuwen Zhang, Wenjun Ju, Zhi Ye & Chenyang Hua

State Key Laboratory of Natural Medicines, China Pharmaceutical University, Nanjing, 211198, China

Meiling Lu & Yibei Xiao

Nanjing Foreign Language School, Nanjing, 210008, China

Department of Biological Sciences, Faculty of Science, National University of Singapore, Singapore, 117543, Singapore

Precision Medicine Translational Research Programme (TRP), Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117543, Singapore

Institute for Hepatology, National Clinical Research Center for Infectious Disease, Shenzhen Third People’s Hospital, Shenzhen, Guangdong, 518112, China

Zhenhuang Yang

Department of Pharmacology, School of Pharmacy, China Pharmaceutical University, Nanjing, 211198, China

Chongqing Innovation Institute of China Pharmaceutical University, Chongqing, 401135, China

You can also search for this author in PubMed   Google Scholar

## Contributions

M.L.L., Z.H.Y. and Y.B.X. conceived the project and designed the experiments. M.L.L., C.L.Y., Y.W.Z., W.J.J., Z.Y., C.Y.H., J.Z.M. and Z.H.Y. carried out the experiments. M.L.L., C.L.Y., Y.W.Z., Z.H.Y. and Y.B.X. analyzed the data. M.L.L., C.L.Y., Z.H.Y. and Y.B.X. wrote the manuscript. All authors discussed the results and contributed to the final manuscript.

## Corresponding authors

Correspondence to Meiling Lu , Zhenhuang Yang or Yibei Xiao .

## Ethics declarations

Competing interests.

The authors declare no competing interests.

## Peer review

Peer review information.

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary information

Reporting summary, peer review file, source data, rights and permissions.

Reprints and permissions

Lu, M., Yu, C., Zhang, Y. et al. Structure and genome editing of type I-B CRISPR-Cas. Nat Commun 15 , 4126 (2024). https://doi.org/10.1038/s41467-024-48598-2

Accepted : 07 May 2024

Published : 15 May 2024

DOI : https://doi.org/10.1038/s41467-024-48598-2

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

• Explore articles by subject
• Guide to authors
• Editorial policies

• This is Broad Learn about our mission, history, and partner institutions.
• People Meet our members, staff scientists, fellows, leadership, and other Broadies.
• Join Broad Find out how to join the Broad as an employee or associate member.
• Contact us Find our contact information, directions to our buildings, and directory.
• Broad@20 Broad turns 20 this year! See how we’re celebrating our 20th anniversary.
• Cardiovascular disease
• Infectious disease and microbiome
• Kidney disease
• Psychiatric disorders
• Rare disease
• Chemical biology and therapeutics science
• Drug discovery
• Genome regulation, cellular circuitry, and epigenomics
• Medical and population genetics
• Data sciences
• Genetic perturbation
• Metabolomics
• Spatial technologies
• Patient-partnered research Patients partner with our scientists to accelerate the pace of discovery and find better treatments.
• Partnering and licensing We work closely with pharmaceutical, biotech, and technology partners to accelerate the translation of our discoveries.
• Publications A catalog of scientific papers published by our members and staff scientists.
• Resources, services, and tools Key scientific datasets and computational tools developed by our scientists and their collaborators.
• Collaborations and consortia We join with institutions and scientists the world over to address foundational challenges in science and health.
• Carlos Slim Center for Health Research The Slim Center aims to bring the benefits of genomics-driven medicine to Latin America, gleaning new insights into diseases with relevance to the region.
• Gerstner Center for Cancer Diagnostics The Gerstner Center is developing next-generation diagnostic technology for cancer detection and tracking disease progression.
• Klarman Cell Observatory The Klarman Cell Observatory is systematically defining mammalian cellular circuits, how they work together to create tissues and organs, and are perturbed to cause disease.
• Merkin Institute for Transformative Technologies in Healthcare The Merkin Institute is supporting early-stage ideas aimed at advancing powerful technological approaches for improving how we understand and treat disease.
• Novo Nordisk Foundation Center for Genomic Mechanisms of Disease This center is developing new paradigms and technologies to scale the discovery of biological mechanisms of common, complex diseases, by facilitating close collaborations between the Broad Institute and the Danish research community.
• Eric and Wendy Schmidt Center The EWSC is catalyzing a new field of interdisciplinary research at the intersection of data science and life science, aimed at improving human health.
• Stanley Center for Psychiatric Research The Stanley Center aims to reduce the burden of serious mental illness by contributing new insights into pathogenesis, identifying biomarkers, and paving the way toward new treatments.
• Art and science connection Explore the connection between art and science and how we bring together artists and Broad scientists through our artist-in-residence program, gallery exhibitions, and ongoing public conversations.
• Broad Discovery Center Visit our free public educational space that showcases how researchers at the Broad and their colleagues around the world seek to understand and treat human disease.
• Learning resources Access free classroom materials and more for STEM educators, parents, students, tutors, and others.
• Public programs Discover remarkable stories of scientific progress, and explore the intersections of science, medicine, and society.
• Student opportunities Learn about Broad Institute's mentored research offerings for high school students, college students, and recent college graduates.

## New gene delivery vehicle shows promise for human brain gene therapy

Scientists have engineered an adeno-associated virus (AAV) that efficiently crosses the blood-brain barrier in human cell models and delivers genes throughout the brain in humanized mice.

Related news

In an important step toward more effective gene therapies for brain diseases, researchers from the Broad Institute of MIT and Harvard have engineered a gene-delivery vehicle that uses a human protein to efficiently cross the blood-brain barrier and deliver a disease-relevant gene to the brain in mice expressing the human protein. Because the vehicle binds to a well-studied protein in the blood-brain barrier, the scientists say it has a good chance at working in patients.

Gene therapy could potentially treat a range of severe genetic brain disorders, which currently have no cures and few treatment options. But FDA-approved forms of the most commonly used vehicle for packaging and delivering these therapies to target cells, adeno-associated viruses (AAVs), aren’t able to efficiently cross the blood-brain barrier at high levels and deliver therapeutic cargo. The enormous challenge of getting therapies past this barrier — a highly selective membrane separating the blood from the brain — has stymied the development of safer and more effective gene therapies for brain diseases for decades.

Now researchers in the lab of Ben Deverman , an institute scientist and senior director of vector engineering at the Broad, have engineered the first published AAV that targets a human protein to reach the brain in humanized mice. The AAV binds to the human transferrin receptor, which is highly expressed in the blood-brain barrier in humans. In a new study published today in Science , the team showed that their AAV, when injected into the bloodstream in mice expressing a humanized transferrin receptor, crossed into the brain at much higher levels than the AAV that is used in an FDA-approved gene therapy for the central nervous system, AAV9. It also reached a large fraction of important types of brain cells, including neurons and astrocytes. The researchers then showed that their AAV could deliver copies of the GBA1 gene, which has been linked to Gaucher’s disease, Lewy body dementia, and Parkinson’s disease, to a large fraction of cells throughout the brain.

The scientists add that their new AAV could be a better option for treating neurodevelopmental disorders caused by mutations in a single gene such as Rett syndrome or SHANK3 deficiency; lysosomal storage diseases like GBA1 deficiency; and neurodegenerative diseases such as Huntington’s disease, prion disease, Friedreich’s ataxia, and single-gene forms of ALS and Parkinson’s disease.

“Since we came to the Broad we’ve been focused on the mission of enabling gene therapies for the central nervous system,” said Deverman, senior author on the study. “If this AAV does what we think it will in humans based on our mouse studies, it will be so much more effective than current options.”

“These AAVs have the potential to change a lot of patients’ lives,” said Ken Chan, a co-first author on the paper and group leader from Deverman’s group who has been working on solving gene delivery to the central nervous system for nearly a decade.

## Mechanism first

For years, researchers developed AAVs for specific applications by preparing massive AAV libraries and testing them in animals to identify top candidates. But even when this approach succeeds, the candidates often don’t work in other species, and the approach doesn’t provide information about how the AAVs reach their targets. This can make it difficult to translate a gene therapy using these AAVs from animals to humans.

To find a delivery vehicle with a greater chance of reaching the brain in people, Deverman’s team switched to a different approach. They used a method they published last year, which involves screening a library of AAVs in a test tube for ones that bind to a specific human protein. Then they test the most promising candidates in cells and mice that have been modified to express the protein.

As their target, the researchers chose human transferrin receptor, which has long been the target of antibody-based therapies that aim to reach the brain. Several of these therapies have shown evidence of reaching the brain in humans.

The team’s screening technique identified an AAV called BI-hTFR1 that binds human transferrin receptor, enters human brain cells, and bypasses a human cell model of the blood-brain barrier.

“We’ve learned a lot from in vivo screens but it has been tough finding AAVs that worked this well across species,” added Qin Huang , a co-first author on the study and a senior research scientist in Deverman’s lab who helped develop the screening method to find AAVs that bind specific protein targets. “Finding one that works using a human receptor is a big step forward.”

## Beyond the dish

To test the AAVs in animals, the researchers used mice in which the mouse gene that encodes the transferrin receptor was replaced with its human equivalent. The team injected the AAVs into the bloodstream of adult mice and found dramatically higher levels of the AAVs in the brain and spinal cord compared to mice without the human transferrin receptor gene, indicating that the receptor was actively ferrying the AAVs across the blood-brain barrier.

The AAVs also showed 40-50 times higher accumulation in brain tissue than AAV9, which is part of an FDA-approved therapy for spinal muscular atrophy in infants but is relatively inefficient at delivering cargo to the adult brain. The new AAVs reached up to 71 percent of neurons and 92 percent of astrocytes in different regions of the brain.

In work led by research scientist Jason Wu, Deverman’s team also used the AAVs to deliver healthy copies of the human GBA1 gene, which is mutated in several neurological conditions. The new AAVs delivered 30 times more copies of the GBA1 gene than AAV9 in mice and were delivered throughout the brain.

The team said that the new AAVs are ideal for gene therapy because they target a human protein and have similar production and purification yields as AAV9 using scalable manufacturing methods. A biotech company co-founded by Deverman, Apertura Gene Therapy , is already developing new therapies using the AAVs to target the central nervous system.

With more development, the scientists think it’s possible to improve the gene-delivery efficiency of their AAVs to the central nervous system, decrease their accumulation in the liver, and avoid inactivation by antibodies in some patients.

Sonia Vallabh and Eric Minikel , two researchers at the Broad who are developing treatments for prion disease, are excited by the potential of the AAVs to deliver brain therapies in humans.

"When we think about gene therapy for a whole-brain disease like prion disease, you need really systemic delivery and broad biodistribution in order to achieve anything," said Minikel. "Naturally occurring AAVs just aren't going to get you anywhere. This engineered capsid opens up a world of possibilities."

This work was supported by Apertura Gene Therapy, the National Institutes of Health Common Fund, the National Institute of Neurological Disorders and Stroke, and the Stanley Center for Psychiatric Research.

Paper cited

Huang Q, Chan KY, et al. An AAV capsid reprogrammed to bind human transferrin receptor mediates brain-wide gene delivery . Science . Online May 16, 2024. DOI: 10.1126/science.adm8386.

## Latest news

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

• Publications
• Account settings

## Public nucleic acid sequence repository

• Submission Types
• Submission Tools
• Update GenBank Records
• Sample Record
• Revision History
• Sequence IDs
• Sequence Data Processing
• Complete Genome Submission Guide
• Prokaryotic Genome Annotation Guide

## Eukaryotic Genome Annotation Guide

• Annotation Examples
• Genome Submission Portal
• WGS Project List
• WGS Submission Guide
• Eukaryotic Annotation Guide
• Prokaryotic Annotation Guide
• Discrepancy Report
• Structured Comment
• TSA Submission Guide
• Country List
• db_xref List
• Feature Table
• Submission Brokers
• Accession Number Prefixes
• Organelle Submission Guide
• Monkeypox Submission Guide
• Common Submission Errors
• Ribosomal Submission Errors
• Common Sequence Errors
• Submission FAQs

table2asn (the replacement of tbl2asn) use a simple five-column tab-delimited table of feature locations and qualifiers in order to generate annotation.

The format of this feature table allows diferent kinds of features (e.g. gene, coding region, tRNA, repeat_region) and qualifiers (e.g. /product, /note) to be indicated. The validator will check for errors such as internal stops in coding regions.

Guidelines for prokaryotic genome submissions .

If you do not understand any of the instructions presented here or you have questions, please contact us by email at [email protected] prior to creating your submission. This will save us both a lot of time.

Gene features, transcript_id, cds (coding region) features, partial coding regions, gene fragments, transpliced genes, split genes on two contigs, ribosomal rna, trna and other rna features, mrna features, alternatively spliced genes, evidence qualifiers.

• Data base cross references

## Gene Ontology

Prepare annotation table.

The features must be in a simple five-column tab-delimited table, called the feature table. The feature table specifies the location and type of each feature for table2asn (previously tbl2asn) to include in the GenBank submission that is created. The first line of the table contains the following basic information:

The SeqID must be the same as the sequence's SeqID in the FASTA file. The table_name is optional. Subsequent lines of the table list the features. Columns are separated by tabs.

• Column 1: Start location of feature
• Column 2: Stop location of feature
• Column 3: Feature key
• Column 4: Qualifier key
• Column 5: Qualifier value

Figure 2 shows a sample feature table and illustrates a number of points about the feature table format. The GenBank flatfile corresponding to this table is shown in Figure 3 . The allowed features and their qualifiers are listed in the Feature Table documentation .

Features that are on the complementary strand, such as the genes Ngs_3038 and Ngs_11232 and their corresponding features shown in Figure 2 , are indicated by reversing the interval locations.

Additional requirements, as well as suggestions for various types of annotation, are included in the following sections.

Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and polyA binding sites.

Gene names should follow the standard nomenclature rules of the particular organism. For example, mouse gene names begin with an uppercase letter, and the remaining letters are lowercase.

Coding regions (CDS) and RNAs, such as tRNAs and rRNAs, must have a corresponding gene feature. However, other features such as repeat_regions and misc_features do not have a corresponding gene or locus_tag.

All genes should be assigned a systematic gene identifier which should receive the locus_tag qualifier on the gene feature in the table. Genes may also have functional names as assigned in the scientific literature. In this example, KCS_0001 is the systematic gene identifier, while Abc5 is the functional gene name.

## Table view of gene with both biological name and locus_tag:

Flatfile view:, table view of gene with only locus_tag:.

For consistency the same locus_tag prefix must be used throughout the entire genome. Therefore, all the chromosomes of a genome should have the same locus_tag prefix.

To improve the use of locus_tags we are now requiring that all locus_tag prefixes be registered and that they be unique. We recommend having the BioProject registration process auto-assign a locus_tag prefix, as they are not meant to convey meaning. The locus_tag prefix should be 3-12 alphanumeric characters and the first character may not be a digit. The locus_tag prefix is followed by an underscore and then an alphanumeric identification number that is unique within the given genome. Other than the single underscore used to separate the prefix from the identification number, no special characters can be used in the locus_tag.

The chromosome number can be embedded in the locus_tag, if desired, in the format Prefix_#g#####, where the first # is the chromosome number and ##### is the unique number of the gene. For example, Ajs_4g00123 for a gene on chromosome 4.

Please register your genome project and proposed locus_tag prefix on the BioProject registration page prior to preparing your submission to GenBank. Each project that is registered here is assigned a project_id, and in the future we intend that the project_id will appear on all entries associated with a particular genome project.

All proteins in a WGS or complete genome must be assigned an identification number by the submitter. We use this number to track proteins when sequences are updated. This number is indicated in the table by the CDS qualifier protein_id, and should have the format gnl|dbname|string, where dbname is a version of your lab name that you think will be unique (eg SmithUCSD), and string is the unique protein SeqID assigned by the submitter. This identifier is saved with the record (in ASN.1 format), but it is not visible in the flatfile. We recommend using the locus_tag as the protein SeqID. In this example, the protein_id for ABC5 is gnl|SmithUCSD|KCS_0001.

Since the protein_id is used for internal tracking in our database, it is important that the complete protein_id (dbname + SeqID) not be duplicated by a genome center. Thus, if your genome center is submitting more than one complete genome, please be sure to use unique protein_id's for all of the genomes.

The protein_id is also included as a qualifier on the corresponding mRNA feature, to allow the CDS and mRNA to be paired during processing.

Note that when WGS submissions are processed, the dbname in the protein_id is automatically changed to 'WGS:XXXX', where XXXX is the project's accession number prefix.

After your genome is released into GenBank, the proteins are assigned accession numbers. We will provide a table of the protein SeqIDs and accession numbers for you to use in future updates .

The transcript_id is included as a qualifier for both the CDS and its corresponding mRNA. It has the same format as the protein_id, gnl|dbname|identifier. Because each transcript_id and protein_id must be unique, we suggest adding 'mrna' or 't' to the protein_id identifier as a simple way to create the corresponding (unique) transcript_id. However, you can use whatever naming convention you choose, as long as all of the identifiers are unique.

All CDS features must have a product qualifier (protein name). NCBI protein naming conventions are adopted from the International Protein Nomenclature Guidelines .

Consistent nomenclature is indispensable for communication, literature searching and data retrieval. Many species-specific communities have established gene nomenclature committees that try to assign consistent and, if possible, meaningful gene symbols. Other scientific communities have established protein nomenclatures for a set of proteins based on sequence similarity and/or function. But there is no established organization involved in the standardization of protein names, nor are there any efforts to establish naming rules that are valid across the largest spectrum of species possible.

Ambiguities regarding gene/protein names are a major problem in the literature and it is even worse in the sequence databases which tend to propagate the confusion. For this reason, we ask that you follow some basic guidelines in naming your proteins. The protein naming guidelines are based on the premise that a good and stable recommended name for a protein is a name that is as neutral as possible.

Guidelines for naming proteins:

• If it exists, use the approved nomenclature.
• Use a concise name, not a description or phrase.
• Ideally, the name should be unique and attributed to all orthologs.
• In cases where the protein name is not known use "hypothetical protein" or "uncharacterized protein" as the product name.
• The protein name should not reflect the the protein's subcellular location, its domain structure, its molecular weight or its species of origin. This information can be included in the note.
• For proteins that belong to a multigene family, it is recommended that you choose a coherent nomenclature with numbers to specify the different members of the family.
• When naming proteins which can be grouped into a family based on homology or according to a notion of shared function, the different members should be enumerated with a dash "-" followed by an Arabic number. e.g. "desmoglein-1", "desmoglein-2", etc.
• Proteins of unknown function which contain a defined domain or motif, can be named according to the domain present. The name should be of the following type: "<domain|repeat>-containing protein". e.g. "PAS domain-containing protein 5".
• Protein names may be denoted by the same symbol as the corresponding gene, but in the correct format for the organism. For example, mouse proteins have the same symbol as the gene name, but the protein name has all capital letters.
• Greek letters must be written in full e.g. "alpha", and written entirely in lower case with the exception of "Delta" in the context of steroid/fatty acid metabolism nomenclature. Additionally the Greek letters that are followed by a number should be preceded or followed by a dash "-" e.g. "unicornase alpha-1".
• Use lowercase letters, except when uppercase are required (for example, in acronyms such as DNA or ATP).
• Wherever appropriate, the name should use American spelling conventions.
• Avoid the use of molecular weights in protein names; "unicornase subunit A" is preferred to "unicornase 52 kDa subunit"
• Avoid the term "homolog" in a protein as this infers an evolutionary relationship that has generally not been determined.
• Avoid the use of commas in protein names whenever possible.
• Avoid the use of Roman numerals where possible. Use Arabic numbers instead.
• Do not build molecular weights into abbreviations
• Do not use diacritics, such as accents, umlauts. Many computer systems (ours included) can only understand ASCII characters.
• Do not use plurals in a protein name. e.g. "ankyrin repeats-containing protein 8" is wrong.

Here are some examples of good protein names:

Here are some examples of bad protein names:

Please avoid including notes indicating a specific percentage of similarity to other entries in the database, since the corresponding record that you have pointed to may change and make your current note inaccurate, incorrect and obsolete. Descriptions, notes describing similarity to other proteins, and functional comments must be placed in the appropriate CDS qualifiers such as note, or prot_desc, as they are descriptors of the product. E.C. numbers must be fielded in an EC_number qualifier.

Qualifiers that can be used on the CDS feature are:

Multiple note qualifiers can be included and will be concatenated by table2asn into a single note with semi-colons as separators.

## Bifunctional proteins:

If a protein contains two separate and distinct functions or if it has more than one name, these can be annotated in several ways, as outlined below. Table view:

To annotate a partial coding region, you should use the "<" or ">" in your feature table to designate the feature as either 5' or 3' partial. The coding region should begin at the first nucleotide present in the sequence or exon, and you will indicate where the first complete codon begins in that coding region.

Partial genes within a sequence should begin or end at consensus splice sites.

In the first example below, the "<" designates this coding region as 5' partial and "codon_start 3" tells the software to start translation with the third nucleotide of the CDS. Note that if the codon_start is not specified, then the software assumes a codon_start of 1. The second coding region below is partial at the 3' end so ">" is used to indicate a 3' partial feature. The third example is of a 3' partial coding region on the complementary or minus strand.

Here are more examples of formatting partial CDS features .

Include an mRNA feature for each translated CDS. Several things to note are:

• Use the same product name for the mRNA and its corresponding CDS.
• If there is no UTR information, then the mRNA's location will agree with its CDS's location, but the mRNA will be partial at its 5' and 3' ends.
• Extend the gene feature to include the entire mRNA.
• If the mRNA is partial, then make the gene partial.

The first example is a complete CDS whose 5' and 3' UTRs are known.

The second example is a CDS that is partial at the 5' end and lacks any 3' UTR information.

Sometimes a genome will have adjacent or nearby genes that seem to be only part of a protein. In many cases these indicate a possible problem with the sequence and/or annotation. A related issue is the presence of internal stop codons in the conceptual translation of a CDS that looks like it should be a real CDS. These problems may be due to a variety of reasons, including mutations or sequencing artifacts. They can be annotated in a number of ways:

Annotate the gene with /pseudo to indicate that there is a problem with the gene. Note that this qualifier does NOT mean that the gene is a pseudogene. (see point 2, below, if it is known that the gene IS a pseudogene) If multiple gene fragments were present initially, then add a single gene feature which covers all of the potential coding regions and add the pseudo qualifier. If known, a note qualifier may be added indicating why this gene is disrupted, for example:

If you are sure that the disrupted or error-filled gene is a biological pseudogene, then add the pseudogene qualifier and the appropriate pseudogene type . For example:

If the feature is just noting a similarity to genes in the database and is probably not translated, then it should be annotated as a misc_feature without a corresponding gene feature.

Transpliced genes are the exception to the rule for annotating gene feature spans. Transpliced genes are similar to intron containing genes except the two pieces of the gene are found on different regions of the chromosome. These genes are transcribed as two or more separate RNA products that are transpliced into a single mRNA or tRNA. To annotate this using a table, enter the nucleotide spans so that the complementary (minus strand) spans are arranged from high to low and vice versa for the plus strand.

NEW (Sept 2012): Sometimes in incomplete genomes the ends of a gene may be on different contigs. When certain that the two pieces are part of the same gene, annotate these as separate genes with unique locus_tags, plus separate CDS/mRNAs with different protein_id's and transcript_id's. In addition, link the features together with notes that refer to the other part of the gene. However, do not create extremely short features, for example if one end is only the start methinione or only a few amino acids before the stop codon.

In many cases a gene can be alternatively spliced, yielding alternative transcripts. These transcripts may differ in the coding region and produce different products, or they may differ in the non-translated 5' or 3' UTR and produce the same protein. To annotate alternatively spliced genes, include one mRNA and CDS for each transcript, and include only one gene over all of the features. Give the corresponding mRNA and CDS the same name, and include a note "alternatively spliced" on each. If there are multiple CDS with the same name, then add a note to each mRNA and CDS to refer to each other, eg "transcript variant A" and "encoded by transcript variant A" for one mRNA/CDS pair. If the CDS have different translations, then they should have different product names. Make sure that all the proteins have unique protein_id's.

## Example 1 (different products):

Example 2 (same product):.

RNA features (rRNA, tRNA, ncRNA) need a corresponding gene feature with a locus_tag qualifier. If the amino acid of a tRNA is unknown, use tRNA-Xxx as the product, as in the example. Many submitters like to label the tRNAs such as tRNA-Gly1, etc. If you wish to do this please include "tRNA-Gly1" as a note and not in /gene. The use of /gene is reserved for the actual biological gene symbol such as "trnG". If a tRNA is a pseudogene, please use the /pseudo qualifier.

Annotate ncRNAs that belong to one of the INSDC nRNA_class as an ncRNA feature, with the appropriate value in the required /ncRNA_class qualifier. Regions of an RNA should be annotated as a misc_feature (eg, leader sequences), or a misc_binding feature if they bind a known molecule (eg, riboswitches). If the RFAM identifier is known, it can be included as a db_xref .

## Some rRNA, tRNA and ncRNA examples:

Riboswitches used to be annotated using the misc_binding feature if the bound moiety was known, for example:

New in 2017: annotate riboswitches as regulatory features with the regulatory_class 'riboswitch':

If the bound moiety is unknown or if the sequence is a leader sequence, annotate as a misc_feature, for example:

misc_feature and misc_binding and regulatory features do not have an associated gene feature. If it is desired to tag these features with a locus_tag-like identifier, then include that value in the note, separated from other information by a semi-colon and space.

The International Nucleotide Sequence Database Collaboration, DDBJ, EMBL and GenBank has adopted a set of new qualifiers to describe the evidence for feature annotation in GenBank records. These are:

/experimental="text" /inference="TYPE:text", where 'TYPE' is from a select list and 'text' is structured text.

These qualifiers replace /evidence=experimental and /evidence=non-experimental, respectively, which are no longer supported.

## Database cross references

A variety of database cross references can be added to a feature. These appear as /db_xref on the features. This qualifier serves as a vehicle for linking of sequence records to other external databases. See the full list of db_xref databases.

GO (Gene Ontology) terms can be included in genomes in order to describe protein functionality. Gene Ontology (GO) terms can be indicated with the following qualifiers

The value field is separated by vertical bars '|' into a descriptive string, the GO identifier (leading zeroes are retained), and optionally a PubMed ID and one or more evidence codes. The evidence code is the fourth token, so include blank fields, as necessary (eg the last qualifier has no PubMed ID so the third field is blank).

## Genome Resources

• WGS Browser
• Genome Submission Guide
• Update Genome Records
• Submitting Multiple Haplotype Assemblies
• Create Submission Template
• Annotation Example Files
• Annotating Genomes with GFF3 or GTF files
• Validation Error Explanations for Genomes
• NCBI Prokaryotic Genome Annotation Pipeline
• Metagenome Submission Guide

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

Last updated: 2024-04-23T18:22:06Z

#### IMAGES

1. annotate_my_genomes

2. Genome annotation

3. Introduction to Genome Annotation

4. what is annotate genome

5. Genome Annotation

6. what is annotate genome

#### VIDEO

1. Annotating a genome in BV-BRC

2. How to Annotate Complex Books: Tips for Effective Reading & Understanding

3. How to annotate your text? (Stop and Jot)

4. DNA Subway -- Red Line

5. Annotate Meaning

6. BV-BRC Analysis Services

1. DNA annotation

DNA annotation is classified into two categories: structural annotation, which identifies and demarcates elements in a genome, and functional annotation, which assigns functions to these elements. [7] This is not the only way in which it has been categorized, as several alternatives, such as dimension-based [8] and level-based classifications ...

2. Genome Annotation and Analysis

Genome Annotation and Analysis. In the preceding chapter, we gave a brief overview of the methods that are commonly used for identification of protein-coding genes and analysis of protein sequences. Here, we turn to one of the main subjects of this book, namely, how these methods are applied to the task of primary analysis of genomes, which ...

3. 18.4.1: Genome Annotation

Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: identifying elements on the genome, a process called gene prediction, and attaching biological information to these elements. Automatic annotation tools try to perform all of this by computer analysis, as opposed to manual ...

4. Genome Annotation

Genome annotation is the process of deriving the structural and functional information of a protein or gene from a raw data set using different analysis, comparison, estimation, precision, and other mining techniques. Genome annotation is essential because the sequencing of the genome or DNA generates sequence information without its functional ...

5. Ten steps to get started in Genome Assembly and Annotation

Genome annotation consists of attaching biological meaningful information to genome sequences by analyzing their sequence structure and composition as well as to consider what we know from closely related species, which can be used as reference. While genome annotation involves characterizing a plethora of biologically significant elements in a ...

6. 7.13B: Annotating Genomes

An annotation (irrespective of the context) is a note added by way of explanation or commentary. Since the 1980's, molecular biology and bioinformatics have created the need for DNA annotation. DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining ...

7. A beginner's guide to eukaryotic genome annotation

Genome Biol. 7 (Suppl. 1), 1-3 (2006). This is the introduction to an entire issue of Genome Biology that is dedicated to benchmarking an entire host of eukaryotic gene finders and annotation ...

8. Twelve quick steps for genome assembly and annotation in the ...

Step 10: Genome annotation. Unlike advanced and revolutionized genome sequencing and assembly, getting genome annotation correct remains a challenge. Annotation is the process of identifying and describing regions of biological interest within a genome (both functionally and structurally).

9. Genome annotation: from sequence to biology

The genome sequence of an organism is an information resource unlike any that biologists have previously had access to. But the value of the genome is only as good as its annotation. It is the ...

10. PDF A beginner's guide to eukaryotic genome annotation

Generally, genome-wide annotation of gene structures is divided into two distinct phases. In the first phase, the 'computation' phase, expressed sequence tags (ESTs), proteins, and so on, are ...

11. An Introduction to Genome Annotation

Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools.

12. Progress, Challenges, and Surprises in Annotating the Human Genome

The pangenome and extracts of it representing individual human genomes will be the substrate for future genome annotation and analysis. The Genome as a Template for Transcription. Despite tremendous progress since the publication of the draft genome, the identification and characterization of transcribed regions of the genome are still moving ...

13. Hands-on: Genome Annotation / Genome Annotation

Genome annotation is the process of attaching biological information to sequences. It consists of three main steps: identifying portions of the genome that do not code for proteins. identifying elements on the genome, a process called gene prediction, and. attaching biological information to these elements.

14. Genome annotation: From human genetics to biodiversity genomics

1. Identifying and mapping genes into a given genome sequence is usually referred to as annotating the genome. Annotating genomes is not a trivial task, as illustrated by the fact that more than 20 years after the completion of the first drafts of the human genome, the exact number of human genes is still unknown. 2.

15. What is genome annotation? · NLM Customer Support Center

Genome annotation is the process of finding and designating locations of individual genes and other features on raw DNA sequences, called assemblies. Annotation gives meaning to a given sequence and makes it much easier for researchers to view and analyze its contents. To visualize what annotation adds to our understanding of the sequence, you ...

16. Next-generation genome annotation: we still struggle to get it right

While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so.

17. Why is genome annotation important?

Simply put, genome annotation involves taking genomic data - DNA or RNA sequences - and mapping the correct genes (or more accurately, functional elements) to the correct locations. It gives the genome meaning. According to Kaithakottil, this is an essential step that is frustratingly undervalued. "People often spend a lot of effort on genome ...

18. PDF Using Full-length RNA Sequencing to Annotate Genomes and Solve Diseases

WHOLE-GENOME ANNOTATION: KEY PUBLICATIONS Wang et al., Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing, Nat Comm (2016)-First Iso-Seq application for whole genome annotation-Multiplexed 6 different maize B73 tissues-Obtained ~111 k high-quality transcripts-Vastly improved existing annotation and incorporated to MaizeGDB v4

19. The first high-quality genome assembly and annotation of

The genome annotation identified 83,775 protein-coding genes, with 83% functionally annotated. In particular, the study mapped 42 anthocyanin and carotenoid candidate gene clusters and 12 herbicide target genes to the assembly, identifying 38 genes spread across the genome that are integral to flower color development and 53 genes for herbicide ...

20. Teaching transposon classification as a means to crowd source the

The percentage of genome annotated is shown on the y-axis. Full size image. Table 2 Number of base pairs annotated and percentages of the main TE categories. The full version of this table with information about the annotation with Repbase Arthropoda and reciprocal libraries can be found in Table S5.

21. Combining DNA and protein alignments to improve genome annotation with

Here we introduce LiftOn, a homology-based software tool that integrates DNA and protein alignments to enhance the accuracy of genome-scale annotation and to allow mapping between relatively distant species. LiftOn's protein-centric algorithm considers both types of alignments, chooses optimal open reading frames, resolves overlapping gene loci ...

22. A novel approach to exploring the dark genome and its application to

The availability of whole genome sequence (WGS) data from a broad range of species provides unprecedented scope for comparative genomic investigations [1,2,3].However, these investigations rely to a large extent on annotation—the process of identifying and labeling genome features—which usually lags far behind the generation of sequence data.. Consequently, most whole genome sequences are ...

23. The NCBI Eukaryotic Genome Annotation Pipeline

The NCBI Eukaryotic Genome Annotation Pipeline provides content for various NCBI resources including Nucleotide, Protein, BLAST, Gene and the Genome Data Viewer genome browser. This page provides an overview of the annotation process. Please refer to the Eukaryotic Genome Annotation chapter of the NCBI Handbook for algorithmic details.

24. Comparative 3D genome analysis between neural retina and retinal

Comparative 3D genome analysis between the neural retina and RPE/choroid reveals differential interactions. As many known retinal disease genes are expressed within specific components and cell types within the human retina [], we wanted to explore the role of tissue-specific 3D genomic structures or interactions in establishing these expression patterns in the neural retina and RPE/choroid.

25. Chromosome-level genome assembly and characterization of the

Therefore, we have assembled and annotated the C. sinica genome using long reads obtained from Oxford Nanopore (ONT) sequencing and short reads obtained from Illumina sequencing. This effort resulted in a genome assembly of 1.06 Gb with a 11.91Mb contig N50 size. Using the Hi-C data, we associated 91.62% of the assembled bases with eight pseudo ...

26. Viruses are doing mysterious things everywhere

Researchers annotate viruses by matching viral sequences in a sample to previously annotated sequences available in public databases of viral genetic sequences.

27. PDF Comparative genomic analysis identifies potential adaptive variation

3 47 signature of adaptation. The presence or absence of annotated genes was very similar between 48 sheep and goat clades, with only two annotated genes significantly clade-associated. However, three M. ovipneumoniae genome assemblies49 from asymptomatic caribou in Alaska formed a 50 highly divergent subclade within the sheep clade that lacked 23 annotated genes compared to

28. Structure and genome editing of type I-B CRISPR-Cas

Here, we present two cryo-EM structures of the Synechocystis sp. PCC 6714 ( Syn) type I-B Cascade, revealing the molecular mechanisms that underlie RNA-directed Cascade assembly, target DNA ...

29. New gene delivery vehicle shows promise for human brain gene therapy

The Drug Repurposing Hub is one of the most comprehensive and up-to-date biologically annotated collections of FDA-approved compounds in the world. Researchers anywhere can explore more than 6,000 drugs in the hub and search for possible new uses for them to jump-start new drug discovery. ... Broad Institute has shared CRISPR genome-editing ...

30. Eukaryotic Genome Annotation Guide

Annotate the gene with /pseudo to indicate that there is a problem with the gene. Note that this qualifier does NOT mean that the gene is a pseudogene. (see point 2, below, if it is known that the gene IS a pseudogene) If multiple gene fragments were present initially, then add a single gene feature which covers all of the potential coding ...