LLMs in Biology Applications ============================ As we saw in the previous section, Large Language Models (LLMs) have demonstrated remarkable capabilities in summarizing, processing, and generating human-like text. Their success hinges on the assumption that natural language can be modeled as a system of sequential words/sounds (tokens), and that the precise arrangement of those tokens is what gives rise to context and meaning. If we look elsewhere in nature, we find it is rich with sequential information and processes - DNA sequences, transcriptomes, protein sequences, signal transduction cascades, metabolic pathways, and even small molecules (interpreted as SMILES strings). Do these sequences in nature have their own "vocabulary" conferring some biological importance? **If we treat these sequences as language, can we uncover hidden signals or patterns with LLMs?** In this section, we will look at how LLMs are currently being used in a few different areas in the life sciences. The hands-on examples are driven by current topics found in the literature. By the end of this section, you should be able to: * Pull existing models from Hugging Face using the ``transformers`` library * Apply DNABERT-2 to a DNA sequence classification task * Generate new protein sequences using ProtGPT2 .. note:: The following examples should be done in a Jupyter Notebook on Vista with the ``Day4-pt-251`` kernel. If you are doing this elsewhere, you need to install a few libraries: .. code-block:: console $ pip install --user transformers scikit-learn pandas einops peft Example 1: DNA Sequence Classification -------------------------------------- **DNABERT-2** is an advanced transformer-based foundation model designed for analyzing DNA sequences across multiple species. It builds upon the original DNABERT model by introducing significant improvements in tokenization and architecture, and it is trained on genomes from 135 different species. This allows DNABERT-2 to perform well on a wide range of genomic tasks, including promoter prediction, gene expression analysis, DNA methylation detection, and more. *Given DNA sequences and labels, how would we approach a classification task?* How it Works ^^^^^^^^^^^^ DNABERT and DNABERT-2 are transformer-based architectures modeled after Google's BERT architecture. BERT (Bidirectional Encoder Representations from Transformers) deviates from the traditional transformer architecture in that it only uses the encoder portion of the original model. Traditional transformers include both an encoder and decoder for sequence-to-sequence tasks like translation, but BERT-like models are designed solely for *understanding language*, not generating it. The original DNABERT tokenized k-mer sequences with fixed length as input, and added a CLS token (a tag representing meaning of entire sentence), a SEP token (sentence separator) and MASK tokens (to represent masked k-mers in pre-training). The input is embedded and fed through 12 transformer blocks. The outputs can be interepreted for sentence-level classification or token-level classification. .. figure:: ./images/dnabert_model.png :width: 800 :align: center The original DNABERT model [1]_ (shown above) was improved in DNABERT-2 [2]_ Some of the key innovations of DNABERT-2 over the original DNABERT include: 1. **Byte Pair Encoding (BPE) Tokenization:** Instead of fixed-length k-mer tokenization, which can be computationally inefficient and may not capture variable-length patterns effectively. DNABERT-2 replaces this with BPE, a data-driven compression algorithm that identifies and merges frequently occurring nucleotide sequences. This approach reduces sequence length by approximately five times on average, enhancing computational efficiency and preserving meaningful biological patterns. 2. **Enhanced Model Architecture:** DNABERT-2 improves upon the original model by replacing the original embeddings with Attention with Linear Biases (ALiBi) embeddings, which overcome limitations with input length and allow the model to handle inputs of unlimited length without performance issues. Also, DNABERT-2 incorporates Flash Attention, an optimized mechanism for improved IO to accelerate training and inference and reducing memory requirements. 3. **Genome Understanding Evaluation (GUE) Benchmark:** The original DNABERT was trained on human sequences alone. DNABERT-2 is trained on genomes from 135 different species, enhancing its generalizability across diverse organisms. To assess and compare genome models effectively, the developers introduced the GUE benchmark, a massive amalgam of 36 datasets for 9 different genomic tasks. DNABERT-2 is trained and evaluated on massive amounts of data: * Trained on genomes from **135 species** comprising **>32B bases** * Evaluated with the GUE benchmark for **9 different genome classification tasks** * In total, there were **262B training tokens** and **117M parameters** in the model With these improvements, DNABERT-2 was able to outperform its predecessor on 23 out of 28 datasets from the GUE benchmark in tasks like promoter prediction, gene expression analysis, DNA methylation detection, chromatin state classification, and variant effect prediction. Try it Out ^^^^^^^^^^ Let's perform a simple classification task using DNABERT-2. First, use the ``transformers.pipeline`` method to load the 117M parameter model from `Hugging Face `__: .. code-block:: python >>> import torch >>> from transformers import AutoTokenizer, AutoModel >>> tokenizer = AutoTokenizer.from_pretrained('zhihan1996/DNABERT-2-117M', trust_remote_code=True) >>> model = AutoModel.from_pretrained('zhihan1996/DNABERT-2-117M', trust_remote_code=True) Then, try *tokenizing* a sample DNA sequence using the tokenizer provided by the authors. This fragments the sequence into variable length k-mers and assigns each k-mer a unique token ID: .. code-block:: python >>> dna = 'ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC' # sample sequence >>> inputs = tokenizer(dna, return_tensors='pt')['input_ids'] >>> print(type(inputs)) >>> print(inputs.shape) >>> print(inputs) .. code-block:: text torch.Size([1, 17]) tensor([[ 1, 5, 194, 32, 757, 1239, 2092, 294, 24, 359, 88, 93, 32, 75, 77, 19, 2]]) The tokenizer gives you back a tensor (in this case, a vector) of token IDs. With DNABERT-2, the first token (``1``) and last token (``2``) are always the same - the CLS and SEP tokens. The rest of the sequence is broken down and given an encoded ID from a look up table. Can we see how the sequence was tokenized? Yes! Let's do a little reverse engineering to see how the author's tokenizer broke down the sequence into k-mers: .. code-block:: python >>> token_ids = inputs[0].tolist() >>> tokens = tokenizer.convert_ids_to_tokens(token_ids) >>> print(tokens) .. code-block:: text ['[CLS]', 'A', 'CGTA', 'GCA', 'TCGGA', 'TCTATCTA', 'TCGACA', 'CTTGG', 'TTA', 'TCGA', 'TCTA', 'CGA', 'GCA', 'TCTC', 'GTTA', 'GC', '[SEP]'] *What are some general observations you can make about the tokenization?* Lets now turn our attention to the model. To see details of the model architecture, you can inspect the ``model`` object that we created above: .. code-block:: python >>> print(model) *Click below to expand the model architecture - notice anything familiar?* .. toggle:: .. code-block:: text BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(4096, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0-11): 12 x BertLayer( (attention): BertUnpadAttention( (self): BertUnpadSelfAttention( (dropout): Dropout(p=0.0, inplace=False) (Wqkv): Linear(in_features=768, out_features=2304, bias=True) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (mlp): BertGatedLinearUnitMLP( (gated_layers): Linear(in_features=768, out_features=6144, bias=False) (act): GELU(approximate='none') (wo): Linear(in_features=3072, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) This pre-trained model by itself does not have broad utility. We need to fine tune the model to a specific task (e.g. classification). For this example, we will fine tune it to build a classifier to predict whether a sequence is from a coding region of DNA or not. We have previously downloaded and modified a DNA sequence dataset from `Kaggle `_. The adapted dataset contains two columns: the sequence and a label (1=coding, 0=non-coding). The dataset is in CSV format and can be downloaded directly from this `URL `_. Since the data is available on the web, it is easiest to use pandas to just read it in directly. .. code-block:: python >>> import pandas as pd >>> data = pd.read_csv('https://raw.githubusercontent.com/TACC/life_sciences_ml_at_tacc/refs/heads/main/docs/section5/files/Coding_NonCoding_DNA_Sequences.csv') >>> print(data.shape) >>> print(data.head()) .. code-block:: text (65321, 2) DNA_sequence Target 0 CTCTTGCGGTCGATCTGGTCACGGGTGATGGTGAAGGTTACGTAGT... 1 1 TCGCGGTCCCGAGCCTGATCGTGCGCCGCGCCAACACGACGGTCGA... 1 2 GGCTACGACGTGACCGCGGGGCAGGTGCTCGTGACCAACGGCGGCA... 1 3 CAGGTAGGTGCCACAGTAGTAAGCGGTGATGCAGTTGCCCCTGAAT... 1 4 GAGTTGTCCTGGTAAGATTCTTACCCATGCGAATCACGTCGAAAGG... 1 It appears there are 65,321 rows and 2 columns. How many of each class are included in the dataset? .. code-block:: python >>> data['Target'].value_counts() .. code-block:: text Target 0 45603 1 19718 Name: count, dtype: int64 It's about 70% non-coding and 30% coding DNA. This is a fairly large data set - more data generally is better for training a model. However to save time in this workshop, let's subsample it to just 1000 sequences, checking to make sure that it preserves the ratio: .. code-block:: python >>> data_subset = data.sample(1000) >>> print(data_subset['Target'].value_counts()) .. code-block:: text Target 0 694 1 306 Name: count, dtype: int64 That's pretty close - we have narrowed it down to 1000 sequences, 694 of which are from non-coding regions of DNA, and 306 of which are from coding regions. Next, let's use the tokenizer that we pulled from Hugging Face to slice up those 1000 sequences into short k-mers (of variable length), then encode them into a tensor of token IDs. .. code-block:: python >>> data_subset_list = data_subset.DNA_sequence.values.tolist() >>> inputs = tokenizer(data_subset_list, return_tensors="pt", padding=True)["input_ids"] Then, use the model that we pulled from Hugging Face to get the embeddings, given the token IDs. The embeddings are a mathematical representation of the hidden states of the model. .. warning:: This step may take about 3 or 4 minutes for 1000 sequences. .. code-block:: python >>> embeddings = model(inputs) >>> embedding_data = torch.mean(embeddings[0], dim=1).detach().numpy() Given those embeddings and the known labels (coding or non-coding), we can train a simple classifier to predict the labels of unknown DNA sequences. For this example, we can use a Naive Bayes Classifier, which typically works well with high-dimension data. As we have seen before, we split our embeddings (which we treat as features) and labels into train and test sets. Then, we fit the classifer to the training set and predict the labels of our test set. .. code-block:: python >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.model_selection import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(embedding_data, data_subset.Target, test_size=0.3, stratify=data_subset.Target, random_state=1) >>> gnb = GaussianNB() >>> y_pred = gnb.fit(X_train, y_train).predict(X_test) How did we do? .. code-block:: python >>> from sklearn.metrics import classification_report >>> print(f"Performance on TEST\n*******************\n{classification_report(y_test, gnb.predict(X_test))}") >>> print(f"Performance on TRAIN\n********************\n{classification_report(y_train, gnb.predict(X_train))}") .. code-block:: text Performance on TEST ******************* precision recall f1-score support 0 0.83 0.69 0.76 214 1 0.46 0.65 0.54 86 accuracy 0.68 300 macro avg 0.65 0.67 0.65 300 weighted avg 0.72 0.68 0.69 300 Performance on TRAIN ******************** precision recall f1-score support 0 0.86 0.71 0.78 500 1 0.49 0.71 0.58 200 accuracy 0.71 700 macro avg 0.68 0.71 0.68 700 weighted avg 0.76 0.71 0.72 700 Accuracy of 0.68 is not great - but this is using the embeddings alone as our features. Could we get a better outcome if we trained over all ~65K sequences? Can we get a better outcome if we fine tune the original DNABERT-2 model that was previously trained on **32B bases**, and then use their baked-in sequence classifier? .. toggle:: .. note:: You can also give it just one sequence for classification. For example, give it the sequence from above: .. code-block:: python >>> dna = 'ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC' >>> dna_tokenized = tokenizer(dna, return_tensors='pt')['input_ids'] >>> dna_embedded = model(dna_tokenized) >>> dna_embedded_data = torch.mean(dna_embedded[0], dim=1).detach().numpy() >>> pred=gnb.predict(dna_embedded_data) >>> print(pred) .. code-block:: text [0] Rather than trying to run all of the Python code required for this in the Jupyter notebook, a quick check of the author's `GitHub repo `_ shows they wrote a convenient script to do exactly what we want to do - fine tune their model with our input sequences for a classification task. In order to run their script, we just need to clone their GitHub repo, and prepare three input files: ``train.csv``, ``test.csv``, and ``dev.csv`` (this last one is what they call their validation set). We can prepare those three files by sub-sampling our 1000-sequence subset: .. code-block:: python >>> import os >>> scratch = os.environ.get('SCRATCH') >>> os.mkdir(f'{scratch}/sample_data_for_dnabert2') >>> data_subset.iloc[:600].to_csv(f'{scratch}/sample_data_for_dnabert2/train.csv', index=False) >>> data_subset.iloc[600:750].to_csv(f'{scratch}/sample_data_for_dnabert2/dev.csv', index=False) >>> data_subset.iloc[750:].to_csv(f'{scratch}/sample_data_for_dnabert2/test.csv', index=False) *Does anyone see any potential problems with this approach?* Next, open up a terminal from Jypyter Lab and clone the DNABERT-2 GitHub repo: .. code-block:: console [gh]$ cds # cd to SCRATCH [gh]$ git clone https://github.com/MAGICS-LAB/DNABERT_2 [gh]$ cd DNABERT_2/finetune/ Then, run the training script. The container image was prepared previously by installing the dependencies listed in the author's GitHub repo. .. warning:: This step may take about 15 minutes for 5 epochs. .. code-block:: console [gh]$ module load tacc-apptainer [gh]$ export APPTAINER_CACHEDIR=$SCRATCH/apptainer_cache [gh]$ apptainer exec docker://wjallen/dnabert2:1.0 \ python train.py \ --model_name_or_path zhihan1996/DNABERT-2-117M \ --data_path ${SCRATCH}/sample_data_for_dnabert2/\ --kmer -1 \ --run_name DNABERT2_test \ --model_max_length 100 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 16 \ --gradient_accumulation_steps 1 \ --learning_rate 3e-5 \ --num_train_epochs 5 \ --fp16 \ --save_steps 200 \ --output_dir output/ \ --evaluation_strategy steps \ --eval_steps 200 \ --warmup_steps 50 \ --logging_steps 100 \ --overwrite_output_dir True \ --log_level info \ --find_unused_parameters False *What is learning_rate? Is it set to a small number or a big number here? Why is it set the way it is? Any other notable parameters?* If successful, the script outputs the results to the ``output/`` directory, which you can check as follows: .. code-block:: console [gh]$ cat output/results/DNABERT2_test/eval_results.json {"eval_loss": 0.5799916386604309, "eval_accuracy": 0.696, "eval_f1": 0.5968084203378321, "eval_matthews_correlation": 0.2162526926317811, "eval_precision": 0.6257488069854807, "eval_recall": 0.5929735004879513, "eval_runtime": 24.8139, "eval_samples_per_second": 10.075, "eval_steps_per_second": 0.645, "epoch": 5.0 } Slightly better than the Naive Bayes classifier we trained above, but not by much. Congratulations! You've successfully completed a simple classification task using DNABERT-2. Other Notable DNA Sequence Models ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * `DNABERT-S `_ - Species-specific genomic analyses * `DNAGPT `_ - Variant calling and motif discovery * `HyenaDNA `_ - Long range genomic sequence modeling * `GROVER `_ - DNA sequence context and language rules for the human genome Example 2: Protein Design ------------------------- **ProtGPT2** is a specialized LLM trained by transfer learning of protein sequence information on top of GPT-2. It is designed to generate novel protein sequences following biological principles of naturally-occurring proteins. Sequences generated by ProtGPT2 have typical amino acid compositions, generally are predicted to be globular, and sequence searches show they are distantly related to real proteins, yet they exist in a new and unexplored protein space. How it Works ^^^^^^^^^^^^ ProtGPT2 works by leveraging a GPT-2-like transformer architecture. GPT-2 (Generative Pretrained Transformer 2) also deviates from the traditional transformer, but this architecture only uses the decoder portion of the model. As mentioned above, the full transformer includes both an encoder and decoder for tasks like translation, but GPT-2-like models are designed for generative tasks such as text completion. It uses a unidirectional (left-to-right) attention mechanism, meaning each token pays attention (attention is all you need!) only to previous tokens, which enables text generation. This contrasts with *bidirectional* models like BERT, which use bidirectional attention for deeper ProtGPT2 is trained specifically on protein sequence data. The model learns patterns, motifs, and structural features inherent in protein sequences by analyzing vast datasets of known proteins. This training enables ProtGPT2 to generate sequences that are not only syntactically valid but also biologically meaningful. The process involves: * **Tokenization**: Protein sequences are broken down into smaller units (amino acid tokens) for processing * **Training**: The model is trained on large datasets of protein sequences to predict the next token in a sequence, capturing contextual relationships * **Generation**: Using the learned patterns, the model can generate new protein sequences by sampling from the probability distribution of possible tokens Massive amounts of data go into the training and evaluation: * An autoregressive transformer model with **738 million parameters** * Trained on **~50 million non-annotated protein sequences** spanning all of known protein space * Generated and predicted structural and chemical properties for **10,000 new sequences** The authors report that by finetuning the model on a subset of sequences that a user chooses, the model can be biased toward certain end goals. For example: * Designing proteins with specific properties * Tune or alter biochemical functions of natural proteins * Exploring sequence space for novel enzymes or therapeutic proteins * Understanding sequence-function relationships .. figure:: ./images/protgpt2_space.png :width: 800 :align: center Protein space and example proteins sampled by ProtGPT2 [3]_ Try it Out ^^^^^^^^^^ Use the ``transformers.pipeline`` method to load the ``protgpt2`` model from `Hugging Face `__: .. code-block:: python >>> from transformers import pipeline >>> protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2") .. code-block:: text config.json: 100% 850/850 [00:00<00:00, 164kB/s] pytorch_model.bin: 100% 3.13G/3.13G [00:39<00:00, 76.7MB/s] model.safetensors: 100% 3.13G/3.13G [00:29<00:00, 107MB/s] vocab.json: 100% 655k/655k [00:00<00:00, 6.85MB/s] merges.txt: 100% 314k/314k [00:00<00:00, 5.05MB/s] tokenizer.json: 100% 1.07M/1.07M [00:00<00:00, 5.44MB/s] special_tokens_map.json: 100% 357/357 [00:00<00:00, 71.7kB/s] The model may take a few minutes to download from the web, but inference only takes a few seconds. Use the default parameters to generate 10 brand new protein sequences: .. code-block:: python >>> sequences = protgpt2("<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0) A few of the important parameters: * ``"<|endoftext|>"``: the starting token, in this case interpreted as starting anew or the amino acid M * ``max_length=100``: sets the maximum token length returned where each token is around four amino acids * ``do_sample=True``: enables sampling instead of greedy decoding, meaning the model will not just take the next token with the highest likelihood, instead it will randomly sample from the probability distribution of likely * ``top_k=950``: number of most probably next tokens that are considered for sampling at each step * ``repetition_penalty=1.2``: applies a penalty for repeating tokens * ``num_return_sequences=10``: specifies how many sequences to generate * ``eos_token_id=0``: generation ceases (end of sequence / EOS) if token 0 is produced The returned ``sequences`` object is a simple list that contains the generated sequences. Because the model samples a probability distribution, the sequences should be different every time: .. code-block:: python >>> print(type(sequences)) >>> print(len(sequences)) >>> print(sequences[0]) .. code-block:: text 10 {'generated_text': '<|endoftext|>\nMAHTRENQWTAMRTLWFRLACLALVVMAITSCEEEEDDTVTRQFADVTSTLPAGITTVQF\nSNAFAGSVTWMTGEATTGPDITIVITGTGFESVASDNSVILTIGDVVVDVIQWSGTEIKI\nSVPASAVASTAKLEIKNMNGLSLDLPAKIKAAFTSINGGSNPNPSGGTNNIIIAGGPFAN\nGYSNIGQFKVGAPATGDDYALIQGNFLENPETGLFYIQLRRAEDSGQTYDLYFSKDDGTN\nWNSPVNLSGTVSPS'} Alternatively, seed the model with a starting sequence token that you want to build from. For example, you may provide an N-termninal sequence or motif for a certain class of proteins (membrane bound, RNA-binding, etc.) and the model should generate other sequences with similar behavior. .. code-block:: python >>> sequences = protgpt2("<|endoftext|>\nMKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFPDWQNY", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=1, eos_token_id=0) >>> print(sequences[0]['generated_text']) .. code-block:: text <|endoftext|> MKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFPDWQNYYLEHLNFMLQDLHPGSSL PLILEIGCGSGEFLNYLAQKHQVLGVDINPDEIELAKHINPDANFLVAQAEALPFHKNTF DYVLCMEVIEHLPNPELLINECKRVLKPNGTLLFTTPNFQSLQNRIKLLLGRSPKSQYYG QEQFGHVNFFEVSSIKEIVKRFGLKPVKQKTFFPYIPSLSILHFIMNVFPIGYKFFCYLY FRKEED Congratulations! You've likely generated realistic protein sequences that no one has ever seen before. Perhaps think about plugging them in to `Alphafold3 `_ to model the structure. Other Notable Protein Sequence Models ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * `ProteinBERT `_ - Prediction of protein-protein interactions * `ESMFold `_ - Protein folding mechanisms and interactions * `ProGen `_ - Generate novel protein sequences given natural language prompts Additional Resources -------------------- * `DNABERT-2 on Hugging Face `_ * `DNABERT-2 on GitHub `_ * `DNABERT-2 Classification Exercise Adapted from Kaggle `_ * `Coding and Non-noncoding Data Adapted from Kaggle `_ * `ProtGPT2 on Hugging Face `_ * `LLM Applications in Bioinformatics Review 1 `_ * `LLM Applications in Bioinformatics Review 2 `_ References ^^^^^^^^^^ .. [1] Ji, Y., Zhou, Z., Liu, H. & Ramana V Davuluri, R.V., DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112-2120 (2021). https://doi.org/10.1093/bioinformatics/btab083 .. [2] Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R.V., and Liu, H. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. arXiv, 2306.15006, (2023). https://doi.org/10.48550/arXiv.2306.15006 .. [3] Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022). https://doi.org/10.1038/s41467-022-32007-7