Step by Step Guide to Fine-Tuning BLOOM

Introduction

BLOOM is a powerful tool that can be used for a variety of tasks, including: - Text generation: BLOOM can generate text in any of the languages it was trained on, including creative text formats such as poems, code, scripts, musical pieces, email, and letters. - Translation: BLOOM can translate text from one language to another with high accuracy. Code generation: BLOOM can generate code in a variety of programming languages, including Python, Java, C++, and JavaScript. - Question answering: BLOOM can answer your questions in a comprehensive and informative way, even if they are open ended, challenging, or strange.

BLOOM is still under development, but it has the potential to revolutionize the way we interact with computers. It can be used to create new and innovative applications in a wide range of fields, including education, healthcare, and business.

One of the key benefits of BLOOM is that it is open-source and open-access. This means that anyone can use BLOOM to develop new applications or to explore the capabilities of LLMs. This is a significant step forward in the field of AI, as it democratizes access to LLM technology and makes it possible for more people to benefit from its capabilities.

What Is Fine-Tuning?

Fine-tuning is a technique in machine learning where a pre-trained model is adapted to a new task by training it on a small amount of data that is specific to the new task. This is different to training a model from scratch, which requires a large amount of data and can be time-consuming.

Fine-tuning is a powerful technique for training LLMs on new tasks. BLOOM is a pre-trained LLM, so it can be fine-tuned to perform a variety of tasks, such as generating text, translating languages, and writing creative content.

Why Fine-Tune BLOOM?

There are several reasons why you might want to fine-tune BLOOM: - To improve the performance of BLOOM on a specific task. For example, you could fine-tune BLOOM to generate more creative texts or to translate languages more accurately. - To adapt BLOOM to a new domain. For example, you could fine-tune BLOOM to generate text in a specific industry or to translate languages from a specific region. - To develop a custom LLM that is tailored to your specific needs. For example, you could fine-tune BLOOM to generate text that is specific to your company or to translate languages that are relevant to your research.

Benefits of Fine-Tuning BLOOM

Fine-tuning BLOOM offers a number of benefits, including: - Improved performance on specific tasks. - Adaptation to new domains. - Development of custom LLMs. - Reduced training time and cost.

Fine-tuning BLOOM can be a relatively quick and easy way to improve the performance of BLOOM on a specific task or to adapt BLOOM to a new domain. This can be a significant advantage over training a model from scratch, which can be time-consuming and expensive.

Requirements

Python Libraries

To fine-tune BLOOM, a user needs the following: * A notebook backed by a GPU. Fine-tuning BLOOM requires a GPU with at least 4GB of memory. However, this can handle only basic operations, and higher functions require higher memory. * Python programming language. * Transformers library. * BLOOM model and tokenizer. * A training dataset. The training dataset should be a collection of text examples that are relevant to the task.

The following steps can be followed to fine-tune BLOOM: * Install the required Python libraries. * Download the BLOOM model and tokenizer. * Load your training data. * Prepare your training data. * Define your training arguments. * Train the model. * Evaluate the model. * Save the fine-tuned model.

In addition to the requirements listed above, there are a few other things to keep in mind while fine-tuning BLOOM: * The size of the training dataset. The larger the training dataset, the better the fine-tuned model will perform. * The quality of your training dataset. The training dataset should be high-quality and representative of the task that you want to fine-tune BLOOM for. * The hyperparameters that you use for training. The hyperparameters control the training process, such as the number of training epochs and the learning rate. It is important to tune the hyperparameters to get the best performance from the fine-tuned model.

Launch Your GPU-Backed Notebook

Head over to [E2E Cloud](https://myaccount.e2enetworks.com), and sign in or register. Once through, click on the top left corner to head to TIR - AI Platform. Then click on Create a Notebook.

Create Notebook on TIR

Make sure you select GPU notebook. Free credits are available, which should easily suffice for this tutorial.

Once notebook has been launched, follow through the next steps.

Packages and Libraries

You will have to install the following libraries:

!pip install transformers
!pip install accelerate -U
!pip install datasets
import transformers
from transformers import BloomForCausalLM
from transformers import BloomForTokenClassification
from transformers import BloomTokenizerFast
from transformers import TrainingArguments
from transformers import Trainer
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import torch
from datasets import load_dataset
import random

BLOOM Model and Tokenizer

To fine-tune BLOOM, the user will need to load their training data. The training data should be a collection of text examples that are relevant to the task that the user wants to fine-tune BLOOM for.

For example, if the user wants to fine-tune BLOOM to generate more creative texts, they could use a training dataset of poems, code, scripts, musical pieces, email, and letters. If the user wants to fine-tune BLOOM to translate languages more accurately, they could use a training dataset of parallel text in multiple languages.

Once the user has collected their training data, they need to convert it to a format that can be used by the BLOOM model. The BLOOM model expects the training data to be in a tokenized format. To tokenize the training data, the user can use the BLOOM tokenizer. The BLOOM tokenizer will split the training data into individual tokens, which are the basic units of text that the BLOOM model can understand.

Here is an example of how to tokenize the training data using the BLOOM tokenizer:

import transformers
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-1b7")

# Load the training data
training_data = []
with open("training_data.txt", "r") as f:
    for line in f:
        training_data.append(line.strip())

# Tokenize the training data
tokenized_training_data = tokenizer(training_data, return_tensors="pt")

Training Data

To prepare the training data for training, the user needs to split the tokenized training data into pairs of input and target sequences. The input sequence is the text that the BLOOM model should predict, and the target sequence is the text that the BLOOM model should predict.

For example, if the user is fine-tuning BLOOM to generate text, their training dataset might look like this:

Input sequence: I am a cat.
Target sequence: Meow.

Input sequence: I love to play.
Target sequence: Fun!

Once the user has created a training dataset, they need to split it into training and validation sets. The training set should be about 80% of the total dataset, and the validation set should be about 20% of the total dataset.

To split the dataset, the user can use the following Python code:

import random
# Split the dataset into training and validation sets
train_dataset = []
val_dataset = []
for input_ids, attention_mask in tokenized_training_data:
    if random.random() < 0.8:
        train_dataset.append((input_ids, attention_mask))
    else:
        val_dataset.append((input_ids, attention_mask))

Example

Suppose tokenized_training_data is a list containing the following pairs of input_ids and attention_mask:

tokenized_training_data = [
    ([1, 2, 3], [1, 1, 1]),
    ([4, 5, 6], [1, 1, 1]),
    ([7, 8, 9], [1, 1, 1]),
    ([10, 11, 12], [1, 1, 1])]

After running the above code, the user might end up with train_dataset and val_dataset similar to the following:

train_dataset: [([1, 2, 3], [1, 1, 1]), ([7, 8, 9], [1, 1, 1])]
val_dataset: [([4, 5, 6], [1, 1, 1]), ([10, 11, 12], [1, 1, 1])]

Training the BLOOM Model

To train the BLOOM model, the user can use the Trainer class from the Transformers library. The Trainer class provides a number of features that make it easy to train and evaluate LLMs, such as: - Automatic gradient computation - Distributed training - Early stopping - Evaluation metrics

To train the BLOOM model using the Trainer class, the user can use the following Python code:

!pip install transformers
!pip install accelerate -U
!pip install datasets

import transformers
from transformers import BloomForCausalLM
from transformers import BloomForTokenClassification
from transformers import BloomTokenizerFast
from transformers import TrainingArguments
from transformers import Trainer
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import torch
from datasets import load_dataset

Load the BLOOM model and tokenizer

model = BloomForCausalLM.from_pretrained("bigscience/bloom-1b7")
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-1b7")

Create the training dataset

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_dataset=tokenized_datasets["train"]

Create the validation dataset

eval_dataset=tokenized_datasets["validation"]

Define the training arguments

training_args = TrainingArguments(
    output_dir="output",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
)

Create the trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

Evaluation

To evaluate the fine-tuned BLOOM model, the user can use the following Python code:

# Evaluate the model on the test dataset
test_accuracy = trainer.evaluate(test_dataset)

The test loss and accuracy will give the user an indication of how well the fine-tuned BLOOM model will perform on new data. If the test loss and accuracy are satisfactory, the user can use the fine-tuned model to generate text, translate languages, and write creative content. Here are some tips for fine-tuning BLOOM: - Use a large and representative training dataset. The larger and more representative the training dataset is, the better the fine-tuned model will perform. - Use a suitable learning rate. The learning rate controls how quickly the model learns. A too-high learning rate can cause the model to overfit the training data, while a too-low learning rate can cause the model to learn slowly. - Use early stopping. Early stopping prevents the model from overfitting the training data by stopping training when the validation loss stops decreasing. - Experiment with different hyperparameters. The hyperparameters of the training process can have a significant impact on the performance of the fine-tuned model. Experiment with different hyperparameters to find the best combination for the user’s task.

Saving and Loading

Once the user has trained and evaluated the fine-tuned BLOOM model, they can save it to a file so that they can use it later. To do this, the user can use the save_pretrained() method of the Trainer class:

# Load the fine-tuned BLOOM model
model = BloomForCausalLM.from_pretrained("directory")

# Save the fine-tuned model to a file
model.save_pretrained("directory")

Once the user has saved the fine-tuned model, they can load it back into memory using the from_pretrained() method:

# Load the fine-tuned BLOOM model
model = BloomForCausalLM.from_pretrained("directory")

The user can then use the fine-tuned model to generate text, translate languages, and write creative content.

Example Let us test BLOOM with various prompts. The required packages are first installed, and necessary libraries are imported.

!pip install transformers
from transformers import BloomForCausalLM
from transformers import BloomForTokenClassification
from transformers import BloomTokenizerFast
import torch

The BLOOM model is then imported to the local drive. Here, 1b7 version of bloom is imported into the system. The size of the model is 3.44 GB.

tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-1b7", local_files_only=False)
model = BloomForCausalLM.from_pretrained("bigscience/bloom-1b7", local_files_only=False)

Once the pretrained model is downloaded, we can then move to giving different prompts as inputs.

prompt = ""
result_length = ""
inputs = tokenizer(prompt, return_tensors="pt")

print(tokenizer.decode(model.generate(inputs["input_ids"],
                    max_length=result_length,
                    num_beams=2,
                    no_repeat_ngram_size=2,
                    early_stopping=True
                    )[0]))

Example 1:

Let us consider generating text related to E2E Cloud. BLOOM has no prior awareness of it. But it has to generate text based on its training even without knowing about E2E. The required length is 50 characters.

prompt = "E2E cloud is a cloud service provider. It is"
result_length = 50
inputs = tokenizer(prompt, return_tensors="pt")

Output:

E2E cloud is a cloud service provider. It is the only cloud provider in the world that provides a complete cloud solution for the entire enterprise. The company is based in Singapore and has offices in Hong Kong, Singapore, and the United States.

From the output, it can be seen that the model has given wrong information, but it is somewhat believable to someone who does not know about the company. The generated text attributes characteristics and locations to the company based on what the model has learned during its training, although these details might not be accurate because the model is not aware of the specific company. This highlights both the flexibility and the limitations of such large-scale language models.

Example 2:

Let us consider generating text related to some generic content, in this case an animal. The prompt given is “Jaguars are wild species, they”, with the required length of 50 characters.

prompt = "Jaguars are wild species, they"
result_length = 50
inputs = tokenizer(prompt, return_tensors="pt")

Output:

Jaguars are a wild species, they are not easy to domesticate, and they have a tendency to become aggressive. They are also very territorial, so it is important to keep them in a small space.

BLOOM demonstrates its ability to produce factual and contextually relevant information about jaguars, a wild animal species. This suggests that BLOOM is well-equipped to handle generalized queries, synthesizing its vast training data into an accurate and informative output. Overall, this example showcases the model’s strength in producing reliable and coherent text when presented with a generic topic.

Example 3:

Let us consider generating a longer text with a question. The text asked is a technical question. The prompt given is: “What is deep learning?”, with a required length of 200 characters.

prompt = "What is deep learning?"
result_length = 200
inputs = tokenizer(prompt, return_tensors="pt")

Output:

What is deep learning? Deep learning is a machine learning technique that uses a large amount of data to learn a model that can be used to solve a specific problem. Deep neural networks (DNNs) are a type of machine-learning algorithm that learns from data. They are used in a wide range of applications, including computer vision, speech recognition, and natural language processing. The DNN is trained using a set of labeled data, which is called the training set. Once trained, the model is able to generalize well to unseen data. Deep learning has been used for a variety of tasks, such as image classification, text categorization, image captioning, object detection, etc. In this paper, we focus on the task of image segmentation. Image segmentation is the process of identifying and classifying the objects in an image. There are two main types of segmentation algorithms: region-based and object-based. Region-based algorithms segment the image into regions based on certain criteria, while object

BLOOM takes on a technical question about deep learning and provides a comprehensive answer within the 200-character limit. BLOOM effectively demonstrates its capability to not only answer a question but also provide contextual background information, thereby giving the reader a rounded understanding of the subject. This example exemplifies BLOOM’s prowess in generating detailed, informative, and relevant content in response to technical queries.

Conclusion

Here are some example use cases for fine-tuned BLOOM: - Generating creative text: Fine-tuned BLOOM can be used to generate creative text, such as poems, code, scripts, musical pieces, email, and letters. - Translating languages: Fine-tuned BLOOM can be used to translate languages more accurately and fluently. - Answering questions: Fine-tuned BLOOM can be used to answer questions in a more comprehensive and informative way. - Summarizing text: Fine-tuned BLOOM can be used to summarize text more concisely and accurately.

These are just a few examples of the many possible use cases for fine-tuned BLOOM. By fine-tuning BLOOM to a specific task, you can unlock its full potential to generate creative text, translate languages, answer questions, summarize text, and write different kinds of creative content.Fine-tuning BLOOM is a powerful way to adapt the model to a specific task. By following the tips in this guide, users can fine-tune BLOOM to achieve good performance on a variety of tasks, such as generating creative text, translating languages, answering questions, summarizing text, and writing different kinds of creative content.

Running BLOOM on a large scale requires a large RAM size and cloud. E2E Cloud offers various cloud based GPU processors at a nominal cost. If you require running BLOOM LLM, consider using [E2E cloud services](https://www.e2enetworks.com/products/). NVIDIA L4 and A100 GPUs are considered to be good for Natural Language processing. [Compare them](https://www.e2enetworks.com/blog/nvidia-l4-vs-a100-gpus-choosing-the-right-option-for-your-ai-needs) to find which is more suitable to your requirements.