<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2634489&amp;fmt=gif">

Google Workspace

LLama2 70b on GC

By Yvonne Marron | August 5, 2022

Hi Community, I am hoping you can help me out. I recently read that I can easily use vortex to launch Llama and it shouldn't be difficult. Has this been others experience? I was recently told that it would be a long process to fine tune Llama2 70b but keep reading other articles of people making it happen in matter of hours. Any experience with this? Also, how can I figure out the hourly costs to host Llama2 on GC?

I appreciate any feedback you can offer.





Best answer by malamin


Thank you for open this discusstion here. It’s really interesting and Fine-tuning Llama2 70b in a matter of hours is still a challenge, but it is becoming increasingly possible with the latest advances in hardware and software. For example, the Hugging Face Transformers library provides a number of techniques for efficient fine-tuning, such as QLoRA and FlashAttention 2.

One way to fine-tune Llama2 70b in a matter of hours is to use a technique called QLoRA, which stands for Low-Rank Adaptation of Large Language Models. QLoRA works by attaching small, low-rank adapters to the pre-trained model. These adapters are then fine-tuned on the downstream task, which is much faster than fine-tuning the entire model.

Another way to fine-tune Llama2 70b in a matter of hours is to use a technique called parameter-efficient fine-tuning. Parameter-efficient fine-tuning works by only fine-tuning the parameters of the model that are most important for the downstream task. This can be done by using techniques such as prompt tuning and adapter tuning.

Here are some tips for fine-tuning Llama2 70b in a matter of hours in Google Cloud Vertex AI:

  • Google Cloud offers a variety of hardware options, including GPUs and TPUs. GPUs are a good option for training large language models, but TPUs are even faster and more efficient.
  • Cloud TPUs: Cloud TPUs are specialized hardware accelerators that are designed for training machine learning models. They can significantly speed up the training process of large models like the Llama2 70b model
  • Use multi-worker training. Multi-worker training allows you to train a model on multiple machines simultaneously, which can significantly reduce the training time. Vertex AI makes it easy to set up multi-worker training jobs, and you can choose the number of workers to use based on your budget and the resources you have available.
  • Use a pre-processed dataset. This will save time on data loading and pre-processing.
  • Use a low learning rate. This will help to prevent the model from overfitting the training data.
  • Use a shorter training schedule. You may not need to train the model for as long as you would think in order to achieve good results.
  • Use a distributed training framework PyTorch FSDP or TensorFlow's MultiWorkerMirroredStrategy and Horovod, allow you to train your model on multiple GPUs or TPUs in parallel. This can significantly reduce the training time of large language models. This will allow you to split the training data across multiple TPUs, which can significantly speed up the training process and reduce the cost of training.
  • Use a checkpointing strategy. This will allow you to save the state of the model at regular intervals so that you can resume training from where you left off if the training process fails for any reason. This can save you time and money, as you will not have to start the training process over from the beginning.
  • Use a mixed precision training strategy. This will allow you to use a lower precision for some of the computations, which can improve the performance of the training process without sacrificing too much accuracy. This can save you money, as you will not need to use as many TPUs to train the model.
  • Use pre-trained weights. If you are able to find pre-trained weights for the Llama2 70b model, you can use them to initialize your model. This can significantly reduce the training time and cost.
  • Choose the right TPU pod. Google Cloud offers a variety of TPU pods that you can use to train the Llama2 70b model. Some TPU pods are more cost-effective than others. For example, the Cloud TPU v4 pod is more cost-effective than the Cloud TPU v3 pod.

Please note:

  • Consider using Cloud TPU Pod Sharing. Cloud TPU Pod Sharing allows you to share a Cloud TPU pod with other users. This can help to reduce the cost of training your model.
  • Use Cloud TPU preemptible VMs. Cloud TPU preemptible VMs are Cloud TPU VMs that can be taken away from you if someone else needs them. However, Cloud TPU preemptible VMs are significantly cheaper than Cloud TPU reserved VMs. If you can afford to have your training process interrupted, then Cloud TPU preemptible VMs can be a good way to save money.
  • Use Cloud TPU custom machine learning (CML) images. Cloud TPU CML images are custom machine learning images that you can create and use to train your model. Cloud TPU CML images can help you to improve the performance and cost-effectiveness of your training process.

If you need help with training or deploying the Llama2 70b model in Google Cloud TPU, you can contact the Google Cloud support team. Also, Read the following blog post and watch the introductory video and episode video , efficient fine tuning to learn more about how to optimize your model training for faster results.


To figure out the hourly costs to host Llama2 on Google Cloud , you will need to basically consider the following factors:

  • Model size: Larger models will typically cost more to host than smaller models.
  • Model complexity: More complex models will also typically cost more to host than simpler models.
  • Prediction volume: The number of predictions that you make per hour will also affect your costs.
  • Region: The region where you host your model will also affect your costs.

Once you have considered these factors, you can use the pricing calculator and Vertex AI pricing calculator to estimate your hourly costs.

Here is an example of how to use the pricing calculator to estimate the hourly costs to host Llama2 on Vertex AI:

Model size: 10 GB
Model complexity: Medium
Prediction volume: 100,000 predictions per hour
Region: US-central1


I hope this information helps you make the right decision.

Recent Articles

Data Analytics

Generative AI: Are You Behind?!

Review the latest insights from the AI Readiness Report.
By Bruno Aziza
Industry Solutions

Make "Gen AI Work": Landscape, SLMs vs. LLMs, Cost & More...

Discover the 5 metrics you need to know in order to be an exceptional CEO and Operator.
By Bruno Aziza
Google Cloud Strategy

AI Cheat Sheet

AI is no more and no less the drive to create robots with human minds so they can do everything we do and more. Use this cheat sheet to help decode the space.
By Leah Zitter