Develop gen AI applications with LLMs locally, using local LLM, Cloud Workstations, and quantized models

8 min readFeb 8, 2024

Large language models (LLMs) are powerful AI tools that can generate natural language content, such as text, speech, or code, based on a given input or prompt. LLMs have many applications in gen AI, which is the branch of AI that focuses on creating novel and realistic artifacts, such as images, video, music, or product designs. However, developing gen AI applications with LLMs is not an easy task, as it requires a lot of computational resources, especially GPUs, which are expensive and scarce. Moreover, relying on remote servers or cloud-based GPU instances for LLM-based application development can introduce latency, security concerns, and dependency on third-party services.

What if there was a way to run LLMs locally on your CPU and memory, without GPUs, and still get high-quality results? This is where localllm comes in. localllm is a new open-source tool that enables developers to run LLMs locally on Cloud Workstations, Google Cloud’s fully managed development environment. By using a combination of quantized models, Cloud Workstations, and localllm, you can develop gen AI applications on a well-equipped development workstation, leveraging existing processes and workflows. In this guide, we will show you how to use localllm, Cloud Workstations, and quantized models to create gen AI applications locally, without GPUs.

Quantized models

Quantized models are AI models that have been optimized to run on devices with limited computational resources, such as smartphones, laptops, and other edge devices. These models are designed to be more efficient in terms of memory usage and processing power, allowing them to run smoothly on devices with low-power CPUs. Quantized models are optimized to perform computations using lower-precision data types, such as 8-bit integers, instead of standard 32-bit floating-point numbers. This reduction in precision allows for faster computations and improved performance on devices with limited resources. Quantization techniques also help reduce the memory requirements of AI models. By representing weights and activations with fewer bits, the overall size of the model is reduced, making it easier to fit on devices with limited storage capacity. Quantized models can perform computations more quickly due to their reduced precision and smaller model size. This enables faster inference times, allowing AI applications to run more smoothly and responsively on local devices.

Quantized models are compatible with localllm, which means you can use them to run LLMs locally on your CPU and memory, without GPUs. There are many quantized models available for different tasks and domains, such as text generation, code generation, image captioning, and more. You can find some examples of quantized models that work with localllm on llama-cpp-python’s webserver1. These models are based on popular LLMs, such as GPT-2, GPT-3, BERT, and T5, but they have been quantized to run faster and more efficiently on local devices.

Cloud Workstations

Cloud Workstations are Google Cloud’s fully managed development environment that provides a flexible, scalable, and cost-effective way to create and run development workstations on Google Cloud. Cloud Workstations are based on Compute Engine VMs that are preconfigured with the software and tools you need for your development workflow. You can access your Cloud Workstations through a browser-based IDE, such as Cloud Shell Editor, or from multiple local code editors, such as VSCode or JetBrains IDEs. You can also connect to your Cloud Workstations through SSH. Cloud Workstations allow you to take advantage of Google Cloud’s infrastructure, security, and services, while giving you full control and customization over your development environment.

Cloud Workstations are ideal for running localllm and quantized models, as they provide a well-equipped development workstation that can run LLMs locally on CPU and memory, without GPUs. You can create a custom base image for your Cloud Workstation that includes localllm and the quantized models you want to use. You can also configure your Cloud Workstation cluster, machine type, and persistent storage according to your needs. You can start and stop your Cloud Workstations on demand, and only pay for the resources you use. You can also update your Cloud Workstation configuration at any time, and the changes will apply to your Cloud Workstations the next time they start. This way, you can keep your development environment up to date and consistent.

localllm

localllm is a new open-source tool that allows you to run LLMs locally on your Cloud Workstation, using CPU and memory, without GPUs. localllm works with quantized models, which are optimized to run on devices with limited resources. localllm can query the quantized models and generate natural language content, such as text, speech, or code, based on a given input or prompt. localllm can also perform tasks such as text summarization, text translation, text classification, and more. localllm is easy to install, run, and query on your Cloud Workstation, using a simple Python script. localllm can also be integrated with other tools and frameworks, such as TensorFlow, PyTorch, or Hugging Face Transformers, to create more complex and advanced gen AI applications.

localllm enables you to develop gen AI applications locally, without GPUs, and still get high-quality results. By using localllm, you can leverage the power of LLMs to generate natural language content, write code, or solve problems in innovative ways. You can also experiment with different quantized models, inputs, and outputs, and see the results in real time. localllm can help you speed up your development process, enhance your creativity, and reduce your costs.

Use cases and examples

There are many use cases and examples of how localllm can be used to develop gen AI applications locally, without GPUs. Here are some of them:

Text generation: You can use localllm to generate natural language text, such as blog posts, product reviews, headlines, slogans, or stories, based on a given topic, keyword, or prompt. For example, you can use localllm to generate a blog post about the benefits of localllm, using the quantized model gpt2-quantized 2. Here is a possible output:

How localllm can help you develop gen AI applications locally, without GPUs

Gen AI is the branch of AI that focuses on creating novel and realistic artifacts, such as images, video, music, or product designs, using large language models (LLMs). LLMs are powerful AI tools that can generate natural language content, such as text, speech, or code, based on a given input or prompt. LLMs have many applications in gen AI, but they also require a lot of computational resources, especially GPUs, which are expensive and scarce.If you are a developer who wants to create gen AI applications with LLMs, you might face some challenges, such as:- High costs: GPUs are not cheap, and you might need to pay for cloud-based GPU instances or rent GPU servers to run LLMs remotely.
- Latency: Running LLMs remotely can introduce latency, which can affect the performance and user experience of your gen AI applications.
- Security: Running LLMs remotely can expose your data and models to third-party services, which can pose security risks and privacy issues.
- Dependency: Running LLMs remotely can make you dependent on third-party services, which can limit your control and flexibility over your development environment and workflow.What if there was a way to run LLMs locally on your CPU and memory, without GPUs, and still get high-quality results? This is where localllm comes in.localllm is a new open-source tool that enables you to run LLMs locally on Cloud Workstations, Google Cloud's fully managed development environment. By using a combination of quantized models, Cloud Workstations, and localllm, you can develop gen AI applications on a well-equipped development workstation, leveraging existing processes and workflows.localllm can help you develop gen AI applications locally, without GPUs, and still get high-quality results. By using localllm, you can:- Save costs: You can run LLMs locally on your CPU and memory, without GPUs, and only pay for the resources you use on your Cloud Workstation. You can also start and stop your Cloud Workstations on demand, and configure them according to your needs.
- Reduce latency: You can run LLMs locally on your Cloud Workstation, and get faster and more responsive results. You can also see the results in real time, and experiment with different inputs and outputs.
- Enhance security: You can run LLMs locally on your Cloud Workstation, and keep your data and models within your Google Cloud project. You can also use Google Cloud's security features, such as encryption, identity and access management, and audit logging, to protect your development environment.
- Increase flexibility: You can run LLMs locally on your Cloud Workstation, and have full control and customization over your development environment and workflow. You can also update your Cloud Workstation configuration at any time, and the changes will apply to your Cloud Workstations the next time they start.localllm is a new and innovative tool that can help you develop gen AI applications locally, without GPUs. If you are interested in trying out localllm, you can follow this guide to get started. You can also check out some examples of gen AI applications that use localllm, such as text generation, code generation, image captioning, and more. localllm is an open-source project, and we welcome your feedback and contributions. You can join our community on GitHub,

Code generation: You can use localllm to generate code, such as Python, Java, or HTML, based on a given specification, description, or example. For example, you can use localllm to generate a Python function that calculates the factorial of a given number, using the quantized model gpt3-quantized. Here is a possible output:

def factorial(n):
  # A function that calculates the factorial of a given number
  # Input: n, a positive integer
  # Output: the factorial of n, or -1 if n is invalid

  # Check if n is a valid input
  if not isinstance(n, int) or n < 0:
    return -1  # Initialize the result variable
  result = 1  # Loop from 1 to n, multiplying the result by each number
  for i in range(1, n + 1):
    result *= i  # Return the result
  return result

Image captioning: You can use localllm to generate captions for images, such as photos, drawings, or diagrams, based on the image content and context. For example, you can use localllm to generate a caption for an image of a dog playing with a ball, using the quantized model t5-quantized. Here is a possible output:

A happy dog is jumping in the air to catch a yellow ball in its mouth.

And more: You can use localllm to perform other tasks, such as text summarization, text translation, text classification, and more, using different quantized models and inputs. You can also integrate localllm with other tools and frameworks, such as TensorFlow, PyTorch, or Hugging Face Transformers, to create more complex and advanced gen AI applications. You can find more examples and tutorials on how to use localllm on the GitHub repository.

Conclusion

In this guide, we have shown you how to use localllm, Cloud Workstations, and quantized models to develop gen AI applications locally, without GPUs. By using this combination, you can leverage the power of LLMs to generate natural language content, write code, or solve problems in innovative ways, while saving costs, reducing latency, enhancing security, and increasing flexibility. localllm is a new and innovative tool that can help you develop gen AI applications locally, without GPUs, and still get high-quality results.

If you are interested in trying out localllm, you can follow this guide to get started. You can also check out some examples of gen AI applications that use localllm, such as text generation, code generation, image captioning, and more. localllm is an open-source project, and we welcome your feedback and contributions. Along with this if you are interested in science research and technology news, do follow physicsalert.com .