2023-12-24 - Easily Play with Local Large Models Using Ollama - Minority

Easily Play with Local Large Models Using Ollama - Minority#

#Omnivore

Introduction#

In April, I took the opportunity of LLaMA's open-source release to attempt to sort out the democratization movement of large language models (LLM) and the "alpaca" family that frequently appeared in the naming of open-source projects, including the important role played by the llama.cpp project.

At that time, the alpaca family was just emerging in the field of open-source large models, and the surrounding ecosystem was thriving. As the year-end approaches, looking back at the past three quarters, with Meta's release of the more powerful and open Llama 2 in June as a milestone, the open-source community has once again adapted, evolved, and landed with unstoppable momentum.

Today, LLM is no longer synonymous with expensive GPUs but can run inference applications on most consumer-grade computers—commonly referred to as local large models.

Llama 2 Three-Piece Set

Elegance is Not Easy#

According to experience, a model with 16-bit floating-point precision (FP16) requires approximately twice the amount of GPU memory (in GB) as the number of model parameters (in billions) for inference. Accordingly, Llama 2 7B (7 billion) requires about 14GB of GPU memory for inference, which clearly exceeds the hardware specifications of an ordinary home computer. For reference, a GeForce RTX 4060 Ti 16GB graphics card costs over 3000 yuan in the market.

Model quantization technology can significantly reduce memory requirements. For example, with 4-bit quantization, the original FP16 precision weight parameters are compressed to 4-bit integer precision, greatly reducing both the model weight size and the GPU memory required for inference, needing only 1/4 to 1/3 of FP16, meaning that about 4GB of memory is sufficient to start inference for a 7B model (of course, actual memory requirements will increase as the context content accumulates).

At the same time, the llama.cpp project rewrote the inference code in C/C++, avoiding the complex dependencies introduced by PyTorch and providing broader hardware support, including pure CPU inference and various underlying computing architectures such as Apple Silicon, which can fully leverage their respective inference acceleration. Due to the popularity of the Llama architecture, the quantization and inference capabilities of llama.cpp can be almost seamlessly migrated to other open-source large language models with the same architecture, such as Alibaba Cloud's Qwen series and Zero One's Yi series.

Despite the many benefits brought by llama.cpp, when you want to truly experience it hands-on, you find that you need to obtain model weights, clone project code, perform model quantization, set environment variables, build executable files, and other steps just to ask a test question via the command line, not to mention the dozens of parameters that may need to be manually adjusted parameters.

Thus, for a long time, local large models and applications based on llama.cpp were limited to a small circle of geeks and researchers, with a high entry barrier keeping many ordinary people out.

Until Ollama came along—a concise and easy-to-use local large model runtime framework. As the ecosystem surrounding Ollama comes to the forefront, more users can easily play with large models on their own computers.

Official Website

Quick Start#

Installing Ollama is very simple; macOS users can directly download the installation package from the official website and run it; Windows does not yet provide an installation package, and the official recommendation is to install it using commands in WSL 2:

% curl https://ollama.ai/install.sh | sh

Tip: Always be cautious about reviewing the risks of curl | sh style installation scripts.

If you are familiar with Docker, you can also directly use its official image.

When you successfully run the command ollama --version and see the version, it indicates that Ollama has been installed successfully, and you can then use the pull command to download models from the online model library to play with.

Taking the Chinese fine-tuned Llama2-Chinese 7B model as an example, the following command will download nearly 4GB of the 4-bit quantized model file, requiring at least 8GB of memory for inference, with 16GB recommended for smooth operation.

% ollama pull llama2-chinese

After downloading, use the run command to run the model, you can directly append the message after the command or leave it empty to enter dialogue mode, which has several commands prefixed with a slash:

# Single input
% ollama run llama2-chinese "Why is the sky blue?"

# Dialogue mode
% ollama run llama2-chinese
>>> /?
Available Commands:
  /set         Set session variables
  /show        Show model information
  /bye         Exit
  /?, /help    Help for a command

Use """ to begin a multi-line message.

>>> Why is the sky blue?

This question is a common controversy. Some scientists believe that the blue color of the sky can be explained by the light color reflected by tiny fragments in fog and clouds, while others believe it is due to the influence of the Earth's own temperature. Currently, there is no widely accepted explanation for this question.

It is worth mentioning that Ollama will detect the hardware being used and call for GPU acceleration when feasible, so you might want to open the Activity Monitor or Task Manager during inference to verify.

At this point, you have experienced an accessible local large model.

Shelling Out#

If you find the command line interface not user-friendly enough, Ollama has a series of peripheral tools available, including web, desktop, terminal interfaces, and various plugins and extensions.

The reason Ollama has quickly formed such a rich ecosystem is that it had a clear positioning from the very beginning: to allow more people to run large models locally in the simplest and fastest way. Thus, Ollama is not just a simple wrapper around llama.cpp; it also packages numerous parameters and corresponding models together; Ollama is therefore equivalent to a concise command-line tool and a stable server API. This greatly facilitates downstream applications and extensions.

Regarding the Ollama GUI, there are many options based on different preferences:

Web Version: Ollama WebUI has an interface closest to ChatGPT and the most abundant features, requiring deployment via Docker;

Ollama WebUI Example, image source from project homepage

Terminal TUI Version: oterm provides comprehensive functionality and shortcut key support, installable via brew or pip;

Untitled

Oterm Example, image source from project homepage

Raycast Plugin: namely Raycast Ollama, which is also my personal most frequently used Ollama front-end UI, inheriting the advantages of Raycast, allowing direct command invocation after selecting or copying statements, providing a smooth experience. As a substitute for Raycast AI, which costs about 8 dollars per month, Raycast Ollama implements most of Raycast AI's functionalities and will support multimodal features that Raycast AI does not support as Ollama and open-source models iterate, showing immense potential.

Untitled

Raycast Ollama plugin replicating Raycast AI

Additionally, there are interesting applications like the macOS native app Ollamac written in Swift, and Obsidian Ollama, similar to Notion AI, which can be selected as needed.

Advanced Play#

Changing Models#

If you carefully read the answer to "Why is the sky blue?" in the previous demonstration, you might have sensed something unusual—congratulations, you successfully captured a "hallucination" of a large language model. In fact, due to the small parameter size and quantization loss, models suitable for local operation are more prone to hallucinations, sometimes leading to nonsensical responses. The only remedy for this may be to run models with more parameters whenever conditions allow.

For example, the Llama2-Chinese model used earlier has 7B parameters in the 4-bit quantized version. If you have 16GB of memory, you might consider running the 13B parameter version.

How to do it? Ollama uses a scheme similar to Docker's organization of images, determining the specific model version using the model name plus a tag ( model:tag ); when no tag is added, it defaults to latest, which usually corresponds to the 7B parameter 4-bit quantized version. To run the 13B version, you can use the 13b tag:

% ollama run llama2-chinese:13b "Why is the sky blue?"

The sky is blue due to the scattering of sunlight by the atmosphere.

Before sunrise, the sky appears purple or rainbow-colored because sunlight reflects back from the sea level and is scattered into blue, purple, or rainbow colors by carbon dioxide and water molecules in the atmosphere.

After sunrise, the sky turns gray because sunlight is blocked by the atmosphere, and there is no longer enough reflection to give the sky a blue hue.

When we see the sky, its color is formed by the interaction of sunlight with substances in the atmosphere. These substances include water, carbon dioxide, and other gases, as well as tiny ice crystals and dust.

As we see the sky turn to night, it gradually becomes a deeper blue due to the propagation of sunlight in the atmosphere and also because of the structure of the atmosphere.

As you can see, the effect is indeed better. Other optional tags can also be viewed on the corresponding model's tags page.

Additionally, you can switch to other models. Here are a few recommended models from the official model library that have relatively good support for Chinese or are quite interesting:

DeepSeek series, launched by the Deep Seek team, including DeepSeek-Coder for code training and the general DespSeek-LLM;
Yi series, launched by the Zero One team, with versions supporting around 200,000 context windows available;
If you happen to have ample financial resources, you might try the first open-source mixture of experts (MoE) model Mixtral-8x7B recently launched by the French star startup Mistral, which requires 48GB of memory to run;
If hardware is tight, don’t be discouraged, Phi-2 is fine-tuned by the Microsoft team for logic and understanding, with a size of 2.7B requiring only 4GB of memory to run, delivering fast response speeds, though it doesn't understand Chinese very well.

Image Support#

In addition to pure language large models, Ollama also offers support for visual models starting from version 0.1.15, which is worth trying. Just write the path of the local image in the prompt (macOS users can directly drag the image into the terminal to get its path):

% ollama run llava
>>> What does the text in this image say? /Users/mchiang/Downloads/image.png 
Added image '/Users/mchiang/Downloads/image.png'

The text in this image says "The Ollamas."

Example from release announcement

LLaVA example, image source from Ollama release announcement

Custom System Prompts#

Based on the experience of using ChatGPT, most people are aware of the importance of system prompts. A good system prompt can effectively customize the large model to the desired state. In Ollama, there are various ways to customize system prompts.

Firstly, many Ollama front-ends have provided configuration entry points for system prompts, and it is recommended to utilize their functionality directly. Additionally, these front-ends often interact with the Ollama server via API at the underlying level, allowing us to call directly and pass in system prompt options:

curl http://localhost:11434/api/chat -d '{
  "model": "llama2-chinese:13b",
  "messages": [
    {
      "role": "system",
      "content": "Answer simply in the tone of a pirate."
    },
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "stream": false
}'

Where the message with role as system is the system prompt.

More Options#

Ollama's ModelFile leaves users with more customization space, including system prompts, dialogue templates, model inference temperature, context window length and other parameters that can be set by the user, suitable for advanced use.

Before creating, you can view the existing model's ModelFile content using the show --modelfile command as a reference:

% ollama show --modelfile llama2-chinese:13b
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama2-chinese:13b

FROM ~/.ollama/models/blobs/sha256:8359bebea988186aa6a947d55d67941fede5044d02e0ab2078f5cc0dcf357831
TEMPLATE """{{ .System }}
Name: {{ .Prompt }}
Assistant:
"""
PARAMETER stop "Name:"
PARAMETER stop "Assistant:"

For example, to customize the system prompt and modify the inference temperature parameter, you should construct the ModelFile in the following format:

FROM llama2-chinese:13b

SYSTEM "Answer in the tone of a pirate."
PARAMETER temperature 0.1

Then use the create command to create it, and the new model will inherit the original model's weight files and unchanged option parameters:

ollama create llama2-chinese-pirate -f ~/path/to/ModelFile

Thus, you have your own local model.

Conclusion#

Compared to ordinary application software, the user experience of Ollama may still be hard to call "elegant." However, compared to the state a few months ago, the progress it brings is like moving from slash-and-burn agriculture to modern society: back then, you needed to spend real money on cards, fiddle with configuration environments to get it working, or compile it yourself to run; now, models can run smoothly on laptops less than a week after release (Phi-2 was released last week). From this perspective, it is not an exaggeration to say that Ollama has contributed to the democratization of AI technology.