Llama.cpp Study Note
llama.cpp Study Environment Setup
Before study the code of llama.cpp, it would be great if a test environment could be setup for playing llama.cpp.
Download llama.cpp
First of all, download the latest llama.cpp from Github.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
llama.cpp is an active developed project. When the note is written, Release b4053 · ggerganov/llama.cpp is the latest release.
To study the implementation of llama.cpp, GDB would be a nice tool to debug the C/C++ code step by step.
# Install necessary tools and libraries
apt install build-essential
# Build llama.cpp with debug flag enabled
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build -j
Download and convert LLM model file
Even llama.cpp is a C++ project, it uses a Python script to convert the model downloaded from Hugging Face to GGUF. Also, downloading model from Hugging Face requires huggingface-cli.
# create a virtual environment
conda create -n llama.cpp -y
conda activate llama.cpp
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 cpuonly -c pytorch
conda install huggingface_hub
pip install -r requirements.txt
Next step, download the LLM model for the test. This note uses meta-llama/Llama-3.1-8B-Instruct.
huggingface-cli login
huggingface-cli download google/gemma-3-1b-it --exclude "original/*" --local-dir models/gemma-3-1b-it
llama.cpp uses GGUF file format to store the LLM. The downloaded Hugging Face model file needs to be converted to GGUF file.
python3 ./convert_hf_to_gguf.py models/gemma-3-1b-it
Code viewer and GDB debug environment
Visual Studio Code is used for reading the code. To enable the GDB, please refer to https://code.visualstudio.com/docs/editor/debugging. Below is the example of launch.json.
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "(gdb) Launch",
"type": "cppdbg",
"request": "launch",
"program": "${workspaceFolder}/build/bin/llama-cli",
"args": ["-m", "${workspaceFolder}/models/gemma-3-1b-it/gemma-3-1B-it-F16.gguf", "-p", "'I believe the meaning of life is'", "-n", "128", "-no-cnv", "--chat-template", "gemma"],
"stopAtEntry": false,
"cwd": "${workspaceFolder}",
"environment": [],
"externalConsole": false,
"MIMode": "gdb",
"setupCommands": [
{
"description": "Enable pretty-printing for gdb",
"text": "-enable-pretty-printing",
"ignoreFailures": true
},
{
"description": "Set Disassembly Flavor to Intel",
"text": "-gdb-set disassembly-flavor intel",
"ignoreFailures": true
}
]
},
]
}
After launch the the GDB debugger, we shall be able to debug the llama.cpp. ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927125033.png]]
How does llama.cpp load model?
In the main function of examples/main/main.cpp, common_init_from_params() is used for loading the LLM model file and initialize the model.
// load the model and apply lora adapter, if any
LOG_INF("%s: load the model and apply lora adapter, if any\n", __func__);
common_init_result llama_init = common_init_from_params(params);
There are 4 params defined & used in this function:
- *common_params *params**: parsed from the command line parameters.
- *common_init_result *iparams**: is the return value which includes both model and context parameters extracted in this function.
- *struct llama_model_params *mparams**: is the model related parameters get from above params. It is used when loading the modle from file.
- *struct llama_context_params *cparams**: is the context related parameters get from the above params.
llama_model_load
Function call chain common_init_from_params() –> llama_load_model_from_file() –> llama_model_load() is to load the model from file. Model can also be loaded from Hugging Face repository or custom URL. In this study note, local model file is used for the discussion.
This function does below things:
- load model from the model file to llama_model_loader ml
llama_model_loader ml(fname, params.use_mmap, params.check_tensors, params.kv_overrides);
- Load model arch from KV pairs. In this case, it is LLM_ARCH_LLAMA. The KV pairs are generated from above llama_model_loader ml
llm_load_arch(ml, model);
- Prepare llama_model model hyperparameters
llm_load_hparams(ml, model);
- Load vocab. In this case, it is LLAMA_VOCAB_TYPE_BPE/LLAMA_VOCAB_PRE_TYPE_LLAMA3 (GPT-2 tokenizer based on byte-level BPE)
llm_load_vocab(ml, model);
- llm_load_tensors
llm_load_tensors(ml, model, params.n_gpu_layers, params.split_mode, params.main_gpu, params.tensor_split, params.use_mlock,params.progress_callback, params.progress_callback_user_data))
llama_model_loader
For loading a model, the major struct named llama_model_loader is defined and implemented in the file llama.cpp. This struct includes all functions related to load model: llama_model_loader, get_arch/weight/tensor_meta/mapping_range, create_tensor and load_all_data, etc.
gguf_init_from_file
llama_model_loader() –> gguf_init_from_file() opens the gguf model file.
- Check the magic of the model file’s first 4 bytes matches with “GGUF”
- The key data structure struct gguf_context ctx is allocated. All kinds of information read from the model file will be written into this ctx.
- Read the header: version, n_tensors, n_kv. For llama3.1:8B model, it has 33 key-value pairs and 292 tensors.
- Read the key-vaule pairs.
- Read the tensor infos
- Create tensors. In this step, ggml_init() allocates the mem_buffer. Then ggml_new_tensor() is called to create struct ggml_tensor for each tensor in the mem_buffer.
The file type is determined on the max num of tensors.
https://github.com/ggerganov/ggml/blob/master/docs/gguf.md ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927141227.png]]
Model KV: ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927180949.png]] Model’s meta ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927182439.png]]
llm_load_tensors
buft_input, buft_layer and buft_output are 3 buffers. All of these buffer are set as CPU buffer in this case. (llama_default_buffer_type_cpu(true))
struct llama_model {
...
layer_buft buft_input;
layer_buft buft_output;
std::vector<layer_buft> buft_layer;
...
}
Then, tok_embd, output and layers’s tensors are created by function ml.create_tensor().
struct llama_model {
...
struct ggml_tensor * tok_embd;
//output
struct ggml_tensor * output_norm;
struct ggml_tensor * output;
...
//32 Layers
std::vector<llama_layer> layers;
}
struct llama_layer {
// normalization
struct ggml_tensor * attn_norm;
struct ggml_tensor * attn_q_norm;
struct ggml_tensor * attn_q_norm_b;
struct ggml_tensor * attn_k_norm;
struct ggml_tensor is * attn_k_norm_b;
struct ggml_tensor * attn_out_norm;
struct ggml_tensor * attn_out_norm_b;
struct ggml_tensor * attn_q_a_norm;
struct ggml_tensor * attn_kv_a_norm;
struct ggml_tensor * attn_sub_norm;
struct ggml_tensor * attn_post_norm;
struct ggml_tensor * ffn_sub_norm;
struct ggml_tensor * attn_norm_cross;
struct ggml_tensor * attn_norm_enc;
// attention
struct ggml_tensor * wq;
struct ggml_tensor * wk;
struct ggml_tensor * wv;
struct ggml_tensor * wo;
// attention bias
struct ggml_tensor * bq;
struct ggml_tensor * bk;
struct ggml_tensor * bv;
struct ggml_tensor * bo;
// relative position bias
struct ggml_tensor * attn_rel_b;
struct ggml_tensor * attn_rel_b_enc;
struct ggml_tensor * attn_rel_b_cross;
// normalization
struct ggml_tensor * ffn_norm;
// ff
struct ggml_tensor * ffn_gate; // w1
struct ggml_tensor * ffn_down; // w2
struct ggml_tensor * ffn_up; // w3
// ff bias
struct ggml_tensor * ffn_gate_b = nullptr;
struct ggml_tensor * ffn_down_b = nullptr; // b2
struct ggml_tensor * ffn_up_b = nullptr; // b3
// long rope factors
struct ggml_tensor * rope_freqs = nullptr;
// bitnet scale
struct ggml_tensor * wq_scale;
struct ggml_tensor * wk_scale;
struct ggml_tensor * wv_scale;
struct ggml_tensor * wo_scale;
struct ggml_tensor * ffn_gate_scale;
struct ggml_tensor * ffn_up_scale;
struct ggml_tensor * ffn_down_scale;
};
Function load_all_data() iterates over the model’s tensors and loads their data from files or memory-mapped files.
llama_build_graph
The compute graph is as below.
struct ggml_cgraph {
int size;
int n_nodes;
int n_leafs;
struct ggml_tensor ** nodes;
struct ggml_tensor ** grads;
struct ggml_tensor ** leafs;
struct ggml_hash_set visited_hash_set;
enum ggml_cgraph_eval_order order;
};
How does llama.cpp do inference?
To Be Continued