llama.cpp Study Environment Setup

Before study the code of llama.cpp, it would be great if a test environment could be setup for playing llama.cpp.

Download llama.cpp

First of all, download the latest llama.cpp from Github.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

llama.cpp is an active developed project. When the note is written, Release b4053 · ggerganov/llama.cpp is the latest release. llama git To study the implementation of llama.cpp, GDB would be a nice tool to debug the C/C++ code step by step.

# Install necessary tools and libraries
apt install build-essential
# Build llama.cpp with debug flag enabled
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build -j

Download and convert LLM model file

Even llama.cpp is a C++ project, it uses a Python script to convert the model downloaded from Hugging Face to GGUF. Also, downloading model from Hugging Face requires huggingface-cli.

# create a virtual environment
conda create -n llama.cpp -y
conda activate llama.cpp
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 cpuonly -c pytorch
conda install huggingface_hub
pip install -r requirements.txt

Next step, download the LLM model for the test. This note uses meta-llama/Llama-3.1-8B-Instruct.

huggingface-cli login
huggingface-cli download google/gemma-3-1b-it --exclude "original/*"  --local-dir models/gemma-3-1b-it

llama.cpp uses GGUF file format to store the LLM. The downloaded Hugging Face model file needs to be converted to GGUF file.

python3 ./convert_hf_to_gguf.py models/gemma-3-1b-it

Code viewer and GDB debug environment

Visual Studio Code is used for reading the code. To enable the GDB, please refer to https://code.visualstudio.com/docs/editor/debugging. Below is the example of launch.json.

{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
	"version": "0.2.0",
	"configurations": [
		{
			"name": "(gdb) Launch",
			"type": "cppdbg",
			"request": "launch",
			"program": "${workspaceFolder}/build/bin/llama-cli",
			"args": ["-m", "${workspaceFolder}/models/gemma-3-1b-it/gemma-3-1B-it-F16.gguf", "-p", "'I believe the meaning of life is'", "-n", "128", "-no-cnv", "--chat-template", "gemma"],
			"stopAtEntry": false,
			"cwd": "${workspaceFolder}",
			"environment": [],
			"externalConsole": false,
			"MIMode": "gdb",
			"setupCommands": [
				{
					"description": "Enable pretty-printing for gdb",
					"text": "-enable-pretty-printing",
					"ignoreFailures": true
				},
				{
					"description": "Set Disassembly Flavor to Intel",
					"text": "-gdb-set disassembly-flavor intel",
					"ignoreFailures": true
				}
			]
		},
	]
}

After launch the the GDB debugger, we shall be able to debug the llama.cpp. ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927125033.png]]

How does llama.cpp load model?

In the main function of examples/main/main.cpp, common_init_from_params() is used for loading the LLM model file and initialize the model.

// load the model and apply lora adapter, if any
LOG_INF("%s: load the model and apply lora adapter, if any\n", __func__);
common_init_result llama_init = common_init_from_params(params);

There are 4 params defined & used in this function:

  1. *common_params *params**: parsed from the command line parameters.
  2. *common_init_result *iparams**: is the return value which includes both model and context parameters extracted in this function.
  3. *struct llama_model_params *mparams**: is the model related parameters get from above params. It is used when loading the modle from file.
  4. *struct llama_context_params *cparams**: is the context related parameters get from the above params.

llama_model_load

Function call chain common_init_from_params() –> llama_load_model_from_file() –> llama_model_load() is to load the model from file. Model can also be loaded from Hugging Face repository or custom URL. In this study note, local model file is used for the discussion.

This function does below things:

  1. load model from the model file to llama_model_loader ml
    llama_model_loader ml(fname, params.use_mmap, params.check_tensors, params.kv_overrides);
    
  2. Load model arch from KV pairs. In this case, it is LLM_ARCH_LLAMA. The KV pairs are generated from above llama_model_loader ml
    llm_load_arch(ml, model);
    
  3. Prepare llama_model model hyperparameters
    llm_load_hparams(ml, model);
    
  4. Load vocab. In this case, it is LLAMA_VOCAB_TYPE_BPE/LLAMA_VOCAB_PRE_TYPE_LLAMA3 (GPT-2 tokenizer based on byte-level BPE)
    llm_load_vocab(ml, model);
    
  5. llm_load_tensors
    llm_load_tensors(ml, model, params.n_gpu_layers, params.split_mode, params.main_gpu, params.tensor_split, params.use_mlock,params.progress_callback, params.progress_callback_user_data))
    

llama_model_loader

For loading a model, the major struct named llama_model_loader is defined and implemented in the file llama.cpp. This struct includes all functions related to load model: llama_model_loader, get_arch/weight/tensor_meta/mapping_range, create_tensor and load_all_data, etc.

gguf_init_from_file

llama_model_loader() –> gguf_init_from_file() opens the gguf model file.

  1. Check the magic of the model file’s first 4 bytes matches with “GGUF”
  2. The key data structure struct gguf_context ctx is allocated. All kinds of information read from the model file will be written into this ctx.
  3. Read the header: version, n_tensors, n_kv. For llama3.1:8B model, it has 33 key-value pairs and 292 tensors.
  4. Read the key-vaule pairs.
  5. Read the tensor infos
  6. Create tensors. In this step, ggml_init() allocates the mem_buffer. Then ggml_new_tensor() is called to create struct ggml_tensor for each tensor in the mem_buffer.

The file type is determined on the max num of tensors.

https://github.com/ggerganov/ggml/blob/master/docs/gguf.md ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927141227.png]]

Model KV: ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927180949.png]] Model’s meta ![[https://github.com/zhiyisun/zhiyisun.github.io/blob/master/_posts/resource/20240927182439.png]]

llm_load_tensors

buft_input, buft_layer and buft_output are 3 buffers. All of these buffer are set as CPU buffer in this case. (llama_default_buffer_type_cpu(true))

struct llama_model {
	...
    layer_buft buft_input;
    layer_buft buft_output;
    std::vector<layer_buft> buft_layer;
    ...
}

Then, tok_embd, output and layers’s tensors are created by function ml.create_tensor().

struct llama_model {
	...
    struct ggml_tensor * tok_embd;
  
	//output
    struct ggml_tensor * output_norm;
    struct ggml_tensor * output;
    ...

	//32 Layers
	std::vector<llama_layer> layers; 
}


struct llama_layer {
    // normalization
    struct ggml_tensor * attn_norm;

    struct ggml_tensor * attn_q_norm;
    struct ggml_tensor * attn_q_norm_b;
    struct ggml_tensor * attn_k_norm;
    struct ggml_tensor is * attn_k_norm_b;
    struct ggml_tensor * attn_out_norm;
    struct ggml_tensor * attn_out_norm_b;
    struct ggml_tensor * attn_q_a_norm;
    struct ggml_tensor * attn_kv_a_norm;
    struct ggml_tensor * attn_sub_norm;
    struct ggml_tensor * attn_post_norm;
    struct ggml_tensor * ffn_sub_norm;
    struct ggml_tensor * attn_norm_cross;
    struct ggml_tensor * attn_norm_enc;

    // attention
    struct ggml_tensor * wq;
    struct ggml_tensor * wk;
    struct ggml_tensor * wv;
    struct ggml_tensor * wo;

    // attention bias
    struct ggml_tensor * bq;
    struct ggml_tensor * bk;
    struct ggml_tensor * bv;
    struct ggml_tensor * bo;

    // relative position bias
    struct ggml_tensor * attn_rel_b;
    struct ggml_tensor * attn_rel_b_enc;
    struct ggml_tensor * attn_rel_b_cross;

    // normalization
    struct ggml_tensor * ffn_norm;

    // ff
    struct ggml_tensor * ffn_gate; // w1
    struct ggml_tensor * ffn_down; // w2
    struct ggml_tensor * ffn_up;   // w3

    // ff bias
    struct ggml_tensor * ffn_gate_b = nullptr;
    struct ggml_tensor * ffn_down_b = nullptr; // b2
    struct ggml_tensor * ffn_up_b   = nullptr; // b3

    // long rope factors
    struct ggml_tensor * rope_freqs = nullptr;

    // bitnet scale
    struct ggml_tensor * wq_scale;
    struct ggml_tensor * wk_scale;
    struct ggml_tensor * wv_scale;
    struct ggml_tensor * wo_scale;
    struct ggml_tensor * ffn_gate_scale;
    struct ggml_tensor * ffn_up_scale;
    struct ggml_tensor * ffn_down_scale;
};

Function load_all_data() iterates over the model’s tensors and loads their data from files or memory-mapped files.

llama_build_graph

The compute graph is as below.

struct ggml_cgraph {
    int size;
    int n_nodes;
    int n_leafs;
  
    struct ggml_tensor ** nodes;
    struct ggml_tensor ** grads;
    struct ggml_tensor ** leafs;
  
    struct ggml_hash_set visited_hash_set;
  
    enum ggml_cgraph_eval_order order;
};

How does llama.cpp do inference?

To Be Continued