Ubuntu : llama.cpp

Boris Tassou

03 nov. 2025 • 15 min read

Plateforme :
x99 bd4 huananzhi
2660v3
32Go de ram
Crucial MX500 1To

Sous Ubuntu 22.04 LTS car il faut python 3.11.

Requirements

apt install build-essential git python3-venv unzip

Il faut une version de python <= 3.11 donc si vous utilisez Ubuntu 24, il faut rajouter un repo pour avoir python 3.11 par exemple :

add-apt-repository ppa:deadsnakes/ppa
apt install python3.11 python3.11-venv
mkdir py3-venv py_temp
python3.11 -m venv py3-venv/
source py3-venv/bin/activate

Il faudra penser à remplacer les commandes python3 en python3.11 pour forcer la version à utiliser.

Dans mon cas, il faut simplement créer le venv :

python3 -m venv /srv/data/py3-venv
source /srv/data/py3-venv/bin/activate

Arborescence

Création de l'arbo :

mkdir /srv/data/py3-venv
mkdir /srv/data/py_temp

Installation llama.cpp

Changer le dossier tmp lors de l'installation des requirements :

cd /srv/data
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j 18
TMPDIR=/srv/data/py_temp/ python3 -m pip install -r requirements.txt

Attention! Ici le make prend 18 threads, c'est adapté au CPU de la config.

Pour ne pas saturer le /tmp qui est assez petit, il faut dire à python d'utiliser un dossier spécifique pour les fichiers temporaires.

À cette étape, 5Go de pris par l'installation mais c'est monté à 10Go avec les fichiers temp des dépendances de py :

/dev/mapper/vg_system-lv_data      98G  5,2G   88G   6% /srv/data

Modèles

https://huggingface.co/TheBloke/Llama-2-13B-GGUF

Téléchargement d'un modèle :

pip3 install huggingface-hub>=0.17.1
mkdir /srv/data/ai_data/llama-13B
root@x99:/srv/data/ai_data/llama-13B# huggingface-cli download TheBloke/Llama-2-13B-GGUF llama-2-13b.Q4_K_M.gguf --local-dir /srv/data/ai_data/llama-13B/
Downloading 'llama-2-13b.Q4_K_M.gguf' to '/srv/data/ai_data/llama-13B/.huggingface/download/llama-2-13b.Q4_K_M.gguf.e6c5f001cf1e9330bda4c2c9098cc1c363f1cb70634f7e047fddba6096969c59.incomplete'
llama-2-13b.Q4_K_M.gguf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.87G/7.87G [13:50<00:00, 9.47MB/s]
Download complete. Moving file to models/llama-13B/llama-2-13b.Q4_K_M.gguf
models/llama-13B/llama-2-13b.Q4_K_M.gguf
root@x99:/srv/data/ai_data# ls -Alrth llama-13B/
total 7,4G
drwxr-xr-x 3 root root 4,0K juin  23 14:39 .huggingface
-rw-r--r-- 1 root root 7,4G juin  23 14:53 llama-2-13b.Q4_K_M.gguf

Environ 8Go à télécharger.

Premier test

Lancement du test :

./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q4_K_M.gguf -n 128

....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  3200,00 MiB
llama_new_context_with_model: KV self size  = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB
llama_new_context_with_model:        CPU compute buffer size =   368,01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
        top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
        mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1


 the Boss.

"Oh, I'm sorry!" he said. "I didn't mean to tread on your toe. I hope I didn't hurt you much."

"No, thank you," she answered with a smile. "I'm afraid you have hurt my temper a little more than you have my toe."

"Well, I hope you'll forgive me, won't you?" he went on. "I was just going to say I'm not a very good skater, and if you'll let me I'll try to help you to
llama_print_timings:        load time =    2458,50 ms
llama_print_timings:      sample time =       5,12 ms /   128 runs   (    0,04 ms per token, 24975,61 tokens per second)
llama_print_timings: prompt eval time =       0,00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   54993,26 ms /   128 runs   (  429,63 ms per token,     2,33 tokens per second)
llama_print_timings:       total time =   55042,17 ms /   128 tokens
Log end

Si rien n'est précisé en entrée, cela va générer automatiquement un échange puis s'arrêter.

Avec un modèle plus gros :

huggingface-cli download TheBloke/Llama-2-13B-GGUF llama-2-13b.Q6_K.gguf --local-dir /srv/data/ai_data/llama-13B/

Qui prend environ 10Go de stockage.

./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n 128

...................................................................................................
llama_new_context_with_model: n_ctx      = 4096 
llama_new_context_with_model: n_batch    = 2048 
llama_new_context_with_model: n_ubatch   = 512                                                 
llama_new_context_with_model: flash_attn = 0                                                   
llama_new_context_with_model: freq_base  = 10000,0                                                                                                                                            
llama_new_context_with_model: freq_scale = 1                                                                                                                                                  
llama_kv_cache_init:        CPU KV buffer size =  3200,00 MiB                                                                                                                                 
llama_new_context_with_model: KV self size  = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB
llama_new_context_with_model:        CPU compute buffer size =   368,01 MiB                                                                                                                   
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
        top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
        mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1


 #include "libc.h"

#define BUFFER_SIZE 10

int main(void)
{
        char *buffer;
        char *next;

        buffer = malloc(BUFFER_SIZE);
        if (!buffer)
                return EXIT_FAILURE;

        next = buffer;
        while (next != buffer + BUFFER_SIZE) {
                *next++ = 'a';
        }

        printf("[%p] -> [%p]\n", buffer, next - 1);
llama_print_timings:        load time =    2655,80 ms
llama_print_timings:      sample time =       5,04 ms /   128 runs   (    0,04 ms per token, 25422,05 tokens per second)
llama_print_timings: prompt eval time =       0,00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   65534,62 ms /   128 runs   (  511,99 ms per token,     1,95 tokens per second)
llama_print_timings:       total time =   65584,95 ms /   128 tokens
Log end

Test en posant une question

Il est possible de lui poser une question :

./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n 128 -p "Who is Rogal Dorn?"

....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  3200,00 MiB
llama_new_context_with_model: KV self size  = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB
llama_new_context_with_model:        CPU compute buffer size =   368,01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
        top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
        mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1


 Who is Rogal Dorn?
Games Workshop has announced a new book set in the Warhammer 40,000 universe titled Rogue Trader: The Book of Judas Gospel. The book will be a sequel to the previously announced book, The Book of Judas, and will be written by the same author, Aaron Dembski-Bowden.
The Book of Judas Gospel will be the first book in a new trilogy set in the Warhammer 40,000 universe, and will tell the story of Rogal Dorn, the captain of the Rogue Tr
llama_print_timings:        load time =    2633,03 ms
llama_print_timings:      sample time =       5,02 ms /   128 runs   (    0,04 ms per token, 25498,01 tokens per second)
llama_print_timings: prompt eval time =     994,79 ms /     8 tokens (  124,35 ms per token,     8,04 tokens per second)
llama_print_timings:        eval time =   64975,89 ms /   127 runs   (  511,62 ms per token,     1,95 tokens per second)
llama_print_timings:       total time =   66021,06 ms /   135 tokens
Log end

On voit que la fin de la réponse est tronquée.

En augmentant le nombre de tokens et la taille du batch :

./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n -1 -p "Who is Rogal Dorn?"

....................................................................................................                                                                                [251/1400]
llama_new_context_with_model: n_ctx      = 4096                                                                                                                                               
llama_new_context_with_model: n_batch    = 2048                                                                                                                                               
llama_new_context_with_model: n_ubatch   = 512                                                                                                                                                
llama_new_context_with_model: flash_attn = 0                                                                                                                                                  
llama_new_context_with_model: freq_base  = 10000,0                                                                                                                                            
llama_new_context_with_model: freq_scale = 1                                                                                                                                                  
llama_kv_cache_init:        CPU KV buffer size =  3200,00 MiB                                                                                                                                 
llama_new_context_with_model: KV self size  = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB                                                                                         
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB                                                                                                                   
llama_new_context_with_model:        CPU compute buffer size =   368,01 MiB                                                                                                                   
llama_new_context_with_model: graph nodes  = 1286                                                                                                                                             
llama_new_context_with_model: graph splits = 1                                                                                                                                                

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
        top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
        mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1


 Who is Rogal Dorn?
Rogal Dorn is a Space Marine Captain of the Imperial Fists Chapter.
He is a great warrior and a man of honour and has fought in many battles for the Emperor of Mankind.
He is the father of Ferrus Manus, the Primarch of the Imperial Fists and the older brother of Sanguinius, the Primarch of the Blood Angels.
The Imperial Fists are one of the most respected and feared Chapters in the Imperium. They are known for their ferocity in battle and their unwavering loyalty to the Emperor.
Rogal Dorn is a true hero of the Imperium and a symbol of the power and glory of the Space Marines.
Would you like to learn more about Rogal Dorn? Click on the link below to read his biography. [end of text]

llama_print_timings:        load time =    2612,28 ms
llama_print_timings:      sample time =       7,23 ms /   180 runs   (    0,04 ms per token, 24906,60 tokens per second)
llama_print_timings: prompt eval time =    1015,59 ms /     8 tokens (  126,95 ms per token,     7,88 tokens per second)
llama_print_timings:        eval time =   91910,31 ms /   179 runs   (  513,47 ms per token,     1,95 tokens per second)
llama_print_timings:       total time =   92996,97 ms /   187 tokens
Log end

L'option qui change ici :

-n, --predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)

Cela permet de ne plus avoir de limite et token mais attention, cela dépend du modèle! Généralement les modèles utilisés sont entrainés avec un certains nombre de tokens. La limitation du nombre de tokens peut tronquer la réponse en mode cli.

Par exemple en augmentant à 4096 (2048 tronquant le text) :

./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n 4096 -p "Who is Rogal Dorn?"

....................................................................................................                                                                                  [5/1716]
llama_new_context_with_model: n_ctx      = 4096                                                                                                                                               
llama_new_context_with_model: n_batch    = 2048                                                                                                                                               
llama_new_context_with_model: n_ubatch   = 512                                                 
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000,0                                                                                                                                            
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  3200,00 MiB
llama_new_context_with_model: KV self size  = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB
llama_new_context_with_model:        CPU compute buffer size =   368,01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
        top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
        mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 4096, n_keep = 1


 Who is Rogal Dorn?
Rogal Dorn is a fictional character from Warhammer 40,000. He is the Primarch of the Imperial Fists Space Marine Legion.
The Imperial Fists are one of the first Space Marine Legions formed by the Emperor of Mankind. They are a loyalist legion, meaning they fight on the side of the Emperor and the Imperium. The
y are known for their righteousness and unwavering faith in the Emperor.
Dorn is a legendary figure within the Imperial Fists, he is considered to be the greatest of all the Primarchs and is known as the “Emperor’s Champion”. He is a skilled warrior and strategis
t and is known for his courage in battle. He is also known for his loyalty to the Emperor and his determination to protect Mankind from all threats.
Dorn is the leader of the Imperial Fists and is responsible for their success in battle. He is a powerful force on the battlefield and is respected by his fellow Space Marines. He is also re
sponsible for the training of the Imperial Fists and is known for his strict discipline.
Rogal Dorn is a powerful and respected figure within the Imperial Fists. He is a great leader and is revered by his fellow Space Marines. He is a legendary figure within Warhammer 40,000 and
 is the perfect example of the loyalty and strength of the Imperial Fists.
rogal dorn primarch
Previous Post:What is the Average Age of a Space Marine?
Next Post:Where is the Space Marine Legion? [end of text]

llama_print_timings:        load time =    2614,60 ms
llama_print_timings:      sample time =      14,03 ms /   351 runs   (    0,04 ms per token, 25016,04 tokens per second)
llama_print_timings: prompt eval time =     987,96 ms /     8 tokens (  123,50 ms per token,     8,10 tokens per second)
llama_print_timings:        eval time =  181247,20 ms /   350 runs   (  517,85 ms per token,     1,93 tokens per second)
llama_print_timings:       total time =  182370,59 ms /   358 tokens
Log end

Perplexity test

Perplexity :

cd /srv/data/ai_data
wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip

Attention, ce test peut être très long !

./llama-perplexity -t 18 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw

perplexity: tokenizing the input ..
perplexity: tokenization took 1065.02 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 225.10 seconds per pass - ETA 10 hours 14.33 minutes

De base, le nombre de threads utilisé est à 10 maxi, vu que le cpu en a 20, j'ai augmenté le nombre ce qui a changé l'ETA de +11H à 10H15 (gros gros gain...)

Test en augmentant le batch size et du nombre de tokens :

./llama-perplexity -t 18 -n 512 -b 8192 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw

perplexity: tokenizing the input ..
perplexity: tokenization took 1130.27 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=8192, n_seq=16
perplexity: 1038.73 seconds per pass - ETA 11 hours 48.72 minutes

Plus de consommation de ram mais c'est moins bien en temps final donc retour aux params de base.

Donc on relance avec les params de base (minus le nb de threads):

./llama-perplexity -t 18 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw
perplexity: tokenizing the input ..                                       perplexity: tokenization took 1061.55 ms                                 
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4                                                                                                       
perplexity: 223.75 seconds per pass - ETA 10 hours 10.63 minutes                                                                                                                              
[1]3.8946,[2]4.2280,[3]4.9494,[4]5.2314,[5]5.4215,[6]5.3746,[7]5.5474,[8]5.6066,[9]5.8758,[10]6.0637,[11]6.2970,[12]6.3962,[13]6.3284,[14]6.3995,[15]6.5792,[16]6.2696,[17]6.1935,[18]6.1725,[
19]5.8923,[20]5.8973,[21]5.8120,[22]5.6287,[23]5.5831,[24]5.5010,[25]5.4746,[26]5.3299,[27]5.1525,[28]5.0569,[29]4.9740,[30]4.8310,[31]4.7755,[32]4.7881,[33]4.7399,[34]4.7685,[35]4.7869,[36]
4.8169,[37]4.8065,[38]4.8103,[39]4.8357,[40]4.8770,[41]4.9048,[42]4.9394,[43]4.9108,[44]4.9548,[45]4.9710,[46]4.9422,[47]4.9620,[48]4.9472,[49]4.9438,[50]4.9119,[51]4.9206,[52]4.9158,[53]4.9
585,....
]5.1228,[644]5.1232,[645]5.1225,[646]5.1260,[647]5.1190,[648]5.1193,[649]5.1198,[650]5.1221,[651]5.1256,[652]5.1261,[653]5.1293,[654]5.1243,[655]5.1237,
Final estimate: PPL = 5.1237 +/- 0.02741

llama_print_timings:        load time =    2082.33 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 36502134.09 ms / 335360 tokens (  108.84 ms per token,     9.19 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 36528817.76 ms / 335361 tokens

En testant avec un modèle plus petit :

./llama-perplexity -t 18 -m /srv/data/ai_data/llama-7b/ggml-model-q4_0.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw
perplexity: tokenizing the input ..                                       
perplexity: tokenization took 1062.84 ms                                 
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4                                                                                                       
perplexity: 18.94 seconds per pass - ETA 51.70 minutes                                                                                                                                        
[1]8.1817,[2]9.3140,[3]10.3548,[4]12.5301,[5]12.8669,[6]12.5155,[7]12.9187,[8]12.8957,[9]13.5342,[10]14.1225,[11]14.4933,[12]14.5855,[13]14.5143,[14]15.4664,[15]16.2061,[16]15.3574,[17]14.86
00,[18]14.8122,[19]13.9730,[20]13.8956,[21]13.7609,[22]13.5870,[23]13.5537,[24]13.3803,[25]13.4073,[26]13.0164,[27]13.8102,[28]13.6587,[29]13.4204,[30]13.7951,[31]13.7201,[32]13.7531,[33]13.
6152,[34]13.7482,[35]13.7886,[36]13.7833,[37]13.7819,[38]14.2423,[39]14.2797,[40]14.6856,[41]14.7064,[42]14.7809,[43]14.5226,[44]14.5142,[45]14.4658,[46]14.3645,[47]14.3359,[48]14.1911,[49]1
4.1834,...
.5886,[653]13.5991,[654]13.5887,[655]13.5858,
Final estimate: PPL = 13.5858 +/- 0.10011

llama_print_timings:        load time =     125.60 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 3062063.58 ms / 335360 tokens (    9.13 ms per token,   109.52 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 3067122.82 ms / 335361 tokens

C'est bien plus rapide.

Mode server

Params de base mais avec -t 18 :

./llama-server -t 18 -m /srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf -c 2048

Requête d'exemple :

curl -s --request POST     --url http://localhost:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}' | jq

{
  "content": "\n1. Find a domain name for your website.\n2. Choose a hosting provider and get your website set up.\n3. Install WordPress and add the plugins you need.\n4. Create a design for your website that reflects your brand.\n5. Write content for your website that is relevant, interesting, and engaging.\n6. Create a navigation structure for your website that is easy to use.\n7. Add images and videos to your website to make it more visually appealing.\n8. Optimize your website for search engines so that people can find it easily.\n9",
  "id_slot": 0,
  "stop": true,
  "model": "/srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf",
  "tokens_predicted": 128,
  "tokens_evaluated": 14,
  "generation_settings": {
    "n_ctx": 2048,
    "n_predict": -1,
    "model": "/srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf",
    "seed": 4294967295,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0.0,
    "dynatemp_exponent": 1.0,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "tfs_z": 1.0,
    "typical_p": 1.0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "prompt": "Building a website can be done in 10 simple steps:",
  "truncated": false,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": true,
  "stopping_word": "",
  "tokens_cached": 141,
  "timings": {
    "prompt_n": 14,
    "prompt_ms": 1501.448,
    "prompt_per_token_ms": 107.24628571428572,
    "prompt_per_second": 9.32433224460654,
    "predicted_n": 128,
    "predicted_ms": 49247.931,
    "predicted_per_token_ms": 384.7494609375,
    "predicted_per_second": 2.599093959906661
  }
}

Dans ce mode là, le serveur écoute uniquement en local, il est possible de changer son port et son ip d'écoute :

./llama-server --host 192.168.0.149 --port 80 -t 18 -m /srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf -c 4096

Changement de CPU

Est-ce que changer de CPU a un réel impact? Pour tester, j'ai installé un E5-2698v3 soit 16 cores/32 Threads et voilà le résultat :

./llama-perplexity -t 30 -m /srv/data/ai_data/llama-7B/llama-2-7b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |                                                                               
perplexity: tokenizing the input ..                                                                                                                                                           
perplexity: tokenization took 929.871 ms                                                                                                                                                      
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4                                                                                                       
perplexity: 82.51 seconds per pass - ETA 3 hours 45.18 minutes 

llama_print_timings:        load time =   14628.29 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 13410435.73 ms / 335360 tokens (   39.99 ms per token,    25.01 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 13428888.06 ms / 335361 tokens

Une bonne diminution vu que ça passe de 5H10 à 3H45 !

Bench

root@x99:/srv/data/llama.cpp# ./llama-bench -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf 
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 13B Q6_K                 |   9.95 GiB |    13.02 B | CPU        |      16 |         pp512 |     12.72 ± 0.02 |
| llama 13B Q6_K                 |   9.95 GiB |    13.02 B | CPU        |      16 |         tg128 |      2.73 ± 0.00 |

Avec plus de threads :

build: c8ad3595 (3224)
root@x99:/srv/data/llama.cpp# ./llama-bench -t 30 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf 
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 13B Q6_K                 |   9.95 GiB |    13.02 B | CPU        |      30 |         pp512 |     14.11 ± 0.01 |
| llama 13B Q6_K                 |   9.95 GiB |    13.02 B | CPU        |      30 |         tg128 |      2.71 ± 0.00 |

build: c8ad3595 (3224)