Ubuntu : llama.cpp
Plateforme :
x99 bd4 huananzhi
2660v3
32Go de ram
Crucial MX500 1To
Sous Ubuntu 22.04 LTS car il faut python 3.11.
Requirements
apt install build-essential git python3-venv unzip
Il faut une version de python <= 3.11 donc si vous utilisez Ubuntu 24, il faut rajouter un repo pour avoir python 3.11 par exemple :
add-apt-repository ppa:deadsnakes/ppa
apt install python3.11 python3.11-venv
mkdir py3-venv py_temp
python3.11 -m venv py3-venv/
source py3-venv/bin/activate
Il faudra penser à remplacer les commandes python3 en python3.11 pour forcer la version à utiliser.
Dans mon cas, il faut simplement créer le venv :
python3 -m venv /srv/data/py3-venv
source /srv/data/py3-venv/bin/activate
Arborescence
Création de l'arbo :
mkdir /srv/data/py3-venv
mkdir /srv/data/py_temp
Installation llama.cpp
Changer le dossier tmp lors de l'installation des requirements :
cd /srv/data
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j 18
TMPDIR=/srv/data/py_temp/ python3 -m pip install -r requirements.txt
Attention! Ici le make prend 18 threads, c'est adapté au CPU de la config.
Pour ne pas saturer le /tmp qui est assez petit, il faut dire à python d'utiliser un dossier spécifique pour les fichiers temporaires.
À cette étape, 5Go de pris par l'installation mais c'est monté à 10Go avec les fichiers temp des dépendances de py :
/dev/mapper/vg_system-lv_data 98G 5,2G 88G 6% /srv/data
Modèles
https://huggingface.co/TheBloke/Llama-2-13B-GGUF
Téléchargement d'un modèle :
pip3 install huggingface-hub>=0.17.1
mkdir /srv/data/ai_data/llama-13B
root@x99:/srv/data/ai_data/llama-13B# huggingface-cli download TheBloke/Llama-2-13B-GGUF llama-2-13b.Q4_K_M.gguf --local-dir /srv/data/ai_data/llama-13B/
Downloading 'llama-2-13b.Q4_K_M.gguf' to '/srv/data/ai_data/llama-13B/.huggingface/download/llama-2-13b.Q4_K_M.gguf.e6c5f001cf1e9330bda4c2c9098cc1c363f1cb70634f7e047fddba6096969c59.incomplete'
llama-2-13b.Q4_K_M.gguf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.87G/7.87G [13:50<00:00, 9.47MB/s]
Download complete. Moving file to models/llama-13B/llama-2-13b.Q4_K_M.gguf
models/llama-13B/llama-2-13b.Q4_K_M.gguf
root@x99:/srv/data/ai_data# ls -Alrth llama-13B/
total 7,4G
drwxr-xr-x 3 root root 4,0K juin 23 14:39 .huggingface
-rw-r--r-- 1 root root 7,4G juin 23 14:53 llama-2-13b.Q4_K_M.gguf
Environ 8Go à télécharger.
Premier test
Lancement du test :
./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q4_K_M.gguf -n 128
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 3200,00 MiB
llama_new_context_with_model: KV self size = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,12 MiB
llama_new_context_with_model: CPU compute buffer size = 368,01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1
the Boss.
"Oh, I'm sorry!" he said. "I didn't mean to tread on your toe. I hope I didn't hurt you much."
"No, thank you," she answered with a smile. "I'm afraid you have hurt my temper a little more than you have my toe."
"Well, I hope you'll forgive me, won't you?" he went on. "I was just going to say I'm not a very good skater, and if you'll let me I'll try to help you to
llama_print_timings: load time = 2458,50 ms
llama_print_timings: sample time = 5,12 ms / 128 runs ( 0,04 ms per token, 24975,61 tokens per second)
llama_print_timings: prompt eval time = 0,00 ms / 0 tokens ( -nan ms per token, -nan tokens per second)
llama_print_timings: eval time = 54993,26 ms / 128 runs ( 429,63 ms per token, 2,33 tokens per second)
llama_print_timings: total time = 55042,17 ms / 128 tokens
Log end
Si rien n'est précisé en entrée, cela va générer automatiquement un échange puis s'arrêter.
Avec un modèle plus gros :
huggingface-cli download TheBloke/Llama-2-13B-GGUF llama-2-13b.Q6_K.gguf --local-dir /srv/data/ai_data/llama-13B/
Qui prend environ 10Go de stockage.
./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n 128
...................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 3200,00 MiB
llama_new_context_with_model: KV self size = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,12 MiB
llama_new_context_with_model: CPU compute buffer size = 368,01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C =
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1
#include "libc.h"
#define BUFFER_SIZE 10
int main(void)
{
char *buffer;
char *next;
buffer = malloc(BUFFER_SIZE);
if (!buffer)
return EXIT_FAILURE;
next = buffer;
while (next != buffer + BUFFER_SIZE) {
*next++ = 'a';
}
printf("[%p] -> [%p]\n", buffer, next - 1);
llama_print_timings: load time = 2655,80 ms
llama_print_timings: sample time = 5,04 ms / 128 runs ( 0,04 ms per token, 25422,05 tokens per second)
llama_print_timings: prompt eval time = 0,00 ms / 0 tokens ( -nan ms per token, -nan tokens per second)
llama_print_timings: eval time = 65534,62 ms / 128 runs ( 511,99 ms per token, 1,95 tokens per second)
llama_print_timings: total time = 65584,95 ms / 128 tokens
Log end
Test en posant une question
Il est possible de lui poser une question :
./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n 128 -p "Who is Rogal Dorn?"
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 3200,00 MiB
llama_new_context_with_model: KV self size = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,12 MiB
llama_new_context_with_model: CPU compute buffer size = 368,01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1
Who is Rogal Dorn?
Games Workshop has announced a new book set in the Warhammer 40,000 universe titled Rogue Trader: The Book of Judas Gospel. The book will be a sequel to the previously announced book, The Book of Judas, and will be written by the same author, Aaron Dembski-Bowden.
The Book of Judas Gospel will be the first book in a new trilogy set in the Warhammer 40,000 universe, and will tell the story of Rogal Dorn, the captain of the Rogue Tr
llama_print_timings: load time = 2633,03 ms
llama_print_timings: sample time = 5,02 ms / 128 runs ( 0,04 ms per token, 25498,01 tokens per second)
llama_print_timings: prompt eval time = 994,79 ms / 8 tokens ( 124,35 ms per token, 8,04 tokens per second)
llama_print_timings: eval time = 64975,89 ms / 127 runs ( 511,62 ms per token, 1,95 tokens per second)
llama_print_timings: total time = 66021,06 ms / 135 tokens
Log end
On voit que la fin de la réponse est tronquée.
En augmentant le nombre de tokens et la taille du batch :
./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n -1 -p "Who is Rogal Dorn?"
.................................................................................................... [251/1400]
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 3200,00 MiB
llama_new_context_with_model: KV self size = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,12 MiB
llama_new_context_with_model: CPU compute buffer size = 368,01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C =
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
Who is Rogal Dorn?
Rogal Dorn is a Space Marine Captain of the Imperial Fists Chapter.
He is a great warrior and a man of honour and has fought in many battles for the Emperor of Mankind.
He is the father of Ferrus Manus, the Primarch of the Imperial Fists and the older brother of Sanguinius, the Primarch of the Blood Angels.
The Imperial Fists are one of the most respected and feared Chapters in the Imperium. They are known for their ferocity in battle and their unwavering loyalty to the Emperor.
Rogal Dorn is a true hero of the Imperium and a symbol of the power and glory of the Space Marines.
Would you like to learn more about Rogal Dorn? Click on the link below to read his biography. [end of text]
llama_print_timings: load time = 2612,28 ms
llama_print_timings: sample time = 7,23 ms / 180 runs ( 0,04 ms per token, 24906,60 tokens per second)
llama_print_timings: prompt eval time = 1015,59 ms / 8 tokens ( 126,95 ms per token, 7,88 tokens per second)
llama_print_timings: eval time = 91910,31 ms / 179 runs ( 513,47 ms per token, 1,95 tokens per second)
llama_print_timings: total time = 92996,97 ms / 187 tokens
Log end
L'option qui change ici :
- -n, --predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
Cela permet de ne plus avoir de limite et token mais attention, cela dépend du modèle! Généralement les modèles utilisés sont entrainés avec un certains nombre de tokens. La limitation du nombre de tokens peut tronquer la réponse en mode cli.
Par exemple en augmentant à 4096 (2048 tronquant le text) :
./llama-cli -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -n 4096 -p "Who is Rogal Dorn?"
.................................................................................................... [5/1716]
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 3200,00 MiB
llama_new_context_with_model: KV self size = 3200,00 MiB, K (f16): 1600,00 MiB, V (f16): 1600,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,12 MiB
llama_new_context_with_model: CPU compute buffer size = 368,01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C =
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 4096, n_keep = 1
Who is Rogal Dorn?
Rogal Dorn is a fictional character from Warhammer 40,000. He is the Primarch of the Imperial Fists Space Marine Legion.
The Imperial Fists are one of the first Space Marine Legions formed by the Emperor of Mankind. They are a loyalist legion, meaning they fight on the side of the Emperor and the Imperium. The
y are known for their righteousness and unwavering faith in the Emperor.
Dorn is a legendary figure within the Imperial Fists, he is considered to be the greatest of all the Primarchs and is known as the “Emperor’s Champion”. He is a skilled warrior and strategis
t and is known for his courage in battle. He is also known for his loyalty to the Emperor and his determination to protect Mankind from all threats.
Dorn is the leader of the Imperial Fists and is responsible for their success in battle. He is a powerful force on the battlefield and is respected by his fellow Space Marines. He is also re
sponsible for the training of the Imperial Fists and is known for his strict discipline.
Rogal Dorn is a powerful and respected figure within the Imperial Fists. He is a great leader and is revered by his fellow Space Marines. He is a legendary figure within Warhammer 40,000 and
is the perfect example of the loyalty and strength of the Imperial Fists.
rogal dorn primarch
Previous Post:What is the Average Age of a Space Marine?
Next Post:Where is the Space Marine Legion? [end of text]
llama_print_timings: load time = 2614,60 ms
llama_print_timings: sample time = 14,03 ms / 351 runs ( 0,04 ms per token, 25016,04 tokens per second)
llama_print_timings: prompt eval time = 987,96 ms / 8 tokens ( 123,50 ms per token, 8,10 tokens per second)
llama_print_timings: eval time = 181247,20 ms / 350 runs ( 517,85 ms per token, 1,93 tokens per second)
llama_print_timings: total time = 182370,59 ms / 358 tokens
Log end
Perplexity test
Perplexity :
cd /srv/data/ai_data
wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
Attention, ce test peut être très long !
./llama-perplexity -t 18 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw
perplexity: tokenizing the input ..
perplexity: tokenization took 1065.02 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 225.10 seconds per pass - ETA 10 hours 14.33 minutes
De base, le nombre de threads utilisé est à 10 maxi, vu que le cpu en a 20, j'ai augmenté le nombre ce qui a changé l'ETA de +11H à 10H15 (gros gros gain...)
Test en augmentant le batch size et du nombre de tokens :
./llama-perplexity -t 18 -n 512 -b 8192 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw
perplexity: tokenizing the input ..
perplexity: tokenization took 1130.27 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=8192, n_seq=16
perplexity: 1038.73 seconds per pass - ETA 11 hours 48.72 minutes
Plus de consommation de ram mais c'est moins bien en temps final donc retour aux params de base.
Donc on relance avec les params de base (minus le nb de threads):
./llama-perplexity -t 18 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw
perplexity: tokenizing the input .. perplexity: tokenization took 1061.55 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 223.75 seconds per pass - ETA 10 hours 10.63 minutes
[1]3.8946,[2]4.2280,[3]4.9494,[4]5.2314,[5]5.4215,[6]5.3746,[7]5.5474,[8]5.6066,[9]5.8758,[10]6.0637,[11]6.2970,[12]6.3962,[13]6.3284,[14]6.3995,[15]6.5792,[16]6.2696,[17]6.1935,[18]6.1725,[
19]5.8923,[20]5.8973,[21]5.8120,[22]5.6287,[23]5.5831,[24]5.5010,[25]5.4746,[26]5.3299,[27]5.1525,[28]5.0569,[29]4.9740,[30]4.8310,[31]4.7755,[32]4.7881,[33]4.7399,[34]4.7685,[35]4.7869,[36]
4.8169,[37]4.8065,[38]4.8103,[39]4.8357,[40]4.8770,[41]4.9048,[42]4.9394,[43]4.9108,[44]4.9548,[45]4.9710,[46]4.9422,[47]4.9620,[48]4.9472,[49]4.9438,[50]4.9119,[51]4.9206,[52]4.9158,[53]4.9
585,....
]5.1228,[644]5.1232,[645]5.1225,[646]5.1260,[647]5.1190,[648]5.1193,[649]5.1198,[650]5.1221,[651]5.1256,[652]5.1261,[653]5.1293,[654]5.1243,[655]5.1237,
Final estimate: PPL = 5.1237 +/- 0.02741
llama_print_timings: load time = 2082.33 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 36502134.09 ms / 335360 tokens ( 108.84 ms per token, 9.19 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 36528817.76 ms / 335361 tokens
En testant avec un modèle plus petit :
./llama-perplexity -t 18 -m /srv/data/ai_data/llama-7b/ggml-model-q4_0.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw
perplexity: tokenizing the input ..
perplexity: tokenization took 1062.84 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 18.94 seconds per pass - ETA 51.70 minutes
[1]8.1817,[2]9.3140,[3]10.3548,[4]12.5301,[5]12.8669,[6]12.5155,[7]12.9187,[8]12.8957,[9]13.5342,[10]14.1225,[11]14.4933,[12]14.5855,[13]14.5143,[14]15.4664,[15]16.2061,[16]15.3574,[17]14.86
00,[18]14.8122,[19]13.9730,[20]13.8956,[21]13.7609,[22]13.5870,[23]13.5537,[24]13.3803,[25]13.4073,[26]13.0164,[27]13.8102,[28]13.6587,[29]13.4204,[30]13.7951,[31]13.7201,[32]13.7531,[33]13.
6152,[34]13.7482,[35]13.7886,[36]13.7833,[37]13.7819,[38]14.2423,[39]14.2797,[40]14.6856,[41]14.7064,[42]14.7809,[43]14.5226,[44]14.5142,[45]14.4658,[46]14.3645,[47]14.3359,[48]14.1911,[49]1
4.1834,...
.5886,[653]13.5991,[654]13.5887,[655]13.5858,
Final estimate: PPL = 13.5858 +/- 0.10011
llama_print_timings: load time = 125.60 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 3062063.58 ms / 335360 tokens ( 9.13 ms per token, 109.52 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 3067122.82 ms / 335361 tokens
C'est bien plus rapide.
Mode server
Params de base mais avec -t 18 :
./llama-server -t 18 -m /srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf -c 2048
Requête d'exemple :
curl -s --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}' | jq
{
"content": "\n1. Find a domain name for your website.\n2. Choose a hosting provider and get your website set up.\n3. Install WordPress and add the plugins you need.\n4. Create a design for your website that reflects your brand.\n5. Write content for your website that is relevant, interesting, and engaging.\n6. Create a navigation structure for your website that is easy to use.\n7. Add images and videos to your website to make it more visually appealing.\n8. Optimize your website for search engines so that people can find it easily.\n9",
"id_slot": 0,
"stop": true,
"model": "/srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf",
"tokens_predicted": 128,
"tokens_evaluated": 14,
"generation_settings": {
"n_ctx": 2048,
"n_predict": -1,
"model": "/srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf",
"seed": 4294967295,
"temperature": 0.800000011920929,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"tfs_z": 1.0,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"penalty_prompt_tokens": [],
"use_penalty_prompt_tokens": false,
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"penalize_nl": false,
"stop": [],
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "",
"samplers": [
"top_k",
"tfs_z",
"typical_p",
"top_p",
"min_p",
"temperature"
]
},
"prompt": "Building a website can be done in 10 simple steps:",
"truncated": false,
"stopped_eos": false,
"stopped_word": false,
"stopped_limit": true,
"stopping_word": "",
"tokens_cached": 141,
"timings": {
"prompt_n": 14,
"prompt_ms": 1501.448,
"prompt_per_token_ms": 107.24628571428572,
"prompt_per_second": 9.32433224460654,
"predicted_n": 128,
"predicted_ms": 49247.931,
"predicted_per_token_ms": 384.7494609375,
"predicted_per_second": 2.599093959906661
}
}
Dans ce mode là, le serveur écoute uniquement en local, il est possible de changer son port et son ip d'écoute :
./llama-server --host 192.168.0.149 --port 80 -t 18 -m /srv/data/ai_data/llama-13b/llama-2-13b.Q6_K.gguf -c 4096
Changement de CPU
Est-ce que changer de CPU a un réel impact? Pour tester, j'ai installé un E5-2698v3 soit 16 cores/32 Threads et voilà le résultat :
./llama-perplexity -t 30 -m /srv/data/ai_data/llama-7B/llama-2-7b.Q6_K.gguf -f /srv/data/ai_data/wikitext-2-raw/wiki.test.raw
system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C =
1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 929.871 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 82.51 seconds per pass - ETA 3 hours 45.18 minutes
llama_print_timings: load time = 14628.29 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 13410435.73 ms / 335360 tokens ( 39.99 ms per token, 25.01 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 13428888.06 ms / 335361 tokens
Une bonne diminution vu que ça passe de 5H10 à 3H45 !
Bench
root@x99:/srv/data/llama.cpp# ./llama-bench -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 13B Q6_K | 9.95 GiB | 13.02 B | CPU | 16 | pp512 | 12.72 ± 0.02 |
| llama 13B Q6_K | 9.95 GiB | 13.02 B | CPU | 16 | tg128 | 2.73 ± 0.00 |
Avec plus de threads :
build: c8ad3595 (3224)
root@x99:/srv/data/llama.cpp# ./llama-bench -t 30 -m /srv/data/ai_data/llama-13B/llama-2-13b.Q6_K.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 13B Q6_K | 9.95 GiB | 13.02 B | CPU | 30 | pp512 | 14.11 ± 0.01 |
| llama 13B Q6_K | 9.95 GiB | 13.02 B | CPU | 30 | tg128 | 2.71 ± 0.00 |
build: c8ad3595 (3224)