로컬 LLM 셋업 가이드 (v23)

1. 개요 및 사전 준비

로컬 LLM(대형 언어 모델)을 실행하는 것은 비용 효율적인 방법으로 AI 기능을 통합할 수 있는 가장 간단한 접근 방식입니다. 이 가이드는 Linux 기반 시스템에서 로컬 LLM을 설정하고 최적화하는 실용적인 방법을 제공합니다.

사전 요구사항

운영 체제: Ubuntu 20.04 이상 또는 Debian 11 이상
하드웨어:
- GPU: NVIDIA RTX 30xx 이상 (최소 8GB VRAM)
- CPU: 최소 8코어
- RAM: 최소 32GB (64GB 이상 권장)
- 저장소: 최소 100GB 여유 공간

시스템 확인

# GPU 확인
nvidia-smi

# RAM 확인
free -h

# CPU 확인
lscpu

2. 프레임워크 비교

프레임워크	장점	단점	추천 사용 사례
llama.cpp	빠른 설치, 최적화된 C++ 구현, 최소 의존성	API 서버 미포함	단일 모델 실행
Ollama	쉬운 설치, 간단한 API, 이미지 기반 배포	메모리 사용량 높음	개발/테스트 환경
vLLM	최고의 성능, 대규모 토크나이즈 처리	복잡한 설치 과정	프로덕션 환경
LocalAI	다양한 API 호환성, 클라우드 연계	기술 지원 제한	API 기반 어플리케이션

3. 추천 설정 - llama.cpp 설치

llama.cpp는 가장 적절한 선택입니다. 간단하고 빠르며 최적화된 성능을 제공합니다.

# 설치 전 준비
sudo apt update
sudo apt install build-essential git -y

# llama.cpp 다운로드 및 컴파일
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# 컴파일
make clean
make

# 필수 라이브러리 설치 (필요시)
pip install torch numpy

# 모델 다운로드 (예시: LLaMA-2 7B)
mkdir -p models
wget https://huggingface.co/llamav2-7b/resolve/main/llama-2-7b.gguf -O models/llama-2-7b.gguf

4. 모델 선택 가이드

사용 사례	추천 모델	설명
일반 텍스트 생성	LLaMA-2 7B (Q5_K_M)	균형 잡힌 성능과 정확도
빠른 추론	Mistral 7B (Q4_K_M)	빠른 추론 속도
고정밀도	Phi-3 3.8B (Q4_K_M)	정밀한 답변
코드 생성	CodeLLaMA 7B (Q4_K_M)	프로그래밍 관련 작업

5. 양자화 유형 설명

# 양자화 유형별 설명
# Q4_K_M: 최적화된 4비트 양자화, 높은 성능/정확도 비율
# Q5_K_M: 5비트 양자화, 정확도 향상
# Q6_K: 6비트, 최고 정확도
# Q8_0: 8비트, 최대 정확도

실제 모델 변환 예시

# Q5_K_M 양자화
./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q5_k_m --outfile models/llama-2-7b-q5k.gguf

# Q4_K_M 양자화
./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q4_k_m --outfile models/llama-2-7b-q4k.gguf

6. API 설정 및 도구 통합

# llama.cpp API 서버 시작
./server -m models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080

# API 테스트
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello, how are you?",
    "n_predict": 128,
    "temperature": 0.7
  }'

외부 도구 통합 예시 (Python)

import requests

def llama_completion(prompt, max_tokens=128, temperature=0.7):
    response = requests.post(
        "http://localhost:8080/completion",
        json={
            "prompt": prompt,
            "n_predict": max_tokens,
            "temperature": temperature
        }
    )
    return response.json()['content']

# 사용 예시
result = llama_completion("Python에서 JSON 파싱 방법은?")
print(result)

7. Systemd 서비스 설정

24시간 실행을 위해 systemd 서비스를 설정합니다.

# 서비스 파일 생성
sudo nano /etc/systemd/system/llama.service

# 서비스 내용
[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=your_user
WorkingDirectory=/home/your_user/llama.cpp
ExecStart=/home/your_user/llama.cpp/server -m /home/your_user/llama.cpp/models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# 서비스 활성화
sudo systemctl daemon-reload
sudo systemctl enable llama.service
sudo systemctl start llama.service

8. 모니터링 및 성능 최적화

성능 모니터링 스크립트

# 성능 모니터링 스크립트 (monitor.sh)
#!/bin/bash
while true; do
    echo "=== Memory Usage ==="
    free -h
    echo "=== GPU Usage ==="
    nvidia-smi
    echo "=== CPU Load ==="
    top -bn1 | grep "Cpu(s)"
    sleep 30
done

최적화 옵션

# 빠른 추론 (적은 메모리 사용)
./server -m models/llama-2-7b-q5k.gguf -c 512 -n 128

# 최대 성능 (높은 메모리 사용)
./server -m models/llama-2-7b-q5k.gguf -c 2048 -n 2048 --threads 8

# GPU 메모리 최적화
./server -m models/llama-2-7b-q5k.gguf --gpu-layers 30 -c 1024

9. 실제 성능 벤치마크

추론 성능 테스트

# 성능 테스트
./server -m models/llama-2-7b-q5k.gguf -c 2048 --port 8081

# 빠른 테스트
ab -n 10 -c 5 http://localhost:8081/completion

# 실제 요청 테스트
curl -X POST http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "n_predict": 10}' \
  -w "%{time_total}s\n"

추론 시간 기록 (예시)

LLaMA-2 7B (Q5_K_M):
- 문맥 길이 512: 0.8초
- 문맥 길이 1024: 1.2초
- 문맥 길이 2048: 2.1초

Mistral 7B (Q4_K_M):
- 문맥 길이 512: 0.5초
- 문맥 길이 1024: 0.9초
- 문맥 길이 2048: 1.6초

10. 실전 사용 사례

📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

로컬 LLM 셋업 가이드 (v23)

로컬 LLM 셋업 가이드 (v23)

1. 개요 및 사전 준비

사전 요구사항

시스템 확인

2. 프레임워크 비교

3. 추천 설정 - llama.cpp 설치

4. 모델 선택 가이드

5. 양자화 유형 설명

실제 모델 변환 예시

6. API 설정 및 도구 통합

외부 도구 통합 예시 (Python)

7. Systemd 서비스 설정

8. 모니터링 및 성능 최적화

성능 모니터링 스크립트

최적화 옵션

9. 실제 성능 벤치마크

추론 성능 테스트

추론 시간 기록 (예시)

10. 실전 사용 사례

Tags

Author

Stats

Published

You Might Also Like

The Principle of Least AI

. .. . ... . .... . .... . ... .

I'm not a developer, but I built a calendar app to fix my most annoying work task

Too cheap to be good? Think again.

The 80/20 Rule of AI Code — Why the Last 20% Takes 80% of Your Time

Internmaxxing vs. Old Man Shakes Fist at Cloud