Skip to content
Pollex architecture diagram showing browser extension, API, and Jetson GPU inference

Pollex

Self-hosted text polishing powered by GPU inference on a Jetson Nano. Fix grammar, coherence, and wording in 3 seconds — your text never leaves your network.

Cloud LLM APIs see everything you type. Every email draft, every Slack message, every document revision — routed through third-party servers, logged, and used for training. You trade privacy for convenience.

Pollex runs a quantized language model on a $99 Jetson Nano with 4GB of RAM sitting on your desk. The Chrome extension sends your text through a Cloudflare Tunnel to a Go API that calls llama.cpp with full CUDA offload. The result comes back in 3-8 seconds, reading like a fluent non-native speaker — professional and clear, never AI-flavored.

Go API

Stdlib net/http, zero frameworks. 7 middleware layers (CORS, auth, rate limit, metrics, request ID, body limit, timeout). Multi-stage Docker image at 24.7MB. 80+ tests with -race, CI on every push.

Chrome Extension

Manifest V3 service worker. Paste text, pick a model, get polished output, copy it. Persistent job recovery across popup opens. 7-entry history. Works from anywhere through Cloudflare Tunnel.

GPU Inference

llama.cpp on Jetson Nano 4GB — 128 Maxwell cores, CUDA 10.2, full GPU offload. Qwen 2.5 1.5B quantized to Q4_0 (~1GB VRAM). Sustained ~4 tok/s. Short text polished in ~3s, medium in ~8s.

Observability

Prometheus metrics with 4 custom collectors. 6 alerting rules tied to SLO targets. Grafana dashboard auto-provisioned on startup. k6 load test scripts for burst, sustained, and soak scenarios.

Browser Extension --> Cloudflare Tunnel --> Go API (:8090) --> llama-server (CUDA) --> Qwen 2.5 1.5B
Manifest V3 pollex.mlorente.dev 7 middleware full GPU offload Q4_0 · ~1GB
LayerTechWhat it does
ExtensionChrome Manifest V3Paste text, select model, copy result
TunnelCloudflare TunnelZero-config HTTPS ingress — Jetson sits behind double NAT
APIGo 1.26, stdlib net/httpRoutes requests to LLM backends, enforces auth and rate limits
Inferencellama.cpp + Qwen 2.5 1.5B Q4_0GPU inference on 128 Maxwell cores at ~4 tok/s
MonitoringPrometheus + Alertmanager + GrafanaSLO tracking, burn-rate alerts, auto-provisioned dashboards

Measured on Jetson Nano 4GB, Qwen 2.5 1.5B Q4_0, full GPU offload (-ngl 999), 128 Maxwell cores.

InputCharsLatencyThroughput
Short (one sentence)~50~3s~4 tok/s
Medium (paragraph)~500~8s~4 tok/s
Long (full page)~1000~16s~4 tok/s

SLO targets (7-day rolling window):

MetricTarget
Availability99% (error budget: 100.8 min/week)
Latency p50under 20s
Latency p95under 60s
Error rate (5xx)under 1%

Load tested with k6 — burst (5 VUs, 25 iterations), sustained (12 req/min for 2 min), and 30-minute soak runs.

Seven middleware layers, applied in order:

LayerLimitResponse
API key authX-API-Key, constant-time compare401
Request body64KB max413
Text length10,000 chars400
Rate limit10 req/min/IP, sliding window429
Timeout120s per request504

Authenticated requests bypass rate limiting — by design, since API key holders are trusted. The system prompt includes injection guardrails: user input is always treated as text to polish, never as instructions.

No GPU, no Jetson, no external services. The mock adapter returns canned responses:

Terminal window
git clone https://github.com/mlorentedev/pollex.git
cd pollex
make dev # API on :8090 with mock adapter
Terminal window
curl -s http://localhost:8090/api/health | python3 -m json.tool
LanguageGo 1.26, stdlib net/http
Dependenciesyaml.v3, prometheus/client_golang
Tests80+ with subtests, -race clean, go vet clean
Docker24.7MB (multi-stage Alpine 3.21, non-root)
CI/CDLint + test + build on push, goreleaser on tags
LicenseMIT
HardwareJetson Nano 4GB — ARM64, CUDA 10.2, 128 Maxwell cores