Pollex

Self-hosted text polishing powered by GPU inference on a Jetson Nano. Fix grammar, coherence, and wording in 3 seconds — your text never leaves your network.

Get Started View on GitHub

The Problem

Cloud LLM APIs see everything you type. Every email draft, every Slack message, every document revision — routed through third-party servers, logged, and used for training. You trade privacy for convenience.

Pollex runs a quantized language model on a $99 Jetson Nano with 4GB of RAM sitting on your desk. The Chrome extension sends your text through a Cloudflare Tunnel to a Go API that calls llama.cpp with full CUDA offload. The result comes back in 3-8 seconds, reading like a fluent non-native speaker — professional and clear, never AI-flavored.

Features

Go API

Stdlib net/http, zero frameworks. 7 middleware layers (CORS, auth, rate limit, metrics, request ID, body limit, timeout). Multi-stage Docker image at 24.7MB. 80+ tests with -race, CI on every push.

Chrome Extension

Manifest V3 service worker. Paste text, pick a model, get polished output, copy it. Persistent job recovery across popup opens. 7-entry history. Works from anywhere through Cloudflare Tunnel.

GPU Inference

llama.cpp on Jetson Nano 4GB — 128 Maxwell cores, CUDA 10.2, full GPU offload. Qwen 2.5 1.5B quantized to Q4_0 (~1GB VRAM). Sustained ~4 tok/s. Short text polished in ~3s, medium in ~8s.

Observability

Prometheus metrics with 4 custom collectors. 6 alerting rules tied to SLO targets. Grafana dashboard auto-provisioned on startup. k6 load test scripts for burst, sustained, and soak scenarios.

Architecture

Browser Extension  -->  Cloudflare Tunnel  -->  Go API (:8090)  -->  llama-server (CUDA)  -->  Qwen 2.5 1.5B
   Manifest V3        pollex.mlorente.dev      7 middleware           full GPU offload         Q4_0 · ~1GB

Layer	Tech	What it does
Extension	Chrome Manifest V3	Paste text, select model, copy result
Tunnel	Cloudflare Tunnel	Zero-config HTTPS ingress — Jetson sits behind double NAT
API	Go 1.26, stdlib `net/http`	Routes requests to LLM backends, enforces auth and rate limits
Inference	llama.cpp + Qwen 2.5 1.5B Q4_0	GPU inference on 128 Maxwell cores at ~4 tok/s
Monitoring	Prometheus + Alertmanager + Grafana	SLO tracking, burn-rate alerts, auto-provisioned dashboards

Performance

Measured on Jetson Nano 4GB, Qwen 2.5 1.5B Q4_0, full GPU offload (-ngl 999), 128 Maxwell cores.

Input	Chars	Latency	Throughput
Short (one sentence)	~50	~3s	~4 tok/s
Medium (paragraph)	~500	~8s	~4 tok/s
Long (full page)	~1000	~16s	~4 tok/s

SLO targets (7-day rolling window):

Metric	Target
Availability	99% (error budget: 100.8 min/week)
Latency p50	under 20s
Latency p95	under 60s
Error rate (5xx)	under 1%

Load tested with k6 — burst (5 VUs, 25 iterations), sustained (12 req/min for 2 min), and 30-minute soak runs.

Hardening

Seven middleware layers, applied in order:

Layer	Limit	Response
API key auth	`X-API-Key`, constant-time compare	401
Request body	64KB max	413
Text length	10,000 chars	400
Rate limit	10 req/min/IP, sliding window	429
Timeout	120s per request	504

Authenticated requests bypass rate limiting — by design, since API key holders are trusted. The system prompt includes injection guardrails: user input is always treated as text to polish, never as instructions.

Quick Start

No GPU, no Jetson, no external services. The mock adapter returns canned responses:

git clone https://github.com/mlorentedev/pollex.git
cd pollex
make dev    # API on :8090 with mock adapter

curl -s http://localhost:8090/api/health | python3 -m json.tool

Installation and configuration From source, Docker, or release binary. Includes extension setup, API examples, config reference, and monitoring stack.

Project Details


Language	Go 1.26, stdlib `net/http`
Dependencies	`yaml.v3`, `prometheus/client_golang`
Tests	80+ with subtests, `-race` clean, `go vet` clean
Docker	24.7MB (multi-stage Alpine 3.21, non-root)
CI/CD	Lint + test + build on push, goreleaser on tags
License	MIT
Hardware	Jetson Nano 4GB — ARM64, CUDA 10.2, 128 Maxwell cores