HTPBE — Has This PDF Been Edited?
SaaS platform for PDF authenticity verification with a public REST API. Detects edits through 5-layer analysis of metadata, xref structure, digital signatures, and content.
Tech Stack
Frontend
Backend
Database
Infrastructure
Integrations
Key Results
- 5-layer analysis completes in ≤9 seconds
- Supports files up to 10 MB (2.2× Vercel's 4.5 MB limit)
- 100% detection confidence for 7+ deterministic markers
- 7 algorithm versions in 5 days with auto-outdated flagging
The Challenge
PDF verification market is dominated by expensive enterprise solutions ($1–$1.50/document) without self-service API options. Banks, lawyers, and HR departments need to verify document authenticity — was this contract modified after signing? Is this invoice original?
Additional technical constraint: Vercel's serverless limits request body to 4.5 MB, making direct PDF upload via API routes impossible for real-world documents.
I needed to build something that:
- Detects modifications reliably (not just metadata)
- Works instantly via web upload or API
- Distinguishes legitimate updates (LTV) from tampering
- Offers self-service API with tiered pricing
- Handles files larger than Vercel's body size limit
The Solution
I built a 5-layer PDF forensics system with deterministic rules (no ML black box). The key insight: incremental updates are the most reliable indicator of modification — legitimate PDFs have one xref table, each edit adds another.
Client-Side Upload to Bypass Vercel Limits
Two-stage upload pattern: browser uploads directly to Vercel Blob via presigned URL (bypassing serverless), then server downloads and analyzes:
// components/Hero/index.tsx — Step 1: browser → Vercel Blob
const blob = await upload(filename, file, {
access: "public",
handleUploadUrl: "/api/blob-token",
clientPayload: JSON.stringify({ size: file.size }),
});
// Step 2: server downloads from Blob URL and analyzes
await analyzePdf(blob.url, file.name); // Server Action
// app/api/blob-token/route.ts — origin-based security
export async function POST(req: Request): Promise<Response> {
const origin = req.headers.get("origin") ?? "";
const isAllowed = ALLOWED_ORIGINS.some((o) => origin === o);
if (!isAllowed) return new Response("Forbidden", { status: 403 });
return handleUpload(req, res, {
body,
onBeforeGenerateToken: async () => ({
allowedContentTypes: ["application/pdf"],
maximumSizeInBytes: 10 * 1024 * 1024,
}),
});
}
This allows processing files up to 10 MB without infrastructure changes.
5-Layer PDF Forensics with Algorithm Versioning
Analysis is split into independent layers. The critical marker is xref table counting: a legitimate PDF has one, each incremental update adds another. PDF 1.5+ complicates this with xref streams (different syntax):
// lib/services/pdf-structure.service.ts
private countXrefTables(pdfBuffer: Buffer): number {
let count = 0;
let pos = 0;
while (pos < pdfBuffer.length) {
// Classic xref tables: "xref\n" at start of line
const xrefIdx = pdfBuffer.indexOf(Buffer.from('xref\n'), pos);
// xref streams (PDF 1.5+): "/Type /XRef" in stream dictionary
const xrefStreamIdx = pdfBuffer.indexOf(Buffer.from('/Type /XRef'), pos);
const nextIdx = Math.min(
xrefIdx === -1 ? Infinity : xrefIdx,
xrefStreamIdx === -1 ? Infinity : xrefStreamIdx
);
if (nextIdx === Infinity) break;
count++;
pos = nextIdx + 5;
}
return count;
}
Algorithm is versioned (v2.1.3), version saved with each result. When requesting /api/v1/result/{uid}, outdated results are automatically flagged with algorithmOutdated: true.
Dynamic Quota System Without Counters
Instead of storing a usage counter in the user row (denormalization → race conditions), quota is calculated dynamically from the checks table:
// lib/services/quota.service.ts
async checkQuota(userId: string): Promise<QuotaStatus> {
const user = await db.select().from(users).where(eq(users.id, userId)).get();
const startOfMonth = new Date();
startOfMonth.setDate(1);
startOfMonth.setHours(0, 0, 0, 0);
// Count from checks table — no stale counters, no race conditions
const result = await db
.select({ count: sql<number>`cast(count(*) as integer)` })
.from(checks)
.innerJoin(apiKeys, eq(checks.apiKeyId, apiKeys.id))
.where(
and(
eq(apiKeys.userId, userId),
gte(checks.checkDate, Math.floor(startOfMonth.getTime() / 1000))
)
);
const used = result[0]?.count ?? 0;
const limit = user.requestsPerMonth; // null = unlimited (Enterprise)
return { used, limit, remaining: limit === null ? null : limit - used };
}
Dual-Environment API Keys (Live/Test)
Keys follow format htpbe_{live|test}_{43-random-chars}. Test keys only accept mock URLs from whitelist — developers can test integration without consuming quota:
// app/api/v1/analyze/route.ts
const keyEnv = getApiKeyEnvironment(apiKey); // 'live' | 'test'
if (keyEnv === 'test') {
const TEST_URLS = ['https://htpbe.tech/samples/modified-high.pdf', ...];
if (!TEST_URLS.includes(file_url)) {
return Response.json({
error: 'Test keys only work with official HTPBE sample URLs',
test_urls: TEST_URLS,
}, { status: 422 });
}
}
// Live keys: fetch any public PDF URL, analyze, bill quota
LTV Validation (Avoiding False Positives)
Long-Term Validation (LTV) adds timestamps and certificates to signed PDFs — this is legitimate modification that shouldn't trigger alerts:
// lib/services/pdf-ltv.service.ts
export function analyzeLTV(bytes: Uint8Array): LTVAnalysis {
const text = new TextDecoder("latin1").decode(bytes);
// Detect Document Security Store (DSS)
const hasDSS = /\/Type\s*\/DSS\b/.test(text);
// Detect Document Timestamp
const hasDTS =
/\/Type\s*\/DocTimeStamp\b/.test(text) || /\/SubFilter\s*\/ETSI\.RFC3161\b/.test(text);
// ETSI PAdES-LTV compliance markers
const isETSICompliant = hasDSS || hasDTS || /\/SubFilter\s*\/ETSI\.CAdES\.detached\b/.test(text);
return {
hasLTV: hasDSS || hasDTS,
hasDSS,
hasDTS,
isETSICompliant,
ltvDetails: isETSICompliant ? extractLTVDetails(bytes) : null,
};
}
Results
The platform launched in September 2024:
| Metric | Value |
|---|---|
| Analysis time | ≤9 seconds (within Vercel timeout) |
| Max file size | 10 MB (2.2× Vercel's 4.5 MB limit) |
| Detection markers | 7+ deterministic (iLovePDF, PDF24, QPDF, etc.) |
| Algorithm versions | 7 in 5 days (v2.0.0 → v2.1.3) |
| Database schema | 6 tables with dynamic quota calculation |
| Pricing tiers | $15–$499/month + Enterprise on-premise |
The deterministic approach means results are reproducible, explainable, and fast. Enterprise option (Docker/Kubernetes) addresses GDPR, HIPAA, and PCI DSS requirements for fintech and legal clients.
AvailableNeed something similar?
I build custom solutions — from APIs to full products. Let's talk about your project.