HTPBE — Has This PDF Been Edited?

September 25, 2024

SaaS platform for PDF authenticity verification with a public REST API. Detects edits through 5-layer analysis of metadata, xref structure, digital signatures, and content.

Tech Stack

Frontend

Next.js 16React 19Tailwind CSSFramer Motionnext-themesSwiper

Backend

Next.js Server Actionspdf-libZod 4NextAuth.js v5

Database

Turso (libSQL edge)Drizzle ORM

Infrastructure

Vercel ServerlessVercel BlobVercel Analytics

Integrations

Resend (magic links)Google OAuthGitHub OAuthStripe

Key Results

  • 5-layer analysis completes in ≤9 seconds
  • Supports files up to 10 MB (2.2× Vercel's 4.5 MB limit)
  • 100% detection confidence for 7+ deterministic markers
  • 7 algorithm versions in 5 days with auto-outdated flagging

The Challenge

PDF verification market is dominated by expensive enterprise solutions ($1–$1.50/document) without self-service API options. Banks, lawyers, and HR departments need to verify document authenticity — was this contract modified after signing? Is this invoice original?

Additional technical constraint: Vercel's serverless limits request body to 4.5 MB, making direct PDF upload via API routes impossible for real-world documents.

I needed to build something that:

  1. Detects modifications reliably (not just metadata)
  2. Works instantly via web upload or API
  3. Distinguishes legitimate updates (LTV) from tampering
  4. Offers self-service API with tiered pricing
  5. Handles files larger than Vercel's body size limit

The Solution

I built a 5-layer PDF forensics system with deterministic rules (no ML black box). The key insight: incremental updates are the most reliable indicator of modification — legitimate PDFs have one xref table, each edit adds another.

Client-Side Upload to Bypass Vercel Limits

Two-stage upload pattern: browser uploads directly to Vercel Blob via presigned URL (bypassing serverless), then server downloads and analyzes:

// components/Hero/index.tsx — Step 1: browser → Vercel Blob
const blob = await upload(filename, file, {
  access: "public",
  handleUploadUrl: "/api/blob-token",
  clientPayload: JSON.stringify({ size: file.size }),
});

// Step 2: server downloads from Blob URL and analyzes
await analyzePdf(blob.url, file.name); // Server Action

// app/api/blob-token/route.ts — origin-based security
export async function POST(req: Request): Promise<Response> {
  const origin = req.headers.get("origin") ?? "";
  const isAllowed = ALLOWED_ORIGINS.some((o) => origin === o);
  if (!isAllowed) return new Response("Forbidden", { status: 403 });

  return handleUpload(req, res, {
    body,
    onBeforeGenerateToken: async () => ({
      allowedContentTypes: ["application/pdf"],
      maximumSizeInBytes: 10 * 1024 * 1024,
    }),
  });
}

This allows processing files up to 10 MB without infrastructure changes.

5-Layer PDF Forensics with Algorithm Versioning

Analysis is split into independent layers. The critical marker is xref table counting: a legitimate PDF has one, each incremental update adds another. PDF 1.5+ complicates this with xref streams (different syntax):

// lib/services/pdf-structure.service.ts
private countXrefTables(pdfBuffer: Buffer): number {
  let count = 0;
  let pos = 0;
  while (pos < pdfBuffer.length) {
    // Classic xref tables: "xref\n" at start of line
    const xrefIdx = pdfBuffer.indexOf(Buffer.from('xref\n'), pos);
    // xref streams (PDF 1.5+): "/Type /XRef" in stream dictionary
    const xrefStreamIdx = pdfBuffer.indexOf(Buffer.from('/Type /XRef'), pos);
    const nextIdx = Math.min(
      xrefIdx === -1 ? Infinity : xrefIdx,
      xrefStreamIdx === -1 ? Infinity : xrefStreamIdx
    );
    if (nextIdx === Infinity) break;
    count++;
    pos = nextIdx + 5;
  }
  return count;
}

Algorithm is versioned (v2.1.3), version saved with each result. When requesting /api/v1/result/{uid}, outdated results are automatically flagged with algorithmOutdated: true.

Dynamic Quota System Without Counters

Instead of storing a usage counter in the user row (denormalization → race conditions), quota is calculated dynamically from the checks table:

// lib/services/quota.service.ts
async checkQuota(userId: string): Promise<QuotaStatus> {
  const user = await db.select().from(users).where(eq(users.id, userId)).get();

  const startOfMonth = new Date();
  startOfMonth.setDate(1);
  startOfMonth.setHours(0, 0, 0, 0);

  // Count from checks table — no stale counters, no race conditions
  const result = await db
    .select({ count: sql<number>`cast(count(*) as integer)` })
    .from(checks)
    .innerJoin(apiKeys, eq(checks.apiKeyId, apiKeys.id))
    .where(
      and(
        eq(apiKeys.userId, userId),
        gte(checks.checkDate, Math.floor(startOfMonth.getTime() / 1000))
      )
    );

  const used = result[0]?.count ?? 0;
  const limit = user.requestsPerMonth; // null = unlimited (Enterprise)
  return { used, limit, remaining: limit === null ? null : limit - used };
}

Dual-Environment API Keys (Live/Test)

Keys follow format htpbe_{live|test}_{43-random-chars}. Test keys only accept mock URLs from whitelist — developers can test integration without consuming quota:

// app/api/v1/analyze/route.ts
const keyEnv = getApiKeyEnvironment(apiKey); // 'live' | 'test'

if (keyEnv === 'test') {
  const TEST_URLS = ['https://htpbe.tech/samples/modified-high.pdf', ...];
  if (!TEST_URLS.includes(file_url)) {
    return Response.json({
      error: 'Test keys only work with official HTPBE sample URLs',
      test_urls: TEST_URLS,
    }, { status: 422 });
  }
}
// Live keys: fetch any public PDF URL, analyze, bill quota

LTV Validation (Avoiding False Positives)

Long-Term Validation (LTV) adds timestamps and certificates to signed PDFs — this is legitimate modification that shouldn't trigger alerts:

// lib/services/pdf-ltv.service.ts
export function analyzeLTV(bytes: Uint8Array): LTVAnalysis {
  const text = new TextDecoder("latin1").decode(bytes);

  // Detect Document Security Store (DSS)
  const hasDSS = /\/Type\s*\/DSS\b/.test(text);

  // Detect Document Timestamp
  const hasDTS =
    /\/Type\s*\/DocTimeStamp\b/.test(text) || /\/SubFilter\s*\/ETSI\.RFC3161\b/.test(text);

  // ETSI PAdES-LTV compliance markers
  const isETSICompliant = hasDSS || hasDTS || /\/SubFilter\s*\/ETSI\.CAdES\.detached\b/.test(text);

  return {
    hasLTV: hasDSS || hasDTS,
    hasDSS,
    hasDTS,
    isETSICompliant,
    ltvDetails: isETSICompliant ? extractLTVDetails(bytes) : null,
  };
}

Results

The platform launched in September 2024:

MetricValue
Analysis time≤9 seconds (within Vercel timeout)
Max file size10 MB (2.2× Vercel's 4.5 MB limit)
Detection markers7+ deterministic (iLovePDF, PDF24, QPDF, etc.)
Algorithm versions7 in 5 days (v2.0.0 → v2.1.3)
Database schema6 tables with dynamic quota calculation
Pricing tiers$15–$499/month + Enterprise on-premise

The deterministic approach means results are reproducible, explainable, and fast. Enterprise option (Docker/Kubernetes) addresses GDPR, HIPAA, and PCI DSS requirements for fintech and legal clients.

Iurii RoguliaAvailable

Need something similar?

I build custom solutions — from APIs to full products. Let's talk about your project.

View all projects