Data Methodology — ParheliaWeb

1 Core Principles

🔒 Source-Anchored Verification

Every extracted field must be traceable to a specific location in the raw source text. If we cannot locate the claim in the original document, it does not enter our database.

🚫 No Hallucination Tolerance

We do not accept inferred, implied, or generated data. Our extraction pipeline validates each field against the source using pattern matching before any structured storage occurs.

📊 Classification-First Routing

Articles are classified by dataset type before extraction begins. A funding article cannot enter the fines pipeline. This prevents cross-contamination and schema-forced hallucination.

2 The Pipeline

📡

Discovery

Monitored sources include SEC EDGAR filings, regulatory press releases, news aggregators, and official company announcements. Phase 1: same-source URLs. Phase 2: external search with source attribution.

🏷️

Classification

Each article is routed to the correct dataset pipeline (Funding, Layoffs, Fines, Acquisitions, IPOs) using keyword and structural analysis. Misrouted articles are quarantined for manual review.

🔍

Extraction & Validation

Regex-first extraction attempts pattern matching. LLM extraction is used only when regex fails, and every LLM-extracted field is verified against the raw source text before acceptance.

✓

Human Review

High-confidence records pass automatically. Borderline cases — unusual amounts, ambiguous entities, or first-seen sources — are flagged for manual verification.

🔄

Continuous Monitoring

Records are periodically re-checked against source URLs. If a source is updated or retracted, our record is updated or flagged accordingly.

3 Quality Checks

Every fine amount is cross-referenced against the issuing regulator's official announcement

Funding rounds are verified against at least one of: SEC filing, press release, or reputable publication

Layoff figures require a named source (company statement, WARN filing, or credible report)

Acquisitions must include buyer, seller, and reported value from official disclosure

IPO filings are sourced directly from SEC EDGAR with accession numbers

All records include source_url and verified_at timestamps in API responses

Why This Matters

Automated data pipelines can hallucinate. A funding article misrouted to a fines pipeline can produce entirely fabricated enforcement records. Our classification-first, source-anchored approach ensures that what you query is what actually happened — documented, traceable, and verifiable.

4 API Transparency

Every record returned by our APIs includes:

source_url — Link to the primary source document

verified_at — ISO 8601 timestamp of last verification

extraction_method — regex, llm_verified, or manual

confidence_score — 0.0 to 1.0 based on validation layers passed