ParheliaWeb
ParheliaWeb
ParheliaWeb

Data Methodology

How we collect, verify, and maintain data you can trust.

1 Core Principles

๐Ÿ”’ Source-Anchored Verification

Every extracted field must be traceable to a specific location in the raw source text. If we cannot locate the claim in the original document, it does not enter our database.

๐Ÿšซ No Hallucination Tolerance

We do not accept inferred, implied, or generated data. Our extraction pipeline validates each field against the source using pattern matching before any structured storage occurs.

๐Ÿ“Š Classification-First Routing

Articles are classified by dataset type before extraction begins. A funding article cannot enter the fines pipeline. This prevents cross-contamination and schema-forced hallucination.

2 The Pipeline

๐Ÿ“ก

Discovery

Monitored sources include SEC EDGAR filings, regulatory press releases, news aggregators, and official company announcements. Phase 1: same-source URLs. Phase 2: external search with source attribution.

๐Ÿท๏ธ

Classification

Each article is routed to the correct dataset pipeline (Funding, Layoffs, Fines, Acquisitions, IPOs) using keyword and structural analysis. Misrouted articles are quarantined for manual review.

๐Ÿ”

Extraction & Validation

Regex-first extraction attempts pattern matching. LLM extraction is used only when regex fails, and every LLM-extracted field is verified against the raw source text before acceptance.

โœ“

Human Review

High-confidence records pass automatically. Borderline cases โ€” unusual amounts, ambiguous entities, or first-seen sources โ€” are flagged for manual verification.

๐Ÿ”„

Continuous Monitoring

Records are periodically re-checked against source URLs. If a source is updated or retracted, our record is updated or flagged accordingly.

3 Quality Checks

Why This Matters

Automated data pipelines can hallucinate. A funding article misrouted to a fines pipeline can produce entirely fabricated enforcement records. Our classification-first, source-anchored approach ensures that what you query is what actually happened โ€” documented, traceable, and verifiable.

4 API Transparency

Every record returned by our APIs includes: