Discovery
Monitored sources include SEC EDGAR filings, regulatory press releases, news aggregators, and official company announcements. Phase 1: same-source URLs. Phase 2: external search with source attribution.
How we collect, verify, and maintain data you can trust.
Every extracted field must be traceable to a specific location in the raw source text. If we cannot locate the claim in the original document, it does not enter our database.
We do not accept inferred, implied, or generated data. Our extraction pipeline validates each field against the source using pattern matching before any structured storage occurs.
Articles are classified by dataset type before extraction begins. A funding article cannot enter the fines pipeline. This prevents cross-contamination and schema-forced hallucination.
Monitored sources include SEC EDGAR filings, regulatory press releases, news aggregators, and official company announcements. Phase 1: same-source URLs. Phase 2: external search with source attribution.
Each article is routed to the correct dataset pipeline (Funding, Layoffs, Fines, Acquisitions, IPOs) using keyword and structural analysis. Misrouted articles are quarantined for manual review.
Regex-first extraction attempts pattern matching. LLM extraction is used only when regex fails, and every LLM-extracted field is verified against the raw source text before acceptance.
High-confidence records pass automatically. Borderline cases โ unusual amounts, ambiguous entities, or first-seen sources โ are flagged for manual verification.
Records are periodically re-checked against source URLs. If a source is updated or retracted, our record is updated or flagged accordingly.
source_url and verified_at timestamps in API responsesAutomated data pipelines can hallucinate. A funding article misrouted to a fines pipeline can produce entirely fabricated enforcement records. Our classification-first, source-anchored approach ensures that what you query is what actually happened โ documented, traceable, and verifiable.
Every record returned by our APIs includes:
source_url โ Link to the primary source documentverified_at โ ISO 8601 timestamp of last verificationextraction_method โ regex, llm_verified, or manualconfidence_score โ 0.0 to 1.0 based on validation layers passed