Fundus

Need to crawl online news? With Fundus, you can crawl millions of pages of online news with just a few lines of code!

Fundus

Fundusis a library for crawling and parsing online news. With a few lines of code, you can crawl millions of news articles to build a big corpus for text analysis or training models!

Each crawled article is parsed such that we identify

This allows you to directly extract the article plain text or other features that you need in your NLP pipeline!

Here is an example code snippet to crawl two articles from US-based publishers:

Fundus is at its core rule-based, with bespoke parsers for each supported online news source. Because of this, Fundus is able to better extract plain text than other libraries. Check out this comparative evaluation:

ScraperPrecisionRecallF1-ScoreEvaluated Version
Fundus99.89±0.5796.75±12.797.69±9.750.4.1
Trafilatura93.91±12.8996.85±15.6993.62±16.731.12.0
news-please97.95±10.0891.89±16.1593.39±14.521.6.13
BTE81.09±19.4198.23±8.6187.14±15.48/
jusText86.51±18.9290.23±20.6186.96±19.763.0.1
BoilerNet85.96±18.5591.21±19.1586.52±18.03/
Boilerpipe82.89±20.6582.11±29.9979.90±25.861.3.0

Getting Started

Publication

Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions.Max Dallabetta, Conrad Dobberstein, Adrian Breiding and Alan Akbik. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), ACL 2024.