Fundus
Need to crawl online news? With Fundus, you can crawl millions of pages of online news with just a few lines of code!
Fundus
Fundusis a library for crawling and parsing online news. With a few lines of code, you can crawl millions of news articles to build a big corpus for text analysis or training models!
Each crawled article is parsed such that we identify
- its title
- its authors
- its main text body
- its paragraph structure
- its images and their captions
- ... and other structured attributes
Here is an example code snippet to crawl two articles from US-based publishers:
Fundus is at its core rule-based, with bespoke parsers for each supported online news source. Because of this, Fundus is able to better extract plain text than other libraries. Check out this comparative evaluation:
Scraper | Precision | Recall | F1-Score | Evaluated Version |
Fundus | 99.89±0.57 | 96.75±12.7 | 97.69±9.75 | 0.4.1 |
Trafilatura | 93.91±12.89 | 96.85±15.69 | 93.62±16.73 | 1.12.0 |
news-please | 97.95±10.08 | 91.89±16.15 | 93.39±14.52 | 1.6.13 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 | / |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 | 3.0.1 |
BoilerNet | 85.96±18.55 | 91.21±19.15 | 86.52±18.03 | / |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 | 1.3.0 |
Getting Started
- Check out the github page
- Check out our tutorials!
Publication
Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions.Max Dallabetta, Conrad Dobberstein, Adrian Breiding and Alan Akbik. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), ACL 2024.