lab_5_scraper package
Submodules
Crawler implementation.
- class lab_5_scraper.scraper.Config(path_to_config: Path)
Bases:
object
Class for unpacking and validating configurations.
- __init__(path_to_config: Path) None
Initialize an instance of the Config class.
- Parameters:
path_to_config (pathlib.Path) – Path to configuration.
- get_headless_mode() bool
Retrieve whether to use headless mode.
- Returns:
Whether to use headless mode or not
- Return type:
- get_num_articles() int
Retrieve total number of articles to scrape.
- Returns:
Total number of articles to scrape
- Return type:
- class lab_5_scraper.scraper.Crawler(config: Config)
Bases:
object
Crawler implementation.
- __init__(config: Config) None
Initialize an instance of the Crawler class.
- Parameters:
config (Config) – Configuration
- _extract_url(article_bs: BeautifulSoup) str
Find and retrieve url from HTML.
- Parameters:
article_bs (bs4.BeautifulSoup) – BeautifulSoup instance
- Returns:
Url from HTML
- Return type:
- class lab_5_scraper.scraper.CrawlerRecursive(config: Config)
Bases:
Crawler
Recursive implementation.
Get one URL of the title page and find requested number of articles recursively.
- class lab_5_scraper.scraper.HTMLParser(full_url: str, article_id: int, config: Config)
Bases:
object
HTMLParser implementation.
- __init__(full_url: str, article_id: int, config: Config) None
Initialize an instance of the HTMLParser class.
- _fill_article_with_meta_information(article_soup: BeautifulSoup) None
Find meta information of article.
- Parameters:
article_soup (bs4.BeautifulSoup) – BeautifulSoup instance
- _fill_article_with_text(article_soup: BeautifulSoup) None
Find text of article.
- Parameters:
article_soup (bs4.BeautifulSoup) – BeautifulSoup instance