lab_5_scraper package

Submodules

Crawler implementation.

class lab_5_scraper.scraper.Config(path_to_config: Path)

Bases: object

Class for unpacking and validating configurations.

__init__(path_to_config: Path) None

Initialize an instance of the Config class.

Parameters:

path_to_config (pathlib.Path) – Path to configuration.

_extract_config_content() ConfigDTO

Get config values.

Returns:

Config values

Return type:

ConfigDTO

_validate_config_content() None

Ensure configuration parameters are not corrupt.

get_encoding() str

Retrieve encoding to use during parsing.

Returns:

Encoding

Return type:

str

get_headers() dict[str, str]

Retrieve headers to use during requesting.

Returns:

Headers

Return type:

dict[str, str]

get_headless_mode() bool

Retrieve whether to use headless mode.

Returns:

Whether to use headless mode or not

Return type:

bool

get_num_articles() int

Retrieve total number of articles to scrape.

Returns:

Total number of articles to scrape

Return type:

int

get_seed_urls() list[str]

Retrieve seed urls.

Returns:

Seed urls

Return type:

list[str]

get_timeout() int

Retrieve number of seconds to wait for response.

Returns:

Number of seconds to wait for response

Return type:

int

get_verify_certificate() bool

Retrieve whether to verify certificate.

Returns:

Whether to verify certificate or not

Return type:

bool

class lab_5_scraper.scraper.Crawler(config: Config)

Bases: object

Crawler implementation.

__init__(config: Config) None

Initialize an instance of the Crawler class.

Parameters:

config (Config) – Configuration

_extract_url(article_bs: BeautifulSoup) str

Find and retrieve url from HTML.

Parameters:

article_bs (bs4.BeautifulSoup) – BeautifulSoup instance

Returns:

Url from HTML

Return type:

str

find_articles() None

Find articles.

get_search_urls() list

Get seed_urls param.

Returns:

seed_urls param

Return type:

list

url_pattern: Pattern | str

Url pattern

class lab_5_scraper.scraper.CrawlerRecursive(config: Config)

Bases: Crawler

Recursive implementation.

Get one URL of the title page and find requested number of articles recursively.

__init__(config: Config) None

Initialize an instance of the CrawlerRecursive class.

Parameters:

config (Config) – Configuration

find_articles() None

Find number of article urls requested.

class lab_5_scraper.scraper.HTMLParser(full_url: str, article_id: int, config: Config)

Bases: object

HTMLParser implementation.

__init__(full_url: str, article_id: int, config: Config) None

Initialize an instance of the HTMLParser class.

Parameters:
  • full_url (str) – Site url

  • article_id (int) – Article id

  • config (Config) – Configuration

_fill_article_with_meta_information(article_soup: BeautifulSoup) None

Find meta information of article.

Parameters:

article_soup (bs4.BeautifulSoup) – BeautifulSoup instance

_fill_article_with_text(article_soup: BeautifulSoup) None

Find text of article.

Parameters:

article_soup (bs4.BeautifulSoup) – BeautifulSoup instance

parse() Article | bool | list

Parse each article.

Returns:

Article instance

Return type:

Union[Article, bool, list]

unify_date_format(date_str: str) datetime

Unify date format.

Parameters:

date_str (str) – Date in text format

Returns:

Datetime object

Return type:

datetime.datetime

lab_5_scraper.scraper.main() None

Entrypoint for scraper module.

lab_5_scraper.scraper.make_request(url: str, config: Config) Response

Deliver a response from a request with given configuration.

Parameters:
  • url (str) – Site url

  • config (Config) – Configuration

Returns:

A response from a request

Return type:

requests.models.Response

lab_5_scraper.scraper.prepare_environment(base_path: Path | str) None

Create ASSETS_PATH folder if no created and remove existing folder.

Parameters:

base_path (Union[pathlib.Path, str]) – Path where articles stores