article package

Submodules

Article implementation.

class core_utils.ctlr.article.article.Article(url: str | None, article_id: int)

Bases: object

Article class implementation.

__init__(url: str | None, article_id: int) None

Initialize an instance of Article.

Parameters:
  • url (str | None) – Site url

  • article_id (int) – Article id

_conllu_info: str

ConLLU information

_date_to_text() str

Convert datetime object to text.

Returns:

Datetime object

Return type:

str

date: datetime | None

A date

get_cleaned_text() str

Get cleaned text.

Returns:

Cleaned text.

Return type:

str

get_conllu_info() str

Get the sentences from ConlluArticle.

Returns:

Sentences from ConlluArticle

Return type:

str

get_conllu_text(include_morphological_tags: bool) str

Get the text in the CONLL-U format.

Parameters:

include_morphological_tags (bool) – Flag to include morphological information

Returns:

A text in the CONLL-U format

Return type:

str

get_file_path(kind: ArtifactType) Path

Get a proper filepath for an Article instance.

Parameters:

kind (ArtifactType) – A variant of a file

Returns:

Path to Article instance

Return type:

pathlib.Path

get_meta() dict

Get all meta params.

Returns:

Meta params

Return type:

dict

get_meta_file_path() Path

Get path for requested article’s meta info.

Returns:

Path to requested article’s meta info

Return type:

pathlib.Path

get_pos_freq() dict

Get a pos_frequency parameter.

Returns:

POS frequency

Return type:

dict

get_raw_text() str

Get raw text from the article.

Returns:

Raw text from the article

Return type:

str

get_raw_text_path() Path

Get path for requested raw article.

Returns:

Path to requested raw article

Return type:

pathlib.Path

set_conllu_info(info: str) None

Set the conllu_sentences_attribute.

Parameters:

info (str) – CONLL-U sentences

set_patterns_info(pattern_matches: dict) None

Set patterns frequencies attribute.

Parameters:

pattern_matches (dict) – Syntactic patterns

set_pos_info(pos_freq: dict) None

Set POS frequencies attribute.

Parameters:

pos_freq (dict) – POS frequencies

class core_utils.ctlr.article.article.ArtifactType(value)

Bases: Enum

Types of artifacts that can be created by text processing pipelines.

CLEANED = 'cleaned'
STANZA_CONLLU = 'stanza_conllu'
UDPIPE_CONLLU = 'udpipe_conllu'
core_utils.ctlr.article.article.date_from_meta(date_txt: str) datetime

Convert text date to datetime object.

Parameters:

date_txt (str) – Date in text format

Returns:

Datetime object

Return type:

datetime.datetime

core_utils.ctlr.article.article.get_article_id_from_filepath(path: Path) int

Extract the article id from its path.

Parameters:

path (pathlib.Path) – Path to article

Returns:

Article id

Return type:

int

core_utils.ctlr.article.article.split_by_sentence(text: str) list[str]

Splits the given text by sentence separators.

Parameters:

text (str) – raw text to split

Returns:

List of sentences

Return type:

list[str]

I/O operations for Article.

core_utils.ctlr.article.io.from_meta(path: Path | str, article: Article | None = None) Article

Load meta.json file into the Article abstraction.

Parameters:
Returns:

Article instance

Return type:

Article

core_utils.ctlr.article.io.from_raw(path: Path | str, article: Article | None = None) Article

Load raw text and create an Article with it.

Parameters:
  • path (Union[pathlib.Path, str]) – Path to article raw text

  • article (Optional[Article]) – Article instance

Returns:

Article instance

Return type:

Article

core_utils.ctlr.article.io.to_cleaned(article: Article) None

Save cleaned text.

Parameters:

article (Article) – Article instance

core_utils.ctlr.article.io.to_meta(article: Article) None

Save metafile.

Parameters:

article (Article) – Article instance

core_utils.ctlr.article.io.to_raw(article: Article) None

Save raw text.

Parameters:

article (Article) – Article instance