Short summary of lectures
- Lecture 1. Introduction to technical track
Web scraping as a craft. Place of technical track in overall discipline: conceptually and assessment formula. Technical track overview. Programming assignment overview.
Lecture 2. 3rd party libraries, Browser headers
Client-server architecture in World Wide Web. Types of HTTP methods:
GET
, POST
, DELETE
, PUT
. Request. Response.
Python package manager pip
. requirements.txt
as a manifest of project dependencies.
Library requests
for sending
requests to server. requests.get
to get html
code of a page.
Making requests with requests
API. Idea of mimicking to human-made requests.
Tip no. 1: random timeouts among calls.
Tip no. 2: sending requests with headers from browser. Obtaining headers.
API for sending a request with headers.
Check for request status: implicit cast to bool
, check for status code,
switch on exception raising.
Lecture 3. HTML structure. bs4
library
Introduction to HTML scraping.
Key strategies for finding elements:
by id
, by class, by tag name, by child-parent relations, and by combination
of aforementioned approaches. Making requests with requests
API.
Extracting headers from browser. Making randomized sleeps
in code. bs4: installation, basic API. Finding elements in HTML page with find, find_all.
Lecture 4. Filesystem with pathlib
. Dates
Paths as unique description of file position. Paths: relative and absolute.
pathlib
as the recommended library for creating, writing and reading files.
Construction of paths with forward slash. Recommendation to build paths based on
__file__
global variable. Basic API: exists
, glob
, mkdir
.
Removal of directories: rmdir
versus shutil.rmtree
. Removal of files: unlink
.
Date management as one of most challenging tasks in data analysis.
Module datetime
to process dates. Basic API: datetime
class,
static methods strftime
for formatting dates and strptime
for parsing them from string.