We were not (and probably wont be ) using any worthwhile `requests`
features (beside `raise_for_status()`) and the `timeout` session
parameter propagation vs adapter plugging "thing" in requests just
annoys me deeply (not that kind of "... Human (TM)")
- skipping the processing of an existing target output file
- skipping the download of an existing target stream file
- resume the download of an existing target stream temporary file
using a HTTP range request
TV series that list episodes through many `collection_subcollection_*`
zones (one per season):
- RC-023217__acquitted.json
- RC-022923__cry-wolf.json
Other collection that list items in one `collection_videos_*` zone:
- RC-023013__l-incroyable-periple-de-magellan.json
- RC-023242__bandes-de-pirates.json
Significant rewrite after model modification: introducing `*Sources`
objects that encapsulate metadata and fetch information (urls,
protocols). The API (#20) is organized as pipe elements with sources
being what flows through the pipe.
1. fetch program sources
2. fetch rendition sources
3. fetch variant sources
4. fetch targets
5. process (download+mux) targets
Some user selection filter or modifiers could then be applied at any
step of the pipe. Our __main__.py is an implementation of that scheme.
Implied modifications include:
- Later failure on unsupported protocols, used to be in `api`, now in
`hls`. This offers the possibility to filter and/or support them
later.
- Give up honoring the http ranges for mp4 download, stream-download
them by fixed chunk instead.
- Cleaning up of the `hls` module moving the main download function to
__init__ and specific (mp4/vtt) download functions to a new
`download` module.
On the side modifications include:
- The progress handler showing downloading rates.
- The naming utilities providing rendition and variant code insertion.
- Download parts to working directories and skip unnecessary
re-downloads on failure.
This was a big change for a single commit... too big of a change maybe.
In order to catch errors related to assumed JSON schema, regroup all
JSON data access under a context manager that catch related errors:
- KeyError
- IndexError
- ValueError
Changes the way the program information is figured out. From URL parsing
to page content parsing.
A massive JSON object is shipped within the HTML of the page, that's
were we get what we need from.
Side effects:
- drop `slug` from the program's info
- drop `slug` naming option
- no `Program` / `ProgramMeta` distinction
Includes some JSON samples.
Change/add/rename model's data structures in order to provide a more
useful API #20, introducing new structures:
- `Sources`: summarizing program, renditions and variants found
at a given ArteTV page URL
- `Target`: summarizing all required data for a download
And new functions:
- `fetch_sources()` to build the `Sources` from a URL
- `iter_[renditions|variants]()` describe the available options for the
`Sources`
- `select_[renditions|variants]()` to narrow down the desired options
for the `Sources`
- `compile_sources` to compute such a `Target` from `Sources`
- `download_target` to download such a `Target`
Finally, this should make the playlist handling #7 easier (I know, I've
said that before)
Move all error definitions to `error` module
In `__init__`
- Remove imports from global scope
- Import all from `model` module
- Import all from `error` module
Refactor: `fetch_sources()` to take the URL as argument
Coding style: import definitions from `error` and `model`
Remove dependency to `webvtt-py` which was both too much and not enough
for our use case.
Implement a basic WebVTT to SRT converter according to ArteTV's usage of
WebVTT features.
- Rename variables and function to reflect model names.
- Convert infrastructure data (JSON, M3U8) to model types.
- Change algorithms to produce/consume `Source` model, in particular
using generator functions to build a list of `Source`s rather than the
opaque `rendition => variant => urls` mapping (this will make #7 very
straight forward).
- Download all master playlists after API call before selecting
rendition/variants.
Motivation for the last point:
We use to offer rendition choosing right after the API call, before we
download the appropriate master playlist to figure out the available
variants.
The problem with that is that ArteTV's codes for the renditions (given
by the API) do not necessarily include complete languages information
(if it is not French or German), for instance a original audio track in
Portuguese would show as `VOEU-` (as in "EUropean"). The actual mention
of the Portuguese would only show up in the master playlist.
So, the new implementation actually downloads all master playlists
straight after the API call. This is a bit wasteful, but I figured it
was necessary to provide quality interaction with the user.
Bonus? Now when we first prompt the user for rendition choice, we
actually already know the available variants available, maybe we make
use of that fact in the future...
A bunch of data structures to be used instead of the types used by the
infrastructures, i.e. JSON for API and M3U8 for the HLS.
It should provide a stronger decoupling of the modules and pave the way
for #7 and #8.
Implementation uses `namedtuple`s as they are transparent to test for
equality and are natively hashable (can be used in `set`s or as keys to
`dict`s) which is useful for deduping for instance.
Creation of a `common.Error` exception whose string representation is
taken from its docstring.
Creation of a `common.UnexpectedError` to serve as base for exceptions
raised while checking assumptions on requests and responses.
The later are handled by displaying a message inviting user to submit
the error to us, so we can correct our assumptions.
- versions => renditions
- resolutions => variants
- ranges and/or chunks => segments
- version index => master playlist
- other index => media playlist url
For now, the CLI has not been updated with this terminology, only the
code.