support for collections #28

Barbagus · 2023-01-11T18:14:41Z

Barbagus commented

2023-01-11 18:14:41 +00:00

Set up the work to address #7. The goal is to be able to download a collection of related videos the way they are grouped on ArteTV website. Best example of that are series. Few remarks first:

A collection URL is undistinguishable from program URL. Yes it looks like collections IDs and program IDs have different pattern (RC-023013 vs 105612-000-A) but we already rely on the URL shape to start with, relying then on the ID shape... feels like too many assumptions. These assumptions are just more opportunities/places in our code that may break the day ArteTV changes a bit their inner mechanisms. Who's to say that there are no other pattern for IDs already that we just never came across so far ?
collection IDs do return valid ConfigPlayer API object, but it is the one of the first program in the collection. This is an indicator that we may deal with a collection URL. So again, another dependency on ArteTV inner mechanisms.
For every ConfigPlayer object, an associated Playlist object can be fetched from the API. However, despite what we might think, that object only references the previous, current and next program for that particular program. Using that strategy leads us to make two API calls for each program in a collection.
If we fetch a Playlist API object with a collection ID, Bingo ! We get the entire list of the programs within that collection... HOWEVER if our collection is actually a collection of collections then: nothing useful is returned. Collections of collections are rather common: all fictional series I found are collections containing one collection per season.

I feel like we are in a "if you gotta do something wrong, do it right" situation.

The approach I would like to go for is to actually download and inspect the HTML code for the given URL. So no URL parsing, no collection ID or program ID guessing. No "let's assume this is a program ID and see if it fails" strategy etc... As of today, the HTML code of the pages do contain a massive chunk of JSON data. I imagine this is the data the page is actually build/hydrated from. It does contain all the info we need. If they ever change some implementation, there will be where we need to maintain. Feels like a single point of assumption/failure.

Set up the work to address #7. The goal is to be able to download a collection of related videos the way they are grouped on ArteTV website. Best example of that are _series_. Few remarks first: 1. A _collection URL_ is undistinguishable from _program URL_. Yes it looks like _collections IDs_ and _program IDs_ have different pattern (`RC-023013` vs `105612-000-A`) but we already rely on the URL shape to start with, relying then on the ID shape... feels like too many assumptions. These assumptions are just more opportunities/places in our code that may break the day ArteTV changes a bit their inner mechanisms. Who's to say that there are no other pattern for IDs already that we just never came across so far ? 2. _collection IDs_ do return valid `ConfigPlayer` API object, but it is the one of the first _program_ in the _collection_. This is an **indicator** that we **may** deal with a _collection_ URL. So again, another dependency on ArteTV inner mechanisms. 3. For every `ConfigPlayer` object, an associated `Playlist` object can be fetched from the API. However, despite what we might think, that object only references the _previous_, _current_ and _next_ program for that particular program. Using that strategy leads us to make two API calls for each program in a collection. 4. If we fetch a `Playlist` API object with a _collection ID_, Bingo ! We get the entire list of the programs within that _collection_... HOWEVER if our _collection_ is actually a _collection of collections_ then: nothing useful is returned. _Collections of collections_ are rather common: all fictional series I found are collections containing one collection per season. I feel like we are in a "_if you gotta do something wrong, do it right_" situation. The approach I would like to go for is to actually download and inspect the HTML code for the given URL. So no URL parsing, no _collection ID_ or _program ID_ guessing. No "_let's assume this is a program ID and see if it fails_" strategy etc... As of today, the HTML code of the pages do contain a massive chunk of JSON data. I imagine this is the data the page is actually build/hydrated from. It does contain all the info we need. If they ever change some implementation, there will be where we need to maintain. Feels like a single point of assumption/failure.

Barbagus commented

2023-01-11 18:29:24 +00:00

Another remark: going forward with this makes me reconsider db0a954497 's last point.

We use to offer rendition choosing right after the API call, before we
download the appropriate master playlist to figure out the available
variants.

The problem with that is that ArteTV's codes for the renditions (given
by the API) do not necessarily include complete languages information
(if it is not French or German), for instance a original audio track in
Portuguese would show as VOEU- (as in "EUropean"). The actual mention
of the Portuguese would only show up in the master playlist.

So, the new implementation actually downloads all master playlists
straight after the API call. This is a bit wasteful, but I figured it
was necessary to provide quality interaction with the user.

In the case of a 20 episodes series, offering 7 renditions (audio/subtitles configurations) this implementation would lead to 20 + 7*20 = 160 API calls before we even start. This is not reasonable and balances the benefit of knowing the actual original language (when not German no French).

Don't take me wrong, it is very anoying for ArteTV not to provide that information: even as a web user...

However, if we use the rendition information from the PlayerConfig API object, then we do not need to come up with rendition codes and labels ourselves like we do now and is probably a nest of bugs to come.

Another remark: going forward with this makes me reconsider db0a954497 's last point. > We use to offer rendition choosing right after the API call, before we download the appropriate master playlist to figure out the available variants. > > The problem with that is that ArteTV's codes for the renditions (given by the API) do not necessarily include complete languages information (if it is not French or German), for instance a original audio track in Portuguese would show as `VOEU-` (as in "EUropean"). The actual mention of the Portuguese would only show up in the master playlist. > > So, the new implementation actually downloads all master playlists straight after the API call. This is a bit wasteful, but I figured it was necessary to provide quality interaction with the user. In the case of a 20 episodes series, offering 7 renditions (audio/subtitles configurations) this implementation would lead to `20 + 7*20 = 160` API calls before we even start. This is not reasonable and balances the _benefit_ of knowing the actual original language (when not German no French). Don't take me wrong, it is very anoying for ArteTV not to provide that information: even as a web user... However, if we use the rendition information from the `PlayerConfig` API object, then we do not need to come up with rendition _codes_ and _labels_ ourselves like we do now and is probably a nest of bugs to come.

Barbagus commented

2023-01-11 18:37:28 +00:00

The approach I would like to go for is to actually download and inspect the HTML code for the given URL. So no URL parsing, no collection ID or program ID guessing. No "let's assume this is a program ID and see if it fails" strategy etc... As of today, the HTML code of the pages do contain a massive chunk of JSON data. I imagine this is the data the page is actually build/hydrated from. It does contain all the info we need. If they ever change some implementation, there will be where we need to maintain. Feels like a single point of assumption/failure.

Also, with that approach, we will be able to fail more comprehensively for the user:

either the page exists or not
either we can read it or not

> The approach I would like to go for is to actually download and inspect the HTML code for the given URL. So no URL parsing, no _collection ID_ or _program ID_ guessing. No "_let's assume this is a program ID and see if it fails_" strategy etc... As of today, the HTML code of the pages do contain a massive chunk of JSON data. I imagine this is the data the page is actually build/hydrated from. It does contain all the info we need. If they ever change some implementation, there will be where we need to maintain. Feels like a single point of assumption/failure. Also, with that approach, we will be able to fail more comprehensively for the user: - either the page exists or not - either we can read it or not

Barbagus added 3 commits 2023-01-24 07:26:23 +00:00

639a8063a5 Get program information from page content

Changes the way the program information is figured out. From URL parsing
to page content parsing.
A massive JSON object is shipped within the HTML of the page, that's
were we get what we need from.

Side effects:
 - drop `slug` from the program's info
 - drop `slug` naming option
 - no `Program` / `ProgramMeta` distinction

Includes some JSON samples.

fcadd531c4 Reorganize imports in files

ed5ba06a98 Implement a "schema guard" for `api` module

In order to catch errors related to assumed JSON schema, regroup all
JSON data access under a context manager that catch related errors:
- KeyError
- IndexError
- ValueError

Barbagus added 1 commit 2023-01-24 08:30:13 +00:00

56c1e8468a Split program/rendition/variant/target operations

Significant rewrite after model modification: introducing `*Sources`
objects that encapsulate metadata and fetch information (urls,
protocols). The API (#20) is organized as pipe elements with sources
being what flows through the pipe.
    1. fetch program sources
    2. fetch rendition sources
    3. fetch variant sources
    4. fetch targets
    5. process (download+mux) targets
Some user selection filter or modifiers could then be applied at any
step of the pipe. Our __main__.py is an implementation of that scheme.

Implied modifications include:
 - Later failure on unsupported protocols, used to be in `api`, now in
   `hls`. This offers the possibility to filter and/or support them
   later.
 - Give up honoring the http ranges for mp4 download, stream-download
   them by fixed chunk instead.
 - Cleaning up of the `hls` module moving the main download function to
   __init__ and specific (mp4/vtt) download functions to a new
   `download` module.

On the side modifications include:
 - The progress handler showing downloading rates.
 - The naming utilities providing rendition and variant code insertion.
 - Download parts to working directories and skip unnecessary
   re-downloads on failure.

This was a big change for a single commit... too big of a change maybe.

Barbagus added 1 commit 2023-01-24 09:52:47 +00:00

3ca02e8e42 Include collection www/json samples

TV series that list episodes through many `collection_subcollection_*`
zones (one per season):
 - RC-023217__acquitted.json
 - RC-022923__cry-wolf.json

Other collection that list items in one `collection_videos_*` zone:
 - RC-023013__l-incroyable-periple-de-magellan.json
 - RC-023242__bandes-de-pirates.json

Barbagus added 1 commit 2023-01-24 19:00:00 +00:00

e23cd73664 Implement collections

Barbagus added 1 commit 2023-01-24 19:25:21 +00:00

6b24b15f57 Update README according to implementation

Barbagus changed title from ~~WIP: support for collections~~ to support for collections

2023-01-24 19:25:52 +00:00

Barbagus merged commit 57da060e73 into stable

2023-01-24 19:26:06 +00:00

Barbagus referenced this issue from a commit

2023-01-24 19:26:06 +00:00

Merge pull request 'support for collections' (#28) from collections into stable

Barbagus deleted branch collections