WIP: Refactor object model #30

Barbagus · 2023-02-14T09:04:23Z

Barbagus commented

2023-02-14 09:04:23 +00:00

The PRV object model (Program, Rendition, Variant) implementation is a bit annoying:

For some aesthetic reasons, the intermediate URLs (or sources) are not embedded in the objects themselves because those are consumed along the way: the URL pointing to the file containing rendition information is "replaced" by that rendition information once we got it. The downside is that we end up dealing with (objects, sources) tuples and/or parameters making the code less readable.
The three structures are inherently interconnected, they actually form a tree structure (programs have renditions, renditions have variants), except they are not (in the code). Again, we end up using (Program[, Rendition [, Variant]]) tuples and/or parameters making the code less readable.
By definition renditions is supposed to be a chosen audio/subtitles languages/type combination. Except that, in our implementation, they are not. Mainly because the rendition information we get from API does not actually tell us, it implies it (through the code: VOF-STF, VO-ST[ITA], ...). Conversely, a variant should be a chosen quality for the tracks, and that is not obvious in the model definition (actually, our Variant looks a bit too much like what a rendition should be and our Rendition does not carry a lot of information).

Addressing those points one by one:

This could be addressed with a bunch more of named tuples, but then we would have to find clever naming and at the end it just adds a lot of code. As the amount of data we are dealing with is ridiculous, we could just grow our data structures, including sources.
Same as above, including tree structure.
Limit the scope of the model to only top-level API functions and having each module (www, api, hls...) having their "own model" based on the actual data they handle.

The _PRV_ object model (`Program`, `Rendition`, `Variant`) implementation is a bit annoying: 1. For some aesthetic reasons, the intermediate URLs (or _sources_) are not embedded in the objects themselves because those are _consumed_ along the way: the URL pointing to the file containing rendition information is "replaced" by that rendition information once we got it. The downside is that we end up dealing with `(objects, sources)` tuples and/or parameters making the code less readable. 2. The three structures are inherently interconnected, they actually form a tree structure (_programs_ have _renditions_, _renditions_ have _variants_), except they are not (in the code). Again, we end up using `(Program[, Rendition [, Variant]])` tuples and/or parameters making the code less readable. 3. By definition _renditions_ is supposed to be a chosen audio/subtitles languages/type combination. Except that, in our implementation, they are not. Mainly because the _rendition_ information we get from API does not actually tell us, it implies it (through the code: VOF-STF, VO-ST[ITA], ...). Conversely, a _variant_ should be a chosen _quality_ for the tracks, and that is not obvious in the model definition (actually, our `Variant` looks a bit too much like what a _rendition_ should be and our `Rendition` does not carry a lot of information). Addressing those points one by one: 1. This could be addressed with a bunch more of _named tuples_, but then we would have to find clever naming and at the end it just adds a lot of code. As the amount of data we are dealing with is ridiculous, we could just grow our data structures, including `sources`. 2. Same as above, including tree structure. 3. Limit the scope of the model to only top-level API functions and having each module (`www`, `api`, `hls`...) having their "own model" based on the actual data they handle.

Barbagus commented

2023-02-14 09:05:52 +00:00

About the tree structure (point 2), I see two approaches:

Parent has a list of children: Program has a list of Rendition, Rendition a list of Variant.
Children embed its parent: Variant has a Rendition, Rendition has a Program.

Although the second option makes bigger objects and duplicate of the parents for each child, it is the one I favor:

When we load a Program, we do not have rendition information yet and therefore would initialize an empty list of Rendition and only then populate it. This makes the intermediate Program a false representation: it does have some Rendition but we just don't know about them yet.
The main objective of the script is to prune the tree structure in order to select exactly one Rendition per Program and exactly one Variant per Rendition. So, again, the benefit of actually dealing with list of children is kind of null.

About the tree structure (point 2), I see two approaches: 1. Parent has a list of children: `Program` has a list of `Rendition`, `Rendition` a list of `Variant`. 2. Children embed its parent: `Variant` has a `Rendition`, `Rendition` has a `Program`. Although the second option makes bigger objects and duplicate of the parents for each child, it is the one I favor: 1. When we load a `Program`, we do not have rendition information yet and therefore would initialize an empty list of `Rendition` and only then populate it. This makes the intermediate `Program` a false representation: it does have some `Rendition` but we just don't know about them yet. 2. The main objective of the script is to prune the tree structure in order to select exactly one `Rendition` per `Program` and exactly one `Variant` per `Rendition`. So, again, the benefit of actually dealing with list of children is kind of null.

Barbagus commented

2023-02-14 09:37:27 +00:00

About the semantics of the model (Point 3):

Variant should carry the information about audio/subtitles languages and types (audio description, hearing impaired, etc...) these can be inferred from the rendition codes (VOF-STF, VO-ST[ITA], ...). Then we do not need to carry the code itself (which, by the way have horrific format) and leave it to the human interface to reconstruct an identifier based on the Variant data.
The main problem is that the code don't always inform about audio languages. But we can temporarily use the und language code that actually mean undefined. We can then set it to the proper language when building the Variant object from the HLS program index.

Rendition should then carry the quality information (only applicable to video track for now). Also, we should drop the code field and likewise, leave it the the human interface to construct it as needed.

About the semantics of the model (Point 3): `Variant` should carry the information about audio/subtitles languages and types (audio description, hearing impaired, etc...) these can be inferred from the rendition codes (VOF-STF, VO-ST[ITA], ...). Then we do not need to carry the code itself (which, by the way have horrific format) and leave it to the human interface to reconstruct an identifier based on the `Variant` data. The main problem is that the code don't _always_ inform about audio languages. But we can temporarily use the `und` language code that actually mean _undefined_. We can then set it to the proper language when building the `Variant` object from the HLS _program index_. `Rendition` should then carry the quality information (only applicable to video track for now). Also, we should drop the `code` field and likewise, leave it the the human interface to construct it as needed.

Barbagus commented

2023-02-14 09:40:02 +00:00

About the codes (renditions and variants): I think it make sense to leave that responsibility to the human interface. We may need different types of codes for example:

for the selection process
to include in the output file name
...

About the _codes_ (renditions and variants): I think it make sense to leave that responsibility to the human interface. We may need different types of codes for example: - for the selection process - to include in the output file name - ...

Barbagus commented

2023-02-16 06:37:21 +00:00

Okay, the more I think about it, the more I think point 1 and 2 are secondary and may be addressed by a better splitting between our 3 "zones":

www, api and hls modules (their data)
__init__ top level module API (our data)
The client code in __main__

So I'll first address that splitting and point 3 (semantics) and then see if 1 and 2 are still relevant.

Okay, the more I think about it, the more I think point 1 and 2 are secondary and may be addressed by a better splitting between our 3 "zones": - `www`, `api` and `hls` modules (their data) - `__init__` top level module API (our data) - The client code in `__main__` So I'll first address that splitting and point 3 (semantics) and then see if 1 and 2 are still relevant.

Barbagus added 4 commits 2023-02-20 06:15:18 +00:00

bdc8b7b246 Refactor `www` module

Split functionalities in smaller parts
- fetch the html code `fetch_page_content()`
- extract JSON data from html code `extract_page_data()`
- read the program info from page data `read_page_data()`
Move that "pipeline" in `__init__.py`

4ffc32eb61 Fix invalid doc strings

58b0ba30a3 Refactor `api` module and `Rendition*` model

Split `api` functionalities in smaller parts
- fetch API JSON object `fetch_api_object()`
- read the config object `read_config_player_object()`
Move that "pipeline" in `__init__.py`

Remove `code` field `Rendition` and replace it with some track rendition
models that are build from parsing the `code` from ArteTV. Also move the
`protocol` from the `RenditionSource` to the `Rendition` model itself...
who knows how we might handle it in the future.

c8aab4c5a3 Refactor `hls` module and `Variant*` model

Split `hls` functionalities in smaller parts
- fetch M3U8 object `fetch_index()`
- read th indexes `read_*_index_object()`
Move that "pipeline" in `__init__::load_variant_sources()`

Remove `code` field `Variant` and replace it with a video quality
descriptor (resolution and frame rate).

Barbagus added 1 commit 2023-02-20 06:19:59 +00:00

10fd8a9675 Organize imports and fix docstyle

This pull request is marked as a work in progress.

You can also view command line instructions.