Compare commits

...

59 Commits

Author SHA1 Message Date
Barbagus 23e2183c93 Merge pull request 'move to `urllib3` instead of `requests`' (#29) from urllib3 into stable
Reviewed-on: fcode/delarte#29
2023-02-14 08:11:20 +00:00
Barbagus 477edc4910 Implement a `raise_for_status()` on `HTTPError` 2023-02-13 18:44:32 +01:00
Barbagus a108135141 Use `urllib3` instead of `requests`
We were not (and probably wont be ) using any worthwhile `requests`
features (beside `raise_for_status()`) and the `timeout` session
parameter propagation vs adapter plugging "thing" in requests just
annoys me deeply (not that kind of "... Human (TM)")
2023-02-13 09:35:33 +01:00
Barbagus f90179e7c3 Fix changes in pages embedded data structure 2023-02-13 08:09:00 +01:00
Barbagus b4eed73a83 Add debug feedback on module exceptions 2023-02-13 08:03:52 +01:00
Barbagus f36d45fb5e Enable interrupt/resume of MP4 streams
- skipping the processing of an existing target output file
- skipping the download of an existing target stream file
- resume the download of an existing target stream temporary file
  using a HTTP range request
2023-01-25 08:53:25 +01:00
Barbagus 57da060e73 Merge pull request 'support for collections' (#28) from collections into stable
Reviewed-on: fcode/delarte#28
2023-01-24 19:26:05 +00:00
Barbagus 6b24b15f57 Update README according to implementation 2023-01-24 20:24:59 +01:00
Barbagus e23cd73664 Implement collections 2023-01-24 19:59:39 +01:00
Barbagus 3ca02e8e42 Include collection www/json samples
TV series that list episodes through many `collection_subcollection_*`
zones (one per season):
 - RC-023217__acquitted.json
 - RC-022923__cry-wolf.json

Other collection that list items in one `collection_videos_*` zone:
 - RC-023013__l-incroyable-periple-de-magellan.json
 - RC-023242__bandes-de-pirates.json
2023-01-24 10:15:50 +01:00
Barbagus 56c1e8468a Split program/rendition/variant/target operations
Significant rewrite after model modification: introducing `*Sources`
objects that encapsulate metadata and fetch information (urls,
protocols). The API (#20) is organized as pipe elements with sources
being what flows through the pipe.
    1. fetch program sources
    2. fetch rendition sources
    3. fetch variant sources
    4. fetch targets
    5. process (download+mux) targets
Some user selection filter or modifiers could then be applied at any
step of the pipe. Our __main__.py is an implementation of that scheme.

Implied modifications include:
 - Later failure on unsupported protocols, used to be in `api`, now in
   `hls`. This offers the possibility to filter and/or support them
   later.
 - Give up honoring the http ranges for mp4 download, stream-download
   them by fixed chunk instead.
 - Cleaning up of the `hls` module moving the main download function to
   __init__ and specific (mp4/vtt) download functions to a new
   `download` module.

On the side modifications include:
 - The progress handler showing downloading rates.
 - The naming utilities providing rendition and variant code insertion.
 - Download parts to working directories and skip unnecessary
   re-downloads on failure.

This was a big change for a single commit... too big of a change maybe.
2023-01-24 08:27:37 +01:00
Barbagus ed5ba06a98 Implement a "schema guard" for `api` module
In order to catch errors related to assumed JSON schema, regroup all
JSON data access under a context manager that catch related errors:
- KeyError
- IndexError
- ValueError
2023-01-16 21:12:55 +01:00
Barbagus fcadd531c4 Reorganize imports in files 2023-01-14 20:46:16 +01:00
Barbagus 639a8063a5 Get program information from page content
Changes the way the program information is figured out. From URL parsing
to page content parsing.
A massive JSON object is shipped within the HTML of the page, that's
were we get what we need from.

Side effects:
 - drop `slug` from the program's info
 - drop `slug` naming option
 - no `Program` / `ProgramMeta` distinction

Includes some JSON samples.
2023-01-14 19:51:02 +01:00
Barbagus ba2dd96b36 Merge pull request 'output file naming #8' (#27) from naming into stable
Reviewed-on: fcode/delarte#27
2023-01-11 17:12:54 +00:00
Barbagus cd24696367 Fix space issue in sequence counter 2023-01-11 18:10:52 +01:00
Barbagus ecba66d27a Implement basic naming options 2023-01-11 09:08:32 +01:00
Barbagus d4616f6298 Update README 2023-01-09 19:48:59 +01:00
Barbagus 4667dbfca1 Refactor models and API
Change/add/rename model's data structures in order to provide a more
useful API #20, introducing new structures:
- `Sources`: summarizing program, renditions and variants found
  at a given ArteTV page URL
- `Target`: summarizing all required data for a download

And new functions:
- `fetch_sources()` to build the `Sources` from a URL
- `iter_[renditions|variants]()` describe the available options for the
  `Sources`
- `select_[renditions|variants]()` to narrow down the desired options
  for the `Sources`
- `compile_sources` to compute such a `Target` from `Sources`
- `download_target` to download such a `Target`

Finally, this should make the playlist handling #7 easier (I know, I've
said that before)
2023-01-09 19:30:46 +01:00
Barbagus b13d4186b0 Add content-type check for HLS responses 2023-01-09 05:07:04 +01:00
Barbagus 5674b4aa0d Fix terminology and harmful language #12
Master playlists become program indexes
Media playlists become track indexes
2023-01-08 20:40:49 +01:00
Barbagus 81913a6f24 Cleanup package API #20
Move all error definitions to `error` module
In `__init__`
  - Remove imports from global scope
  - Import all from `model` module
  - Import all from `error` module
Refactor: `fetch_sources()` to take the URL as argument
Coding style: import definitions from `error` and `model`
2023-01-08 20:04:18 +01:00
Barbagus aa6a6e4a30 Remove obsolete tests 2023-01-08 20:02:54 +01:00
Barbagus eac65aaa1c Fix renditions audio/subtitles objects
Due to faulty syntax the `provides_accessibility` field was None/True
instead of False/True
2023-01-07 12:28:34 +01:00
Barbagus 87f833d655 Add `docopt-ng` to dependencies in README 2023-01-06 10:06:29 +01:00
Barbagus 914f711670 Merge pull request 'Fix #24 and #25' (#26) from vtt2srt into stable
Reviewed-on: fcode/delarte#26
2023-01-06 00:24:56 +00:00
Barbagus 96f411cca0 Fix #24 and #25
Remove dependency to `webvtt-py` which was both too much and not enough
for our use case.
Implement a basic WebVTT to SRT converter according to ArteTV's usage of
WebVTT features.
2023-01-06 01:17:55 +01:00
Barbagus 8d216215dd Merge pull request 'docopt-ng' (#22) from docopt-ng into stable
Reviewed-on: fcode/delarte#22
2023-01-03 08:45:46 +00:00
Barbagus 831d62d1fd Update README 2022-12-29 11:14:23 +01:00
Barbagus 464cf85680 Rename command line argument holder 2022-12-29 11:09:28 +01:00
Barbagus 381cbd7a36 Fix bub in version label building 2022-12-29 11:00:48 +01:00
Barbagus 4eac1fa86d Fix bub in version label building 2022-12-29 10:57:15 +01:00
Barbagus b057bab44b Implement CLI parsing using docopt-ng library 2022-12-29 10:54:45 +01:00
Barbagus 3ec2961a85 Merge pull request 'refactoring' (#21) from barbadev2 into stable
Reviewed-on: fcode/delarte#21
2022-12-29 07:58:48 +00:00
Barbagus e4cba27bdd Update README to reflect changes 2022-12-29 08:49:45 +01:00
Barbagus e1bed8b1be Provide programmatic access #20 2022-12-29 08:49:45 +01:00
Barbagus 07ef013ce3 Rename error handling
- move errors in a `error` module
- rename the module base error from `Error` to `ModuleError`
- fix some error handling in `__main__`
2022-12-29 08:49:45 +01:00
Barbagus db0a954497 Refactor code to use the model types
- Rename variables and function to reflect model names.
- Convert infrastructure data (JSON, M3U8) to model types.
- Change algorithms to produce/consume `Source` model, in particular
  using generator functions to build a list of `Source`s rather than the
  opaque `rendition => variant => urls` mapping (this will make #7 very
  straight forward).
- Download all master playlists after API call before selecting
  rendition/variants.

Motivation for the last point:

We use to offer rendition choosing right after the API call, before we
download the appropriate master playlist to figure out the available
variants.

The problem with that is that ArteTV's codes for the renditions (given
by the API) do not necessarily include complete languages information
(if it is not French or German), for instance a original audio track in
Portuguese would show as `VOEU-` (as in "EUropean"). The actual mention
of the Portuguese would only show up in the master playlist.

So, the new implementation actually downloads all master playlists
straight after the API call. This is a bit wasteful, but I figured it
was necessary to provide quality interaction with the user.

Bonus? Now when we first prompt the user for rendition choice, we
actually already know the available variants available, maybe we make
use of that fact in the future...
2022-12-29 08:43:20 +01:00
Barbagus 4fa5e1953e Create the data model types
A bunch of data structures to be used instead of the types used by the
infrastructures, i.e. JSON for API and M3U8 for the HLS.

It should provide a stronger decoupling of the modules and pave the way
for #7 and #8.

Implementation uses `namedtuple`s as they are transparent to test for
equality and are natively hashable (can be used in `set`s or as keys to
`dict`s) which is useful for deduping for instance.
2022-12-27 07:55:36 +01:00
Barbagus 305d8ab679 Refactor website URL parsing
Lighter implementation and using `target_id` instead of `program_id`,
preparing for #7
2022-12-27 07:52:35 +01:00
Barbagus 4c518993ef Change error handling
Creation of a `common.Error` exception whose string representation is
taken from its docstring.

Creation of a `common.UnexpectedError` to serve as base for exceptions
raised while checking assumptions on requests and responses.

The later are handled by displaying a message inviting user to submit
the error to us, so we can correct our assumptions.
2022-12-22 17:43:42 +01:00
Barbagus 88ffe31a94 Use `requests` library instead of `urllib`
Enables by default:
- gzip compression
- request pooling
2022-12-20 23:46:44 +01:00
Barbagus 458d4cbb6d Add sample files 2022-12-20 10:11:18 +01:00
Barbagus 1eb4d8557d Spell check 2022-12-20 09:48:57 +01:00
Rémi TAUVEL b938dc38c6 Merge branch 'WIP--CLI-argumentsv2#1' into stable 2022-12-19 00:33:02 +01:00
Rémi TAUVEL 28bd775817 📄 📝 docstring and licence at top of test package init module 2022-12-19 00:32:23 +01:00
Rémi TAUVEL 196f88aebb Merge branch 'stable' of git.afpy.org:fcode/delarte into stable 2022-12-19 00:28:51 +01:00
Barbagus dacf9533d6 Fix HLS protocol terminology in the code #12
- versions => renditions
- resolutions => variants
- ranges and/or chunks => segments
- version index => master playlist
- other index => media playlist url

For now, the CLI has not been updated with this terminology, only the
code.
2022-12-18 16:27:04 +01:00
Rémi TAUVEL 52420213cd 📝 add more doc for CLI help string 2022-12-18 15:41:10 +01:00
Rémi TAUVEL e6741594b6 📄 add licence comments top 2022-12-18 15:41:10 +01:00
Rémi TAUVEL 87f2e55a6f 💡 french translating docstrings 2022-12-18 15:41:10 +01:00
Rémi TAUVEL bcf0ba98ad 🐛 💡 fixed bad help sentence for resolution argument 2022-12-18 15:41:10 +01:00
Rémi TAUVEL beb0d99c1a 🚸 remove flags from script prototype
🩹 naming: not "languages", "version"
2022-12-18 15:41:10 +01:00
Rémi TAUVEL aab1308698 🚸 add documentation for user to arguments parser with -h flag 2022-12-18 15:41:10 +01:00
Rémi TAUVEL d39db7a501 📝 change readme doc on usage 2022-12-18 15:41:10 +01:00
Rémi TAUVEL 7d6f132999 🚚 rename modules and cli parser Class 2022-12-18 15:41:10 +01:00
Rémi TAUVEL 00f06ea5ba ♻️ wrapped parser functions in a Parser object 2022-12-18 15:41:10 +01:00
Rémi TAUVEL 8997dc46ec add tests for cli parser behaviour 2022-12-18 15:41:10 +01:00
Rémi TAUVEL 8720a8d47d 🩹 use argparse library for parsing CLI arguments 2022-12-18 15:41:10 +01:00
22 changed files with 10243 additions and 668 deletions

279
README.md
View File

@ -7,9 +7,9 @@
💡 What is it ?
---------------
This is a toy/research project whose only goal is to familiarize with some of the technologies involved in multi-lingual video streaming. Using this program may violate usage policy of ArteTV website and we do not recommend using it for other purpose then studying the code.
This is a toy/research project whose primary goal is to familiarize with some of the technologies involved in multi-lingual video streaming. Using this program may violate usage policy of ArteTV website and we do not recommend using it for other purpose then studying the code.
ArteTV is a is a European public service channel dedicated to culture. Available programms are usually available with multiple audio and subtitiles languages.
ArteTV is a is a European public service channel dedicated to culture. Programmes are usually available with multiple audio and subtitles languages.
🚀 Quick start
---------------
@ -27,7 +27,7 @@ $ git clone https://git.afpy.org/fcode/delarte.git
$ cd delarte
```
Optionally create a virtual environement
Optionally create a virtual environnement
```
$ python3 -m venv .venv
$ source .venv/Scripts/activate
@ -48,247 +48,100 @@ Now you can run the script
$ python3 -m delarte --help
or
$ delarte --help
ArteTV dowloader.
delarte - ArteTV downloader.
usage: delarte [-h|--help] - print this message
or: delarte program_page_url - show available versions
or: delarte program_page_url version - show available resolutions
or: delarte program_page_url version resolution - download the given video
Usage:
delarte (-h | --help)
delarte --version
delarte [options] URL
delarte [options] URL RENDITION
delarte [options] URL RENDITION VARIANT
Download a video from ArteTV streaming service. Omit RENDITION and/or
VARIANT to print the list of available values.
Arguments:
URL the URL from ArteTV website
RENDITION the rendition code [audio/subtitles language combination]
VARIANT the variant code [video quality version]
Options:
-h --help print this message
--version print current version of the program
--debug on error, print debugging information
--name-use-id use the program ID
--name-use-slug use the URL slug
--name-sep=<sep> field separator [default: - ]
--name-seq-pfx=<pfx> sequence counter prefix [default: - ]
--name-seq-no-pad disable sequence zero-padding
--name-add-rendition add rendition code
--name-add-variant add variant code
```
🔧 How it works
----------------
### 🏗️ The streaming infrastructure
## 🏗️ The streaming infrastructure
Every video program have a _program identifier_ visible in their web page URL:
We support both _single program pages_ and _program collection pages_. Every page is shipped with some embedded JSON data (we do not keep samples as the structure seems to change regularly). From that we extract metadata for each programs. In particular, we extract a _site language_ and a _program ID_. These enables us to query the config API
```
https://www.arte.tv/es/videos/110139-000-A/fromental-halevy-la-tempesta/
https://www.arte.tv/fr/videos/100204-001-A/esprit-d-hiver-1-3/
https://www.arte.tv/en/videos/104001-000-A/clint-eastwood/
```
### The _config_ API
That _program identifier_ enables us to query an API for the program's information.
This API returns a `ConfigPlayer` JSON object, a sample of which can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/api/). A list of available audio/subtitles combinations in `$.data.attributes.streams`. In our code such a combination is referred to as a _rendition_. Every such _rendition_ has a reference to a _program index_ file in `.streams[i].url`
##### The _config_ API
### The _program index_ file
For the last example the API call is as such:
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (sample files can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/)). This file show the a list of video _variants_ URIs (one per video resolution). Each of them has
- exactly one video _track index_ reference
- exactly one audio _track index_ reference
- at most one subtitles _track index_ reference
```
https://api.arte.tv/api/player/v2/config/en/104001-000-A
```
Audio and subtitles tracks reference also include:
- a two-letter `language` code attribute (`mul` is used for audio multiple language)
- a free form `name` attribute that is used to detect an audio _original version_
- a coded `characteristics` that is used to detect accessibility tracks (audio or textual description)
The response is a JSON object:
### The video and audio _track index_ file
```json
{
"data": {
"id": "104001-000-A_en",
"type": "ConfigPlayer",
"attributes": {
"metadata": {
"providerId": "104001-000-A",
"language": "en",
"title": "Clint Eastwood",
"subtitle": "The Last Legend",
"description": "70 years of career in front of and behind the camera and still active at 90, Clint Eastwood is a Hollywood legend. A look back at his unique career through a portrait that explores the complexity of the Eastwood myth.",
"duration": { "seconds": 4652 },
...
},
"streams": [
{
"url": "https://.../104001-000-A_VOF-STE%5BANG%5D_XQ.m3u8",
"versions": [
{
"label": "English (Subtitles)",
"shortLabel": "OGsub-ANG",
"eStat": {
"ml5": "VOF-STE[ANG]"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VOF-STF_XQ.m3u8",
"versions": [
{
"label": "French (Original)",
"shortLabel": "FR",
"eStat": {
"ml5": "VOF-STF"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VOF-STMF_XQ.m3u8",
"versions": [
{
"label": "Original french version - closed captioning (FR)",
"shortLabel": "ccFR",
"eStat": {
"ml5": "VOF-STMF"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VA-STA_XQ.m3u8",
"versions": [
{
"label": "German (Dubbed)",
"shortLabel": "DE",
"eStat": {
"ml5": "VA-STA"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VA-STMA_XQ.m3u8",
"versions": [
{
"label": "German closed captioning ",
"shortLabel": "ccDE",
"eStat": {
"ml5": "VA-STMA"
}
}
],
...
}
],
...
}
}
}
```
Information about the program is detailed in `data.attributes.metadata` and a list of available audio/subtitles combinations in `data.attributes.streams`. In our code such a combination is refered to as a _rendition_ (or _version_ in the CLI).
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (sample files can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/). This file is basically a list of _segments_ (http ranges) the client is supposed to download in sequence.
Every such _rendition_ has a reference to a _master playlist_ file in `.streams[i].url` and description of the audio/subtitle combination in `.streams[i].versions[0]`.
### The subtitles _track index_ file
We are using `.streams[i].versions[0].eStat.ml5` as our _rendition_ key:
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (sample files can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/)). This file references the actual file containing the subtitles [VTT](https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API) data.
- `VOF-STE[ANG]` English (Subtitles)
- `VOF-STF` French (Original)
- `VOF-STMF` Original french version - closed captioning (FR)
- `VA-STA` German (Dubbed)
- `VA-STMA` German closed captioning
- ...
## ⚙The process
#### The _master playlist_
1. Fetch _program sources_ form the page pointed by the given URL
2. Fetch _rendition sources_ from _config API_
3. Filter _renditions_
4. Fetch _variant sources_ from _HLS_ _program index_ files.
5. Filter _variants_
6. Fetch final target information and figure out output naming
7. Download data streams (convert VTT subtitles to formatted SRT subtitles) and mux them with FFMPEG
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216), for example:
## 📽️ FFMPEG
```
#EXTM3U
...
#EXT-X-STREAM-INF:BANDWIDTH=2335200,AVERAGE-BANDWIDTH=1123304,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=768x432,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v432.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4534432,AVERAGE-BANDWIDTH=2124680,VIDEO-RANGE=SDR,CODECS="avc1.4d0028,mp4a.40.2",RESOLUTION=1920x1080,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v1080.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4153392,AVERAGE-BANDWIDTH=1917840,VIDEO-RANGE=SDR,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v720.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1445432,AVERAGE-BANDWIDTH=726160,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=640x360,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v360.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=815120,AVERAGE-BANDWIDTH=429104,VIDEO-RANGE=SDR,CODECS="avc1.42e00d,mp4a.40.2",RESOLUTION=384x216,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v216.m3u8
...
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="program_audio_0",LANGUAGE="fr",NAME="VOF",AUTOSELECT=YES,DEFAULT=YES,URI="medias/104001-000-A_aud_VOF.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,LANGUAGE="en",URI="medias/104001-000-A_st_VO-ANG.m3u8"
...
```
The multiplexing (_muxing_) the video file is handled by [ffmpeg](https://ffmpeg.org/). The script expects [ffmpeg](https://ffmpeg.org/) to be installed in the environnement and will call it as a subprocess.
This file show the a list of video _variants_ URIs (one per video resolution). Each of them has
- exactly one video _media playlist_ reference
- exactly one audio _media playlist_ reference
- at most one subtitles _media playlist_ reference
##### The video and audio _media playlist_
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216), for example:
```
#EXTM3U
#EXT-X-TARGETDURATION:6
#EXT-X-VERSION:7
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="104001-000-A_v1080.mp4",BYTERANGE="28792@0"
#EXTINF:6.000,
#EXT-X-BYTERANGE:1734621@28792
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1575303@1763413
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1603739@3338716
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1333835@4942455
104001-000-A_v1080.mp4
...
```
This file shows the list of _segments_ the server expect to serve.
##### The subtitles _media playlist_
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216), for example:
```
#EXTM3U
#EXT-X-VERSION:7
#EXT-X-TARGETDURATION:4650
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-PLAYLIST-TYPE:VOD
#EXTINF:4650,
104001-000-A_st_VO-ANG.vtt
#EXT-X-ENDLIST
```
This file shows the file containing the subtitles data.
### ⚙The process
1. Get the _config_ API object for the _program identifier_.
- Select a _rendition_.
2. Get the _master playlist_.
- Select a _variant_.
3. Download audio, video and subtitles media content.
- convert `VTT` subtitles to `SRT`
4. Figure out the _output filename_ from _metadata_.
5. Feed the all the media to `ffmpeg` for _muxing_
### 📽️ FFMPEG
The multiplexing (_muxing_) the video file is handled by [ffmpeg](https://ffmpeg.org/). The script expects [ffmpeg](https://ffmpeg.org/) to be installed in the environement and will call it as a subprocess.
#### Why not use FFMPEG direcly with the HLS _master playlist_ URL ?
### Why not use FFMPEG directly with the HLS _program index_ URL ?
So we can be more granular about _renditions_ and _variants_ that we want.
#### Why not use `VTT` subtitles direcly ?
### Why not use `VTT` subtitles directly ?
Because it fails 😒.
Because FFMPEG do not support styles in WebVTT 😒.
#### Why not use FFMPEG direcly with the _media playalist_ URLs and let it do the download ?
### Why not use FFMPEG directly with the _track index_ URLs and let it do the download ?
Because some programs would randomly fail 😒. Probably due to invalid _segmentation_ on the server.
### 📌 Dependences
## 📌 Dependencies
- [m3u8](https://pypi.org/project/m3u8/) to parse playlists.
- [webvtt-py](https://pypi.org/project/webvtt-py/) to load `vtt` subtitles files.
- [m3u8](https://pypi.org/project/m3u8/) to parse indexes.
- [urllib3](https://pypi.org/project/urllib3/) to handle HTTP traffic.
- [docopt-ng](https://pypi.org/project/docopt-ng/) to parse command line.
### 🤝 Help
## 🤝 Help
For sure ! The more the merrier.

View File

@ -4,14 +4,15 @@ build-backend = "flit_core.buildapi"
[project]
name = "delarte"
authors = [{name = "Barbagus", email = "barbagus@proton.me"}]
authors = [{name = "Barbagus", email = "barbagus42@proton.me"}]
readme = "README.md"
license = {file = "LICENSE.md"}
classifiers = ["License :: OSI Approved :: GNU Affero General Public License v3"]
dynamic = ["version", "description"]
dependencies = [
"m3u8",
"webvtt-py",
"urllib3",
"docopt-ng"
]
[project.urls]
@ -21,7 +22,6 @@ Home = "https://git.afpy.org/fcode/delarte.git"
dev = [
"black",
"pydocstyle",
"toml"
]
[project.scripts]

View File

@ -0,0 +1,285 @@
{
"data": {
"id": "105612-000-A_fr",
"type": "ConfigPlayer",
"attributes": {
"provider": "arte",
"metadata": {
"providerId": "105612-000-A",
"language": "fr",
"title": "\"E.T.\", un blockbuster intime",
"subtitle": null,
"description": "1982. Un film accomplit le triple exploit de donner naissance à un personnage emblématique de la pop culture, de révolutionner le cinéma de science-fiction et démouvoir aux larmes le monde entier. Retour sur le paradoxal \"E.T., lextra-terrestre\", à la fois blockbuster et oeuvre intime, sans doute la plus personnelle de Steven Spielberg. ",
"images": [
{
"caption": null,
"url": "https://api-cdn.arte.tv/img/v2/image/bUzZ7kxNEJCRDK6Cb3TB79/940x530"
}
],
"link": {
"url": "https://www.arte.tv/fr/videos/105612-000-A/e-t-un-blockbuster-intime/",
"deeplink": "arte://program/105612-000-A",
"videoOnDemand": null
},
"config": {
"url": "https://api.arte.tv/api/player/v2/config/fr/105612-000-A",
"replay": "https://api.arte.tv/api/player/v2/config/fr/105612-000-A",
"playlist": "https://api.arte.tv/api/player/v2/playlist/fr/105612-000-A"
},
"duration": {
"seconds": 3150
},
"episodic": false
},
"live": false,
"chapters": null,
"rights": {
"begin": "2022-12-09T04:00:00+00:00",
"end": "2023-01-15T04:00:00+00:00"
},
"streams": [
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VOF-STF_XQ.m3u8",
"versions": [
{
"label": "Français",
"shortLabel": "VOF",
"eStat": {
"ml5": "VOF-STF"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 1,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
},
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VOF-STMF_XQ.m3u8",
"versions": [
{
"label": "Français (sourds et malentendants)",
"shortLabel": "ST mal",
"eStat": {
"ml5": "VOF-STMF"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 2,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
},
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VA-STA_XQ.m3u8",
"versions": [
{
"label": "Allemand",
"shortLabel": "VA",
"eStat": {
"ml5": "VA-STA"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 3,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
},
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VA-STMA_XQ.m3u8",
"versions": [
{
"label": "Allemand (sourds et malentendants)",
"shortLabel": "ST mal DE",
"eStat": {
"ml5": "VA-STMA"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 4,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
},
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VOEU-STE%5BANG%5D_XQ.m3u8",
"versions": [
{
"label": "ST Anglais",
"shortLabel": "VOST-ANG",
"eStat": {
"ml5": "VOEU-STE[ANG]"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 5,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
},
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VOEU-STE%5BESP%5D_XQ.m3u8",
"versions": [
{
"label": "ST Espagnol",
"shortLabel": "VOST-ESP",
"eStat": {
"ml5": "VOEU-STE[ESP]"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 6,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
},
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VOEU-STE%5BPOL%5D_XQ.m3u8",
"versions": [
{
"label": "ST Polonais",
"shortLabel": "VOST-POL",
"eStat": {
"ml5": "VOEU-STE[POL]"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 7,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
},
{
"url": "https://arte-cmafhls.akamaized.net/am/cmaf/105000/105600/105612-000-A/221213164204/105612-000-A_VOEU-STE%5BITA%5D_XQ.m3u8",
"versions": [
{
"label": "ST Italien",
"shortLabel": "VOST-ITA",
"eStat": {
"ml5": "VOEU-STE[ITA]"
}
}
],
"mainQuality": {
"code": "XQ",
"label": "720p"
},
"slot": 8,
"protocol": "HLS_NG",
"segments": [],
"externalId": null
}
],
"stat": {
"eStat": {
"level1": "CPO_culture-et-pop",
"level2": "PROGRAMME_ANTENNE",
"level3": "fr",
"level4": "POP_culture-pop",
"level5": "105612-000-A",
"mediaChannel": "850",
"mediaContentId": "105612-000-A",
"mediaDiffMode": "TVOD",
"newLevel1": "SHOW",
"newLevel11": "613_culture-pop",
"newLevel2": "auto",
"newLevel3": "-",
"newLevel4": "-",
"streamDuration": 3150,
"streamGenre": "a",
"streamName": "\"E.T.\", un blockbuster intime",
"serial": 266066213484,
"prerollSerial": 213013217336
},
"arte": {
"tablet": {
"WEB": "https://www.arte.tv/pa/api/multimedia/v1/105612-000/A/fr/ARTE_NEXT/TABLET/WEB/arte.gif"
},
"desktop": {
"WEB": "https://www.arte.tv/pa/api/multimedia/v1/105612-000/A/fr/ARTE_NEXT/DESKTOP/WEB/arte.gif"
},
"mobile": {
"WEB": "https://www.arte.tv/pa/api/multimedia/v1/105612-000/A/fr/ARTE_NEXT/MOBILE/WEB/arte.gif"
}
},
"agf": {
"type": "content",
"assetid": "105612-000-A",
"program": "613_culture-pop",
"title": "nach-hause-telefonieren",
"length": 3150,
"nol_c2": "p2,N",
"nol_c5": "p5,https://www.arte.tv/fr/videos/105612-000-A/e-t-un-blockbuster-intime/",
"nol_c7": "p7,105612-000-A",
"nol_c8": "p8,3150",
"nol_c9": "p9,nach-hause-telefonieren",
"nol_c10": "p10,ARTE",
"nol_c12": "p12,Content",
"nol_c15": "p15,105612-000-A",
"nol_c18": "p18,N"
},
"push": {
"programId": "105612-000-A",
"category": "CPO_culture-et-pop",
"subcategory": "POP_culture-pop",
"genre": "1_documentaires-et-reportages"
}
},
"ads": {
"smart": {
"url": "https://www14.smartadserver.com/ac?siteid=307555&pgid=1115590&fmtid=81409&ab=1&tgt=cat%3DCPO_POP%3Blang%3Dfr%3Bplatform%3DARTE_NEXT&oc=1&out=vast4&ps=1&pb=0&visit=S&vcn=s&ctid=105612-000-A&ctd=3150&lang=fr&ctt=broadcast&ctc=CPO_POP&ctk=RC-022371"
}
},
"restriction": {
"enablePreroll": true,
"geoblocking": {
"code": "SAT",
"restrictedArea": false,
"inclusion": [],
"exclusion": [],
"userGeoblockingZone": [
"DE_FR",
"EUR_DE_FR",
"SAT",
"ALL"
],
"userCountryCode": "FR"
},
"ageRestriction": "NONE",
"allowEmbed": true,
"enableMyArte": true
},
"stickers": [],
"autoplay": true
}
}
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,29 @@
#EXTM3U
#EXT-X-VERSION:7
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-STREAM-INF:BANDWIDTH=2369840,AVERAGE-BANDWIDTH=1168160,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=768x432,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v432.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4720688,AVERAGE-BANDWIDTH=2164360,VIDEO-RANGE=SDR,CODECS="avc1.4d0028,mp4a.40.2",RESOLUTION=1920x1080,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v1080.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4067496,AVERAGE-BANDWIDTH=1921696,VIDEO-RANGE=SDR,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v720.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1443248,AVERAGE-BANDWIDTH=729696,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=640x360,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v360.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=819168,AVERAGE-BANDWIDTH=430848,VIDEO-RANGE=SDR,CODECS="avc1.42e00d,mp4a.40.2",RESOLUTION=384x216,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v216.m3u8
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=670672,AVERAGE-BANDWIDTH=158304,VIDEO-RANGE=SDR,CODECS="avc1.4d401e",RESOLUTION=768x432,URI="medias/105612-000-A_v432_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=1255560,AVERAGE-BANDWIDTH=266544,VIDEO-RANGE=SDR,CODECS="avc1.4d0028",RESOLUTION=1920x1080,URI="medias/105612-000-A_v1080_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=1096696,AVERAGE-BANDWIDTH=250848,VIDEO-RANGE=SDR,CODECS="avc1.4d401f",RESOLUTION=1280x720,URI="medias/105612-000-A_v720_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=458864,AVERAGE-BANDWIDTH=103496,VIDEO-RANGE=SDR,CODECS="avc1.4d401e",RESOLUTION=640x360,URI="medias/105612-000-A_v360_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=130136,AVERAGE-BANDWIDTH=42200,VIDEO-RANGE=SDR,CODECS="avc1.42e00d",RESOLUTION=384x216,URI="medias/105612-000-A_v216_iframe_index.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="program_audio_0",LANGUAGE="de",NAME="VA",AUTOSELECT=YES,DEFAULT=YES,URI="medias/105612-000-A_aud_VA.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="Deutsch",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,LANGUAGE="de",URI="medias/105612-000-A_st_VA-ALL.m3u8"
#SPRITES: medias/105612-000-A_SPR.vtt

View File

@ -0,0 +1,29 @@
#EXTM3U
#EXT-X-VERSION:7
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-STREAM-INF:BANDWIDTH=2369840,AVERAGE-BANDWIDTH=1168160,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=768x432,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v432.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4720688,AVERAGE-BANDWIDTH=2164360,VIDEO-RANGE=SDR,CODECS="avc1.4d0028,mp4a.40.2",RESOLUTION=1920x1080,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v1080.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4067496,AVERAGE-BANDWIDTH=1921696,VIDEO-RANGE=SDR,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v720.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1443248,AVERAGE-BANDWIDTH=729696,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=640x360,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v360.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=819168,AVERAGE-BANDWIDTH=430848,VIDEO-RANGE=SDR,CODECS="avc1.42e00d,mp4a.40.2",RESOLUTION=384x216,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/105612-000-A_v216.m3u8
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=670672,AVERAGE-BANDWIDTH=158304,VIDEO-RANGE=SDR,CODECS="avc1.4d401e",RESOLUTION=768x432,URI="medias/105612-000-A_v432_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=1255560,AVERAGE-BANDWIDTH=266544,VIDEO-RANGE=SDR,CODECS="avc1.4d0028",RESOLUTION=1920x1080,URI="medias/105612-000-A_v1080_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=1096696,AVERAGE-BANDWIDTH=250848,VIDEO-RANGE=SDR,CODECS="avc1.4d401f",RESOLUTION=1280x720,URI="medias/105612-000-A_v720_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=458864,AVERAGE-BANDWIDTH=103496,VIDEO-RANGE=SDR,CODECS="avc1.4d401e",RESOLUTION=640x360,URI="medias/105612-000-A_v360_iframe_index.m3u8"
#EXT-X-I-FRAME-STREAM-INF:BANDWIDTH=130136,AVERAGE-BANDWIDTH=42200,VIDEO-RANGE=SDR,CODECS="avc1.42e00d",RESOLUTION=384x216,URI="medias/105612-000-A_v216_iframe_index.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="program_audio_0",LANGUAGE="fr",NAME="VOF",AUTOSELECT=YES,DEFAULT=YES,URI="medias/105612-000-A_aud_VOF.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="Français (ST Sourds/Mal)",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,LANGUAGE="fr",CHARACTERISTICS="public.accessibility.transcribes-spoken-dialog,public.accessibility.describes-music-and-sound",URI="medias/105612-000-A_st_VF-MAL.m3u8"
#SPRITES: medias/105612-000-A_SPR.vtt

View File

@ -0,0 +1,8 @@
#EXTM3U
#EXT-X-VERSION:7
#EXT-X-TARGETDURATION:3149
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-PLAYLIST-TYPE:VOD
#EXTINF:3149,
105612-000-A_st_VA-ALL.vtt
#EXT-X-ENDLIST

File diff suppressed because it is too large Load Diff

3355
samples/vtt/captions.vtt Normal file

File diff suppressed because it is too large Load Diff

2216
samples/vtt/subtitles.vtt Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,6 +1,174 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""delarte - ArteTV downloader."""
__version__ = "0.1"
from .error import *
from .model import *
def fetch_program_sources(url, http):
"""Fetch program sources listed on given ArteTV page."""
from .www import iter_programs
return [
ProgramSource(
program,
player_config_url,
)
for program, player_config_url in iter_programs(url, http)
]
def fetch_rendition_sources(program_sources, http):
"""Fetch renditions for given programs."""
from itertools import groupby
from .api import iter_renditions
sources = [
RenditionSource(
program,
rendition,
protocol,
program_index_url,
)
for program, player_config_url in program_sources
for rendition, protocol, program_index_url in iter_renditions(
program.id,
player_config_url,
http,
)
]
descriptors = list({(s.rendition.code, s.rendition.label) for s in sources})
descriptors.sort()
for code, group in groupby(descriptors, key=lambda t: t[0]):
labels_for_code = [t[1] for t in group]
if len(labels_for_code) != 1:
raise UnexpectedError("MULTIPLE_RENDITION_LABELS", code, labels_for_code)
return sources
def fetch_variant_sources(renditions_sources, http):
"""Fetch variants for given renditions."""
from itertools import groupby
from .hls import iter_variants
sources = [
VariantSource(
program,
rendition,
variant,
VariantSource.VideoMedia(*video),
VariantSource.AudioMedia(*audio),
VariantSource.SubtitlesMedia(*subtitles) if subtitles else None,
)
for program, rendition, protocol, program_index_url in renditions_sources
for variant, video, audio, subtitles in iter_variants(
protocol, program_index_url, http
)
]
descriptors = list(
{(s.variant.code, s.video_media.track.frame_rate) for s in sources}
)
descriptors.sort()
for code, group in groupby(descriptors, key=lambda t: t[0]):
frame_rates_for_code = [t[1] for t in group]
if len(frame_rates_for_code) != 1:
raise UnexpectedError(
"MULTIPLE_RENDITION_FRAME_RATES", code, frame_rates_for_code
)
return sources
def fetch_targets(variant_sources, http, **naming_options):
"""Compile download targets for given variants."""
from .hls import fetch_mp4_media, fetch_vtt_media
from .naming import file_name_builder
build_file_name = file_name_builder(**naming_options)
targets = [
Target(
Target.VideoInput(
video_media.track,
fetch_mp4_media(video_media.track_index_url, http),
),
Target.AudioInput(
audio_media.track,
fetch_mp4_media(audio_media.track_index_url, http),
),
(
Target.SubtitlesInput(
subtitles_media.track,
fetch_vtt_media(subtitles_media.track_index_url, http),
)
if subtitles_media
else None
),
(program.title, program.subtitle) if program.subtitle else program.title,
build_file_name(program, rendition, variant),
)
for program, rendition, variant, video_media, audio_media, subtitles_media in variant_sources
]
return targets
def download_targets(targets, http, on_progress):
"""Download given target."""
import os
from .download import download_mp4_media, download_vtt_media
from .muxing import mux_target
for target in targets:
output_path = f"{target.output}.mkv"
if os.path.isfile(output_path):
print(f"Skipping {output_path!r}")
continue
video_path = target.output + ".video.mp4"
audio_path = target.output + ".audio.mp4"
subtitles_path = target.output + ".srt"
download_mp4_media(target.video_input.url, video_path, http, on_progress)
download_mp4_media(target.audio_input.url, audio_path, http, on_progress)
if target.subtitles_input:
download_vtt_media(
target.subtitles_input.url, subtitles_path, http, on_progress
)
mux_target(
target._replace(
video_input=target.video_input._replace(url=video_path),
audio_input=target.audio_input._replace(url=audio_path),
subtitles_input=(
target.subtitles_input._replace(url=subtitles_path)
if target.subtitles_input
else None
),
),
on_progress,
)
if os.path.isfile(subtitles_path):
os.unlink(subtitles_path)
if os.path.isfile(audio_path):
os.unlink(audio_path)
if os.path.isfile(video_path):
os.unlink(video_path)

View File

@ -1,113 +1,199 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""delarte - ArteTV dowloader.
"""delarte - ArteTV downloader.
usage: delarte [-h|--help] - print this message
or: delarte program_page_url - show available versions
or: delarte program_page_url version - show available resolutions
or: delarte program_page_url version resolution - download the given video
Usage:
delarte (-h | --help)
delarte --version
delarte [options] URL
delarte [options] URL RENDITION
delarte [options] URL RENDITION VARIANT
Download a video from ArteTV streaming service. Omit RENDITION and/or
VARIANT to print the list of available values.
Arguments:
URL the URL from ArteTV website
RENDITION the rendition code [audio/subtitles language combination]
VARIANT the variant code [video quality version]
Options:
-h --help print this message
--version print current version of the program
--debug on error, print debugging information
--name-use-id use the program ID
--name-sep=<sep> field separator [default: - ]
--name-seq-pfx=<pfx> sequence counter prefix [default: - ]
--name-seq-no-pad disable sequence zero-padding
--name-add-rendition add rendition code
--name-add-variant add variant code
"""
import itertools
import sys
import time
from . import api
from . import hls
from . import muxing
from . import naming
from . import www
import docopt
import urllib3
from . import (
ModuleError,
UnexpectedError,
HTTPError,
__version__,
download_targets,
fetch_program_sources,
fetch_rendition_sources,
fetch_targets,
fetch_variant_sources,
)
def _fail(message, code=1):
print(message, file=sys.stderr)
return code
class Abort(ModuleError):
"""Aborted."""
def _print_available_renditions(config, f):
print(f"Available versions:", file=f)
for code, label in api.iter_renditions(config):
print(f"\t{code} - {label}", file=f)
class Fail(UnexpectedError):
"""Unexpected error."""
def _print_available_variants(version_index, f):
print(f"Available resolutions:", file=f)
for code, label in hls.iter_variants(version_index):
print(f"\t{code} - {label}", file=f)
def _create_progress():
# create a progress handler for input downloads
state = {}
def create_progress():
"""Create a progress handler for input downloads."""
state = {
"last_update_time": 0,
"last_channel": None,
}
def progress(channel, current, total):
def on_progress(file, current, total):
now = time.time()
if current == total:
print(f"\rDownloading {channel}: 100.0%")
state["last_update_time"] = now
elif channel != state["last_channel"]:
print(f"Dowloading {channel}: 0.0%", end="")
state["last_update_time"] = now
state["last_channel"] = channel
elif now - state["last_update_time"] > 1:
if current == 0:
print(f"Downloading {file!r}: 0.0%", end="")
state["start_time"] = now
state["last_time"] = now
state["last_count"] = 0
elif current == total:
elapsed_time = now - state["start_time"]
rate = int(total / elapsed_time) if elapsed_time else "NaN"
print(f"\rDownloading {file!r}: 100.0% [{rate}]")
state.clear()
elif now - state["last_time"] > 1:
elapsed_time1 = now - state["start_time"]
elapsed_time2 = now - state["last_time"]
progress = int(1000.0 * current / total) / 10.0
rate1 = int(current / elapsed_time1) if elapsed_time1 else "NaN"
rate2 = (
int((current - state["last_count"]) / elapsed_time2)
if elapsed_time2
else "NaN"
)
print(
f"\rDownloading {channel}: {int(1000.0 * current / total) / 10.0}%",
f"\rDownloading {file!r}: {progress}% [{rate1}, {rate2}]",
end="",
)
state["last_update_time"] = now
state["last_time"] = now
state["last_count"] = current
return progress
return on_progress
def _select_rendition_sources(rendition_code, rendition_sources):
if rendition_code:
filtered = [s for s in rendition_sources if s.rendition.code == rendition_code]
if filtered:
return filtered
print(
f"{rendition_code!r} is not a valid rendition code. Available values are:"
)
else:
print("Available renditions:")
key = lambda s: (s.rendition.label, s.rendition.code)
rendition_sources.sort(key=key)
for (label, code), _ in itertools.groupby(rendition_sources, key=key):
print(f"{code:>12} : {label}")
raise Abort()
def _select_variant_sources(variant_code, variant_sources):
if variant_code:
filtered = [s for s in variant_sources if s.variant.code == variant_code]
if filtered:
return filtered
print(f"{variant_code!r} is not a valid variant code. Available values are:")
else:
print("Available variants:")
variant_sources.sort(key=lambda s: s.video_media.track.height, reverse=True)
for code, _ in itertools.groupby(variant_sources, key=lambda s: s.variant.code):
print(f"{code:>12}")
raise Abort()
def main():
"""CLI command."""
args = sys.argv[1:]
if not args or args[0] == "-h" or args[0] == "--help":
print(__doc__)
return 0
args = docopt.docopt(__doc__, sys.argv[1:], version=__version__)
http = urllib3.PoolManager(timeout=5)
try:
www_lang, program_id = www.parse_url(args.pop(0))
except ValueError as e:
return _fail(f"Invalid url: {e}")
program_sources = fetch_program_sources(args["URL"], http)
try:
config = api.load_config(www_lang, program_id)
except ValueError:
return _fail("Invalid program")
rendition_sources = _select_rendition_sources(
args["RENDITION"],
fetch_rendition_sources(program_sources, http),
)
if not args:
_print_available_renditions(config, sys.stdout)
return 0
variant_sources = _select_variant_sources(
args["VARIANT"],
fetch_variant_sources(rendition_sources, http),
)
master_playlist_url = api.select_rendition(config, args.pop(0))
if master_playlist_url is None:
_fail("Invalid version")
_print_available_renditions(config, sys.stderr)
targets = fetch_targets(
variant_sources,
http,
**{
k[7:].replace("-", "_"): v
for k, v in args.items()
if k.startswith("--name-")
},
)
download_targets(targets, http, _create_progress())
except UnexpectedError as e:
if args["--debug"]:
raise e
print(str(e))
print()
print(
"This program is the result of browser/server traffic analysis and involves\n"
"some level of trying and guessing. This error might mean that we did not try\n"
"enough or that we guessed poorly."
)
print("")
print("Please consider submitting the issue to us so we may fix it.")
print("")
print("Issue tracker: https://git.afpy.org/fcode/delarte/issues")
print(f"Title: {e.args[0]}")
print("Body:")
print(f" {repr(e)}")
return 1
master_playlist = hls.load_master_playlist(master_playlist_url)
except ModuleError as e:
if args["--debug"]:
raise e
print(str(e))
return 1
if not args:
_print_available_variants(master_playlist, sys.stdout)
return 0
remote_inputs = hls.select_variant(master_playlist, args.pop(0))
if remote_inputs is None:
_fail("Invalid resolution")
_print_available_variants(master_playlist, sys.stderr)
return 0
file_base_name = naming.build_file_base_name(config)
progress = create_progress()
with hls.download_inputs(remote_inputs, progress) as temp_inputs:
muxing.mux(temp_inputs, file_base_name, progress)
except HTTPError as e:
if args["--debug"]:
raise e
print("Network error.")
return 1
if __name__ == "__main__":

View File

@ -1,58 +1,71 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide ArteTV JSON API utilities."""
import json
from http import HTTPStatus
from urllib.request import urlopen
from .error import UnexpectedAPIResponse, HTTPError
from .model import Rendition
MIME_TYPE = "application/vnd.api+json; charset=utf-8"
def load_api_data(url):
"""Retrieve the root node (infamous "data") of an API call response."""
http_response = urlopen(url)
def _fetch_api_object(http, url, object_type):
# Fetch an API object.
if http_response.status != HTTPStatus.OK:
raise RuntimeError("API request failed")
r = http.request("GET", url)
HTTPError.raise_for_status(r)
if (
http_response.getheader("Content-Type")
!= "application/vnd.api+json; charset=utf-8"
):
raise ValueError("API response not supported")
mime_type = r.getheader("content-type")
if mime_type != MIME_TYPE:
raise UnexpectedAPIResponse("MIME_TYPE", url, MIME_TYPE, mime_type)
return json.load(http_response)["data"]
obj = json.loads(r.data.decode("utf-8"))
try:
data_type = obj["data"]["type"]
if data_type != object_type:
raise UnexpectedAPIResponse("OBJECT_TYPE", url, object_type, data_type)
return obj["data"]["attributes"]
except (KeyError, IndexError, ValueError) as e:
raise UnexpectedAPIResponse("SCHEMA", url) from e
def load_config(lang, program_id):
"""Retrieve a program config from API."""
url = f"https://api.arte.tv/api/player/v2/config/{lang}/{program_id}"
config = load_api_data(url)
def iter_renditions(program_id, player_config_url, http):
"""Iterate over renditions for the given program."""
obj = _fetch_api_object(http, player_config_url, "ConfigPlayer")
if config["type"] != "ConfigPlayer":
raise ValueError("Invalid API response")
codes = set()
try:
provider_id = obj["metadata"]["providerId"]
if provider_id != program_id:
raise UnexpectedAPIResponse(
"PROVIDER_ID_MISMATCH", player_config_url, provider_id
)
if config["attributes"]["metadata"]["providerId"] != program_id:
raise ValueError("Invalid API response")
for s in obj["streams"]:
code = s["versions"][0]["eStat"]["ml5"]
return config
if code in codes:
raise UnexpectedAPIResponse(
"DUPLICATE_RENDITION_CODE", player_config_url, code
)
codes.add(code)
yield (
Rendition(
s["versions"][0]["eStat"]["ml5"],
s["versions"][0]["label"],
),
s["protocol"],
s["url"],
)
def iter_renditions(config):
"""Return a rendition (code, label) iterator."""
for stream in config["attributes"]["streams"]:
yield (
# rendition code
stream["versions"][0]["eStat"]["ml5"],
# rendition full name
stream["versions"][0]["label"],
)
except (KeyError, IndexError, ValueError) as e:
raise UnexpectedAPIResponse("SCHEMA", player_config_url) from e
def select_rendition(config, rendition_code):
"""Return the master playlist index url for the given rendition code."""
for stream in config["attributes"]["streams"]:
if stream["versions"][0]["eStat"]["ml5"] == rendition_code:
return stream["url"]
return None
if not codes:
raise UnexpectedAPIResponse("NO_RENDITIONS", player_config_url)

61
src/delarte/download.py Normal file
View File

@ -0,0 +1,61 @@
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide download utilities."""
import os
from . import subtitles
from .error import HTTPError
_CHUNK = 64 * 1024
def download_mp4_media(url, file_name, http, on_progress):
"""Download a MP4 (video or audio) to given file."""
on_progress(file_name, 0, 0)
if os.path.isfile(file_name):
on_progress(file_name, 1, 1)
return
temp_file = f"{file_name}.tmp"
with open(temp_file, "ab") as f:
r = http.request(
"GET",
url,
headers={"Range": f"bytes={f.tell()}-"},
preload_content=False,
)
HTTPError.raise_for_status(r)
_, total = r.getheader("content-range").split("/")
total = int(total)
for content in r.stream(_CHUNK, True):
f.write(content)
on_progress(file_name, f.tell(), total)
r.release_conn()
os.rename(temp_file, file_name)
def download_vtt_media(url, file_name, http, on_progress):
"""Download a VTT and SRT-convert it to to given file."""
on_progress(file_name, 0, 0)
if os.path.isfile(file_name):
on_progress(file_name, 1, 1)
return
temp_file = f"{file_name}.tmp"
with open(temp_file, "w", encoding="utf-8") as f:
r = http.request("GET", url)
HTTPError.raise_for_status(r)
subtitles.convert(r.data.decode("utf-8"), f)
on_progress(file_name, f.tell(), f.tell())
os.rename(temp_file, file_name)

73
src/delarte/error.py Normal file
View File

@ -0,0 +1,73 @@
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide common utilities."""
class ModuleError(Exception):
"""Module error."""
def __str__(self):
"""Use the class definition docstring as a string representation."""
return self.__doc__
def __repr__(self):
"""Use the class qualified name and constructor arguments."""
return f"{self.__class__}{self.args!r}"
class ExpectedError(ModuleError):
"""A feature limitation to submit as an enhancement to developers."""
class UnexpectedError(ModuleError):
"""An error to report to developers."""
class HTTPError(Exception):
"""A wrapper around a filed HTTP response."""
@classmethod
def raise_for_status(self, r):
if not 200 <= r.status < 300:
raise self(r)
#
# www
#
class PageNotFound(ModuleError):
"""Page not found at ArteTV."""
class PageNotSupported(ExpectedError):
"""The page you are trying to download from is not (yet) supported."""
class InvalidPage(UnexpectedError):
"""Invalid ArteTV page."""
#
# api
#
class UnexpectedAPIResponse(UnexpectedError):
"""Unexpected response from ArteTV."""
#
# hls
#
class UnexpectedHLSResponse(UnexpectedError):
"""Unexpected response from ArteTV."""
class UnsupportedHLSProtocol(ModuleError):
"""Program type not supported."""
#
# subtitles
#
class WebVTTError(UnexpectedError):
"""Unexpected WebVTT data."""

View File

@ -1,338 +1,192 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide HLS protocol utilities."""
# For terminology, from HLS protocol RFC8216
# 2. Overview
#
# A multimedia presentation is specified by a Uniform Resource
# Identifier (URI) [RFC3986] to a Playlist.
#
# A Playlist is either a Media Playlist or a Master Playlist. Both are
# UTF-8 text files containing URIs and descriptive tags.
#
# A Media Playlist contains a list of Media Segments, which, when
# played sequentially, will play the multimedia presentation.
#
# Here is an example of a Media Playlist:
#
# #EXTM3U
# #EXT-X-TARGETDURATION:10
#
# #EXTINF:9.009,
# http://media.example.com/first.ts
# #EXTINF:9.009,
# http://media.example.com/second.ts
# #EXTINF:3.003,
# http://media.example.com/third.ts
#
# The first line is the format identifier tag #EXTM3U. The line
# containing #EXT-X-TARGETDURATION says that all Media Segments will be
# 10 seconds long or less. Then, three Media Segments are declared.
# The first and second are 9.009 seconds long; the third is 3.003
# seconds.
#
# To play this Playlist, the client first downloads it and then
# downloads and plays each Media Segment declared within it. The
# client reloads the Playlist as described in this document to discover
# any added segments. Data SHOULD be carried over HTTP [RFC7230], but,
# in general, a URI can specify any protocol that can reliably transfer
# the specified resource on demand.
#
# A more complex presentation can be described by a Master Playlist. A
# Master Playlist provides a set of Variant Streams, each of which
# describes a different version of the same content.
#
# A Variant Stream includes a Media Playlist that specifies media
# encoded at a particular bit rate, in a particular format, and at a
# particular resolution for media containing video.
#
# A Variant Stream can also specify a set of Renditions. Renditions
# are alternate versions of the content, such as audio produced in
# different languages or video recorded from different camera angles.
#
# Clients should switch between different Variant Streams to adapt to
# network conditions. Clients should choose Renditions based on user
# preferences.
import contextlib
import io
import os
import re
from http import HTTPStatus
from http.client import HTTPConnection, HTTPSConnection
from tempfile import NamedTemporaryFile
from urllib.parse import urlparse
from urllib.request import urlopen
import m3u8
import webvtt
from .error import UnexpectedHLSResponse, UnsupportedHLSProtocol, HTTPError
from .model import AudioTrack, SubtitlesTrack, Variant, VideoTrack
#
# WARNING !
#
# This module does not aim for a full implementation of HLS, only the
# subset usefull for the actual observed usage of ArteTV.
# subset useful for the actual observed usage of ArteTV.
#
# - URIs are relative file paths
# - Master playlists have at least one variant
# - Program indexes have at least one variant
# - Every variant is of different resolution
# - Every variant has exactly one audio medium
# - Every variant has at most one subtitles medium
# - Audio and video media playlists segments are incrmental ranges of the same file
# - Subtitles media playlists have only one segment
# - Audio and video indexes segments are incremental ranges of
# the same file
# - Subtitles indexes have only one segment
MIME_TYPE = "application/x-mpegURL"
def _make_resolution_code(variant):
# resolution code (1080p, 720p, ...)
return f"{variant.stream_info.resolution[1]}p"
def _fetch_index(http, url):
# Fetch a M3U8 playlist
r = http.request("GET", url)
HTTPError.raise_for_status(r)
if (_ := r.getheader("content-type")) != MIME_TYPE:
raise UnexpectedHLSResponse("MIME_TYPE", url, MIME_TYPE, _)
return m3u8.loads(r.data.decode("utf-8"), url)
def _is_relative_file_path(uri):
try:
url = urlparse(uri)
return url.path == uri and not uri.startswith("/")
except ValueError:
return False
def iter_variants(protocol, program_index_url, http):
"""Iterate over variants for the given rendition."""
if protocol != "HLS_NG":
raise UnsupportedHLSProtocol(protocol, program_index_url)
program_index = _fetch_index(http, program_index_url)
def load_master_playlist(url):
"""Download and return a master playlist."""
master_playlist = m3u8.load(url)
audio_media = None
subtitles_media = None
if not master_playlist.playlists:
raise ValueError("Unexpected missing playlists")
resolution_codes = set()
for variant in master_playlist.playlists:
resolution_code = _make_resolution_code(variant)
if resolution_code in resolution_codes:
raise ValueError("Unexpected duplicate resolution")
resolution_codes.add(resolution_code)
audio_media = False
subtitles_media = False
for m in variant.media:
if not _is_relative_file_path(m.uri):
raise ValueError("Invalid relative file name")
if m.type == "AUDIO":
for media in program_index.media:
match media.type:
case "AUDIO":
if audio_media:
raise ValueError("Unexpected multiple audio tracks")
audio_media = True
elif m.type == "SUBTITLES":
raise UnexpectedHLSResponse(
"MULTIPLE_AUDIO_MEDIA", program_index_url
)
audio_media = media
case "SUBTITLES":
if subtitles_media:
raise ValueError("Unexpected multiple subtitles tracks")
subtitles_media = True
raise UnexpectedHLSResponse(
"MULTIPLE_SUBTITLES_MEDIA", program_index_url
)
subtitles_media = media
if not audio_media:
raise ValueError("Unexpected missing audio track")
if not audio_media:
raise UnexpectedHLSResponse("NO_AUDIO_MEDIA", program_index_url)
return master_playlist
audio = (
AudioTrack(
audio_media.name,
audio_media.language,
audio_media.name.startswith("VO"),
(
audio_media.characteristics is not None
and ("public.accessibility" in audio_media.characteristics)
),
),
audio_media.absolute_uri,
)
subtitles = (
(
SubtitlesTrack(
subtitles_media.name,
subtitles_media.language,
(
subtitles_media.characteristics is not None
and ("public.accessibility" in subtitles_media.characteristics)
),
),
subtitles_media.absolute_uri,
)
if subtitles_media
else None
)
codes = set()
for video_media in program_index.playlists:
stream_info = video_media.stream_info
if stream_info.audio != audio_media.group_id:
raise UnexpectedHLSResponse(
"INVALID_AUDIO_MEDIA", program_index_url, stream_info.audio
)
if subtitles_media:
if stream_info.subtitles != subtitles_media.group_id:
raise UnexpectedHLSResponse(
"INVALID_SUBTITLES_MEDIA", program_index_url, stream_info.subtitles
)
elif stream_info.subtitles:
raise UnexpectedHLSResponse(
"INVALID_SUBTITLES_MEDIA", program_index_url, stream_info.subtitles
)
code = f"{stream_info.resolution[1]}p"
if code in codes:
raise UnexpectedHLSResponse(
"DUPLICATE_STREAM_CODE", program_index_url, code
)
codes.add(code)
def iter_variants(master_playlist):
"""Iterate over variants."""
for variant in sorted(
master_playlist.playlists,
key=lambda v: v.stream_info.resolution[1],
reverse=True,
):
yield (
_make_resolution_code(variant),
f"{variant.stream_info.resolution[0]} x {variant.stream_info.resolution[1]}",
Variant(
code,
stream_info.average_bandwidth,
),
(
VideoTrack(
stream_info.resolution[0],
stream_info.resolution[1],
stream_info.frame_rate,
),
video_media.absolute_uri,
),
audio,
subtitles,
)
def select_variant(master_playlist, resolution_code):
"""Return the stream information for a given resolution code."""
for variant in master_playlist.playlists:
code = _make_resolution_code(variant)
if code != resolution_code:
continue
audio_track = None
for m in variant.media:
if m.type == "AUDIO":
audio_track = (m.language, variant.base_uri + m.uri)
break
subtitles_track = None
for m in variant.media:
if m.type == "SUBTITLES":
subtitles_track = (m.language, variant.base_uri + m.uri)
break
return (
variant.base_uri + variant.uri,
audio_track,
subtitles_track,
)
return None
if not codes:
raise UnexpectedHLSResponse("NO_VARIANTS", program_index_url)
def _parse_byterange(obj):
# Parse a M3U8 `byterange` (count@offset) into http range (range_start, rang_end)
def _convert_byterange(obj):
# Convert a M3U8 `byterange` (1) to an `http range` (2).
# 1. "count@offset"
# 2. (start, end)
count, offset = [int(v) for v in obj.byterange.split("@")]
return offset, offset + count - 1
def _load_av_segments(media_playlist_url):
media_playlist = m3u8.load(media_playlist_url)
def fetch_mp4_media(track_index_url, http):
"""Fetch an audio or video media."""
track_index = _fetch_index(http, track_index_url)
file_name = media_playlist.segment_map[0].uri
range_start, range_end = _parse_byterange(media_playlist.segment_map[0])
if range_start != 0:
raise ValueError("Invalid a/v index: does not start at 0")
chunks = [(range_start, range_end)]
total = range_end + 1
file_name = track_index.segment_map[0].uri
start, end = _convert_byterange(track_index.segment_map[0])
if start != 0:
raise UnexpectedHLSResponse("INVALID_AV_INDEX_FRAGMENT_START", track_index_url)
for segment in media_playlist.segments:
# ranges = [(start, end)]
next_start = end + 1
for segment in track_index.segments:
if segment.uri != file_name:
raise ValueError("Invalid a/v index: multiple file names")
raise UnexpectedHLSResponse("MULTIPLE_AV_INDEX_FILES", track_index_url)
range_start, range_end = _parse_byterange(segment)
if range_start != total:
raise ValueError(
f"Invalid a/v index: discontious ranges ({range_start} != {total})"
start, end = _convert_byterange(segment)
if start != next_start:
raise UnexpectedHLSResponse(
"DISCONTINUOUS_AV_INDEX_FRAGMENT", track_index_url
)
chunks.append((range_start, range_end))
total = range_end + 1
# ranges.append((start, end))
next_start = end + 1
return urlparse(media_playlist.segment_map[0].absolute_uri), chunks
return track_index.segment_map[0].absolute_uri
def _download_av_stream(media_playlist_url, progress):
# Download an audio or video stream to temporary directory
url, ranges = _load_av_segments(media_playlist_url)
total = ranges[-1][1]
Connector = HTTPSConnection if url.scheme == "https" else HTTPConnection
connection = Connector(url.hostname)
connection.connect()
with (
NamedTemporaryFile(
mode="w+b", delete=False, prefix="delarte.", suffix=".mp4"
) as f,
contextlib.closing(connection) as c,
):
for range_start, range_end in ranges:
c.request(
"GET",
url.path,
headers={
"Accept": "*/*",
"Accept-Language": "fr,en;q=0.7,en-US;q=0.3",
"Accept-Encoding": "gzip, deflate, br, identity",
"Range": f"bytes={range_start}-{range_end}",
"Origin": "https://www.arte.tv",
"Connection": "keep-alive",
"Referer": "https://www.arte.tv/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "cross-site",
"Sec-GPC": "1",
"DNT": "1",
},
)
r = c.getresponse()
if r.status != 206:
raise ValueError(f"Invalid response status {r.status}")
content = r.read()
if len(content) != range_end - range_start + 1:
raise ValueError("Invalid range length")
f.write(content)
progress(range_end, total)
return f.name
def _download_subtitles_input(index_url, progress):
# Return a temporary file name where VTT subtitle has been downloaded/converted to SRT
subtitles_index = m3u8.load(index_url)
urls = [subtitles_index.base_uri + "/" + f for f in subtitles_index.files]
def fetch_vtt_media(track_index_url, http):
"""Fetch an audio or video media."""
track_index = _fetch_index(http, track_index_url)
urls = [s.absolute_uri for s in track_index.segments]
if not urls:
raise ValueError("No subtitle files")
raise UnexpectedHLSResponse("NO_S_INDEX_FILES", track_index_url)
if len(urls) > 1:
raise ValueError("Multiple subtitle files")
raise UnexpectedHLSResponse("MULTIPLE_S_INDEX_FILES", track_index_url)
progress(0, 2)
http_response = urlopen(urls[0])
if http_response.status != HTTPStatus.OK:
raise RuntimeError("Subtitle request failed")
buffer = io.StringIO(http_response.read().decode("utf8"))
progress(1, 2)
with NamedTemporaryFile(
"w", delete=False, prefix="delarte.", suffix=".srt", encoding="utf8"
) as f:
i = 1
for caption in webvtt.read_buffer(buffer):
print(i, file=f)
print(
re.sub(r"\.", ",", caption.start)
+ " --> "
+ re.sub(r"\.", ",", caption.end),
file=f,
)
print(caption.text + "\n", file=f)
i += 1
progress(2, 2)
return f.name
@contextlib.contextmanager
def download_inputs(remote_inputs, progress):
"""Download inputs in temporary files."""
# It is implemented as a context manager that will delete temporary files on exit.
video_index_url, audio_track, subtitles_track = remote_inputs
video_filename = None
audio_filename = None
subtitles_filename = None
try:
video_filename = _download_av_stream(
video_index_url, lambda i, n: progress("video", i, n)
)
(audio_lang, audio_index_url) = audio_track
audio_filename = _download_av_stream(
audio_index_url, lambda i, n: progress("audio", i, n)
)
if subtitles_track:
(subtitles_lang, subtitles_index_url) = subtitles_track
subtitles_filename = _download_subtitles_input(
subtitles_index_url, lambda i, n: progress("subtitles", i, n)
)
yield (
video_filename,
(audio_lang, audio_filename),
(subtitles_lang, subtitles_filename),
)
else:
yield (video_filename, (audio_lang, audio_filename), None)
finally:
if video_filename and os.path.isfile(video_filename):
os.unlink(video_filename)
if audio_filename and os.path.isfile(audio_filename):
os.unlink(audio_filename)
if subtitles_filename and os.path.isfile(subtitles_filename):
os.unlink(subtitles_filename)
return urls[0]

137
src/delarte/model.py Normal file
View File

@ -0,0 +1,137 @@
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide data model types."""
from typing import NamedTuple, Optional
#
# Metadata objects
#
class Program(NamedTuple):
"""A program metadata."""
id: str
language: str
title: str
subtitle: str
class Rendition(NamedTuple):
"""A program rendition metadata."""
code: str
label: str
class Variant(NamedTuple):
"""A program variant metadata."""
code: str
average_bandwidth: int
#
# Track objects
#
class VideoTrack(NamedTuple):
"""A video track."""
width: int
height: int
frame_rate: float
class AudioTrack(NamedTuple):
"""An audio track."""
name: str
language: str
original: bool
visual_impaired: bool
class SubtitlesTrack(NamedTuple):
"""A subtitles track."""
name: str
language: str
hearing_impaired: bool
#
# Source objects
#
class ProgramSource(NamedTuple):
"""A program source item."""
program: Program
player_config_url: str
class RenditionSource(NamedTuple):
"""A rendition source item."""
program: Program
rendition: Rendition
protocol: str
program_index_url: Program
class VariantSource(NamedTuple):
"""A variant source item."""
class VideoMedia(NamedTuple):
"""A video media."""
track: VideoTrack
track_index_url: str
class AudioMedia(NamedTuple):
"""An audio media."""
track: AudioTrack
track_index_url: str
class SubtitlesMedia(NamedTuple):
"""A subtitles media."""
track: SubtitlesTrack
track_index_url: str
program: Program
rendition: Rendition
variant: Variant
video_media: VideoMedia
audio_media: AudioMedia
subtitles_media: Optional[SubtitlesMedia]
class Target(NamedTuple):
"""A download target item."""
class VideoInput(NamedTuple):
"""A video input."""
track: VideoTrack
url: str
class AudioInput(NamedTuple):
"""An audio input."""
track: AudioTrack
url: str
class SubtitlesInput(NamedTuple):
"""A subtitles input."""
track: SubtitlesTrack
url: str
video_input: VideoInput
audio_input: AudioInput
subtitles_input: Optional[SubtitlesInput]
title: str | tuple[str, str]
output: str

View File

@ -1,37 +1,74 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide media muxing utilities."""
"""Provide target muxing utilities."""
import subprocess
def mux(inputs, file_base_name, progress):
"""Build FFMPEG args."""
video_input, audio_track, subtitles_track = inputs
audio_lang, audio_input = audio_track
if subtitles_track:
subtitles_lang, subtitles_input = subtitles_track
def mux_target(target, _progress):
"""Multiplexes target into a single file."""
cmd = ["ffmpeg", "-hide_banner"]
cmd.extend(["-i", video_input])
cmd.extend(["-i", audio_input])
if subtitles_track:
cmd.extend(["-i", subtitles_input])
# inputs
cmd.extend(["-i", target.video_input.url])
cmd.extend(["-i", target.audio_input.url])
if target.subtitles_input:
cmd.extend(["-i", target.subtitles_input.url])
# codecs
cmd.extend(["-c:v", "copy"])
cmd.extend(["-c:a", "copy"])
if subtitles_track:
if target.subtitles_input:
cmd.extend(["-c:s", "copy"])
cmd.extend(["-bsf:a", "aac_adtstoasc"])
cmd.extend(["-metadata:s:a:0", f"language={audio_lang}"])
if subtitles_track:
cmd.extend(["-metadata:s:s:0", f"language={subtitles_lang}"])
cmd.extend(["-disposition:s:0", "default"])
# stream metadata & disposition
# cmd.extend(["-metadata:s:v:0", f"name={target.video.name!r}"])
# cmd.extend(["-metadata:s:v:0", f"language={target.video.language!r}"])
cmd.append(f"{file_base_name}.mkv")
cmd.extend(["-metadata:s:a:0", f"name={target.audio_input.track.name}"])
cmd.extend(["-metadata:s:a:0", f"language={target.audio_input.track.language}"])
a_disposition = "default"
if target.audio_input.track.original:
a_disposition += "+original"
else:
a_disposition += "-original"
if target.audio_input.track.visual_impaired:
a_disposition += "+visual_impaired"
else:
a_disposition += "-visual_impaired"
cmd.extend(["-disposition:a:0", a_disposition])
if target.subtitles_input:
cmd.extend(["-metadata:s:s:0", f"name={target.subtitles_input.track.name}"])
cmd.extend(
["-metadata:s:s:0", f"language={target.subtitles_input.track.language}"]
)
s_disposition = "default"
if target.subtitles_input.track.hearing_impaired:
s_disposition += "+hearing_impaired+descriptions"
else:
s_disposition += "-hearing_impaired-descriptions"
cmd.extend(["-disposition:s:0", s_disposition])
# file metadata
if isinstance(target.title, tuple):
cmd.extend(["-metadata", f"title={target.title[0]}"])
cmd.extend(["-metadata", f"subtitle={target.title[1]}"])
else:
cmd.extend(["-metadata", f"title={target.title}"])
# output
cmd.append(f"{target.output}.mkv")
print(cmd)
subprocess.run(cmd)

View File

@ -1,9 +1,49 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide contexted based file naming utility."""
"""Provide contextualized based file naming utility."""
import re
def build_file_base_name(config):
"""Create a base file name from config metadata."""
return config["attributes"]["metadata"]["title"].replace("/", "-")
def file_name_builder(
*,
use_id=False,
sep=" - ",
seq_pfx=" - ",
seq_no_pad=False,
add_rendition=False,
add_variant=False
):
"""Create a file namer."""
def sub_sequence_counter(match):
index = match[1]
if not seq_no_pad:
index = (len(match[2]) - len(index)) * "0" + index
return seq_pfx + index
def replace_sequence_counter(s: str) -> str:
return re.sub(r"\s+\((\d+)/(\d+)\)", sub_sequence_counter, s)
def build_file_name(program, rendition, variant):
"""Create a file name."""
if use_id:
return program.id
fields = [replace_sequence_counter(program.title)]
if program.subtitle:
fields.append(replace_sequence_counter(program.subtitle))
if add_rendition:
fields.append(rendition.code)
if add_variant:
fields.append(variant.code)
name = sep.join(fields)
name = re.sub(r'[/:<>"\\|?*]', "", name)
return name
return build_file_name

53
src/delarte/subtitles.py Normal file
View File

@ -0,0 +1,53 @@
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide WebVTT to SRT subtitles conversion."""
import re
from .error import WebVTTError
RE_CUE_START = r"^((?:\d\d:)\d\d:\d\d)\.(\d\d\d) --> ((?:\d\d:)\d\d:\d\d)\.(\d\d\d)"
RE_STYLED_CUE = r"^<c\.(\w+)\.bg_(?:\w+)>(.*)</c>$"
def convert(input, output):
"""Convert input ArteTV's WebVTT string data and write it on output file."""
# This is a very (very) simple implementation based on what has actually
# been seen on ArteTV and is not at all a generic WebVTT solution.
blocks = []
block = []
for line in input.splitlines():
if not line and block:
blocks.append(block)
block = []
else:
block.append(line)
if block:
blocks.append(block)
block = []
if not blocks:
raise WebVTTError("INVALID_DATA")
header = blocks.pop(0)
if not (len(header) == 1 and header[0].startswith("WEBVTT")):
raise WebVTTError("INVALID_HEADER")
counter = 1
for block in blocks:
if m := re.match(RE_CUE_START, block.pop(0)):
print(f"{counter}", file=output)
print(f"{m[1]},{m[2]} --> {m[3]},{m[4]}", file=output)
for line in block:
if m := re.match(RE_STYLED_CUE, line):
print(f'<font color="{m[1]}">{m[2]}</font>', file=output)
else:
print(line, file=output)
print("", file=output)
counter += 1
if counter == 1:
raise WebVTTError("EMPTY_DATA")

View File

@ -1,29 +1,134 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
# License: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of `delarte` (https://git.afpy.org/fcode/delarte.git)
"""Provide ArteTV website utilities."""
from urllib.parse import urlparse
import json
LANGUAGES = ["fr", "de", "en", "es", "pl", "it"]
from .error import InvalidPage, PageNotFound, PageNotSupported, HTTPError
from .model import Program
_DATA_MARK = '<script id="__NEXT_DATA__" type="application/json">'
def parse_url(program_page_url):
"""Parse ArteTV web URL into UI language and program ID."""
url = urlparse(program_page_url)
if url.hostname != "www.arte.tv":
raise ValueError("not an ArteTV url")
def _process_programs_page(page_value):
language = page_value["language"]
program_page_path = url.path.split("/")[1:]
zone_found = False
program_found = False
lang = program_page_path.pop(0)
for zone in page_value["zones"]:
if zone["code"].startswith("program_content_"):
if zone_found:
raise InvalidPage("PROGRAMS_CONTENT_ZONES_COUNT")
zone_found = True
else:
continue
if lang not in LANGUAGES:
raise ValueError(f"invalid url language code: {lang}")
for data_item in zone["content"]["data"]:
if data_item["type"] == "program":
if program_found:
raise InvalidPage("PROGRAMS_CONTENT_PROGRAM_COUNT")
program_found = True
else:
raise InvalidPage("PROGRAMS_CONTENT_PROGRAM_TYPE")
if program_page_path.pop(0) != "videos":
raise ValueError("invalid ArteTV url")
yield (
Program(
data_item["programId"],
language,
data_item["title"],
data_item["subtitle"],
),
data_item["player"]["config"],
)
program_id = program_page_path.pop(0)
if not zone_found:
raise InvalidPage("PROGRAMS_CONTENT_ZONES_COUNT")
return lang, program_id
if not program_found:
raise InvalidPage("PROGRAMS_CONTENT_PROGRAM_COUNT")
def _process_collections_page(page_value):
language = page_value["language"]
main_zone_found = False
sub_zone_found = False
program_found = False
for zone in page_value["zones"]:
if zone["code"].startswith("collection_videos_"):
if main_zone_found:
raise InvalidPage("COLLECTIONS_MAIN_ZONE_COUNT")
if program_found:
raise InvalidPage("COLLECTIONS_MIXED_ZONES")
main_zone_found = True
elif zone["code"].startswith("collection_subcollection_"):
if program_found and not sub_zone_found:
raise InvalidPage("COLLECTIONS_MIXED_ZONES")
sub_zone_found = True
else:
continue
for data_item in zone["content"]["data"]:
if (_ := data_item["type"]) == "teaser":
program_found = True
else:
raise InvalidPage("COLLECTIONS_INVALID_CONTENT_DATA_ITEM", _)
yield (
Program(
data_item["programId"],
language,
data_item["title"],
data_item["subtitle"],
),
f"https://api.arte.tv/api/player/v2/config/{language}/{data_item['programId']}",
)
if not main_zone_found:
raise InvalidPage("COLLECTIONS_MAIN_ZONE_COUNT")
if not program_found:
raise InvalidPage("COLLECTIONS_PROGRAMS_COUNT")
def iter_programs(page_url, http):
"""Iterate over programs listed on given ArteTV page."""
r = http.request("GET", page_url)
# special handling of 404
if r.status == 404:
raise PageNotFound(page_url)
HTTPError.raise_for_status(r)
# no HTML parsing required, whe just find the mark
html = r.data.decode("utf-8")
start = html.find(_DATA_MARK)
if start < 0:
raise InvalidPage("DATA_MARK_NOT_FOUND", page_url)
start += len(_DATA_MARK)
end = html.index("</script>", start)
try:
next_js_data = json.loads(html[start:end].strip())
except json.JSONDecodeError:
raise InvalidPage("INVALID_JSON_DATA", page_url)
try:
page_value = next_js_data["props"]["pageProps"]["props"]["page"]["value"]
match page_value["type"]:
case "program":
yield from _process_programs_page(page_value)
case "collection":
yield from _process_collections_page(page_value)
case _:
raise PageNotSupported(page_url, page_value)
except (KeyError, IndexError, ValueError) as e:
raise InvalidPage("SCHEMA", page_url) from e
except InvalidPage as e:
raise InvalidPage(e.args[0], page_url) from e

View File

@ -0,0 +1,4 @@
# Licence: GNU AGPL v3: http://www.gnu.org/licenses/
# This file is part of [`delarte`](https://git.afpy.org/fcode/delarte.git)
"""Test package."""