delarte_test/README.md

295 lines
9.0 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

`delarte`
=========
🎬 ArteTV downloader
💡 What is it ?
---------------
This is a toy/research project whose only goal is to familiarize with some of the technologies involved in multi-lingual video streaming. Using this program may violate usage policy of ArteTV website and we do not recommend using it for other purpose then studying the code.
ArteTV is a is a European public service channel dedicated to culture. Available programms are usually available with multiple audio and subtitiles languages.
🚀 Quick start
---------------
Install [FFMPEG](https://ffmpeg.org/download.html) binaries and ensure it is in your `PATH`
```
$ ffmpeg -version
ffmpeg version N-109344-g1bebcd43e1-20221202 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 12.2.0 (crosstool-NG 1.25.0.90_cf9beb1)
```
Clone this repository
```
$ git clone https://git.afpy.org/fcode/delarte.git
$ cd delarte
```
Optionally create a virtual environement
```
$ python3 -m venv .venv
$ source .venv/Scripts/activate
```
Install in edit mode
```
$ pip install -e .
```
Or install in edit mode with `dev` dependencies if you intend to contribute.
```
$ pip install -e .[dev]
```
Now you can run the script
```
$ python3 -m delarte --help
or
$ delarte --help
ArteTV dowloader.
usage: delarte [-h|--help] - print this message
or: delarte program_page_url - show available versions
or: delarte program_page_url version - show available resolutions
or: delarte program_page_url version resolution - download the given video
```
🔧 How it works
----------------
### 🏗️ The streaming infrastructure
Every video program have a _program identifier_ visible in their web page URL:
```
https://www.arte.tv/es/videos/110139-000-A/fromental-halevy-la-tempesta/
https://www.arte.tv/fr/videos/100204-001-A/esprit-d-hiver-1-3/
https://www.arte.tv/en/videos/104001-000-A/clint-eastwood/
```
That _program identifier_ enables us to query an API for the program's information.
##### The _config_ API
For the last example the API call is as such:
```
https://api.arte.tv/api/player/v2/config/en/104001-000-A
```
The response is a JSON object:
```json
{
"data": {
"id": "104001-000-A_en",
"type": "ConfigPlayer",
"attributes": {
"metadata": {
"providerId": "104001-000-A",
"language": "en",
"title": "Clint Eastwood",
"subtitle": "The Last Legend",
"description": "70 years of career in front of and behind the camera and still active at 90, Clint Eastwood is a Hollywood legend. A look back at his unique career through a portrait that explores the complexity of the Eastwood myth.",
"duration": { "seconds": 4652 },
...
},
"streams": [
{
"url": "https://.../104001-000-A_VOF-STE%5BANG%5D_XQ.m3u8",
"versions": [
{
"label": "English (Subtitles)",
"shortLabel": "OGsub-ANG",
"eStat": {
"ml5": "VOF-STE[ANG]"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VOF-STF_XQ.m3u8",
"versions": [
{
"label": "French (Original)",
"shortLabel": "FR",
"eStat": {
"ml5": "VOF-STF"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VOF-STMF_XQ.m3u8",
"versions": [
{
"label": "Original french version - closed captioning (FR)",
"shortLabel": "ccFR",
"eStat": {
"ml5": "VOF-STMF"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VA-STA_XQ.m3u8",
"versions": [
{
"label": "German (Dubbed)",
"shortLabel": "DE",
"eStat": {
"ml5": "VA-STA"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VA-STMA_XQ.m3u8",
"versions": [
{
"label": "German closed captioning ",
"shortLabel": "ccDE",
"eStat": {
"ml5": "VA-STMA"
}
}
],
...
}
],
...
}
}
}
```
Information about the program is detailed in `data.attributes.metadata` and a list of available audio/subtitles combinations in `data.attributes.streams`. In our code such a combination is refered to as a _rendition_ (or _version_ in the CLI).
Every such _rendition_ has a reference to a _master playlist_ file in `.streams[i].url` and description of the audio/subtitle combination in `.streams[i].versions[0]`.
We are using `.streams[i].versions[0].eStat.ml5` as our _rendition_ key:
- `VOF-STE[ANG]` English (Subtitles)
- `VOF-STF` French (Original)
- `VOF-STMF` Original french version - closed captioning (FR)
- `VA-STA` German (Dubbed)
- `VA-STMA` German closed captioning
- ...
#### The _master playlist_
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216), for example:
```
#EXTM3U
...
#EXT-X-STREAM-INF:BANDWIDTH=2335200,AVERAGE-BANDWIDTH=1123304,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=768x432,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v432.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4534432,AVERAGE-BANDWIDTH=2124680,VIDEO-RANGE=SDR,CODECS="avc1.4d0028,mp4a.40.2",RESOLUTION=1920x1080,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v1080.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4153392,AVERAGE-BANDWIDTH=1917840,VIDEO-RANGE=SDR,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v720.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1445432,AVERAGE-BANDWIDTH=726160,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=640x360,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v360.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=815120,AVERAGE-BANDWIDTH=429104,VIDEO-RANGE=SDR,CODECS="avc1.42e00d,mp4a.40.2",RESOLUTION=384x216,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v216.m3u8
...
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="program_audio_0",LANGUAGE="fr",NAME="VOF",AUTOSELECT=YES,DEFAULT=YES,URI="medias/104001-000-A_aud_VOF.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,LANGUAGE="en",URI="medias/104001-000-A_st_VO-ANG.m3u8"
...
```
This file show the a list of video _variants_ URIs (one per video resolution). Each of them has
- exactly one video _media playlist_ reference
- exactly one audio _media playlist_ reference
- at most one subtitles _media playlist_ reference
##### The video and audio _media playlist_
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216), for example:
```
#EXTM3U
#EXT-X-TARGETDURATION:6
#EXT-X-VERSION:7
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="104001-000-A_v1080.mp4",BYTERANGE="28792@0"
#EXTINF:6.000,
#EXT-X-BYTERANGE:1734621@28792
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1575303@1763413
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1603739@3338716
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1333835@4942455
104001-000-A_v1080.mp4
...
```
This file shows the list of _segments_ the server expect to serve.
##### The subtitles _media playlist_
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216), for example:
```
#EXTM3U
#EXT-X-VERSION:7
#EXT-X-TARGETDURATION:4650
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-PLAYLIST-TYPE:VOD
#EXTINF:4650,
104001-000-A_st_VO-ANG.vtt
#EXT-X-ENDLIST
```
This file shows the file containing the subtitles data.
### ⚙The process
1. Get the _config_ API object for the _program identifier_.
- Select a _rendition_.
2. Get the _master playlist_.
- Select a _variant_.
3. Download audio, video and subtitles media content.
- convert `VTT` subtitles to `SRT`
4. Figure out the _output filename_ from _metadata_.
5. Feed the all the media to `ffmpeg` for _muxing_
### 📽️ FFMPEG
The multiplexing (_muxing_) the video file is handled by [ffmpeg](https://ffmpeg.org/). The script expects [ffmpeg](https://ffmpeg.org/) to be installed in the environement and will call it as a subprocess.
#### Why not use FFMPEG direcly with the HLS _master playlist_ URL ?
So we can be more granular about _renditions_ and _variants_ that we want.
#### Why not use `VTT` subtitles direcly ?
Because it fails 😒.
#### Why not use FFMPEG direcly with the _media playalist_ URLs and let it do the download ?
Because some programs would randomly fail 😒. Probably due to invalid _segmentation_ on the server.
### 📌 Dependences
- [m3u8](https://pypi.org/project/m3u8/) to parse playlists.
- [webvtt-py](https://pypi.org/project/webvtt-py/) to load `vtt` subtitles files.
### 🤝 Help
For sure ! The more the merrier.