delarte/README.md

310 lines
9.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

`delarte`
=========
🎬 ArteTV downloader
💡 What is it ?
---------------
This is a toy/research project whose only goal is to familiarize with some of the technologies involved in multi-lingual video streaming. Using this program may violate usage policy of ArteTV website and we do not recommend using it for other purpose then studying the code.
ArteTV is a is a European public service channel dedicated to culture. Available programms are usually available with multiple audio and subtitiles languages.
🚀 Quick start
---------------
Install [FFMPEG](https://ffmpeg.org/download.html) binaries and ensure it is in your `PATH`
```
$ ffmpeg -version
ffmpeg version N-109344-g1bebcd43e1-20221202 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 12.2.0 (crosstool-NG 1.25.0.90_cf9beb1)
```
Clone this repository
```
$ git clone git@gitlab.com:Barbagus/delarte.git
$ cd delarte
```
Optionally create a virtual environement
```
$ python3 -m venv .venv
$ source .venv/Scripts/activate
```
Install in edit mode
```
$ pip install -e .[dev]
```
Now you can run the script
```
$ python3 -m delarte --help
or
$ delarte --help
ArteTV dowloader.
usage: delarte [-h|--help] - print this message
or: delarte program_page_url - show available versions
or: delarte program_page_url version - show available resolutions
or: delarte program_page_url version resolution - download the given video
```
🔧 How it works
----------------
### 🏗️ The streaming infrastructure
Every video program have a _program identifier_ visible in their web page URL:
```
https://www.arte.tv/es/videos/110139-000-A/fromental-halevy-la-tempesta/
https://www.arte.tv/fr/videos/100204-001-A/esprit-d-hiver-1-3/
https://www.arte.tv/en/videos/104001-000-A/clint-eastwood/
```
That _program identifier_ enables us to query an API for the program's information.
##### The _config_ API
For the last exemple the API call is as such:
```
https://api.arte.tv/api/player/v2/config/en/104001-000-A
```
The response is a JSON object:
```json
{
"data": {
"id": "104001-000-A_en",
"type": "ConfigPlayer",
"attributes": {
"metadata": {
"providerId": "104001-000-A",
"language": "en",
"title": "Clint Eastwood",
"subtitle": "The Last Legend",
"description": "70 years of career in front of and behind the camera and still active at 90, Clint Eastwood is a Hollywood legend. A look back at his unique career through a portrait that explores the complexity of the Eastwood myth.",
"duration": { "seconds": 4652 },
...
},
"streams": [
{
"url": "https://.../104001-000-A_VOF-STE%5BANG%5D_XQ.m3u8",
"versions": [
{
"label": "English (Subtitles)",
"shortLabel": "OGsub-ANG",
"eStat": {
"ml5": "VOF-STE[ANG]"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VOF-STF_XQ.m3u8",
"versions": [
{
"label": "French (Original)",
"shortLabel": "FR",
"eStat": {
"ml5": "VOF-STF"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VOF-STMF_XQ.m3u8",
"versions": [
{
"label": "Original french version - closed captioning (FR)",
"shortLabel": "ccFR",
"eStat": {
"ml5": "VOF-STMF"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VA-STA_XQ.m3u8",
"versions": [
{
"label": "German (Dubbed)",
"shortLabel": "DE",
"eStat": {
"ml5": "VA-STA"
}
}
],
...
},
{
"url": "https://.../104001-000-A_VA-STMA_XQ.m3u8",
"versions": [
{
"label": "German closed captioning ",
"shortLabel": "ccDE",
"eStat": {
"ml5": "VA-STMA"
}
}
],
...
}
],
...
}
}
}
```
Information about the program is detailed in `data.attributes.metadata` and a list of available audio/subtitles combinations in `data.attributes.streams`. In our code such a combination is refered to as a _version_.
Every such _version_ has a reference to a _version index_ file in `.streams[i].url` and description of the audio/subtitle combination in `.streams[i].versions[0]`.
We are using `.streams[i].versions[0].eStat.ml5` as our _version codes_:
- `VOF-STE[ANG]` English (Subtitles)
- `VOF-STF` French (Original)
- `VOF-STMF` Original french version - closed captioning (FR)
- `VA-STA` German (Dubbed)
- `VA-STMA` German closed captioning
- ...
##### The _version index_ file
The file is in [HTTP Livestreaming](https://www.rfc-editor.org/rfc/rfc8216) `.m3u8` format:
```
#EXTM3U
...
#EXT-X-STREAM-INF:BANDWIDTH=2335200,AVERAGE-BANDWIDTH=1123304,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=768x432,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v432.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4534432,AVERAGE-BANDWIDTH=2124680,VIDEO-RANGE=SDR,CODECS="avc1.4d0028,mp4a.40.2",RESOLUTION=1920x1080,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v1080.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4153392,AVERAGE-BANDWIDTH=1917840,VIDEO-RANGE=SDR,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v720.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1445432,AVERAGE-BANDWIDTH=726160,VIDEO-RANGE=SDR,CODECS="avc1.4d401e,mp4a.40.2",RESOLUTION=640x360,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v360.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=815120,AVERAGE-BANDWIDTH=429104,VIDEO-RANGE=SDR,CODECS="avc1.42e00d,mp4a.40.2",RESOLUTION=384x216,FRAME-RATE=25.000,AUDIO="program_audio_0",SUBTITLES="subs"
medias/104001-000-A_v216.m3u8
...
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="program_audio_0",LANGUAGE="fr",NAME="VOF",AUTOSELECT=YES,DEFAULT=YES,URI="medias/104001-000-A_aud_VOF.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,LANGUAGE="en",URI="medias/104001-000-A_st_VO-ANG.m3u8"
...
```
This can be parsed with the [m3u8](https://pypi.org/project/m3u8/) library.
This file show the a list of _video index_ URIs (one per video resolution). Each of them is linked to exactly one _audio index_ file and at most one _subtitiles index_ file.
##### The _video index_ files
The file is also in [HTTP Livestreaming](https://www.rfc-editor.org/rfc/rfc8216) `.m3u8` format:
```
#EXTM3U
#EXT-X-TARGETDURATION:6
#EXT-X-VERSION:7
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="104001-000-A_v1080.mp4",BYTERANGE="28792@0"
#EXTINF:6.000,
#EXT-X-BYTERANGE:1734621@28792
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1575303@1763413
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1603739@3338716
104001-000-A_v1080.mp4
#EXTINF:6.000,
#EXT-X-BYTERANGE:1333835@4942455
104001-000-A_v1080.mp4
...
```
This file shows the list of _video chuncks_ the server expect to serve.
##### The _audio index_ file
Similarly to the _video index_ file it shows the list of _audio chuncks_ the server expect to serve:
```
#EXTM3U
#EXT-X-TARGETDURATION:6
#EXT-X-VERSION:7
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="104001-000-A_aud_VOF.mp4",BYTERANGE="28752@0"
#EXTINF:5.991,
#EXT-X-BYTERANGE:82445@28752
104001-000-A_aud_VOF.mp4
#EXTINF:5.991,
#EXT-X-BYTERANGE:99299@111197
104001-000-A_aud_VOF.mp4
#EXTINF:5.991,
#EXT-X-BYTERANGE:101640@210496
104001-000-A_aud_VOF.mp4
#EXTINF:5.991,
#EXT-X-BYTERANGE:102047@312136
104001-000-A_aud_VOF.mp4
...
```
##### The _subtitles index_ file
The file is also in [HTTP Livestreaming](https://www.rfc-editor.org/rfc/rfc8216) `.m3u8` format:
```
#EXTM3U
#EXT-X-VERSION:7
#EXT-X-TARGETDURATION:4650
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-PLAYLIST-TYPE:VOD
#EXTINF:4650,
104001-000-A_st_VO-ANG.vtt
#EXT-X-ENDLIST
```
This file shows the file(s) containing the subtitles data.
### ⚙The process
1. Get the _config_ API object for the _program identifier_
- Figure out the _output filename_ from _metadata_.
- Select a _version_.
2. Get the _version index_ file
- Select a resolution _video index_ along with its _audio index_ and _subtitle index_
3. Get the subtitles in `vtt` format and convert them to `srt`
4. Feed the _video index_, _audio index_ and `srt` file to `ffmpeg`
### 📽️ FFMPEG
The actual build of the video file is handled by [ffmpeg](https://ffmpeg.org/). The script expects [ffmpeg](https://ffmpeg.org/) to be installed in the environement and will call it as a subprocess.
##### Why not use FFMPEG direcly with the _version index_ URL ?
So we can select the video resolution and not rely on stream mapping arguments in `ffmpeg`.
##### Why not use VTT subtitles direcly ?
Because it fails 😒.
### 📌 Dependences
- [m3u8](https://pypi.org/project/m3u8/) to parse index files.
- [webvtt-py](https://pypi.org/project/webvtt-py/) to load `vtt` subtitles files.
### 🤝 Help
For sure ! The more the merrier.