delarte/README.md

161 lines
6.0 KiB
Markdown
Raw Normal View History

`delarte`
=========
2022-12-05 21:20:04 +00:00
2022-12-07 21:04:29 +00:00
🎬 ArteTV downloader
2022-12-07 21:04:29 +00:00
💡 What is it ?
---------------
2022-12-28 08:58:54 +00:00
This is a toy/research project whose primary goal is to familiarize with some of the technologies involved in multi-lingual video streaming. Using this program may violate usage policy of ArteTV website and we do not recommend using it for other purpose then studying the code.
2022-12-28 08:58:54 +00:00
ArteTV is a is a European public service channel dedicated to culture. Programmes are usually available with multiple audio and subtitles languages.
2022-12-07 21:04:29 +00:00
🚀 Quick start
---------------
2022-12-05 21:56:29 +00:00
2022-12-08 23:34:15 +00:00
Install [FFMPEG](https://ffmpeg.org/download.html) binaries and ensure it is in your `PATH`
```
$ ffmpeg -version
ffmpeg version N-109344-g1bebcd43e1-20221202 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 12.2.0 (crosstool-NG 1.25.0.90_cf9beb1)
```
2022-12-07 21:04:29 +00:00
2022-12-08 23:34:15 +00:00
Clone this repository
```
2022-12-09 20:14:57 +00:00
$ git clone https://git.afpy.org/fcode/delarte.git
2022-12-08 23:34:15 +00:00
$ cd delarte
```
2022-12-20 08:48:57 +00:00
Optionally create a virtual environnement
2022-12-08 23:34:15 +00:00
```
$ python3 -m venv .venv
$ source .venv/Scripts/activate
```
Install in edit mode
```
$ pip install -e .
```
Or install in edit mode with `dev` dependencies if you intend to contribute.
```
2022-12-08 23:34:15 +00:00
$ pip install -e .[dev]
```
Now you can run the script
```
$ python3 -m delarte --help
or
$ delarte --help
2022-12-29 10:14:23 +00:00
delarte - ArteTV downloader.
Usage:
delarte (-h | --help)
delarte --version
delarte [options] URL
delarte [options] URL RENDITION
delarte [options] URL RENDITION VARIANT
Download a video from ArteTV streaming service. Omit RENDITION and/or
VARIANT to print the list of available values.
Arguments:
URL the URL from ArteTV website
RENDITION the rendition code [audio/subtitles language combination]
VARIANT the variant code [video quality version]
Options:
-h --help print this message
--version print current version of the program
--debug on error, print debugging information
2022-12-08 23:34:15 +00:00
```
2022-12-07 21:04:29 +00:00
🔧 How it works
----------------
2022-12-28 08:58:54 +00:00
## 🏗️ The streaming infrastructure
2022-12-07 21:04:29 +00:00
Every video program have a _program identifier_ visible in their web page URL:
```
https://www.arte.tv/es/videos/110139-000-A/fromental-halevy-la-tempesta/
https://www.arte.tv/fr/videos/100204-001-A/esprit-d-hiver-1-3/
https://www.arte.tv/en/videos/104001-000-A/clint-eastwood/
```
That _program identifier_ enables us to query an API for the program's information.
2022-12-28 08:58:54 +00:00
### The _config_ API
2022-12-07 21:04:29 +00:00
For the last example the API call is as such:
2022-12-07 21:04:29 +00:00
```
https://api.arte.tv/api/player/v2/config/en/104001-000-A
```
2022-12-28 08:58:54 +00:00
The response is a JSON object, a sample of which can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/api/config-105612-000-A.json):
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
Information about the program is detailed in `$.data.attributes.metadata` and a list of available audio/subtitles combinations in `$.data.attributes.streams`. In our code such a combination is referred to as a _rendition_ (or _version_ in the CLI).
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
Every such _rendition_ has a reference to a _master playlist_ file in `.streams[i].url`
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
### The _master playlist_ file
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (sample file can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/master-105612-000-A_VOF-STMF_XQ.m3u8) or [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/master-105612-000-A_VA-STA_XQ.m3u8)). This file show the a list of video _variants_ URIs (one per video resolution). Each of them has
- exactly one video _media playlist_ reference
- exactly one audio _media playlist_ reference
- at most one subtitles _media playlist_ reference
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
Audio and subtitles tracks reference also include:
- a two-letter `language` code attribute (`mul` is used for audio multiple language)
- a free form `name` attribute that is used to detect an audio _original version_
- a coded `characteristics` that is used to detect accessibility tracks (audio or textual description)
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
### The video and audio _media playlist_ file
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (a sample file can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/audio-105612-000-A_aud_VA.m3u8) or [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/video-105612-000-A_v1080.m3u8)). This file is basically a list of _segments_ (http ranges) the client is supposed to download in sequence.
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
### The subtitles _media playlist_ file
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (a sample file can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/subtitles-105612-000-A_st_VA-ALL.m3u8)). This file references the actual file containing the subtitles [VTT](https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API) data.
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
## ⚙The process
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
1. Figure out available _sources_ by:
- fetching the _config_ API object for the _program identifier_
- fetching all referenced _master playlist_.
2. Select the desired _source_ based on _renditions_ and _variants_ codes.
3. Figure out the _output filename_ from _source_ details.
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
4. Download video, audio and subtitles media content.
- convert `VTT` subtitles to `SRT`
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
5. Feed the all the media to `ffmpeg` for multiplexing (or _muxing_)
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
## 📽️ FFMPEG
2022-12-07 21:04:29 +00:00
2022-12-20 08:48:57 +00:00
The multiplexing (_muxing_) the video file is handled by [ffmpeg](https://ffmpeg.org/). The script expects [ffmpeg](https://ffmpeg.org/) to be installed in the environnement and will call it as a subprocess.
2022-12-05 21:56:29 +00:00
2022-12-28 08:58:54 +00:00
### Why not use FFMPEG directly with the HLS _master playlist_ URL ?
2022-12-05 21:56:29 +00:00
So we can be more granular about _renditions_ and _variants_ that we want.
2022-12-05 21:56:29 +00:00
2022-12-28 08:58:54 +00:00
### Why not use `VTT` subtitles directly ?
2022-12-05 23:18:15 +00:00
Because FFMPEG do not support styles in WebVTT 😒.
2022-12-05 21:56:29 +00:00
2022-12-28 08:58:54 +00:00
### Why not use FFMPEG directly with the _media playlist_ URLs and let it do the download ?
2022-12-11 17:36:46 +00:00
Because some programs would randomly fail 😒. Probably due to invalid _segmentation_ on the server.
2022-12-11 17:36:46 +00:00
2022-12-28 08:58:54 +00:00
## 📌 Dependencies
- [m3u8](https://pypi.org/project/m3u8/) to parse playlists.
- [requests](https://pypi.org/project/requests/) to handle HTTP traffic.
2022-12-28 08:58:54 +00:00
## 🤝 Help
2022-12-07 21:04:29 +00:00
For sure ! The more the merrier.