delarte/README.md

148 lines
5.8 KiB
Markdown
Raw Normal View History

`delarte`
=========
2022-12-05 21:20:04 +00:00
2022-12-07 21:04:29 +00:00
🎬 ArteTV downloader
2022-12-07 21:04:29 +00:00
💡 What is it ?
---------------
2022-12-28 08:58:54 +00:00
This is a toy/research project whose primary goal is to familiarize with some of the technologies involved in multi-lingual video streaming. Using this program may violate usage policy of ArteTV website and we do not recommend using it for other purpose then studying the code.
2022-12-28 08:58:54 +00:00
ArteTV is a is a European public service channel dedicated to culture. Programmes are usually available with multiple audio and subtitles languages.
2022-12-07 21:04:29 +00:00
🚀 Quick start
---------------
2022-12-05 21:56:29 +00:00
2022-12-08 23:34:15 +00:00
Install [FFMPEG](https://ffmpeg.org/download.html) binaries and ensure it is in your `PATH`
```
$ ffmpeg -version
ffmpeg version N-109344-g1bebcd43e1-20221202 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 12.2.0 (crosstool-NG 1.25.0.90_cf9beb1)
```
2022-12-07 21:04:29 +00:00
2022-12-08 23:34:15 +00:00
Clone this repository
```
2022-12-09 20:14:57 +00:00
$ git clone https://git.afpy.org/fcode/delarte.git
2022-12-08 23:34:15 +00:00
$ cd delarte
```
2022-12-20 08:48:57 +00:00
Optionally create a virtual environnement
2022-12-08 23:34:15 +00:00
```
$ python3 -m venv .venv
$ source .venv/Scripts/activate
```
Install in edit mode
```
$ pip install -e .
```
Or install in edit mode with `dev` dependencies if you intend to contribute.
```
2022-12-08 23:34:15 +00:00
$ pip install -e .[dev]
```
Now you can run the script
```
$ python3 -m delarte --help
or
$ delarte --help
2022-12-29 10:14:23 +00:00
delarte - ArteTV downloader.
Usage:
delarte (-h | --help)
delarte --version
delarte [options] URL
delarte [options] URL RENDITION
delarte [options] URL RENDITION VARIANT
Download a video from ArteTV streaming service. Omit RENDITION and/or
VARIANT to print the list of available values.
Arguments:
URL the URL from ArteTV website
RENDITION the rendition code [audio/subtitles language combination]
VARIANT the variant code [video quality version]
Options:
2023-01-11 08:08:32 +00:00
-h --help print this message
--version print current version of the program
--debug on error, print debugging information
--name-use-id use the program ID
--name-use-slug use the URL slug
--name-sep=<sep> field separator [default: - ]
--name-seq-pfx=<pfx> sequence counter prefix [default: - ]
--name-seq-no-pad disable sequence zero-padding
--name-add-rendition add rendition code
--name-add-variant add variant code
2022-12-08 23:34:15 +00:00
```
2022-12-07 21:04:29 +00:00
🔧 How it works
----------------
2022-12-28 08:58:54 +00:00
## 🏗️ The streaming infrastructure
2022-12-07 21:04:29 +00:00
We support both _single program pages_ and _program collection pages_. Every page is shipped with some embedded JSON data, example of such data can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/www/). From that we extract metadata for each programs. In particular, we extract a _site language_ and a _program ID_. These enables us to query the config API
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
### The _config_ API
2022-12-07 21:04:29 +00:00
This API returns a `ConfigPlayer` JSON object, a sample of which can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/api/). A list of available audio/subtitles combinations in `$.data.attributes.streams`. In our code such a combination is referred to as a _rendition_. Every such _rendition_ has a reference to a _program index_ file in `.streams[i].url`
2022-12-07 21:04:29 +00:00
### The _program index_ file
2022-12-07 21:04:29 +00:00
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (sample files can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/)). This file show the a list of video _variants_ URIs (one per video resolution). Each of them has
- exactly one video _track index_ reference
- exactly one audio _track index_ reference
- at most one subtitles _track index_ reference
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
Audio and subtitles tracks reference also include:
- a two-letter `language` code attribute (`mul` is used for audio multiple language)
- a free form `name` attribute that is used to detect an audio _original version_
- a coded `characteristics` that is used to detect accessibility tracks (audio or textual description)
2022-12-07 21:04:29 +00:00
### The video and audio _track index_ file
2022-12-07 21:04:29 +00:00
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (sample files can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/). This file is basically a list of _segments_ (http ranges) the client is supposed to download in sequence.
2022-12-07 21:04:29 +00:00
### The subtitles _track index_ file
2022-12-07 21:04:29 +00:00
As defined in [HTTP Live Streaming](https://www.rfc-editor.org/rfc/rfc8216) (sample files can be found [here](https://git.afpy.org/fcode/delarte/src/branch/stable/samples/hls/)). This file references the actual file containing the subtitles [VTT](https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API) data.
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
## ⚙The process
2022-12-07 21:04:29 +00:00
1. Fetch _program sources_ form the page pointed by the given URL
2. Fetch _rendition sources_ from _config API_
3. Filter _renditions_
4. Fetch _variant sources_ from _HLS_ _program index_ files.
5. Filter _variants_
6. Fetch final target information and figure out output naming
7. Download data streams (convert VTT subtitles to formatted SRT subtitles) and mux them with FFMPEG
2022-12-07 21:04:29 +00:00
2022-12-28 08:58:54 +00:00
## 📽️ FFMPEG
2022-12-07 21:04:29 +00:00
2022-12-20 08:48:57 +00:00
The multiplexing (_muxing_) the video file is handled by [ffmpeg](https://ffmpeg.org/). The script expects [ffmpeg](https://ffmpeg.org/) to be installed in the environnement and will call it as a subprocess.
2022-12-05 21:56:29 +00:00
### Why not use FFMPEG directly with the HLS _program index_ URL ?
2022-12-05 21:56:29 +00:00
So we can be more granular about _renditions_ and _variants_ that we want.
2022-12-05 21:56:29 +00:00
2022-12-28 08:58:54 +00:00
### Why not use `VTT` subtitles directly ?
2022-12-05 23:18:15 +00:00
Because FFMPEG do not support styles in WebVTT 😒.
2022-12-05 21:56:29 +00:00
### Why not use FFMPEG directly with the _track index_ URLs and let it do the download ?
2022-12-11 17:36:46 +00:00
Because some programs would randomly fail 😒. Probably due to invalid _segmentation_ on the server.
2022-12-11 17:36:46 +00:00
2022-12-28 08:58:54 +00:00
## 📌 Dependencies
- [m3u8](https://pypi.org/project/m3u8/) to parse indexes.
- [requests](https://pypi.org/project/requests/) to handle HTTP traffic.
- [docopt-ng](https://pypi.org/project/docopt-ng/) to parse command line.
2022-12-28 08:58:54 +00:00
## 🤝 Help
2022-12-07 21:04:29 +00:00
For sure ! The more the merrier.