This is a toy/research project whose only goal is to familiarize with some of the technologies involved in multi-lingual video streaming
Go to file
Barbagus 23e2183c93 Merge pull request 'move to `urllib3` instead of `requests`' (#29) from urllib3 into stable
Reviewed-on: #29
2023-02-14 08:11:20 +00:00
samples Fix changes in pages embedded data structure 2023-02-13 08:09:00 +01:00
src/delarte Implement a `raise_for_status()` on `HTTPError` 2023-02-13 18:44:32 +01:00
tests Remove obsolete tests 2023-01-08 20:02:54 +01:00
.gitignore Packaging with flit 2022-12-08 22:39:46 +01:00
LICENSE.md 📄 Change from WTFPL to AGPL 2022-12-05 23:56:10 +01:00
Makefile 🔨 Apply pydocstyle to project 2022-12-06 01:38:47 +01:00
README.md Use `urllib3` instead of `requests` 2023-02-13 09:35:33 +01:00
pyproject.toml Use `urllib3` instead of `requests` 2023-02-13 09:35:33 +01:00

README.md

delarte

🎬 ArteTV downloader

💡 What is it ?

This is a toy/research project whose primary goal is to familiarize with some of the technologies involved in multi-lingual video streaming. Using this program may violate usage policy of ArteTV website and we do not recommend using it for other purpose then studying the code.

ArteTV is a is a European public service channel dedicated to culture. Programmes are usually available with multiple audio and subtitles languages.

🚀 Quick start

Install FFMPEG binaries and ensure it is in your PATH

$ ffmpeg -version
ffmpeg version N-109344-g1bebcd43e1-20221202 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 12.2.0 (crosstool-NG 1.25.0.90_cf9beb1)

Clone this repository

$ git clone https://git.afpy.org/fcode/delarte.git
$ cd delarte

Optionally create a virtual environnement

$ python3 -m venv .venv
$ source .venv/Scripts/activate

Install in edit mode

$ pip install -e .

Or install in edit mode with dev dependencies if you intend to contribute.

$ pip install -e .[dev]

Now you can run the script

$ python3 -m delarte --help
or
$ delarte --help
delarte - ArteTV downloader.

Usage:
  delarte (-h | --help)
  delarte --version
  delarte [options] URL
  delarte [options] URL RENDITION
  delarte [options] URL RENDITION VARIANT

Download a video from ArteTV streaming service. Omit RENDITION and/or
VARIANT to print the list of available values.

Arguments:
  URL         the URL from ArteTV website
  RENDITION   the rendition code [audio/subtitles language combination]
  VARIANT     the variant code [video quality version]

Options:
  -h --help              print this message
  --version              print current version of the program
  --debug                on error, print debugging information
  --name-use-id          use the program ID
  --name-use-slug        use the URL slug
  --name-sep=<sep>       field separator [default:  - ]
  --name-seq-pfx=<pfx>   sequence counter prefix [default:  - ]
  --name-seq-no-pad      disable sequence zero-padding
  --name-add-rendition   add rendition code
  --name-add-variant     add variant code

🔧 How it works

🏗️ The streaming infrastructure

We support both single program pages and program collection pages. Every page is shipped with some embedded JSON data (we do not keep samples as the structure seems to change regularly). From that we extract metadata for each programs. In particular, we extract a site language and a program ID. These enables us to query the config API

The config API

This API returns a ConfigPlayer JSON object, a sample of which can be found here. A list of available audio/subtitles combinations in $.data.attributes.streams. In our code such a combination is referred to as a rendition. Every such rendition has a reference to a program index file in .streams[i].url

The program index file

As defined in HTTP Live Streaming (sample files can be found here). This file show the a list of video variants URIs (one per video resolution). Each of them has

  • exactly one video track index reference
  • exactly one audio track index reference
  • at most one subtitles track index reference

Audio and subtitles tracks reference also include:

  • a two-letter language code attribute (mul is used for audio multiple language)
  • a free form name attribute that is used to detect an audio original version
  • a coded characteristics that is used to detect accessibility tracks (audio or textual description)

The video and audio track index file

As defined in HTTP Live Streaming (sample files can be found here. This file is basically a list of segments (http ranges) the client is supposed to download in sequence.

The subtitles track index file

As defined in HTTP Live Streaming (sample files can be found here). This file references the actual file containing the subtitles VTT data.

⚙️The process

  1. Fetch program sources form the page pointed by the given URL
  2. Fetch rendition sources from config API
  3. Filter renditions
  4. Fetch variant sources from HLS program index files.
  5. Filter variants
  6. Fetch final target information and figure out output naming
  7. Download data streams (convert VTT subtitles to formatted SRT subtitles) and mux them with FFMPEG

📽️ FFMPEG

The multiplexing (muxing) the video file is handled by ffmpeg. The script expects ffmpeg to be installed in the environnement and will call it as a subprocess.

Why not use FFMPEG directly with the HLS program index URL ?

So we can be more granular about renditions and variants that we want.

Why not use VTT subtitles directly ?

Because FFMPEG do not support styles in WebVTT 😒.

Why not use FFMPEG directly with the track index URLs and let it do the download ?

Because some programs would randomly fail 😒. Probably due to invalid segmentation on the server.

📌 Dependencies

🤝 Help

For sure ! The more the merrier.