Getting up to speed.

Logo_Black

Introduction

I work as operations for DTube which you may have heard of.

I've wanted to write for some time about the operational challenges we have a DTube and the method by which we overcome them.

I'd also like to begin opening some dialog in regards to incidents as they occur to give people a greater understanding of what's going on behind the scenes, why DTube has been down and what has been done to combat this so it doesn't occur in the future. But all that's for another post.

Ancient History

During early 2017 I came across a project called IPFS, for some reason it was fascinating and I wanted to get involved. I saw an opportunity to create a service that would store files in IPFS so that should a users machine go-offline, their hard drive fail, their entire house get washed away in a flood etc, that their content would remain online and accessible even if nobody was actively peering it.

I developed a platform that would accept bitcoin in return for pinning content.
It garnered a little interest at first and then appeared to fall off into obscurity, as my $300 trial of google's cloud service started to come to an end I felt it was nearly time to turn the service offline and write it up as one of my many failures.

In Mid-2017 I received an email from @hemindanger asking if the IPFS daemon had crashed on my server (it had, it did quite often)
We quickly began working together to enable uploads directly into the IPFS daemon for one of his projects
It was initially just an nginx server which acted as a reverse proxy into the IPFS HTTP API

About a month later DTube(.video) launched.

Recent History

Following DTubes immense success we were forced to make a series of modifications to the file upload process.
Initially files were directly uploaded to an IPFS HTTP gateway via an NGINX reverse proxy (so we could do SSL and lock down functionality to just uploads)

Then a it became apparent users were uploading MKV, WMV, PDF etc, @SuperKoala developed the IPFS-Uploader we use today.
It acts as a shim between nginx and IPFS, re-encode video in the correct format and outputs the file in 240 and 480p.
This means that while source file videos may not work 100% of the time, other versions should.

We brought all the uploading services "In-house" (11 servers at our peak) and I became a member of the DTube team and began renting dedicated hardware with large CPUs and even larger HDDs.

Around this time we were regularly running into issues where a server would reach 100% disk usage or a disk would fail.
A solution was to implement a large "saver" server we use for long term storage and regularly take down an encoder and wipe its contents then reinstall.
This was chosen over just rm'ing the IPFS cache as the IPFS cache is comprised of a number of small 256kb files which in an 8TB drive would result in 31.25 million files. Which means that a quick format is much faster than deleting each file individually.

Due to my lack of interest in constantly reformatting machines and manually reinstalling our software on them I began to look at ansible for as solution.
This served as a method for rapid deployment of the software and at its peak we could take a freshly formatted debian 9 install from factory fresh to production ready in 7 minutes. The scripts used can be found here.

We also began to look a GPU encoding which would enable parallelisation of encoding streams, and enable us to perform encoding in 720p which had been too computationally expensive before.
The software worked fine but our rented hardware kept failing & IPFS file propagation was heavily affected by the distance between hosts (as our gpu servers were in New York and the rest of our infrastructure in France). It became impractical for us to maintain GPU encoding which resulted in 720p being dropped.

Following a reversion to CPU only encoding DTube lost the ability to parallelise encoding & each dedicated server had to be managed individually which resulted in a large amount of work to monitor and maintain each host.

Today

Astute users will notice that their uploads have been going to cluster.d.tube over the past few weeks & that their videos are being served from video.dtube.top

This is due to a new way in which DTube accepts and stores its content.

Docker

Our IPFS-uploader program and ffmpeg are now wrapped up in dockerised containers which run on a number of hosts, this provides us with the ability to perform encoding on two or more files at the same time without making changes to the IPFS uploader application.

It also gives us the ability to include new applications inside the current infrastructure without having to worry about constant rebuilds or separate dedicated hardware which would also need to be managed. For example we are currently running our Video serving nodes through this infrastructure.

Finally we chose docker as it allows us (through the use of swarm) to scale.

Enrolling a new dedicated server is now a very simple process, upon receiving a factory fresh Ubuntu host we install the latest general kernel, install the docker reposity and docker. Then a single command can bring the host into the cluster. From there we need only change the number of containers we'd like and the containers are distributed across to new nodes.

Ceph

We use Ceph to serve as a large file storage platform which provides a resilient shared drive across all of our nodes.
This provides our containers with a set of shared directories from which to run applications, write IPFS caches & write full video files.

As with with docker we can (a little less) easily enroll new dedicated servers in the cluster and scale.

Ceph has been tested with 30PiB of data (1024*30TiB) at CERN, we're currently running it with a storage capacity of 65TiB of which at the moment of writing 28TiB has been used.

What does this mean?

We've had a rocky month, following implementation of this service there have been a number of issues I've had to address, however I feel that DTube currently offers best service it has since inception.

  • We can scale to meet the needs of our users
  • We can scale quickly
  • Video encoding is faster
  • Videos are accessible more quickly

To wrap up, this has been a fantastic quarter for me personally, I've learned much more than I expected, I'm very excited about the future & I'm very much looking forward to continuing to help the project grow.