Welcome to the WebVision Video Challenge


as part of the




June 14th - 19th, 2020

Seattle, WA

in conjunction with CVPR 2020


Collecting data for large-scale action classification is becomming more and more time consuming. This puts a natural limit to the size of current benchmarks. Additionally, techniques developed and finetuned on such data do not naturally transfer to applications in the wild. To adress this problem, we want to move some steps away from the usual action classification and explore the problem of learning actions from real-live videos without human supervision. 

To address this problem, we are happy to bring you the first Webvision Video challenge! The idea of this challenge is to learn action classes in videos from subtitles only. To this end, we collected 20,000 YouTube videos dealing with various egg recipes such as pancake, omelete or eggroll. The challenge is about to learn actions mentioned in those videos, e.g. "crack egg", or "add butter" without any human generated labels. To make it easy for everyone to bring their ideas to live, we will have two tracks for this challenge, one based on all available video data and one based on features-only! So everyone, no matter if you have a large GPU cluster or not, can get involved. We also provide a short tutorial and basedline code to get started in 1h. Just check it out!

The webvision video challenge is part of the Workshop on Visual Understanding by Learning from Web Data. The workshop aims at promoting the advance of learning state-of-the-art visual models from webly supervised data. We want to transfer this idea to the case of learning action representations from video subtitles without any human supervision.


Getting Started

Challenge details

General rules

Frequently asked questions



The data for this challenge is based on the MiningYouTube dataset ( you can find all details here: https://github.com/hildekuehne/Weak_YouTube_dataset ). The dataset comprises ~ 20,000 YouTube videos that display explain various egg recipes, namely for fried egg, scrambled egg, pancake, omelet, and egg roll.


We provide three different data modalities: full videos (resp. You tube IDs, you need to download the videos), preextracted action clips (~ 5-10 sec per clip), and precomputed features for the action clips.

a) Full videos: ~20,000 video indexes of the full youtube videos with the respective subtitles (subtitles downloaded in 2017/2018), you would have to download the videos or contact webvisionworkshop AT gmail.com

b) Action clips: ~350,000 pre-extracted video clips with tentative labels (carefull, not all of them are correct!) and subtitles

c) Features: ~350,000 data files (packed in 350 hdf5 files) with pre-computed rgb and flow features (https://github.com/yjxiong/temporal-segment-networks) for each frame based on a Kinetics pretrained backbone. Features are computed from the pre-extracted video clips with tentative labels (again, carefull, not all of them are correct!) and subtitles. Working with pre-computed features will allow a faster development and testing. Especially new ideas on the mining and/or concept learning side should be easy to test.

Each hdf file comprises the features, tentative labels, subtiles, and respective filenames. You can access in the hdf file as follows:

hf = h5py.File(file_in, 'r');
temp_feat = hf.get('data')[()];
temp_labels = hf.get('labels')[()];
temp_subtitles = hf.get('subtitles')[()];
temp_filenames = hf.get('filenames')[()];

2) Validation: The data additionally has a set of ~5000 video clips IDs with class labels and a human annotation if this class label is present in the video, which can be used for training or validation.

3) Test: The test data (and validation data for the challenge) is available under:

4) Challenge data: We will have two tracks for this challenge, one based on the original videos and one based on precomputed features only.

a) Video Track: For the orginal video track, you are free to use the full videos or the pre-extracted video clips:

b) Feature Track: For the feature track, you are only allowed to use the preextracted features (named under 3) Features) from the 350k clips. 

Getting started

We provide a vanilla benchmark code that works on features only to get an idea of the task and the data. The following example has been tested under Python 3.6.

1) Download data

To get started with the code:

Your folder structure should look like that:

| +---train
| | +---packed_numpy_new_flow_rgb
| | | file_0000.h5f
| | | ...
| |
| \---test
| +---features
| | blahblah.npy
| | ...
| +---groundTruth
| | blahblah.txt
| | ...
| +---transcripts
| | blahblah.txt
| | ...

2) Checkout code

Checkout the challenge repository: https://github.com/qinenergy/webvision-2020-public

3) Run the videolearning example

The root folder for the videolearning example is <checkout_path>/webvision-2020-public/videolearning

a) Install the requirements.txt in your prefered enviroment.

pip install -r <checkout_path>/webvision-2020-public/videolearning/requirements.txt

b) Adjust paths in the config file

Open the config file under <checkout_path>/webvision-2020-public/videolearning/config

Replace all occurances of /my/data/folder/ with the path where you stored the data.

c) Training
The training is run by the function



The dataset needs ~150GB memory + some overhead in case you want to balance the training data. As not all systems provide this memory, we provide a full and a sparse data loader:

Full dataloader
If you have enough memory, please set the "sparse" flag in the config file to 0 and you can start the training.

Sparse dataloader
If you don't have enough memory, please set the "sparse" flag in the config file to 1 and set the "sparse_num_frames" to the number of frames you want to load per file for each epoch. Depending on the fraction of files you can load, you also need to adapt the number of epochs. E.g. each hdf file has ~150k frames, so if you load 15k frames per file you need 10 epochs until all data has been processed once and you need to multiply the number of epochs by 10 to get the same training.

Balancing training data
As the original distribution of the training data is highly imbalanced, it is recommended to balance the training data (downsample most occurring classes, upsample least occurring classes). You can do this by setting the flag "balance_training" in the config file to 1. The implementation can be found in the respective dataset classes ./datasets/train_ds_full.py and ./datasets/train_ds_sparse.py and can be modified for different ratios.

d) Computing output probabilities
The test function expects a score between [0,..,1] for each class for each frame of the video. You can compute output probabilities with the function:



Use softmax?
The function uses softmax to convert the output of the last linear layer to [0, .., 1]. If you don't want that, please comment line 49 in mp_save_probs_webvisionTest.py

Use conditional probabilities
The function can further make use for the class prior from training to compute the conditional probabilities of the class scores (might be a good idea if the training data is not balanced). You can turn the is function on/off by setting the respective parameter "get_cond_probs" in the config file.

e) Testing

For testing please run the function:


We measure accuracy as intersection over union (Jaccard IoU), by first computing the framewise IoU of all classes and take the mean over all present classes as IoU for each video. The final score is computed as mean over all video IoUs.

The here provided testing routine is run "as is" on the evaluation server.


Some remarks about the testing routine:

Annotation by natural language can be very inconsistent and even contradicting, e.g. we have three different classes "whisk_egg", "beat_egg", and "mix_egg", which obviously all refer to the same action and we have other classes such as "add_pepper" which can refer to the bell pepper as well as to grounded pepper powder.

It is therefore difficult to assess the classification by just comparing the max score label to the annotated one as nobody knows if the annotator was more a "whisk_egg", "beat_egg", or "mix_egg" type of person.

We therefore decided to use the task of video alignment to test the quality of you classifier. Alignment means that the transcripts of the actions ( i.a. the action labels in the right order) are already given and the task is to find the right boundaries for the given actions in the video. We know from previous work on weak learning for video sequences (see e.g. https://ieeexplore.ieee.org/document/8585084, https://arxiv.org/abs/1610.02237) that this task is usually a good surrogate for the overall classification accuracy. In this case it helps to avoid any language inconsistencies as it aligns the output to the correct action labels only and ignores the rest. It is therefore not so important which score was given to "mix_egg" or "beat_egg", as only the scores of the class "whisk_egg" would be considered (if this was the annotation).

f) Prepare challenge submission

Prepare data:

  • Under /my/data/folder/ make a folder /challenge and download and unpack the complete feature files under Data 4b) (features.tar) .
  • Your folder structure should look like that:

| +---train
| | +---packed_numpy_new_flow_rgb
| | | file_0000.h5f
| | | ...
| |
| +---test
| | +---features
| | | blahblah.npy
| | | ...
| | +---groundTruth
| | | blahblah.txt
| | | ...
| | +---transcripts
| | | blahblah.txt
| | | ...
| \---challenge
| +---features
| | file_0000.npy
| | ...

Run code:

  • Change the folder specified under "test_features" and "out_probs" in the config file to the challenge directory
  • Run step d), function mp_save_probs_webvisionTest.py , with the callenge test data
  • You should find the output probabilities in the folder specified under "out_probs" in the config file.
  • Zip all 50 numpy files into one zip file without any directories and submit it to the Codalab challenge website. The submission file should look like this: dummy_submission.zip

Challenge details

We run two tracks in this challenge. For both tracks, you need to submit a .zip file with 50 numpy files with the same filename as the original video or image files (file_0000.npy, file_0001.npy, ... etc. ) . The numpy array needs to have the shape (num_frames, num_classes) with num_classes = 513. You can find the mapping for the classes in the file <checkout_path>/webvision-2020-public/videolearning/src/mapping_without_meta.py . Files generated by the function mp_save_probs_webvisionTest.py (see Getting stated, d)) are already in the right format.

Video track

Sumission: CodaLab Webvision - Video track

For the video track you are allowed to use any video/image data provided in the dataset. This can be the full videos or the video clips. You can train any CNN architecture and make use of the provided subtitles.

Feature track

Sumission: CodaLab Webvision - Video track - FEATURES ONLY!

For the feature track, you are only allowed to use the features provided under Data 1c) as well as any subtitle information of the clips or the full videos.

For both tracks, you are allowed to use additional text data and knowledge sources that are publically available. For details, please checkout the General rules and the FAQ section.

General rules

Training data: You are allowed to use the yes/no validation data listed in the 'val_yes_no.txt' file (here:) for validation and/or training. It's only a few clips per class, so the assumption is that it will not get you all the way, but any new ideas are welcome.

Subtitles: You are only allowed to use the orginal subititles or the generated labels from the baseline. Please do not! download new subtitles as they can change over time and we would not be able to compare your methods to others any more.

Validation and testing: You can use the test set of the original dataset as validation set. It is not allowed to include the data from the test set as additional training data!

As a rule of thumb, please keep everything reproductionable! 

Frequently Asked Questions

Can I use pretrained models for action classification?
Yes, but! :

  • Either: The model or the data that you use have to be freely available for everyone (e.g. the TSN or I3D models).
  • Or: You can only use ONE! single PUBLIC! datasets like Kinetics, ActivityNet etc.
  • In both cases, you need to submit your results to the video track, NOT the feature track.
  • You need to indicate what dataset was used for pretraining or which backbone you used.
  • You are not allowed to mix various datasets for better pretraining.
  • You are not allowed to use any datasets or models for pretraining that are not publically available.

Can I use other/more videos without human annotations?
No. For fairness, we restrict the challenge to use only the videos provided with the MiningYouTube datasets. You are allowed to use one (!) more public action dataset for pretraining. You are not allowed to crawl for more videos by yourself.

Can I use the text data (subtitles) in the MiningYouTube dataset?
Yes, and we encourage you to do so. You are not allowed to download updated subtitles for the videos!

Can I use external text data, or models pretrained with external text data, with or without human annotation?
Yes, but any data or model has to be publicly available, e.g. WordNet, Knowledge Graph, etc. can be used. Models trained using external text data are also allowed, such as Word2Vec, BERT models, as the data is available or made available. You need to explicitly state in your final submission that what text datasets/models are used.

Can I crawl text data according to MiningYouTube concepts by myself, and use it as training data?
Yes, as long as the data is publicly and legally available, so people could reproduce the results. If you crawl text data by yourself, please clearly state it in your submission, and you need to make it available to public before the final submission deadline. An URL should be provided in the method description part of your submission.


If you have any questions, please drop an email to webvisionworkshop AT gmail.com or kuehne AT ibm.com