Welcome to the WebVision Video Challengeas part of theTHE 4TH WORKSHOP ONVISUAL UNDERSTANDING BY LEARNING FROM WEB DATA 2020June 14th - 19th, 2020Seattle, WAin conjunction with CVPR 2020
Collecting data for large-scale action classification is becomming more and more time consuming. This puts a natural limit to the size of current benchmarks. Additionally, techniques developed and finetuned on such data do not naturally transfer to applications in the wild. To adress this problem, we want to move some steps away from the usual action classification and explore the problem of learning actions from real-live videos without human supervision. To address this problem, we are happy to bring you the first Webvision Video challenge! The idea of this challenge is to learn action classes in videos from subtitles only. To this end, we collected 20,000 YouTube videos dealing with various egg recipes such as pancake, omelete or eggroll. The challenge is about to learn actions mentioned in those videos, e.g. "crack egg", or "add butter" without any human generated labels. To make it easy for everyone to bring their ideas to live, we will have two tracks for this challenge, one based on all available video data and one based on features-only! So everyone, no matter if you have a large GPU cluster or not, can get involved. We also provide a short tutorial and basedline code to get started in 1h. Just check it out! The webvision video challenge is part of the Workshop on Visual Understanding by Learning from Web Data. The workshop aims at promoting the advance of learning state-of-the-art visual models from webly supervised data. We want to transfer this idea to the case of learning action representations from video subtitles without any human supervision. DataGetting StartedChallenge detailsGeneral rulesFrequently asked questionsContactDataThe data for this challenge is based on the MiningYouTube dataset ( you can find all details here: https://github.com/hildekuehne/Weak_YouTube_dataset ). The dataset comprises ~ 20,000 YouTube videos that display explain various egg recipes, namely for fried egg, scrambled egg, pancake, omelet, and egg roll. 1)Training We provide three different data modalities: full videos (resp. You tube IDs, you need to download the videos), preextracted action clips (~ 5-10 sec per clip), and precomputed features for the action clips. a) Full videos: ~20,000 video indexes of the full youtube videos with the respective subtitles (subtitles downloaded in 2017/2018), you would have to download the videos or contact webvisionworkshop AT gmail.com
b) Action clips: ~350,000 pre-extracted video clips with tentative labels (carefull, not all of them are correct!) and subtitles
c) Features: ~350,000 data files (packed in 350 hdf5 files) with pre-computed rgb and flow features (https://github.com/yjxiong/temporal-segment-networks) for each frame based on a Kinetics pretrained backbone. Features are computed from the pre-extracted video clips with tentative labels (again, carefull, not all of them are correct!) and subtitles. Working with pre-computed features will allow a faster development and testing. Especially new ideas on the mining and/or concept learning side should be easy to test.
Each hdf file comprises the features, tentative labels, subtiles, and respective filenames. You can access in the hdf file as follows:
2) Validation: The data additionally has a set of ~5000 video clips IDs with class labels and a human annotation if this class label is present in the video, which can be used for training or validation.
3) Test: The test data (and validation data for the challenge) is available under:
4) Challenge data: We will have two tracks for this challenge, one based on the original videos and one based on precomputed features only. a) Video Track: For the orginal video track, you are free to use the full videos or the pre-extracted video clips:
b) Feature Track: For the feature track, you are only allowed to use the preextracted features (named under 3) Features) from the 350k clips.
Getting startedWe provide a vanilla benchmark code that works on features only to get an idea of the task and the data. The following example has been tested under Python 3.6. 1) Download data To get started with the code:
Your folder structure should look like that:
2) Checkout code Checkout the challenge repository: https://github.com/qinenergy/webvision-2020-public 3) Run the videolearning example The root folder for the videolearning example is <checkout_path>/webvision-2020-public/videolearning a) Install the requirements.txt in your prefered enviroment.
b) Adjust paths in the config file Open the config file under Replace all occurances of /my/data/folder/ with the path where you stored the data. c) Training
Details: The dataset needs ~150GB memory + some overhead in case you want to balance the training data. As not all systems provide this memory, we provide a full and a sparse data loader: Full dataloader Sparse dataloader Balancing training data d) Computing output probabilities
Details: Use softmax? Use conditional probabilities e) Testing For testing please run the function:
We measure accuracy as intersection over union (Jaccard IoU), by first computing the framewise IoU of all classes and take the mean over all present classes as IoU for each video. The final score is computed as mean over all video IoUs. The here provided testing routine is run "as is" on the evaluation server. Details: Some remarks about the testing routine: Annotation by natural language can be very inconsistent and even contradicting, e.g. we have three different classes "whisk_egg", "beat_egg", and "mix_egg", which obviously all refer to the same action and we have other classes such as "add_pepper" which can refer to the bell pepper as well as to grounded pepper powder. It is therefore difficult to assess the classification by just comparing the max score label to the annotated one as nobody knows if the annotator was more a "whisk_egg", "beat_egg", or "mix_egg" type of person. We therefore decided to use the task of video alignment to test the quality of you classifier. Alignment means that the transcripts of the actions ( i.a. the action labels in the right order) are already given and the task is to find the right boundaries for the given actions in the video. We know from previous work on weak learning for video sequences (see e.g. https://ieeexplore.ieee.org/document/8585084, https://arxiv.org/abs/1610.02237) that this task is usually a good surrogate for the overall classification accuracy. In this case it helps to avoid any language inconsistencies as it aligns the output to the correct action labels only and ignores the rest. It is therefore not so important which score was given to "mix_egg" or "beat_egg", as only the scores of the class "whisk_egg" would be considered (if this was the annotation). f) Prepare challenge submission Prepare data:
Run code:
Challenge detailsWe run two tracks in this challenge. For both tracks, you need to submit a .zip file with 50 numpy files with the same filename as the original video or image files (file_0000.npy, file_0001.npy, ... etc. ) . The numpy array needs to have the shape (num_frames, num_classes) with num_classes = 513. You can find the mapping for the classes in the file <checkout_path>/webvision-2020-public/videolearning/src/mapping_without_meta.py . Files generated by the function mp_save_probs_webvisionTest.py (see Getting stated, d)) are already in the right format. Video trackSumission: CodaLab Webvision - Video trackFor the video track you are allowed to use any video/image data provided in the dataset. This can be the full videos or the video clips. You can train any CNN architecture and make use of the provided subtitles. Feature trackSumission: CodaLab Webvision - Video track - FEATURES ONLY!For the feature track, you are only allowed to use the features provided under Data 1c) as well as any subtitle information of the clips or the full videos. For both tracks, you are allowed to use additional text data and knowledge sources that are publically available. For details, please checkout the General rules and the FAQ section. General rulesTraining data: You are allowed to use the yes/no validation data listed in the 'val_yes_no.txt' file (here:) for validation and/or training. It's only a few clips per class, so the assumption is that it will not get you all the way, but any new ideas are welcome. Subtitles: You are only allowed to use the orginal subititles or the generated labels from the baseline. Please do not! download new subtitles as they can change over time and we would not be able to compare your methods to others any more. Validation and testing: You can use the test set of the original dataset as validation set. It is not allowed to include the data from the test set as additional training data! As a rule of thumb, please keep everything reproductionable! Frequently Asked Questions
Can I crawl text data according to MiningYouTube concepts by myself, and use it as training data?
Contact
If you have any questions, please drop an email to webvisionworkshop AT gmail.com or kuehne AT ibm.com |