Building the best video annotation tool for computer vision.
Annotating video to build datasets for computer vision is different from annotating images in a number of critical respects. In the past, data scientists were forced to break up video, frame by frame, into images in order to use their existing image annotation tools. Today an emerging class of video annotation tools is making life easier for creating video datasets.
Treating video as a first class citizen requires a video labeling tool that was built from the start to recognize two fundamental aspects of video annotation: 1) that we need to treat videos as n-dimensional tensors that can be accessed in a random fashion; and 2) that compute, memory, and storage complexity for video datasets are an order of magnitude higher than those of image datasets.
In this article we explain some of the unique design choices we made to make Lodestar an optimal video annotation platform.
Random access of video pixel tensors
Images and video, when used in a dataset, are described as a tensor. A tensor is a mathematical data structure; a way to organize data and a set of algebraic rules that define mathematical operations on them. A number — say, a floating point value — is a 0-dimensional tensor. A list of numbers — say, the output values of a neural network — is a 1-dimensional tensor. A grayscale image is often represented as a 2-dimensional tensor. A color image is often represented as a 3-dimensional tensor. An ordered sequence of images, such as a video (ordered by time) or CT scan imagery (ordered by depth), can be represented as a 4-dimensional tensor.
In video datasets, accessing 4-dimensional data randomly is a new frontier in computer vision. We focused intensely on building this capability into Lodestar. The gain in productivity for the data scientist over using tools designed primarily for static images can be dramatic. As an example, being able to leverage random tensors allowed an in-house data scientist to create synthetic tasks for self-supervised learning in an hour. Without random on-demand tensor access, it would have taken weeks, with most of the engineering time spent on building tensor accessors vs. actually creating the self-supervised task.
Entity persistence and object tracking
Video’s time dimension presents opportunities and challenges for annotation. Understanding entity persistence, for example, is an opportunity to describe how an object behaves over time. Data scientists can use Lodestar’s tools to specify the need to see an object over “n” consecutive frames and use this filter to understand how fast objects are moving. They might also use it to debug or tune the dataset: objects that persist for 1-5 seconds may be treated differently than those that persist for 30 seconds.
Entity persistence also presents a challenge in dataset creation because the same object exists in multiple frames, with very little change between frames. Manually labeling every object in every frame of the video would be exorbitantly expensive and add little value to the dataset.
Conversely, video often contains long stretches of time with no relevant objects. Unlike image datasets, video is usually captured continuously regardless of the appearance of the desired objects. Again, manually reviewing every frame to find objects adds expense and time and no value to the dataset.
Without video-specific labeling tools, data scientists were forced to address these issues by creating an image from each frame of the video and curating the resulting image set, removing frames with no objects and making judgments about which image of an object in a time series would be best for training the model. This usually resulted in multiple rounds of adjusting the dataset and re-labeling.
Lodestar overcomes these challenges and allows you to work directly on the full video dataset, using a number of techniques to optimize in the time dimension.
Lodestar is a continuous training, real-time active learning system. During labeling, we train a custom AI model and begin searching the video for objects to label and ranking the frames containing them to have the labelers review the ones that are confusing it the most. This involves random access of different frames from different videos in a dataset. This random access is important for rapid exploration of the dataset to search for relevant frames to annotate. As the labelers add more confirmed labels to the dataset, the model is continually retrained, getting more accurate and more effective at finding and ranking objects.
This is a straightforward solution to the need for curation in sparse video datasets, as the model is finding relevant objects and suggesting frames to label, ignoring frames with no objects.
As noted above, entity persistence is a challenge for labeling video, and active learning (combined with predictive labeling) addresses the issue by examining whether the object, as defined by the model’s predicted localization, persists over a number of frames (the default is five but that is configurable). If the model doesn’t see the bounding box persist when it expects to, it loses confidence that it has found a legitimate object, and sets the rank for the frame higher so the human labelers can review and eliminate this confusion. Again, this technique obviates the need for data scientists to review a series of frames showing an object over time and manually choose one or more for labeling. The active learning system will auto-curate the frames and prioritize confusing sets for human intervention.
Building on real-time active learning, the system will automatically localize objects the custom AI model finds in the ranked frames. This greatly accelerates labeling; as the model continues to improve, the labelers are making fewer and smaller adjustments to the bounding boxes and confirming whole frames in seconds. Features like confidence filtering make the work even quicker. The active learning system is queuing frames that most need labeling, and the continuous training system is ensuring that the system’s confidence scoring is always up to date. Again, this relies on the ability to randomly access the tensor to identify specific frames across multiple videos in the dataset.
Predictive labeling and entity persistence create a challenge specific to video that had to be addressed as well: machine generated labels across an entire, long video can quickly run into the many tens of millions. As anyone who has used a predictive learning system knows, an inference model that is not well-tuned to the current dataset will generate spurious bounding boxes. If we machine-generate labels across the entire video immediately after training an initial model, we’ll quickly run into resource limitations. Since we continuously retrain our custom AI inference model, it’s to our advantage to stay in step with the active learning system and the human labelers, and dynamically generate predictions ahead of their work on the ranked frames. In doing so, we not only produce better predictions on each frame the labelers work on, we make it possible to scale to tens of hours of video and tens or hundreds of millions of objects labeled and tracked on a single node.
Video annotation at scale
Video is not merely a set of images, but it’s useful to think of the scale involved in annotating video datasets. A one hour 1920x1080 pixel video at just 30 frames per second comprises 108,000 frames and 223,948,800,000 pixels. Providing random access to any pixel in that video is no trivial task. Of course, we didn’t stop there — a Lodestar project can contain over ten hours of video.
Working with hours of video requires some heavy lifting on our part so you don’t have to do it. Most systems limit you to a few minutes or maybe an hour of video, because they haven’t done the work to make it practical to label hours of video. But that’s going to mean you have to spend hours pre-processing your video before you even load it into the platform. We wanted our customers to simply drag and drop hours of video into a project and get to work, letting the system handle navigation, performance, and curation of the frames you need to label.
Ingestion, storage, and random access to the pixels
First, we store video as video. We don’t require you to break it up into images, nor do we do that on the back end. Storing video as a first class data type saves storage space but more importantly your video is always the single source of truth. We extract frames on the fly for labelers to work on but all metadata is associated frame by frame with the video, not images created from the video. When video is ingested, we automatically process it so we know exactly how many frames it has, and store the metadata for each video in a database. This provides random access to every pixel in every frame with constant access times regardless of the length of the video.
|String||Jump key replacement in |
|String||Down key replacement in |
|String||Replacing the movement key to the right in the |
|String||Replacing the movement key to the left in the |
|Number||Specifies the speed of movement of the body ⇄ along the |
We also automatically normalize the video in any given project. By default we set the resolution of the project to that of the first video ingested. Subsequent videos are transcoded to match. Other than that, we don’t alter the pixels in the video. We recognize that there are optimizations that could improve a video’s value for training a model, but that’s best left to the data scientists to experiment with using any of the many open source and commercial tools available. So we focused on making it easy to run those experiments. You can make modified copies of your video, to extract a single color channel or change the contrast, for example, and copy all of your existing annotations over the new project. Since we keep frame-by-frame metadata references, everything stays right where it belongs in the new project.
Natural video UI with random frame access
During processing we set and store in the database keyframe-style markers at regular intervals (125 frames by default but configurable). This helps power a video-native user experience.
Anyone who’s watched a YouTube video will find the Lodestar navigational UI familiar. You can play the video, fast forward or rewind, skip ahead or back, and randomly access any frame in the video fluidly. Behind the scenes, because the video is stored natively with stored metadata, we can stream the video to the UI, and randomly access any frame by referring to the nearest key frame and counting the offset to the frame the user is accessing. This gives us millisecond access to any frame in the video.
Scaling to millions of labels per project
Creating a system that can handle many hours of video was a challenge, but one we know was important to achieve. Real-world business problems don’t come in neat little packages. Drones will fly for hours inspecting infrastructure or crops. Cars will drive for hours collecting street scenes. Manufacturing floor and supermarket cameras will capture hours of relevant data for training models. We wanted our customers to take that video, drop it into a project, and start solving their computer vision problems. We didn’t want you to have to spend days or weeks messing around with the video before you started to get value from it.
One of our first challenges in creating a system that could scale across multiple nodes was to build our own tools to support transferring tensors between the two relevant environments:: C++ for the video processing and Python for the data science. For an image-only system, these tools already exist in Python; the bindings to C++ are already there; we had to write our own for video.
We also had to develop a method to deal with the density of objects in many types of video, such as satellite footage. If we were requiring users to create image sets from their video, we could have further asked them to cut each image up into smaller blocks that were easier to annotate. Instead, we developed a sliding window technique that allows the labelers to work on virtual sub-frames so they can more easily label with thousands of objects in a frame without any additional pre-processing. Other touches in the UI like easy zoom in and out make quick work of annotating dense video frames.
Using inference-on-demand and a database-backed metadata store allows us to exponentially increase the number of predicted objects in the video. We can scale to tens of millions of objects on a single node and access time to any of those objects remains constant regardless of the length of video in a project.
Lodestar is video-first
All of this is just the start, but it’s an important foundation for a video-first platform. We believe that video will become the primary source and destination for computer vision applications, and our platform is focused on enabling data scientists and their annotation teams to work more easily and productively with higher-dimensional tensors and video data. We have a lot of exciting features planned to help make the most of video and other 4+ dimensional data, and we’re looking forward to seeing some amazing applications from our customers.