A self-driving vehicle has to detect objects, track them over time, and predict where they will be in the future in order to plan a safe maneuver. These tasks are typically trained independently from one another, which could result in disasters should any one task fail.
Researchers at the University of Toronto’s department of computer science and Uber’s Advanced Technologies Group (ATG) in Toronto have developed an algorithm that jointly reasons about all these tasks—the first algorithm to bring them all together. Importantly, their solution takes as little as 30 milliseconds per frame.
We try to optimize as a whole so we can correct mistakes between each of the tasks themselves. When done jointly, uncertainty can be propagated and computation shared.—Wenjie Luo, U Toronto PhD student in computer science
Luo and Bin Yang, a PhD student in computer science, along with their graduate supervisor, Raquel Urtasun, an associate professor of computer science and head of Uber ATG Toronto, presented their paper, “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net”, at the Computer Vision and Pattern Recognition (CVPR) conference in Salt Lake City last week.
The Fast and Furious network takes multiple frames as input and performs detection, tracking and motion forecasting.
To start, Uber collected a large-scale dataset of several North American cities using roof-mounted lidar scanners. The dataset includes more than a million frames, collected from 6,500 different scenes.
Urtasun says the output of the lidar is a point-cloud in three dimensional space that needs to be understood by an artificial intelligence (AI) system. This data is unstructured in nature, and is thus considerably different from structured data typically fed into AI systems, such as images.
If the task is detecting objects, you can try to detect objects everywhere but there’s too much free space, so a lot of computation is done for nothing. In bird’s eye view, the objects we try to recognize sit on the ground and thus it’s very efficient to reason about where things are.—Raquel Urtasun
To deal with large amounts of unstructured data, PhD student Shenlong Wang and researchers from Uber ATG developed a special AI tool.
A picture is a 2-D grid. A 3-D model is a bunch of 3-D meshes. But here, what we capture [with lidar] is just a bunch of points, and they are scattered in that space, which for traditional AI is very difficult to deal with.—Bin Yang
Images are rectangular objects, made up of tiny pixels, also rectangular, so the algorithms work well on analyzing grid-like structures. But lidar data is without any regular structure, making it difficult for AI systems to learn.
Their results for processing scattered points directly is not limited to self-driving, but any domain where there is unstructured data, including chemistry and social networks.
Nine papers were presented at CVPR from Urtasun’s lab. Mengye Ren, a PhD student in computer science, Andrei Pokrovsky, a staff software engineer at Uber ATG, Yang and Urtasun also sought faster computation and developed SBNet: Sparse Blocks Network for Fast Inference.
We want the network to be as fast as possible so that it can detect and make decisions in real time, based on the current situation. For example, humans look at certain regions we feel are important to perceive, so we apply this to self-driving.—Mengye Ren
To increase the speed of the whole computation, says Ren, they’ve devised a sparse computation based on what regions are important. As a result, their algorithm proved up to 10 times faster when compared to existing methods.
The researchers released the SBNet code as it is widely useful for improving processing for small devices, including smartphones.
Urtasun says the overall impact of her group’s research has increased significantly when they’ve seen their algorithms implemented in Uber’s self-driving fleet, rather than reside solely in academic papers.