Augmented Reality stands to be one of the most significant tech developments in recent memory, with the ability to combine the physical world with the digital world. Currently, the most popular AR experiences are being delivered through smartphones. While these experiences are seamlessly rendered on a smartphone’s display, there are several layers of technology that allow a camera-based AR system to function.

For example, localization & mapping frameworks allow your smartphone to determine its exact location relative to other objects in the real world. The cameras on your phone allow a mobile software set to identify key elements of an environment in combination with data from motion sensors. The motion sensors (Gyroscope/Accelerometer) detect the devices movement within the physical space through Position data (x,y & z axis) as well as orientation (Pitch, Yaw Roll), together known as Six Degrees of Freedom (6Dof).

A real world use case is the Mars Curiosity rover, which uses a GPS on top of a VIO system (Visual Inertial Odometry) that contains the cameras and gyroscope/accelerometer combo. This hardware is no different than what most modern smartphones contain. Together, the devices position can be determined in a 3D space in which digital content can be placed on top of the devices camera feed. The generated 3D Space is known as a “Point Cloud”, which is essentially a map of all the aforementioned distinguishing “points” in space. While important for implementing real time tracking of objects and animations, the point cloud also allows a device to resume tracking almost immediately in the event that the user drops their phone or moves the device too quickly.

AR software sets can also utilize Convolutional Neural Networks (CNN’s) to perform a set of tasks known as Image Segmentation. This process involves analyzing images semantically, segmenting objects as well as defining the foreground and background of a scene. This is accomplished by a neural network that is ‘trained’ to recognize patterns across a set of images with the same object. For example, the neural network would be able to draw a bounding box around a cat in a scene after being trained by many unique images of cats. From there the network can recognize patterns that allow it to perform object detection and classification. These processes work together to enable AR experiences that include effects such as context based scenes and occlusion. Serving AR Scenes based on context allow for targeted experiences when a user is in a certain location or viewing a predefined physical object with their smartphone. Occlusion allows for lifelike displays where digital objects can actually ‘hide’ behind/in-front of physical objects in the real world.

Combining mobile hardware with Computer Vision creates richer mobile AR experiences. Further Advancements in these fields will continue to create more use cases for mixed reality at large.

For further reading on Medium.

Reality Check: The marvel of computer vision technology in today’s camera-based AR systems
by Alex Chuang