Make sure to check out this post on neural module networks for context!
Questions that require to be broken down into subproblems are complicated for your vanilla neural network to get right. Researchers are playing with modularity, the ability to make rigid neural networks more flexible to solve more complex problems. This post will explore how the BAIR Lab and Google have approached this problem and the possibilities for recreating some of their work.
BAIR Lab approach
As shown in the blog, a basic neural net trained on identifying images has trouble identifying the color of the green cylinder when asked what color is the thing with the same size as the blue cylinder. To answer this question, first you must compare the size of the blue cylinder with the sizes of all the other objects, then you must identify the color of the object that matches in size. As Andreas described, a model that can dynamically determine and solve the subproblems within the main question is needed to solve these kinds of problems. The blog goes on to describe how Neural Module Networks can be used to chain blueprints together to tackle complex tasks. Two or more problems that share a sub step to answer different questions can share weights for those particular sections of the neural networks architectures. As an example, the blog described counting the number of objects the same size as the sphere and finding the color of the object of the same size of the cylinder both require the model to compare sizes, so the compare sizes parts of both models can share weights. They wanted to chain together many different neural module networks to answer different problems without having to train a massive neural network for one specific task (i.e. identify dogs).
Google has a massive amount of data as they help curate the internet. In their paper, their approach for a more flexible network involves creating a large neural network that blends building blocks from different types of neural networks such as convolutional layers (from CNNs) and attention mechanisms (found in RNNs). They also experiment with feeding the neural network training data from different domains. For this model, they used training data from ImageNet, COCO image captioning set, and various speech and languages corpora. Their goal was to create a monolithic model composed of many of what they call modality nets (sub-nets) that are specific to a data domain (i.e. images are handled by a modality net for images within the larger network). The modules that google's model builds are more general purpose than the modules that the BAIR Lab neural module networks since tasks from the same domain (ie. images) share a modality module. The BAIR Lab's modules are more task specific (ie. detect the color blue, detect the height of object).
Just this month, DeepMind released their state of the art results in using a neural network module to tackle complex questions. Their approach involves using a Relation Network, which is a specialized neural network module designed to learn relational reasoning. The ability to compute relations between objects is built into the structure of the RN architecture. They paired up this RN with a convolutional neural network and trained it on a more manicured version of the CLEVR dataset that the BAIR lab worked on called the Sort-of-CLEVR dataset. This set has 2D images of squares and circles of 6 different colors.
All are interesting approaches to advance the types of questions neural networks can answer. The BAIR Lab approach seems to be more feasible for someone to recreate who does not have access to the amounts of processing power and variety of data that Google fortunately has. They also use an open dataset that you can go ahead and download here.