iPhone 6s
ƒ/2.2
4.15 mm
1/30
50

Congratulations to Intel on their acquisition of Nervana. This photo is from the last board meeting at our offices; the Nervana founders — from right to left: Naveen Rao, Amir Khosrowshahi and Arjun Bansal — pondered where on the wall they may fall during M&A negotiations.

We are now free to share some of our perspectives on the company and its mission to accelerate the future with custom chips for deep learning.

I’ll share a recap of the Nervana story, from an investor’s perspective, and try to explain why machine learning is of fundamental importance to every business over time. In short, I think the application of iterative algorithms (e.g., machine learning, directed evolution, generative design) to build complex systems is the most powerful advance in engineering since the Scientific Method. Machine learning allows us to build software solutions that exceed human understanding, and shows us how AI can innervate every industry.

By crude analogy, Nervana is recapitulating the evolutionary history of the human brain within computing — moving from the logical constructs of the reptilian brain to the cortical constructs of the human brain, with massive arrays of distributed memory and iterative learning algorithms.

Not surprisingly, the founders integrated experiences in neuroscience, distributed computing, and networking — a delightful mélange for tackling cognitive computing. Ali Partovi, an advisor to Nervana, introduced us to the company.

We were impressed with the founding team and we had a prepared mind to share their enthusiasm for the future of deep learning. Part of that prepared mind dates back to 1989, when I started a PhD in EE focusing on how to accelerate neural networks by mapping them to parallel processing computers. Fast forward 25 years, and the nomenclature has shifted to machine learning and the deep learning subset, and I chose it as the top tech trend of 2013 at the Churchill Club VC debate (video). We were also seeing the powerful application of deep learning and directed evolution across our portfolio, from molecular design to image recognition to cancer research to autonomous driving.

All of these companies were deploying these simulated neural networks on traditional compute clusters. Some were realizing huge advantages by porting their code to GPUs; these specialized processors originally designed for rapid rendering of computer graphics have many more computational cores than a traditional CPU, a baby step toward a cortical architecture. I first saw them being used for cortical simulations in 2007. But by the time of Nervana’s founding in 2014, some (e.g., Microsoft’s and Google’s search teams) were exploring FPGA chips for their even finer-grained arrays of customizable logic blocks. Custom silicon that could scale beyond any of these approaches seemed like the natural next step. Here is a page from Nervana’s original business plan (Fig. 1 in comments below).

The march to specialized silicon, from CPU to GPU to FPGA to ASIC, had played out similarly for Bitcoin miners, with each step toward specialized silicon obsoleting the predecessors. When we spoke to Amazon, Google, Baidu, and Microsoft in our due diligence, we found a much broader application of deep learning within these companies than we could have imagined prior, from product positioning to supply chain management.

Machine learning is central to almost everything that Google does. And through that lens, their acquisition, and new product strategies make sense; they are not traditional product line extensions, but a process expansion of machine leaning (more on that later). They are not just playing games of Go for the fun of it. Recently, Google switched their core search algorithms to deep learning, and they used Deep Mind to cut data center cooling costs by a whopping 40%.

The advances in deep learning are domain independent. Google can hire and acquire talent and delight in their passionate pursuit of game playing or robotics. These efforts help Google build a better brain. The brain can learn many things. It is like a newborn human; it has the capacity to learn any of the languages of the world, but based on training exposure, it will only learn a few. Similarly, a synthetic neural network can learn many things.

Google can let the Brain team find cats on the Internet and play a great game of Go. The process advances they make in building a better brain (or in this case, a better learning machine) can then be turned to ad matching, a task that does not inspire the best and the brightest to come work for Google.

The domain independence of deep learning has profound implications on labor markets and business strategy. The locus of learning shifts from end products to the process of their creation. Artifact engineering becomes more like parenting than programming. But more on that later; back to the Nervana story.

Our investment thesis for the Series A revolved around some universal tenets: a great group of people pursuing a product vision unlike anything we had seen before. The semiconductor sector was not crowded with investor interest. AI was not yet on many venture firms’ sectors of interest. We also shared with the team that we could envision secondary benefits from discovering the customers. Learning about the cutting edge of deep learning applications and the startups exploring the frontiers of the unknown held a certain appeal for me. And sure enough, there were patterns in customer interest, from an early flurry in medical imaging of all kinds to a recent explosion of interest in the automotive sector after Tesla’s Autopilot feature went live. The auto industry collectively rushed to catch up.

Soon after we led the Series A on August 8, 2014, I found myself moderating a deep learning panel at Stanford with Nervana CEO Naveen Rao.

I opened with an introduction to deep learning and why it has exploded in the past four years (video primer). I ended with some common patterns in the power and inscrutability of artifacts built with iterative algorithms. We see this in biology, cellular automata, genetic programming, machine learning and neural networks.

There is no mathematical shortcut for the decomposition of a neural network or genetic program, no way to “reverse evolve” with the ease that we can reverse engineer the artifacts of purposeful design.

The beauty of compounding iterative algorithms — evolution, fractals, organic growth, art — derives from their irreducibility. (More from my Google Tech Talk and MIT Tech Review)

Year 1. 2015
Nervana adds remarkable engineering talent, a key strategy of the first mover. One of the engineers figures out how to rework the undocumented firmware of NVIDIA GPUs so that they run deep learning algorithms faster than off-the-shelf GPUs or anything else Facebook could find. Matt Ocko preempted the second venture round of the company, and he brought the collective learning of the Data Collective to the board.

Year 2. 2016 Happy 2nd Birthday Nervana!
The company is heads down on chip development. They share some technical details (flexpoint arithmetic optimized for matrix multiplies and 32GB of stacked 3D memory on chip) that gives them 55 trillion operations per second on their forthcoming chip, and multiple high-speed interconnects (as typically seen in the networking industry) for ganging a matrix of chips together into unprecedented compute fabrics. 10x made manifest. See Fig. 2 below.

And then Intel came knocking.
With the most advanced production fab in the world and a healthy desire to regain the mantle of leading the future of Moore’s Law, the combination was hard to resist. Intel vice president Jason Waxman told Recode that the shift to artificial intelligence could dwarf the move to cloud computing. “I firmly believe this is not only the next wave but something that will dwarf the last wave.” But we had to put on our wizard hats to negotiate with giants.

The deep learning and AI sector have heated up in labor markets to relatively unprecedented levels. Large companies are recently paying $6–10 million per engineer for talent acquisitions, and $4–5M per head for pre-product startups still in academia. For the Masters students in a certain Stanford lab, they averaged $500K/yr for their first job offer at graduation. We witnessed an academic turn down a million dollar signing bonus because they got a better offer.

Why so hot?
The deep learning techniques, while relatively easy to learn, are quite foreign to traditional engineering modalities. It takes a different mindset and a relaxation of the presumption of control. The practitioners are like magi, sequestered from the rest of a typical engineering process. The artifacts of their creation are isolated blocks of functionality defined by their interfaces. They are like blocks of magic handed to other parts of a traditional organization. (This carries over to the customers too; just about any product that you experience in the next five years that seems like magic will almost certainly be built by these algorithms).

And remember that these “brain builders” could join any industry. They can ply their trade in any domain. When we were building the deep learning team at Human Longevity Inc. (HLI), we hired the engineering lead from the Google’s Translate team. Franz Och pioneered Google’s better-than-human translation service not by studying linguistics, grammar, or even speaking the languages being translated. He focused on building the brain that could learn the job from countless documents already translated by humans (UN transcripts in particular). When he came to HLI, he cared about the mission, but knew nothing about cancer and the genome. The learning machines can find the complex patterns across the genome. In short, the deep learning expertise is fungible, and there are a burgeoning number of companies hiring and competing across industry lines.

And it is an ever-widening set of industries undergoing transformation, from automotive to agriculture, healthcare to financial services. We saw this explosion in the Nervana customer pipeline. And we see it across the DFJ portfolio, especially in our newer investments. Here are some examples:

• Learning chemistry and drug discovery: Here is a visualization of the search space of candidates for a treatment for Ebola; it generated the lead molecule for animal trials. Atomwise summarizes: “When we examine different neurons on the network we see something new: AtomNet has learned to recognize essential chemical groups like hydrogen bonding, aromaticity, and single-bonded carbons. Critically, no human ever taught AtomNet the building blocks of organic chemistry. AtomNet discovered them itself by studying vast quantities of target and ligand data. The patterns it independently observed are so foundational that medicinal chemists often think about them, and they are studied in academic courses. Put simply, AtomNet is teaching itself college chemistry.”

• Designing new microbial life for better materials: Zymergen uses machine learning to predict the combination of genetic modifications that will optimize product yield for their customers. They are amassing one of the largest data sets about microbial design and performance, which enables them to train machine learning algorithms that make search predictions with increasing precision. Genomatica had great success in pathway optimization using directed evolution, a physical variant of an iterative optimization algorithm.

• Discovery and change detection in satellite imagery: Planet and Mapbox. Planet is now producing so much imagery that humans can’t actually look at each picture it takes. Soon, they will image every meter of the Earth every day. From a few training examples, a convolutional neural net can find similar examples globally — like all new housing starts, all depleted reservoirs, all current deforestation, or car counts for all retail parking lots.

• Automated driving & robotics: Tesla, Zoox, SpaceX, Rethink Robotics, etc.

• Visual classification: From e-commerce to drones to security cameras and more. Imagen is using deep learning to radically improve medical image analysis, starting with radiology.

• Cybersecurity: When protecting endpoint computing & IOT devices from the most advanced cyberthreats, AI-powered Cylance is proving to be a far superior and adaptive approach versus older signature-based antivirus solutions.

• Financial risk assessment: Avant and Prosper use machine learning to improve credit verification and merge traditional and non-traditional data sources during the underwriting process.

• And now for something completely different: quantum computing. For a wormhole peek into the near future, our quantum computing company, D-Wave Systems, powered a 100,000,000x speedup in a demonstration benchmark for Google, a company that has used D-Wave quantum computers for over a decade now on machine learning applications.

So where will this take us?
Neural networks had their early success in speech recognition in the 90’s. In 2012, the deep learning variant dominated the ImageNet competitions, and visual processing can now be better done by machine than human in many domains (like pathology, radiology and other medical image classification tasks). DARPA has research programs to do better than a dog’s nose in olfaction.

We are starting the development of our artificial brains in the sensory cortex, much like an infant coming into the world. Even within these systems, like vision, the deep learning network starts with similar low level constructs (like edge-detection) as foundations for higher level constructs like facial forms, and ultimately, finding cats on the internet with self-taught learning.

But the artificial brains need not limit themselves to the human senses. With the internet of things, we are creating a sensory nervous system on the planet, with countless sensors and data collecting proliferating across the planet. All of this “big data” would be a big headache but for machine learning to find patterns in it all and make it actionable. So, not only are we transcending human intelligence with multitudes of dedicated intelligences, we are transcending our sensory perception.

And it need not stop there. It is precisely by these iterative algorithms that human intelligence arose from primitive antecedents. While biological evolution was slow, it provides an existence proof of the process, now vastly accelerated in the artificial domain. It shifts the debate from the realm of the possible to the likely timeline ahead.

Let me end with the closing chapter in Danny Hillis’ CS book The Pattern on the Stone: “We will not engineer an artificial intelligence; rather we will set up the right conditions under which an intelligence can emerge. The greatest achievement of our technology may well be creation of tools that allow us to go beyond engineering — that allow us to create more than we can understand.”

—–
Here is some early press:
Xconomy(most in-depth), MIT Tech Review, Re/Code, Forbes, WSJ, Fortune.

10 responses to “Intelligence Inside: the future of AI and computation”

  1. Fig 1. From the original business plan… Fig 2. Mad scribbles… Board meetings got especially playful in the Playground =) Moore’s Law over 120 years, from CPU to GPU to…. Moore's Law over 120 YearsA face of the box… Calling all magi. The last negotiation call with the board… Photo by Nervana

  2. "The greatest achievement of our technology may well be creation of tools that allow us to go beyond engineering — that allow us to create more than we can understand.” — Danny Hillis


  3. Love that Calvin & Hillis! I ended my blog post with that Hillis quote.

  4. Meanwhile, a Hooli moment… and Intel actually bought two DFJ learning chip investments at the same time. The second, Movidius, was finally announced after some delays

  5. A fascinating summary of the tie between the power of neural networks and the peculiar physics of our universe. "Evolution has somehow settled on a brain structure that is ideally suited to teasing apart the complexity of the universe." http://www.technologyreview.com/s/602344/the-extraordinary-link-...

  6. Congrats to Nervana CEO Naveen Rao on his promotion to run all of AI at Intel. news

  7. P.S. a nice summary from the original Lin & Tegmark paper: "The exceptional simplicity of physics-based functions hinges on properties such as symmetry, locality, compositionality and polynomial log-probability, and we explore how these properties translate into exceptionally simple neural networks approximating both natural phenomena such as images and abstract representations thereof such as drawings. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine-learning, a deep neural network can be more efficient than a shallow one. Various “no-flattening theorems” show when these efficient deep networks cannot be accurately approximated by shallow ones without efficiency loss"

    This last point reminds me of something I wrote in 2006 in the MIT Tech Review: "Stephen Wolfram’s theory of computational equivalence suggests that simple, formulaic shortcuts for understanding evolution (and neural networks) may never be discovered. We can only run the iterative algorithm forward to see the results, and the various computational steps cannot be skipped. Thus, if we evolve a complex system, it is a black box defined by its interfaces. We cannot easily apply our design intuition to the improvement of its inner workings. We can’t even partition its subsystems without a serious effort at reverse-engineering."

  8. More thoughts on “Why does deep and cheap learning work so well?”

    The mystery of why they work so well may be resolved by seeing the resonant homology across the information-accumulating substrate of our universe, from the base simplicity of our physics to the constrained nature of the evolved and grown artifacts all around us. The data in our natural world is the product of a hierarchy of iterative algorithms, and the computational simplification embedded within a deep learning network is also a hierarchy of iteration. Since neural networks are symbolic abstractions of how the human cortex works, perhaps it should not be a surprise that the brain has evolved structures that are computationally tuned to tease apart the complexity of our world.

    Does anyone know about other explorations into these topics?

    Back to quotes from the paper:
    Neural networks perform a combinatorial swindle, replacing exponentiation by multiplication: if there are say n = 106 inputs taking v = 256 values each, this swindle cuts the number of parameters from v^n to v×n times some constant factor. We will show that this success of this swindle depends fundamentally on physics: although neural networks only work well for an exponentially tiny fraction of all possible inputs, the laws of physics are such that the data sets we care about for machine learning (natural images, sounds, drawings, text, etc.) are also drawn from an exponentially tiny fraction of all imaginable data sets. Moreover, we will see that these two tiny subsets are remarkably similar, enabling deep learning to work well in practice.

    Increasing the depth of a neural network can provide polynomial or exponential efficiency gains even though it adds nothing in terms of expressivity.

    Both physics and machine learning tend to favor Hamiltonians that are polynomials — indeed, often ones that are sparse, symmetric and low-order.

    1. Low polynomial order
    For reasons that are still not fully understood, our universe can be accurately described by polynomial Hamiltonians of low order d. At a fundamental level, the Hamiltonian of the standard model of particle physics has d = 4. There are many approximations of this quartic Hamiltonian that are accurate in specific regimes, for example the Maxwell equations governing electromagnetism, the Navier-Stokes equations governing fluid dynamics, the Alv ́en equations governing magnetohydrodynamics and various Ising models governing magnetization — all of these approximations have Hamiltonians that are polynomials in the field variables, of degree d ranging from 2 to 4.

    2. Locality
    One of the deepest principles of physics is locality: that things directly affect only what is in their immediate vicinity. When physical systems are simulated on a computer by discretizing space onto a rectangular lattice, locality manifests itself by allowing only nearest-neighbor interaction.

    3. Symmetry
    Whenever the Hamiltonian obeys some symmetry (is invariant under some transformation), the number of independent parameters required to describe it is further reduced. For instance, many probability distributions in both physics and machine learning are invariant under translation and rotation.

    Why Deep?
    What properties of real-world probability distributions cause efficiency to further improve when networks are made deeper? This question has been extensively studied from a mathematical point of view, but mathematics alone cannot fully answer it, because part of the answer involves physics. We will argue that the answer involves the hierarchical/compositional structure of generative processes together with inability to efficiently “flatten” neural networks reflecting this structure.

    A. Hierarchical processes
    One of the most striking features of the physical world is its hierarchical structure. Spatially, it is an object hierarchy: elementary particles form atoms which in turn form molecules, cells, organisms, planets, solar systems, galaxies, etc. Causally, complex structures are frequently created through a distinct sequence of simpler steps.

    We can write the combined effect of the entire generative process as a matrix product.

    If a given data set is generated by a (classical) statistical physics process, it must be described by an equation in the form of [a matrix product], since dynamics in classical physics is fundamentally Markovian: classical equations of motion are always first order differential equations in the Hamiltonian formalism. This technically covers essentially all data of interest in the machine learning community, although the fundamental Markovian nature of the generative process of the data may be an in-efficient description.

    Summary
    The success of shallow neural networks hinges on symmetry, locality, and polynomial log-probability in data from or inspired by the natural world, which favors sparse low-order polynomial Hamiltonians that can be efficiently approximated. Whereas previous universality theorems guarantee that there exists a neural network that approximates any smooth function to within an error ε, they cannot guarantee that the size of the neural network does not grow to infinity with shrinking ε or that the activation function σ does not become pathological. We show constructively that given a multivariate polynomial and any generic non-linearity, a neural network with a fixed size and a generic smooth activation function can indeed approximate the polynomial highly efficiently.

    The success of deep learning depends on the ubiquity of hierarchical and compositional generative processes in physics and other machine-learning applications.

  9. Wow!!! "We show constructively that given a multivariate polynomial and any generic non-linearity, a neural network with a fixed size and a generic smooth activation function can indeed approximate the polynomial highly efficiently."

  10. Update: Atomwise just raised their $123M Series B. TechCrunch news.

Leave a Reply

Your email address will not be published. Required fields are marked *