122 Years of Moore’s Law + Tesla AI Update

Tesla now holds the mantle of Moore’s Law, with the D1 chip introduced last night for the DOJO supercomputer (video, news summary).

NOTE: it’s a semi-log graph, so a straight line is an exponential; each y-axis tick is 100x. This graph covers a 10,000,000,000,000,000,000x improvement in computation/$.

DOJO’s dominance should not be a surprise, as Intel ceded leadership to NVIDIA a decade ago, and further handoffs were inevitable. The computational frontier has shifted across many technology substrates over the past 120 years, most recently from the CPU to the GPU to ASICs optimized for neural networks (the majority of new compute cycles). The ASIC approach is being pursued by scores of new companies and Google TPUs now added to the chart by popular request (see note below for methodology), as well as the Mythic analog M.2

Of all of the depictions of Moore’s Law, this is the one, originally by Ray Kurzweil, I find to be most useful, as it captures what customers actually value — computation per $ spent.

Humanity’s capacity to compute has compounded for as long as we can measure it, exogenous to the economy, and starting long before Intel co-founder Gordon Moore noticed a refraction of the longer-term trend in the belly of the fledgling semiconductor industry in 1965.

Why the transition within the integrated circuit era? Intel lost to NVIDIA for neural networks because the fine-grained parallel compute architecture of a GPU maps better to the needs of deep learning. There is a poetic beauty to the computational similarity of a processor optimized for graphics processing and the computational needs of a sensory cortex, as commonly seen in neural networks today. A custom chip (like the Tesla D1 ASIC) optimized for neural networks extends that trend to its inevitable future in the digital domain. Further advances are possible in analog in-memory compute, an even closer biomimicry of the human cortex. The best business planning assumption is that Moore’s Law, as depicted here, will continue for the next 20 years as it has for the past 120.

For those unfamiliar with this chart, here is a more detailed description:

Moore’s Law is both a prediction and an abstraction

Moore’s Law is commonly reported as a doubling of transistor density every 18 months. But this is not something the co-founder of Intel, Gordon Moore, has ever said. It is a nice blending of his two predictions; in 1965, he predicted an annual doubling of transistor counts in the most cost effective chip and revised it in 1975 to every 24 months. With a little hand waving, most reports attribute 18 months to Moore’s Law, but there is quite a bit of variability. The popular perception of Moore’s Law is that computer chips are compounding in their complexity at near constant per unit cost. This is one of the many abstractions of Moore’s Law, and it relates to the compounding of transistor density in two dimensions. Others relate to speed (the signals have less distance to travel) and computational power (speed x density).

Unless you work for a chip company and focus on fab-yield optimization, you do not care about transistor counts. Integrated circuit customers do not buy transistors. Consumers of technology purchase computational speed and data storage density. When recast in these terms, Moore’s Law is no longer a transistor-centric metric, and this abstraction allows for longer-term analysis.

What Moore observed in the belly of the early IC industry was a derivative metric, a refracted signal, from a longer-term trend, a trend that begs various philosophical questions and predicts mind-bending futures.

Ray Kurzweil’s abstraction of Moore’s Law shows computational power on a logarithmic scale, and finds a double exponential curve that holds over 120 years! A straight line would represent a geometrically compounding curve of progress.

Through five paradigm shifts – such as electro-mechanical calculators and vacuum tube computers – the computational power that $1000 buys has doubled every two years. For the past 35 years, it has been doubling every year.

Each dot is the frontier of computational price performance of the day. One machine was used in the 1890 Census; one cracked the Nazi Enigma cipher in World War II; one predicted Eisenhower’s win in the 1956 Presidential election. Many of them can be seen in the Computer History Museum.

Each dot represents a human drama. Prior to Moore’s first paper in 1965, none of them even knew they were on a predictive curve. Each dot represents an attempt to build the best computer with the tools of the day. Of course, we use these computers to make better design software and manufacturing control algorithms. And so the progress continues.

Notice that the pace of innovation is exogenous to the economy. The Great Depression and the World Wars and various recessions do not introduce a meaningful change in the long-term trajectory of Moore’s Law. Certainly, the adoption rates, revenue, profits and economic fates of the computer companies behind the various dots on the graph may go though wild oscillations, but the long-term trend emerges nevertheless.

Any one technology, such as the CMOS transistor, follows an elongated S-shaped curve of slow progress during initial development, upward progress during a rapid adoption phase, and then slower growth from market saturation over time. But a more generalized capability, such as computation, storage, or bandwidth, tends to follow a pure exponential – bridging across a variety of technologies and their cascade of S-curves.

In the modern era of accelerating change in the tech industry, it is hard to find even five-year trends with any predictive value, let alone trends that span the centuries. I would go further and assert that this is the most important graph ever conceived.

Why is this the most important graph in human history?

A large and growing set of industries depends on continued exponential cost declines in computational power and storage density. Moore’s Law drives electronics, communications and computers and has become a primary driver in drug discovery, biotech and bioinformatics, medical imaging and diagnostics. As Moore’s Law crosses critical thresholds, a formerly lab science of trial and error experimentation becomes a simulation science, and the pace of progress accelerates dramatically, creating opportunities for new entrants in new industries. Boeing used to rely on the wind tunnels to test novel aircraft design performance. Ever since CFD modeling became powerful enough, design moves to the rapid pace of iterative simulations, and the nearby wind tunnels of NASA Ames lie fallow. The engineer can iterate at a rapid rate while simply sitting at their desk.

Every industry on our planet is going to become an information business. Consider agriculture. If you ask a farmer in 20 years’ time about how they compete, it will depend on how they use information, from satellite imagery driving robotic field optimization to the code in their seeds. It will have nothing to do with workmanship or labor. That will eventually percolate through every industry as IT innervates the economy.

Non-linear shifts in the marketplace are also essential for entrepreneurship and meaningful change. Technology’s exponential pace of progress has been the primary juggernaut of perpetual market disruption, spawning wave after wave of opportunities for new companies. Without disruption, entrepreneurs would not exist.

Moore’s Law is not just exogenous to the economy; it is why we have economic growth and an accelerating pace of progress. At Future Ventures, we see that in the growing diversity and global impact of the entrepreneurial ideas that we see each year. The industries impacted by the current wave of tech entrepreneurs are more diverse, and an order of magnitude larger than those of the 90’s — from automobiles and aerospace to energy and chemicals.

At the cutting edge of computational capture is biology; we are actively reengineering the information systems of biology and creating synthetic microbes whose DNA is manufactured from bare computer code and an organic chemistry printer. But what to build? So far, we largely copy large tracts of code from nature. But the question spans across all the complex systems that we might wish to build, from cities to designer microbes, to computer intelligence.

Reengineering engineering

As these systems transcend human comprehension, we will shift from traditional engineering to evolutionary algorithms and iterative learning algorithms like deep learning and machine learning. As we design for evolvability, the locus of learning shifts from the artifacts themselves to the process that created them. There is no mathematical shortcut for the decomposition of a neural network or genetic program, no way to “reverse evolve” with the ease that we can reverse engineer the artifacts of purposeful design. The beauty of compounding iterative algorithms (evolution, fractals, organic growth, art) derives from their irreducibility. And it empowers us to design complex systems that exceed human understanding.

Why does progress perpetually accelerate?

All new technologies are combinations of technologies that already exist. Innovation does not occur in a vacuum; it is a combination of ideas from before. In any academic field, the advances today are built on a large edifice of history. . This is why major innovations tend to be ‘ripe’ and tend to be discovered at the nearly the same time by multiple people. The compounding of ideas is the foundation of progress, something that was not so evident to the casual observer before the age of science. Science tuned the process parameters for innovation, and became the best method for a culture to learn.

From this conceptual base, come the origin of economic growth and accelerating technological change, as the combinatorial explosion of possible idea pairings grows exponentially as new ideas come into the mix (on the order of 2^n of possible groupings per Reed’s Law). It explains the innovative power of urbanization and networked globalization. And it explains why interdisciplinary ideas are so powerfully disruptive; it is like the differential immunity of epidemiology, whereby islands of cognitive isolation (e.g., academic disciplines) are vulnerable to disruptive memes hopping across, much like South America was to smallpox from Cortés and the Conquistadors. If disruption is what you seek, cognitive island-hopping is good place to start, mining the interstices between academic disciplines.

It is the combinatorial explosion of possible innovation-pairings that creates economic growth, and it’s about to go into overdrive. In recent years, we have begun to see the global innovation effects of a new factor: the internet. People can exchange ideas like never before Long ago, people were not communicating across continents; ideas were partitioned, and so the success of nations and regions pivoted on their own innovations. Richard Dawkins states that in biology it is genes which really matter, and we as people are just vessels for the conveyance of genes. It’s the same with ideas or “memes”. We are the vessels that hold and communicate ideas, and now that pool of ideas percolates on a global basis more rapidly than ever before.

In the next 6 years, three billion minds will come online for the first time to join this global conversation (via inexpensive smart phones in the developing world). This rapid influx of three billion people to the global economy is unprecedented in human history, and so to, will the pace of idea-pairings and progress.

We live in interesting times, at the cusp of the frontiers of the unknown and breathtaking advances. But, it should always feel that way, engendering a perpetual sense of future shock.

16 responses to “122 Years of Moore’s Law + Tesla AI Update”

jurvetson

Aug 21, 2021 at 3:17 am

P.S. Moore’s original prediction, from 1965… Note his Y-axis wording and 12-month doubling:It was the dotted line that was the profound prediction. Carver Mead called it "Moore’s Law" (Gordon was way too modest to declare an eponymous law).

And for those unfamiliar with semi-log graphs, here is the exact same data as above for 1900-2010 with a standard linear y-axis scale:It makes everything feel like a recent phenomenon, post 2000. But, if you extended it to 2020, then everything would look flat until 2010. At each point, 10 years of progress = 2^10, or just over 1000x of progress, and history looks flat. There no "knee in the curve" or sudden "hockey stick" when plotted on semi-log charts. It is misleading to look at the linear chart above and think something changed in the year 2000… or 2010… or any given year.

P.S.S. The longer explanation above was for a more detailed article I wrote for the Computer History Museum on the past and future of Moore’s Law: And more on the NVIDIA era and thoughts from their CEOP.S. it was about a decade ago that I first marveled at the use of NVIDIA GPUs for neural simulations Evolved Machines CEO Paul Rhodes reflects: "The GPU impact on compute is beautiful. We had CUDA 0.1, and one of the first cards (it was called the G-80), handed to us in a bag, pre commercial availability. I will never forget the 80x speedup we got on our neuron simulations when we ported to that platform. It is so native for neural computation."

Reply
Henrique Vicente

Aug 21, 2021 at 3:19 am

Nice! 🙂

Reply
gsikich1

Aug 21, 2021 at 10:38 am

Interesting write up explaining the graphic. Nice to see it visualized from 1900 (pre-Moore’s Law 1965) to the projection at 2025. In the next 4 years will we see a doubling? One wonders if there is a point of leveling off and, perhaps, decline as external influences and limitations (physical) come in to play.

Reply
jurvetson

Aug 21, 2021 at 3:10 pm

@miocene No. Design aspirations for the Analytical Engine started in 1837, and construction was 1890-1910. Similarly, Dojo is 2022 in the chart above.

Reply
jurvetson

Aug 21, 2021 at 3:12 pm

@gsikich1 Intel keeps saying it will come to an end, but that is because they are very focused on their product lines.

It has been accelerating, on a double exponential. And on top of that, the advances in software and algorithms are accelerating further still in machine intelligence

Reply
AGrinberg

Aug 21, 2021 at 4:26 pm

Wonderful essay.
It is hard to imagine life 100 years from now, unless our destruction of the environment outpaces the development of our knowledge, which maybe it already has.

Reply
gsikich1

Aug 21, 2021 at 8:48 pm

@Steve Jurvetson Steve can we (humans) keep up with the acceleration? Amazing to see the graphics. Thanks.

Reply
sbove

Aug 21, 2021 at 10:52 pm

Great post and graphics. Thanks!!!

After Jim Keller worked with Elon and team (presumably on various things that led to DOJO), he went to TensTorrent where he was an early investor. Any thoughts on their general purpose neural net TPU? BTW, his recent interviews with Lex Fridman were a tour de force on a host of incredibly important topics for tech dev, management, chip architecture, etc. – must watch!

Some paraphrases from that set of interviews (forgive any errors, these were taken on the fly during the video a few months ago):

1) "The actual current cadence of Moore’s law is 0.6 every 2 years (shrink factor for a constant chip area)…good but not 0.5 (2x/2yrs). Performance of computers trend is up 2x every 3 or 4 years. How small can a switching device be? Current scale is 1000x1000x1000 (1 billion) atoms. We run into quantum effects at 2 to the 10 (1024) atoms…so, ~10x10x10 (1,000) atoms = a million times smaller than today’s scale is possible(!) The next 10 or 20 years of shrinking will happen…and in addition…the quantum folks are working on USING quantum effects, not running into them.

2) "Every 10x in compute power enables a new kind of computation >> scalar, vector, matrix, spatial/topological computations >> what’s next?

3) TensTorrent is building "computing platform for AI and Software 2.0", 1st product = Tensor Processing Unit (TPU) board with up to two "Greyskull AI Processors" starting at ~$1K per PCIe Gen 4, 16 lane board > tenstorrent.com/grayskull/) >> Each TPU contains 120 Tensix cores each capable of 3 trillion ops per second (TOPS), peak rate of all cores = 368 TOPS drawing just 65W of power. Architecture specialized to run floating point calculations…tensor algebras simple or complex…matrix multiplications, convolutions, data manipulations, data movements…on matrixes/graphs built from sparsely accessed petabyte data sets distributed across a large number of computers…

More info on the Tenstorrent Tensix architecture with some early/impressive pre-production benchmarks: http://www.linleygroup.com/mpr/article.php?id=12287

Reply
bltaaxpi89

Aug 22, 2021 at 4:16 am

Magnificent distillation of our current state of art!

Reply
sbove

Aug 22, 2021 at 4:09 pm

ps: One GREAT thing about all the AI dev at Tesla…their self-driving car tech can certainly be used to create networks of automated AI traffic signals that would massively increase throughput of congested streets (and save massive amounts of energy). Variable Speed Limits (controlled by AI Traffic Signal Networks) are the 2nd key to making our roads much more efficient (decreasing traffic jams, wasted drive time, wasted energy). More cars? Increase the speed! Very simple, but hard without AI networks managing Digital Road Signs. Tesla should be building an internal "AI Traffic Signaling" division (or Future should be funding a start-up focused on on this)!!!

Reply
jurvetson

Aug 23, 2021 at 12:31 am

Several people on Twitter complained that Google TPUs are not represented. I left them off because Google does not share cost or chip level performance for comparison purposes (nor do they ever need to). If someone knows these numbers or can point me to them, I will update.

I did the best I could with the available data (and updated the main chart above. Google is not quoting specific performance numbers that I can find for TPU v4, just that it’s ~2x TPU v3 (and v3 was just an estimate as a 2x of v2). The Wikipedia TPU entry has it as an unknown. I assumed 180 TOPS. This makes them 1/2 Tesla D1 at the chip level. Also, to get an exaflop of performance, Google takes 1000 more TPU v4 chips than DOJO requires D1 chips (HPC wire).

[1/29/22 update — I got the detailed TPU numbers from Google, and I was close, actually overestimating their performance/$ by a bit. I’ll adjust their dot downward when the numbers become public]

Reply
ross.pantone

Aug 23, 2021 at 11:10 pm

Really great article. However, I’d like to push back on the suggested metric, essentially TOPS/$. I don’t believe an exponential growth in TOPS/$ necessarily suggests an exponential growth in intelligence. I think, like the original casting of Moore’s Law, it will be a red herring pushed by those who benefit from its narrative (e.g., NVIDIA).

Indisputably, this metric has been accurate so far. This trend can be seen in charting (AI model, corresponding hardware) since AlexNet in 2012. However, on a 10+ year time scale, the proposed metric suggests that scaling “standard” deep learning models alone will get us to human-level intelligence. I believe the future will be more complex than that. For example, see this excellent review analyzing the computational limits of standard deep learning. Specifically, based on current trends, they project that it will cost $100B > to train a model that achieves < 5% error on the ImageNet benchmark. It’s even worse when we consider other modalities.

Currently, the universe of practically realizable neural networks is limited to the subset of models that can be run on digital systems (GPUs, TPUs, etc.)—a big constraint considering the brain’s neural network is actually a fully analog stochastic dynamical system!

Numerous gains are possible when starting with both algorithms and hardware that more closely resemble the brain. Many of these gains are directly quantifiable via standard performance metrics (TOPS, TOPS/W, TOPS/$, etc.). For example, matrix-vector multiplication is realizable in just one time step in analog CIM. Matrix inversion is too! However… many of them are not. Sparsity, attention, and causality are all examples of algorithmic techniques that are demonstrably critical to intelligence, yet—when implemented in the proper hardware (i.e., not digital)—the number of operations goes down, not up.

I believe a suitable performance spec in our field has to be task-dependent. In my opinion, cost-to-solution fits the bill, but I’m open to other suggestions. For now, each intelligence modality requires its own plot, which isn’t great, but in the world of ASICs, I believe this nuance is uniquely critical. For example, there are fantastic applications of AI for highly efficient speech recognition (e.g., Syntiant’s NDP120) but no great solutions for recommendation. Long term, I think it is crucial for the scientific and industrial communities to agree upon a suite of benchmarks that comprise said solution (like MLPerf but better). Then, one can beautifully plot hardware solutions in a united fashion.

Reply
jurvetson

Aug 24, 2021 at 6:21 pm

@Ross Pantone I agree with most of what you say about the future, but I am not sure i see the value of a task-dependent metric. The beauty of the abstraction plotted here is its universal application to computation, and hence the longevity of its value. Also, so you know, NVIDIA had nothing to do with this chart. I added their data points without their knowledge (and I have never been an investor in NVIDIA. I just saw how they took the mantle from Intel and shared it in this format).

I have invested in the next generation of analog in-memory compute that better mimics the human cortex. It has a 6-10x TOPS/W advantage over digital today, on a 40nm node! 100x and more should be possible on a modern node. Mythic:And yes, they implement an 8-bit multiply and accumulate in a single transistor! (another reminder why "transistor counts" are such a broken version of Moore’s Law).

As for scaling to human+ levels of intelligence, I think we will build a superhuman AGI before we understand our own brain well enough to radically improve it or upload it to a silicon substrate. The complex creations of iterative algorithms (like evolution and deep learning) are inherently inscrutable. It is easier to push evolution forward than to reverse engineer the products of evolution.

We are in the middle of a sea change in how the vanguard of engineering will be done. Building complex systems that exceed human understanding is more like parenting than programming. The locus of learning shifts from end products to the process of their creation. An ever-growing percentage of software will be grown and an ever-growing percentage of compute will run on infrastructure that resembles the brain (massively parallel, fine grained architectures with in-memory compute and a growing focus on the memory and interconnect elements). This is the path to AGI, IMHO.

I’ve been working with a neural plasticity company for 14 years now (Posit Science). One of my concerns with uploading is the extreme plasticity of the sensory cortex and the recruitment of neighboring regions in the face of external changes (like phantom limb pain in amputees). Cut and paste of brain state to a foreign substrate may require a deep understanding of the analog domain, where structural topology and functional spike train variation is immense (there are over 300 types of neurons in neocortex that are structurally and electrically different. And each neuron has ~200 ion channels from a pool of 20-40 variations). Furthermore, our mostly 2D silicon substrates lack the interconnect density for a direct map of the synaptic fan-out of the brain. Without a deep understanding of what elements can be ignored or abstracted, a simulation of brain function explodes in combinatorial complexity.

Going back a decade, in talks about AI futures, I was fond of advising to “augment early and often.” I worry that people want to believe in extreme augmentation and uploading, not because it is likely, but because it offers a mental model for “humanity” maintaining the mantle of supremacy, perpetually perched at the pinnacle of evolution. The idea that evolution will eventually progress way beyond us is hard to internalize. We seek transcendence, as the antidote for obsolescence.

My 2006 musings on these topics.

Reply
ross.pantone

Aug 25, 2021 at 2:23 am

I appreciate the response!

I strongly agree with your assertion that “designing” the brain is not the way to go. I believe this is something that most individuals and groups in the academic field of neuromorphic computing get wrong. I agree with your philosophical reasoning behind it, but I also think this bottom-up approach is the easiest to grok, albeit the hardest to engineer. I believe we will learn more about neuroscience in the coming decades by closely studying circuit theory (the intersection between physics and deep learning) than through biological experimentation. I agree with the challenges you mention surrounding chip-to-chip (brain-to-brain, if you will) variation. For this reason, I think, long term, it is crucial for hardware to support on-chip training, where it implicitly accounts for all of these variations.

I mostly agree with adopting iterative algorithms over explicitly designed complex systems, but my outlook varies slightly. I think it is crucial for those pursuing AGI to acknowledge the obvious: deep learning has worked really well so far. I think the few organizations that can afford it should continue their brute-force approach toward achieving AGI—more layers, wider layers, more data, paired with relatively minor algorithmic and architectural improvements. This approach is what OpenAI did moving from GPT-2 to GPT-3. It is virtually the same model, only ~100x bigger and trained on more data. While I think several constraints of deep learning prohibit it from being suitable for true AGI (e.g., I don’t think causality will magically emerge), it’d be foolish to count it out completely. We can consider this the fully iterative approach.

As a side note: I’m even more skeptical that this brute-force approach will ever yield scalable AGI (i.e., similar energy efficiency as the brain). That would require many, many orders of magnitude improvements in energy efficiency.

If this approach doesn’t work, then, by definition, we have to change the design. That said, I agree with your criticism of this approach. Thinking we need to reconstruct every last ion is an unwarranted rejection of the field’s prior successes. This approach is not the way to go. IMO, the logical next step from brute-force deep learning in terms of design is to replicate a few more of the brain’s core mathematical and physical (but not biochemical!) aspects. Of course, selecting these design choices is tricky.

Here are a few properties of the brain that have been shown to be advantageous to modern AI:

(1) Fully analog: the brain is a fully analog neural network. All operations are performed by physics and not by digital simulations. Fully analog systems often exhibit power consumption in the milliwatt range and inference latency in the nanosecond range. Historically, fully analog neural networks were untrainable, but recent work has changed that (e.g., equilibrium propagation).

(2) Sparse: past the obvious advantage of bringing computations down, there are deep technical advantages surrounding continual learning and training stability (critical for reinforcement learning).

(3) Higher-order optimization: Given that standard stochastic gradient descent, a first-order learning algorithm, is so data inefficient compared to the brain, it is not unreasonable to suggest that the brain is using higher-order methods for learning. As mentioned above, analog circuits implicitly solve these complex systems (i.e., matrix inversion) in nanoseconds.

Regarding Mythic, I am very familiar with their work and have been following them since 2017. They have already overcome many complex engineering and product challenges, and I have a tremendous amount of respect for them. However, I do see several long-term prohibitive limitations in their core technology. Specifically, layer-wise conversions between digital and analog result in an enormous number of ADCs, which are notoriously large and power-hungry. This precise fact has kept many analog/mixed-signal approaches from scaling networks to their digital counterparts’ sizes. Also, the digital computations in the data path serve as a latency bottleneck, driving inference into the millisecond range even though each MAC only takes a few nanoseconds. Furthermore, several architectural details are suboptimal (e.g., cannot fully reap the benefits of 3D memory scaling for CIM while using dense crossbar arrays). That said, their M1000-series is thoroughly impressive and appears to be outperforming many (maybe all?) competitors in their form factor.

Reply
pmorgan

Aug 26, 2021 at 2:33 am

@Stephen Bové Agree about massive potential benefit if not just Tesla but other vehicles and infrastructure could function as a network. Not the current model as I understand it, but if cars, for example could communicate real time changes in road conditions to other nearby vehicles there would be real benefit. No need to be quite so autonomous of each other!

Reply
jurvetson

Mar 25, 2023 at 9:02 pm

Gordon Moore, R.I.P. He taught the law, and the law won.

From the WSJ today: “One thing I’ve learned—once you’ve made a successful prediction, avoid making another one,” Mr. Moore joked at a 2015 event to celebrate 50 years of Moore’s Law.

Reply

122 Years of Moore’s Law + Tesla AI Update

16 responses to “122 Years of Moore’s Law + Tesla AI Update”

Leave a Reply Cancel reply