Advanced CPU Designs: Crash Course Computer Science #9

Hi, I’m Carrie Anne and welcome to CrashCourse
Computer Science! As we’ve discussed throughout the series,
computers have come a long way from mechanical devices capable of maybe one calculation per second, to CPUs running at kilohertz and megahertz speeds. The device you’re watching this video on
right now is almost certainly running at Gigahertz speeds – that’s billions of instructions
executed every second. Which, trust me, is a lot of computation! In the early days of electronic computing,
processors were typically made faster by improving the switching time of the transistors inside
the chip – the ones that make up all the logic gates, ALUs and other stuff we’ve talked
about over the past few episodes. But just making transistors faster and more
efficient only went so far, so processor designers have developed various techniques to boost
performance allowing not only simple instructions to run fast, but also performing much more
sophisticated operations. INTRO Last episode, we created a small program for
our CPU that allowed us to divide two numbers. We did this by doing many subtractions in
a row… so, for example, 16 divided by 4 could be broken down into the smaller problem
of 16 minus 4, minus 4, minus 4, minus 4. When we hit zero, or a negative number, we
knew that we we’re done. But this approach gobbles up a lot of clock
cycles, and isn’t particularly efficient. So most computer processors today have divide
as one of the instructions that the ALU can perform in hardware. Of course, this extra circuitry makes the
ALU bigger and more complicated to design, but also more capable – a complexity-for-speed
tradeoff that has been made many times in computing history. For instance, modern computer processors now
have special circuits for things like graphics operations, decoding compressed video, and
encrypting files – all of which are operations that would take many many many clock cycles
to perform with standard operations. You may have even heard of processors with
MMX, 3DNow!, or SSE. These are processors with additional, fancy
circuits that allow them to execute additional, fancy instructions – for things like gaming
and encryption. These extensions to the instruction set have
grown, and grown over time, and once people have written programs to take advantage of
them, it’s hard to remove them. So instruction sets tend to keep getting larger
and larger keeping all the old opcodes around for backwards compatibility. The Intel 4004, the first truly integrated
CPU, had 46 instructions – which was enough to build a fully functional computer. But a modern computer processor has thousands
of different instructions, which utilize all sorts of clever and complex internal circuitry. Now, high clock speeds and fancy instruction
sets lead to another problem – getting data in and out of the CPU quickly enough. It’s like having a powerful steam locomotive,
but no way to shovel in coal fast enough. In this case, the bottleneck is RAM. RAM is typically a memory module that lies
outside the CPU. This means that data has to be transmitted
to and from RAM along sets of data wires, called a bus. This bus might only be a few centimeters long,
and remember those electrical signals are traveling near the speed of light, but when
you are operating at gigahertz speeds – that’s billionths of a second – even this small
delay starts to become problematic. It also takes time for RAM itself to lookup
the address, retrieve the data, and configure itself for output. So a “load from RAM” instruction might take
dozens of clock cycles to complete, and during this time the processor is just sitting there
idly waiting for the data. One solution is to put a little piece of RAM
right on the CPU — called a cache. There isn’t a lot of space on a processor’s
chip, so most caches are just kilobytes or maybe megabytes in size, where RAM is usually
gigabytes. Having a cache speeds things up in a clever
way. When the CPU requests a memory location from
RAM, the RAM can transmit not just one single value, but a whole block of data. This takes only a little bit more time than
transmitting a single value, but it allows this data block to be saved into the cache. This tends to be really useful because computer
data is often arranged and processed sequentially. For example, let say the processor is totalling
up daily sales for a restaurant. It starts by fetching the first transaction
from RAM at memory location 100. The RAM, instead of sending back just that
one value, sends a block of data, from memory location 100 through 200, which are then all
copied into the cache. Now, when the processor requests the next
transaction to add to its running total, the value at address 101, the cache will say “Oh,
I’ve already got that value right here, so I can give it to you right away!” And there’s no need to go all the way to
RAM. Because the cache is so close to the processor,
it can typically provide the data in a single clock cycle — no waiting required. This speeds things up tremendously over having
to go back and forth to RAM every single time. When data requested in RAM is already stored
in the cache like this it’s called a cache hit, and if the data requested isn’t in the cache,
so you have to go to RAM, it’s a called a cache miss. The cache can also be used like a scratch
space, storing intermediate values when performing a longer, or more complicated calculation. Continuing our restaurant example, let’s
say the processor has finished totalling up all of the sales for the day, and wants to
store the result in memory address 150. Like before, instead of going back all the
way to RAM to save that value, it can be stored in cached copy, which is faster to save to,
and also faster to access later if more calculations are needed. But this introduces an interesting problem
— the cache’s copy of the data is now different to the real version stored in RAM. This mismatch has to be recorded, so that
at some point everything can get synced up. For this purpose, the cache has a special
flag for each block of memory it stores, called the dirty bit — which might just be the best
term computer scientists have ever invented. Most often this synchronization happens when
the cache is full, but a new block of memory is being requested by the processor. Before the cache erases the old block to free
up space, it checks its dirty bit, and if it’s dirty, the old block of data is written
back to RAM before loading in the new block. Another trick to boost cpu performance is
called instruction pipelining. Imagine you have to wash an entire hotel’s
worth of sheets, but you’ve only got one washing machine and one dryer. One option is to do it all sequentially: put
a batch of sheets in the washer and wait 30 minutes for it to finish. Then take the wet sheets out and put them
in the dryer and wait another 30 minutes for that to finish. This allows you to do one batch of sheets
every hour. Side note: if you have a dryer that can dry
a load of laundry in 30 minutes, please tell me the brand and model in the comments, because
I’m living with 90 minute dry times, minimum. But, even with this magic clothes dryer, you
can speed things up even more if you parallelize your operation. As before, you start off putting one batch
of sheets in the washer. You wait 30 minutes for it to finish. Then you take the wet sheets out and put them
in the dryer. But this time, instead of just waiting 30
minutes for the dryer to finish, you simultaneously start another load in the washing machine. Now you’ve got both machines going at once. Wait 30 minutes, and one batch is now done,
one batch is half done, and another is ready to go in. This effectively doubles your throughput. Processor designs can apply the same idea. In episode 7, our example processor performed
the fetch-decode-execute cycle sequentially and in a continuous loop: Fetch-decode-execute,
fetch-decode-execute, fetch-decode-execute, and so on. This meant our design required three clock
cycles to execute one instruction. But each of these stages uses a different
part of the CPU, meaning there is an opportunity to parallelize! While one instruction is getting executed,
the next instruction could be getting decoded, and the instruction beyond that fetched from
memory. All of these separate processes can overlap
so that all parts of the CPU are active at any given time. In this pipelined design, an instruction is
executed every single clock cycle which triples the throughput. But just like with caching this can lead to
some tricky problems. A big hazard is a dependency in the instructions. For example, you might fetch something that
the currently executing instruction is just about to modify, which means you’ll end
up with the old value in the pipeline. To compensate for this, pipelined processors
have to look ahead for data dependencies, and if necessary, stall their pipelines to
avoid problems. High end processors, like those found in laptops
and smartphones, go one step further and can dynamically reorder instructions with dependencies
in order to minimize stalls and keep the pipeline moving, which is called out-of-order execution. As you might imagine, the circuits that figure
this all out are incredibly complicated. Nonetheless, pipelining is tremendously effective
and almost all processors implement it today. Another big hazard are conditional jump instructions
— we talked about one example, a JUMP NEGATIVE, last episode. These instructions can change the execution
flow of a program depending on a value. A simple pipelined processor will perform
a long stall when it sees a jump instruction, waiting for the value to be finalized. Only once the jump outcome is known, does
the processor start refilling its pipeline. But, this can produce long delays, so high-end
processors have some tricks to deal with this problem too. Imagine an upcoming jump instruction as a
fork in a road – a branch. Advanced CPUs guess which way they are going
to go, and start filling their pipeline with instructions based off that guess – a technique
called speculative execution. When the jump instruction is finally resolved,
if the CPU guessed correctly, then the pipeline is already full of the correct instructions
and it can motor along without delay. However, if the CPU guessed wrong, it has
to discard all its speculative results and perform a pipeline flush – sort of like when
you miss a turn and have to do a u-turn to get back on route, and stop your GPS’s insistent
shouting. To minimize the effects of these flushes,
CPU manufacturers have developed sophisticated ways to guess which way branches will go,
called branch prediction. Instead of being a 50/50 guess, today’s
processors can often guess with over 90% accuracy! In an ideal case, pipelining lets you complete
one instruction every single clock cycle, but then superscalar processors came along
which can execute more than one instruction per clock cycle. During the execute phase even in a pipelined design, whole areas of the processor might be totally idle. For example, while executing an instruction
that fetches a value from memory, the ALU is just going to be sitting there, not doing
a thing. So why not fetch-and-decode several instructions
at once, and whenever possible, execute instructions that require different parts of the CPU all
at the same time!? But we can take this
one step further and add duplicate circuitry for popular instructions. For example, many processors will have four,
eight or more identical ALUs, so they can execute many mathematical instructions all
in parallel! Ok, the techniques we’ve discussed so far
primarily optimize the execution throughput of a single stream of instructions, but another
way to increase performance is to run several streams of instructions at once with multi-core
processors. You might have heard of dual core or quad
core processors. This means there are multiple independent
processing units inside of a single CPU chip. In many ways, this is very much like having
multiple separate CPUs, but because they’re tightly integrated, they can share some resources,
like cache, allowing the cores to work together on shared computations. But, when more cores just isn’t enough,
you can build computers with multiple independent CPUs! High end computers, like the servers streaming
this video from YouTube’s datacenter, often need the extra horsepower to keep it silky
smooth for the hundreds of people watching simultaneously. Two- and four-processor configuration are
the most common right now, but every now and again even that much processing power isn’t
enough. So we humans get extra ambitious and build
ourselves a supercomputer! If you’re looking to do some really monster
calculations – like simulating the formation of the universe – you’ll need some pretty
serious compute power. A few extra processors in a desktop computer
just isn’t going to cut it. You’re going to need a lot of processors. No.. no… even more than that. A lot more! When this video was made, the world’s fastest
computer was located in The National Supercomputing Center in Wuxi, China. The Sunway TaihuLight contains a brain-melting
40,960 CPUs, each with 256 cores! Thats over ten million cores in total… and
each one of those cores runs at 1.45 gigahertz. In total, this machine can process 93 Quadrillion
— that’s 93 million-billions — floating point math operations per second, knows as
FLOPS. And trust me, that’s a lot of FLOPS!! No word on whether it can run Crysis at max
settings, but I suspect it might. So long story short, not only have computer
processors gotten a lot faster over the years, but also a lot more sophisticated, employing
all sorts of clever tricks to squeeze out more and more computation per clock cycle. Our job is to wield that incredible processing
power to do cool and useful things. That’s the essence of programming, which
we’ll start discussing next episode. See you next week.


  • Miolpurst


  • HAL 9000

    By the way, don't use clothes dryers. These are environmental nightmares.

  • linkliu mayuyu

    I see chinese super computer from wuxi.

  • asdfasdf adfasdf

    This video made me sad. 10 years ago i spent 2 years in an ICT highschool course, but this playlist up to this point has thought me more already. The only thing i remember learning there was that you could power off a pc quicker by just pulling out the plug, followed by the students explaining to the teacher what bad sectors are.
    Those years killed nearly all my enthusiasm for ICT, as did my internships in refurbishment companies.(although they were fun) At this point through this playlist i notice that things aren't so obvious anymore and i have to rewind here and there to fully grasps the concepts, and i realized… this stuff is still really exciting!

  • Sam Fisher

    Why do people speak so fast on YouTube? Whatever happened to being precise and steady…

  • James Press

    But can it run Crysis lol

  • Daniel Iorns

    LMAO crisis at max settings. Made me spit out my drink. I love this series.

  • Leonardo Pozzobon

    Aaaaand we’ve got spectre and meltdown…

  • Collin Roess

    Branch Prediction? This sounds like a great idea! I'm certain that nothing bad will ever come of implementing that in everything!

  • Interspect

    Specture and meltdown lol

  • Opinion

    I trust her for some reason, although I have been programmed to only receive this info from males. Could it be the accent?

  • KT Mansfield

    Dirty bit… hee hee…

  • Sophie D'Amours

    I like to think of cache as an ability similar to being able to borrow several books at once from the library: For a fast reader, this means being able to read more books without having to take the bus to go to the library again. The reader may not be able or interested in reading all the books they end up borrowing, but overall, it still increases quite a bit their ability to read more books in a shorter period of time.

  • DSAhmed

    Hooray for Speculative Execution! what could possibly go wrong? … oh Specter and Meltdown

  • Zélie Zazou

    Ok, so I've been living near the most powerful computer in the world for several years without knowing it :O

  • Bloodstainer

    Cache really haven't been KB in quite a while

  • Bloodstainer

    It might be worth noting that branch prediction is part of the security problem with Spectre and Meltdown

  • Kovanovsky

    Hollee, that supercomputer fact freakin' blows my mind.. 40 thousand CPUs with 256 cores each, hot-dog.. I underestimated the term 'super' in supercomputer lmao

  • Andre Ranulfo

    The good and old MOS 6502!!!

  • Jethro P.

    Lol, branch prediction. Amaaaaaazing cpu optimization technique. And also worst vulnerability of 2018


    You need to give it a more appropriate title. Something more like "cpu basics".

  • Γιωργος Θεοδωρου

    And this is why the spectre and meltdown security issue happens …

  • 2randomcrap3

    Modern multi-core processors theoretically have the ability to perform speculative executions along multiple branches of a program simultaneously with its different cores (up to the number of cores in the processor, of course), so if the primary core's speculation was wrong, the system can switch to the core with the correct speculation while the rest flush their pipelines and start new speculations, minimizing those delays. If they don't already, Intel, AMD, Broadcom, and Qualcomm need to impleent that, especially since there is still a lot of software that only uses one core, and lets the rest idle.

  • Sina Mirhejazi

    I learned much more from these series than my entire university. Maybe that's because I was dropped out! Or I didn't study computer in university ?

  • Messy Lingard

    It's funny seeing the views gradually go down as the number of episodes grows. People gradually dropping out.
    1mil to ~270k now

    Hopefully I don't end up being one of them ?.

  • Teralcraft

    nice Ryzen advertisement

  • Carmella King

    same girl… with the dryer problem

  • Suchismita Karmakar

    China did it >> Whoah!!!!


    ironically, that super computer probably couldn't run crysis at max settings

  • beta mohammad

    Why you talk so fast. You make such this videos to teach other but when you talk so fast it is difficult to someone like me to understand.

  • Alexis Pulido

    1.25 speed made better sense

  • Alp Doğaner

    What is DFTBA???

  • teddysurf

    I think you guys just killed my dreams of being a computer science major… Thank you :s

  • Jarrod C

    Talk way too fast, and accent is too thick.

  • Fire Nation Files

    MIPS, FLOPS, Dirty Bits



    speed 0.25 FOR ENJOY WATCHIN

  • Venkateswaran S

    Watch this video 2x speed and thank me later ?

  • Fubear

    so hot. I'd hit that cache, with my dirty bit and give her an instruction pipeline she wouldnt soon forget. Yeah, you gonna need a 13 minute dryer for all the wet sheets in mah house. Yea baby my high end "processors" love some out of order executions, you know thats right!

    Aint no long stall on the refilling of this pipeline, sheeet. When my speculative execution requires a pipeline flush, well, lets just say its a money shot.

  • michael zivalich

    she doesnt explain how it perains to all the register number to the physical parts so its gibberish

  • michael zivalich

    why arnt they showing more picture as examples rather than watching her talk about it so i lose interest and dose off

  • Trang Doan Thi Thuy

    why are there fewer views after each video? =)) please keep going because this series is sooooooooo amazing. Love it!

  • Sakibul islam


  • lpfcdd

    This is again proving the point…. some people needs visuals to understand the concept and not only the abstract logic. I struggled with Geometry for the same reason in my school life …. thanks to this series , I can finally visually see all the 5 years of computer studies I did …… Thanks once again !!!!!!!!!!!!

  • [Old] Garrom Orc Shaman

    Who else think that supercomputer would make a great bitcoin farm ? I mean that 41.000 CPUs each with 256 cores running at 1.45Ghz would dig enough bitcoins to cover bills of whole country.

  • TheBebo HD

    please translate it to Spanish

  • Hyder Hadi

    i guess you can say i can't process this buhahahahahahahahaha……

  • Cole9559

    this course is what learning in 2k18 looks like

  • James Morrison

    "high cock speeds" 2:28

  • تيرابايت TeraByte

    Please add arbic subtitle

  • Merced Lcars

    Haha! I LOL when you mentioned whether or not the Sunway computer could run Crysis at max settings – cracked me up! ha! My PC back in those days (2007/2008) suffered horrendously when trying to play it on max settings, especially at the intro cut scene when the aeroplane came into the camera view flying over the clouds. My ATI Radeon HD 4650 graphics card really suffered..hehe. Ahhh the good times. 🙂

  • Creamy peanut butter jelly guy

    cache hit or miss, I guess it never misses huh?
    you got a block of RAM, I bet he slows ya down
    He gon find another CPU and he won't miss ya
    He gon skrrt and hit the bus like intel kalifa

  • chuck croner

    40000 chips in one computer at 1.5 ghz? If they ran at 3 ghz would you only need 20000 for same performance? Or one chip at 160 thz?

  • Mr IT-GUY

    That's sooo good love it.

  • Jeff Chen

    instruction pipelining

  • Ikansh Mahajan

    01000011 01100001 01110100 01110011 00101110 00100000 01010010 01110101 01101100 01100101 00101110 00001101 00001010 00001101 00001010 01000100 01101111 00100000 01110101 00100000 01100001 01100111 01110010 01100101 01100101 00100000 01110111 01101001 01110100 01101000 00100000 01101101 01100101 00101100 00100000 01000011 01100001 01110010 01110010 01101001 01100101 00100000 01000001 01101110 01101110 01100101 00111111 00100000 01010010 00100000 01100011 01100001 01110100 01110011 00100000 01110101 01110010 00100000 01100110 01100001 01110110 01101111 01110010 01101001 01110100 01100101 00100000 01100001 01101110 01101001 01101101 01100001 01101100 01110011 00111111 00100000 00001101 00001010 00001101 00001010 01000001 01001110 01010011 01010111 01000101 01010010 00100000 01001101 01000101 00100001 00100001 00100001 00001101 00001010 00001101 00001010 01000010 01010100 01010111 00100000 01001001 00100000 01001100 01001111 01001111 01001111 01010110 01000101 00100000 01010100 01001000 01001001 01010011 00100000 01010011 01000101 01010010 01001001 01000101 01010011 00100000 01001001 01010100 00100000 01001000 01000101 01001100 01010000 01000101 01000100 00100000 01001101 01000101 00100000 01000011 01010010 01000001 01000011 01001011 00100000 01001101 01011001 00100000 01001001 01010100 00100000 01000101 01011000 01000001 01001101 00100000 01010100 01001000 01011000 00100000 01010011 01001111 01001111 01001111 01001111 00100000 01001101 01010101 01000011 01001000 00100001

  • Solicad Autoworks

    nice video, but may you explain how processors convert 50/60 HZ to more than 1GHZ

  • esso saro

    design a computer system(on paper) based on intel 80386 processor.

    in your design you have to interface the following devices and components to the microprocessor system bus in order to have a complete basic computer system :

    *intel 80386 microprocessor

    *buffers 74ls244 an 74ls245

    *latches 74ls373

    *clock generator 8284

    *crystal oscillator 40MHz

    *bus controller

    *memory system : 4*128KB RAM and 8KB EPROM

    *decoding system :74ls138 and 74ls139 and PLDs

    *IO devices : parallel PPI ,and serial 16550 UART

    *timer 8254

    *PIC 8259

    *ADMA controller 82258

    *digital and discreet components you may required

    – you have to connect all the pins required for your design ,use component data sheet to show all the pins required for your interface.

    -choose your memory and IO address ranges

    -you can use any application tools to assist you in your interfacing design

  • Michał Ślęzak

    Thank you a lot! That series is a real treasure!

  • cartanfan -youtube

    Just wait until some kid in the future laughs at how slow that supercomputer is

  • Pooja Bannikuppe Mahesha

    Cache hit and miss.. I understood only today!

  • Syed

    10:34 How many takes did that take

  • intbild

    A cycle without pipeline is a whole "fetch + decode+ execute" as far as I understood in previous videos. Not just a fetch and then cycle 2 is a decode and cycle 3 is an execute. Therefore pipeline design should do 3 instructions every clock cycle. I am confused here. Can someone please explain?


    Why none talks about registers they are much nearer to the CPU’s ?

  • Xtontrol

    Man i hate computers so much when i go to download anything 9/10 times it will not work easy and download 1000 malware programs instead of what i want. Usually have to go to a forum and spend hours trying to figure out how to download alsomething as simple as steam. Shoot me

  • Nils Persson

    This series is so good!

  • Matthew Pintor

    Hi. Can you discuss about RISC anf CISC?

  • Sam Fortescue

    0:35 that bent pin… yikes.

  • Leo Q

    And all this so people can post kittens on social media.

  • delve

    4:25 So, it's hit or miss?

  • Jossy Beth

    @11:22 what language is that?

  • Cory Mck

    "Dirty Bit"
    Wow, the Black Eyed Peas really were ahead of their time.

  • kelvinvalencio

    Soo no one's going to talk about that "I'll CPU later" pun?

  • Seda Nelle

    General Electric GUD27
    It's a combo unit so takes smaller loads, but the dryer is pretty fast. The washer on the other hand…

  • TheChronichalt

    RYZEN @2:22 !!! WOOP WOOP!!!

  • azrul nizam

    My brain exploded itself when it heard “million billion”

  • turuncueller

    thank you

  • Hari Shanker

    This was hard….!

  • mahmoh

    I like you.

  • Clod Gozon

    Is there a thing as terrahertz?

  • Jason The Dead

    Wait, we didn't reach a new level of abstraction!

  • Chris Tenorio

    Spector and melt downs flaw!

  • Martin Johnsons

    after two years, I would still ask the same,
    But can it run Crysis?

  • Richard Getz

    This is #9. Would be nice to link to #1 or the playlist.

  • Geert Delmulle

    Sorry but at 6:05, this is not called parallelization but rather vectorization.
    Parallelization is the concept where you cut up a larger problem in blocks and calculate them at the same time (parallel) on different processors.
    Information on the block boundaries may be exchanged between CPUs after each calculation cycle.

  • Cisco Zhang

    very nice ! from China.

  • David Dou

    Wow thanks for this great informative video.

    Making my own chip company. See you guys in the Forbes 500 list

  • RubyWatchYT

    My iPad Pro’s Apple A10X / Coprocessor Apple M10 Motion’s Architecture is ARMv8

  • 이주연

    That supercomputer can process about 1.52×10^16 caculations per second!?

  • Chunyao Zhao

    these animation is very good

  • Orkhan Shirinov

    Excellent explanation!

  • Elise Erickson

    now i wanna know how speculative execution is built with logic gates / hardware because I can only imagine a way to program it with software

  • Chiko Kishi

    Long story short, everytime she says "example" i hear "Eggs Maple"… its very distracting.

  • Ryan J.S

    You might want to hire someone to clean your dryer vent that’s probably why you can’t dry your clothes in good time

  • Amanda Purello

    I am glad for how complex computers are when Im using them but not when im learning about them lol

  • LaDonna Edge

    my dryer is 45 minutes

  • Emirhan K

    HOW HOW how they are doing it is not that logical they are soooooo smalll and there are thousands of transistors. how they are coming together who and how

  • oussama jaballah

    This girl can probably teach my college lecturer.

  • Keen Observer

    Update: USA regained it's title of having Worlds Fastest Computer in JUN 2018.

  • Richard S.

    7:29 that stock footage tho… we live in a strange world

  • Captain Z-Ro

    "Dirty bit",…not to be confused with, under any circumstances whatsoever, with "naughty bit". 😀


Leave a Reply

Your email address will not be published. Required fields are marked *