Lecture1: Introduction
• Thinking about Machine Structures
• Great Ideas in Computer Architecture
• What you need to know about this class
• Everything is a Number
Lecture 2: C I
• Compile vs. Interpret 不同进制数字表示法
二进制的表示方法:1.Unsigned Integers 2.Two’s-Complement 二补数(取反+1)3. One’s-Complement 一补数(直接取反)4. Sign Magnitude 只有最高位表示符号
CPP Macro 宏定义
#define MAG(x, y) (sqrt( (x)*(x) + (y)*(y)))
要加很多括号不然就会有问题
Lecture 3: C II
• Pointers
• Pointers & Arrays
• C Memory Management
• C Bugs
指针的分类
- 空指针 指针的值为NULL或0的指针
- 空悬指针 指向的空间已被释放
- 野指针 指针未被初始化(赋值)
C内存管理
静态内存、动态内存
静态内存分配好后,程序运行过程中一直存在不会被释放,且一旦分配好,其内存大小就固定下来不能改变,在编译和链接的阶段就会分配好。
动态内存是程序运行过程中,根据程序的需要分配和释放,其大小可变。
2 堆与栈
堆是程序通过调用malloc或new分配,调用free或delete释放。 栈是线性结构,堆是链表结构。
3 C的内存分配
全局变量和static修饰的静态变量都存放在静态内存区
函数内部定义的局部变量,存储在栈上,函数退出时,其占用内存被收回。
调用malloc或new得到的内存在堆上,不再需要时要显示的调用free或delete来释放,否则会造成内存泄漏
int a = 0; //全局初始化区 ,static
char *p1; //全局未初始化区 , static
main()
{
int b; //stack
char s[] = "abc"; //stack
char *p2; //stack
char *p3 = "123456"; //"123456\0"在static,静态内存,p3在栈上。
p3[0] ->static
static int c =0; //全局(静态)初始化区 static
p1 = (char *)malloc(10); // heap
p2 = (char *)malloc(20); //heap
&p2 -> stack
strcpy(p1, "123456"); //"123456\0"放在常量区,编译器可能会将它与p3所指向的"123456"优化成一个地方。
}
Lecture 4: RISC-V I
Big Endian vs. Little Endian
Lecture 5: RISC-V II III IV
introduce how to write risc-v 具体看绿卡就行
Lecture 8: RISC-V V
浮点数表示法
Lecture 9: Call
Interpretation
• Python interpreter is just a program that reads a
python program and performs the functions of
that python program.
- Interpretation vs. Translation
Compiler
Input: High-Level Language Code
Output: Assembly Language Code
Pseudo-instructions: instructions that assembler understands but not in machine
Assembler
Input: Assembly Language Code
Output: Object Code, information tables
Reads and Uses Directives
Replace Pseudo-instructions
Produce Machine Language
Creates Object File
Assembler Directives
• Give directions to assembler, but do not produce machine instructions
Producing Machine Language
• Simple Case
• What about Branches?
– PC-Relative (e.g., beq/bne and jal)
– So once pseudo-instructions are replaced by real ones, we know by how many instructions to branch
• “Forward Reference” problem :Solved by taking two passes over the program
First pass remembers position of labels
Second pass uses label positions to generate code
Symbol Table
List of “items” in this file that may be used by
other files
– Labels: function calling
– Data: anything in the .data section; variables which may be accessed across files
Relocation Table
• List of “items” whose address this file needs
Assembly Step :
Instructions and Labels have addresses
Create relocation table and symbol table
Generate object (.o) file,Output binary representation for
• text segment (instructions)
• data segment (data)
• symbol and relocation tables
Linker
• Input: Object code files, information tables
• Output: Executable code
• Combines several object (.o) files into a single executable (“linking”)
• Enable separate compilation of files
– Changes to one file do not require recompilation of the whole program
• Linux source > 20 M lines of code!
– Old name “Link Editor” from editing the “links” in jump and link instructions
• To resolve references:
– search for reference (data or label) in all “user” symbol tables
– if not found, search library files
– once absolute address is determined, fill in the machine code appropriately
Loader
• Input: Executable Code
• Output: (program is run)
• Executable files are stored on disk
• When one is run, loader’s job is to load it into
memory and start it running
• In reality, loader is the operating system (OS)
– loading is one of the OS tasks
Static vs Dynamically linked libraries
Some steps for each stages
Compiler
Uses Lexers to process the input into tokens.
Be responsible for loop unrolling
Produce an executable binary from a collection of C source files
Checks if there are any syntax errors in the C source files.
Resolve #include directives.
Assembler
Prepares the relocation table.
Produce an executable binary from a collection of C source files.
Translate symbolic register name to integer register name.
Linker
Prepares the virtual address space for the static section.
Fills in the final value for the immediate in jump instructions. Note that the label
we are jumping to exists in a different file than the jump instruction.Incorporates statically-linked libraries.
Produces executable filecontaining text and data (plus header).
Produce an executable binary from a collection of C source files.
Relocation
Loader
Copies instructions and data from executable file into the address space.
Runtime symbol resolution
Lecture 10: SDS
- CMOS
逻辑电路化简公式
并项法: AB + AB’ = A
两项合并为一项,消去B与B’
吸收法: A + AB = A
短项吸收长项
消项法: AB+ A’C + BC =AB + A’C
可拓展为:
AB+ A’C + BCD =AB + A’C
消因子法:A + A’B = A + B
短项能够消去 长项中 的 相反项
此处也能这样理解:A看作A*(1+B), 即A+AB+A’B
配项法: 基本公式 A + A = A
可以在逻辑函数中重复写入某一项,或如下图过程中所示,乘上(A+A’)
Lecture 11: FSM
SDS相关计算
min clk period= clk-to-q delay + max CL delay +set up time
longest set up time = min (clk period - input delay) (包括寄存器clk delay) 算从输入到寄存器的delay
longest hold time = min output delay (包括寄存器clk delay,从输出口子到output的delay)
clk-q + best case CL delay < hold time
Lecture 12 : CPU Control & Datapath
Complete Datapath
Lecture 12: Pipelining
• Pipelining
• Hazards
– Structural
– Data
• R-type instructions
• Load
– Control
Single Cycle Performance without pipeline: $t_{clk}=t_{pc-clk-q}+t_{IMEMread}+t_{RF read}+t_{mux}+t_{ALU}+t_{DMEMread}+t_{mux}+t_{RFsetup}$
how to speed up of 5 through pipelining:
No hazards
No pipeline register delay
All pipeline stages have equal delay
5 stages pipelining
IF : $t_{clk-q}+t_{MEMread}+t_{Regsetup}$
ID:$t_{clk-q}+t_{RFread}+t_{Regsetup}$
EX:$t_{clk-q}+t_{mux}+t_{ALU}+t_{Regsetup}+t_{mux}$
MEM:$t_{Regclk-q}+t_{MEMread}+t_{mux}+t_{Regsetup}$
WB:$t_{Regclk-q}+t_{RFsetup}$
Structural hazards occur when more than one instruction needs to use the same datapath resource at the same time. Structural hazards can always be resolved by adding more hardware.
Data Hazard solution
stalling
forwarding
Control Hazards solution
- prediction
Lecture 13: Superscalar
• Processor Performance - Overview
• Complex Pipelines
• Static Multiple Issues (VLIW)
• Dynamic Multiple Issues (Superscalar)
“Iron Law” of Processor
cpi
Static Multiple Issue
• aka.: Very Long Instruction Word (VLIW)
• Compiler bundles instructions together
• Compiler takes care of hazards
• CPU executes at the same time
fetch buffer between 1 and 2
Lecture 14-16: Caches Part I
Cache Lecture I
– Caches Introduction
– Principle of Locality
• Temporal Locality (locality in time)
– If a memory location is referenced, then it will
tend to be referenced again soon
• Spatial Locality (locality in space)
– If a memory location is referenced, the locations
with nearby addresses will tend to be referenced soon
– Simple Cache
– Direct Mapped & Set-Associative Caches
• Cache Lecture II
– Stores to Caches
– Cache Performance
– Cache Misses
– Cache Configurations
– Cache Examples
increase associativity
hit time increase
miss rate decrease
miss penalty unchange
increase entries
hit time increase
miss rate decrease
miss penalty unchange
increase block size
hit time unchange
miss rate decrease
miss penalty increase
矩阵乘法 最快的循环时jki
Lecture 19: DLP
Flynn* Taxonomy
Single-Instruction/Single-Data Stream(SISD)
Single-Instruction/Multiple-Data Stream(SIMD or “sim-dee”)
Multiple-Instruction/Multiple-Data Streams(MIMD or “mim-dee”)
Multiple-Instruction/Single-Data Stream(MISD)
SIMD simplify in LAB
int sum_naive( int n, int *a )
{
int sum = 0;
for( int i = 0; i < n; i++ )
sum += a[i];
return sum;
}
int sum_unrolled( int n, int *a )
{
int sum = 0;
/* do the body of the work in a faster unrolled loop */
for( int i = 0; i < n/4*4; i += 4 )
{
sum += a[i+0];
sum += a[i+1];
sum += a[i+2];
sum += a[i+3];
}
/* handle the small tail in a usual way */
for( int i = n/4*4; i < n; i++ )
sum += a[i];
return sum;
}
int sum_vectorized( int n, int *a )
{
__m128i tp=_mm_setzero_si128( );
for(int i=0;i<n/4*4;i+=4){
tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i)));
}
int z[4],ans;
_mm_storeu_si128( (__m128i *)z, tp );
ans=z[0]+z[1]+z[2]+z[3];
for(int i=n/4*4;i<n;i++)
ans+=a[i];
return ans;
}
int sum_vectorized_unrolled( int n, int *a )
{
__m128i tp=_mm_setzero_si128( );
for(int i=0;i<n/16*16;i+=16){
tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i)));
tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i+4)));
tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i+8)));
tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i+12)));
}
int z[4],ans;
_mm_storeu_si128( (__m128i *)z, tp );
ans=z[0]+z[1]+z[2]+z[3];
for(int i=n/16*16;i<n;i++)
ans+=a[i];
return ans;
}
Amdahl’s Law
- Loop Unrolled
• Amdahl’s Law: Serial sections limit speedup
• Flynn Taxonomy
• Intel SSE SIMD Instructions
– Exploit data-level parallelism in loops
– One instruction fetch that operates on multiple
operands simultaneously
– 128-bit XMM registers
• SSE Instructions in C
– Embed the SSE machine instructions directly into C
programs through use of intrinsics
– Achieve efficiency beyond that of optimizing compiler
Lecture 20: TLP Thread-Level Parallelism
Hardware Multithreading
Operating System Threads
Hyper-threading (simplified)
OpenMP
OpenMP problem data race
OpenMP lock (too slow)
OpenMP for loop divides index regions sequentially per thread in default.
For potential spatial locality and cache optimization.
OpenMP in lab
// EDIT THIS FUNCTION PART 1
double dotp_manual_optimized (double *x, double *y, int arr_size)
{
double global_sum = 0.0;
#pragma omp parallel
{
int curr_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int start = curr_id * (arr_size / num_threads);
int end = start + (arr_size / num_threads);
double thread_sum = 0.0;
for (int i = start; i < end; i++)
thread_sum += x[i] * y[i];
if (curr_id == num_threads - 1 && end < arr_size){
for (int i = end; i < arr_size; i++)
thread_sum += x[i] * y[i];
}
#pragma omp critical
global_sum += thread_sum;
}
return global_sum;
}
// EDIT THIS FUNCTION PART 2
double dotp_reduction_optimized (double *x, double *y, int arr_size)
{
double global_sum = 0.0;
#pragma omp parallel
{
#pragma omp for reduction (+ : global_sum)
for (int i = 0; i < arr_size; i++)
global_sum += x[i] * y[i];
}
return global_sum;
}
void v_add_for(double* x, double* y, double* z)
{
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<ARRAY_SIZE; i++)
z[i] = x[i] + y[i];
void v_add_optimized_adjacent (double *x, double *y, double *z)
{
#pragma omp parallel
{
int curr_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
for(int i = curr_id; i < ARRAY_SIZE; i += num_threads)
z[i] = x[i] + y[i];
}
}
// Edit this function (Method 2)
void v_add_optimized_chunks (double *x, double *y, double *z)
{
#pragma omp parallel
{
int curr_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int start = curr_id * (ARRAY_SIZE / num_threads);
int end = start + (ARRAY_SIZE / num_threads);
for (int i = start; i < end; i++)
z[i] = x[i] + y[i];
if (curr_id == num_threads - 1 && end < ARRAY_SIZE){
for (int i = end; i < ARRAY_SIZE; i++)
z[i] = x[i] + y[i];
}
}
}
Lecture 22: OS
• OS Boot Sequence and Operation
• Devices and I/O, interrupt and traps
I/O Example (polling) :mouse/processor
Interrupt-driven I/O:Disk
Polling compare with Interrupt
Polling: CPU periodically queries device to determine if they need attention
Interrupts: Each device signals to CPU that it wants to be serviced
The latter one is better since it utilizes CPU better
Traps/Interrupts/Exceptions
• Application, Multiprogramming/time-sharing
Lecture 24: VM
- What do we need Virtual Memory for?
Reason 1: Adding Disks to Hierarchy
Reason 2: Simplifying Memory for Apps
Reason 3: Protection Between Processes
Pages advantages:
run program longer than DRAM
no memory fragamentation
TLB
TLB check -> page check -> allocate new memory -> update page -> update TLB
Some true or false statement
- In a bare system without virtual memory, a program can modify any part of the memory. T
- Both base and bound memory system and paged memory system can run programs with larger memory than DRAM. F
- The TLB should be flushed after a context switch or when any new memory is allocated. F
- Increasing page table size can always lead to reduction in page faults. F
Lecture 24: FGPA
FPGA application/devices
• Communication devices
Wired and wireless routers and switches
• Automotive applications
Braking systems, traction control, airbag release systems, and cruise-control
applications
• Aerospace applications
Flight-control systems, engine controllers, auto-pilots and passenger in-flight
entertainment systems
• Defense systems
Radar systems, fighter aircraft flight-control systems, radio systems, and missile guidance systems
Embedded System Design
– Communication devices
• Wired and wireless routers and switches
– Automotive applications
• Braking systems, traction control, airbag release systems,
and cruise-control applications
– Aerospace applications
• Flight-control systems, engine controllers, auto-pilots and
passenger in-flight entertainment systems
– Defence systems
• Radar systems, fighter aircraft flight-control systems, radio
systems, and missile guidance systems
some statements
Embedded System are designed to perform predefined functions on certain platforms.
Embedded Systems usually provide limited functions and have fewer components than Computer Systems.
ASIC can not be changed once designed, but it is suitable
for high-volume mass production.
FPGA vs others
Lecture 25: WSC
Warehouse Scale Computing
• Request-level Parallelism
e.g. Web search
• Data-level Parallelism
Static web servers use request-level Parallelism while MapReduce is Data-level Parallelism.
Inefficient load balancing in a warehouse-scale computer may lead to higher
energy consumptions.
– MapReduce
– Hadoop, Spark
two parallelism strategies in WSC. : Request-level, Data-level
Impact on WSC software
• Latency, bandwidth à Performance
– Independent data set within an array
– Locality of access within server or rack
• High failure rate à Reliability, Availability
– Preventing failures is expensive
– Cope with failures gracefully
• Varying workloads à Scalability, Availability
– Scale up and down gracefully
- MapReduce
Warehouse-Scale Computers (WSCs)
– New class of computers
– Scalability, energy efficiency, high failure rate
• Cloud Computing
– Benefits of WSC computing for third parties
– “Elastic” pay as you go resource allocation
• Request-Level Parallelism
– High request volume, each largely independent of other
– Use replication for better request throughput, availability
• MapReduce Data Parallelism
– Map: Divide large data set into pieces for independent parallel processing
– Reduce: Combine and process intermediate results to obtain final result
– Hadoop, Spark
Lecture 26: Advanced Caches
Advanced Caches:
- MRU is LRU
Reduce the size of LLC
LLC is not monolithic
Lecture 27: I/O: DMA, Disks, Networking
• Direct Memory Access (DMA)
• Disks
• Networking
Shared vs. Switch-Based Networks
Software Protocol to Send and Receive
PIO/DMA
硬盘和内存之间数据传送的两种方式:一是PIO模式,二是DNA模式
PIO模式下通过CPU来控制硬盘和内存之间的数据传输,是一种通过CPU执行I/O端口指令来进行数据的读写的数据交换模式。
DMA模式下,CPU并不全程参与数据的传送工作,只需下达命令即可。DMA方式下有控制器和通道,CPU只须向DMA控制器下达指令,让DMA控制器来处理数据的传送,数据传送完毕再把信息反馈给CPU,这样就很大程度上减轻了CPU资源占有率。DMA模式与PIO模式的区别就在于,DMA模式不过分依赖CPU,可以大大节省系统资源,二者在传输速度上的差异并不十分明显。DMA模式又可以分为Single-Word DMA(单字节DMA)和Multi-Word DMA(多字节DMA)两种,其中所能达到的最大传输速率也只有16.6MB/s。
DMA 传送方式的优先级高于程序中断,两者的区别主要表现在对CPU的干扰程度不同。程序中断请求不但使CPU停下来,而且要CPU执行中断服务程序为中断请求服务,这个请求包括了对断点和现场的处理以及CPU与外设的传送,所以CPU付出了很多的代价;DMA请求仅仅使CPU暂停一下,不需要对断点和现场的处理,并且是由DMA控制外设与主存之间的数据传送,无需CPU的干预,DMA只是借用了一点CPU的时间而已。还有一个区别就是,CPU对这两个请求的响应时间不同,对程序中断请求一般都在执行完一条指令的时钟周期末尾响应,而对DMA的请求,由于考虑它的高效性,CPU在每条指令执行的各个阶段之中都可以让给DMA使用,是立即响应。 DMA主要由硬件来实现,此时高速外设和内存之间进行数据交换不通过CPU的控制,而是利用系统总线。DMA方式是I/O系统与主机交换数据的主要方式之一,另外还有程序查询方式和中断方式。
• I/O gives computers their 5 senses
• I/O speed range is 100-million to one
• Polling vs. Interrupts
• DMA to avoid wasting CPU time on data transfers
• Disks for persistent storage, replaced by flash
• Networks: computer-to-computer I/O
– Protocol suites allow networking of heterogeneous
components. Abstraction!!!
Lecture 28: Dependability and RAID
Dependability via Redundancy
- 汉明码 Parity /Hamming Error Correction Code
RAID
1.什么是RAID磁盘阵列
RAID是英文Redundant Array of Independent Disks的缩写,中文翻译过来就是“独立冗余磁盘阵列”。简单的说,RAID是一种把多块独立的硬盘(物理硬盘)按不同的方式组合起来形成一个硬盘组(逻辑硬盘),从而提供比单个硬盘更高的存储性能和提供数据备份技术。利用这项技术,将数据切割成许多区段,分别存放在各个硬盘上。磁盘阵列还能利用同位检查(Parity Check)的观念,在数组中任意一个硬盘故障时,仍可读出数据,在数据重构时,将数据经计算后重新置入新硬盘中。
RAID功能实现
提高IO能力,磁盘并行读写
提高耐用性,磁盘冗余算法来实现
RAID实现的方式
外接式磁盘阵列:通过扩展卡提供适配能力
内接式RAID:主板集成RAID控制器,安装OS前在BIOS里配置
软件RAID:通过OS实现
2.RAID各种级别
2.1 RAID 0
RAID 0连续以位或字节为单位分割数据,并行读/写于多个磁盘上,因此具有很高的数据传输率,但它没有数据冗余
RAID 0只是单纯地提高性能,并没有为数据的可靠性提供保证,而且其中的一个磁盘失效将影响到所有数据
RAID 0不能应用于数据安全性要求高的场合
2.2 RAID 1
通过磁盘数据镜像实现数据冗余,在成对的独立磁盘上产生互为备份的数据
当原始数据繁忙时,可直接从镜像拷贝中读取数据,因此RAID1可以提高读取性能
RAID1是磁盘阵列中单位成本最高的,但提供了很高的数据安全性和可用性。当一个磁盘失效时,系统可以自动切换到镜像磁盘上读写,而不需要重组失效的数据
2.3 RAID 5
N(N>=3)块盘组成阵列,一份数据产生N-1个条带,同时还有1份校验数据,共N份数据在N块盘上循环均衡存储
N块盘同时读写,读性能很高,但由于有校验机制的问题,写性能相对不高
(N-1)/N磁盘利用率
可靠性高,允许坏1块盘,不影响所有数据
. RAID 3 employs the strategy of byte-level striping with single parity disk, which is inefficient to detect errors.
2.4 RAID 6
N(N>=4)块盘组成阵列,(N-2)/N磁盘利用率
与RAID5相比,RAID6增加了第二个独立的奇偶校验信息块
两个独立的奇偶系统使用不同的算法,即使两块磁盘同时失效也不会影响数据的使用
相对于RAID5有更大的“写损失,因此写性能较差
2.5 RAID 10
RAID10其实是RAID 1+0
N(偶数,N>=4)块盘两两镜像后,再组合成一个RAID0
N/2磁盘利用率
N/2块盘同时写入,N块盘同时读取
性能高,可靠性高
RAID等级 | 需要硬盘数量 | 容错能力 | 读写能力 |
---|---|---|---|
RAID 0 | 最少1个 | 无 | 读写性能高 |
RAID 1 | N(偶数) | 有 | 读性能高、写性能低 |
RAID 5 | N>=3 | 有(最多一个坏盘) | 读写性能高 |
RAID 10 | N>=4(偶数) | 有(每组最多坏一个盘) | 读写性能高 |
Lecture 29: Security
Heartbleed
Flush+Reload
Flash memory can be used as a disk cache to a magnetic hard disk.
Meltdown
covers: Virtual Memory; Protection Levels; Instruction Pipelining; Out-of
order Execution; Speculative Execution; CPU Caching.
Keywords: Cache, timing, speculative execution, memory paging, OS pages, data of another process (4 of those keywords are enough).
We want to read from OS pages that hold data of other processes. Those are in our
Virtual Memory space in order for the OS to process them quickly. But it is forbidden for our process to read them. If we try to read them the OS will check (with an ”if”) if we are allowed to read them and thus return with a page fault. But speculative execution will nevertheless cause those pages to be read (but the process doesn’t see the result).
Spectre
• Every part of a computer can be vulnerable
– Be vigilant and attentive in designing and
programming
• Security and privacy have different presences
– More than DoS, DDoS, virus, Trojan, ransomware,
spyware, and phishing emails
• The challenges for a computer architect
– To rule out any possibility of vulnerabilities
– To achieve both high performance and security