CA期末总结

Lecture1: Introduction

• Thinking about Machine Structures

• Great Ideas in Computer Architecture

• What you need to know about this class

• Everything is a Number

Lecture 2: C I

• Compile vs. Interpret 不同进制数字表示法

二进制的表示方法：1.Unsigned Integers 2.Two’s-Complement 二补数（取反+1）3. One’s-Complement 一补数（直接取反）4. Sign Magnitude 只有最高位表示符号
CPP Macro 宏定义 #define MAG(x, y) (sqrt( (x)*(x) + (y)*(y))) 要加很多括号不然就会有问题

Lecture 3: C II

• Pointers

• Pointers & Arrays

• C Memory Management

• C Bugs

指针的分类

空指针指针的值为NULL或0的指针
空悬指针指向的空间已被释放
野指针指针未被初始化(赋值)

C内存管理

静态内存、动态内存
静态内存分配好后，程序运行过程中一直存在不会被释放，且一旦分配好，其内存大小就固定下来不能改变，在编译和链接的阶段就会分配好。
动态内存是程序运行过程中，根据程序的需要分配和释放，其大小可变。

2 堆与栈
堆是程序通过调用malloc或new分配，调用free或delete释放。栈是线性结构，堆是链表结构。

3 C的内存分配
全局变量和static修饰的静态变量都存放在静态内存区
函数内部定义的局部变量，存储在栈上，函数退出时，其占用内存被收回。
调用malloc或new得到的内存在堆上，不再需要时要显示的调用free或delete来释放，否则会造成内存泄漏

int a = 0;  //全局初始化区 ，static
char *p1;  //全局未初始化区 , static
main() 
{ 
int b;  //stack
char s[] = "abc";  //stack
char *p2; //stack
char *p3 = "123456";  //"123456\0"在static，静态内存，p3在栈上。 
p3[0] ->static 
static int c =0； //全局（静态）初始化区 static
p1 = (char *)malloc(10); // heap
p2 = (char *)malloc(20); //heap
&p2 -> stack 
strcpy(p1, "123456");  //"123456\0"放在常量区，编译器可能会将它与p3所指向的"123456"优化成一个地方。 
}

Lecture 4: RISC-V I

Big Endian vs. Little Endian

Lecture 5: RISC-V II III IV

introduce how to write risc-v 具体看绿卡就行

Lecture 8: RISC-V V

浮点数表示法

在线进制转换-IEE754浮点数16进制转换

Lecture 9: Call

Interpretation

• Python interpreter is just a program that reads a

python program and performs the functions of

that python program.

Interpretation vs. Translation

Compiler

Input: High-Level Language Code

Output: Assembly Language Code

Pseudo-instructions: instructions that assembler understands but not in machine

Assembler

Input: Assembly Language Code

Output: Object Code, information tables

Reads and Uses Directives

Replace Pseudo-instructions
Produce Machine Language
Creates Object File

Assembler Directives

• Give directions to assembler, but do not produce machine instructions

Producing Machine Language

• Simple Case

• What about Branches?

– PC-Relative (e.g., beq/bne and jal)

– So once pseudo-instructions are replaced by real ones, we know by how many instructions to branch

• “Forward Reference” problem ：Solved by taking two passes over the program

First pass remembers position of labels
Second pass uses label positions to generate code

Symbol Table

List of “items” in this file that may be used by

other files

– Labels: function calling

– Data: anything in the .data section; variables which may be accessed across files

Relocation Table

• List of “items” whose address this file needs

Assembly Step :

Instructions and Labels have addresses
Create relocation table and symbol table
Generate object (.o) file,Output binary representation for

• text segment (instructions)

• data segment (data)

• symbol and relocation tables

Linker

• Input: Object code files, information tables

• Output: Executable code

• Combines several object (.o) files into a single executable (“linking”)

• Enable separate compilation of files

– Changes to one file do not require recompilation of the whole program

• Linux source > 20 M lines of code!

– Old name “Link Editor” from editing the “links” in jump and link instructions

• To resolve references:

– search for reference (data or label) in all “user” symbol tables

– if not found, search library files

– once absolute address is determined, fill in the machine code appropriately

Loader

• Input: Executable Code

• Output: (program is run)

• Executable files are stored on disk

• When one is run, loader’s job is to load it into

memory and start it running

• In reality, loader is the operating system (OS)

– loading is one of the OS tasks

Static vs Dynamically linked libraries

Some steps for each stages

Compiler

Uses Lexers to process the input into tokens.
Be responsible for loop unrolling
Produce an executable binary from a collection of C source files
Checks if there are any syntax errors in the C source files.
Resolve #include directives.

Assembler

Prepares the relocation table.
Produce an executable binary from a collection of C source files.
Translate symbolic register name to integer register name.

Linker

Prepares the virtual address space for the static section.
Fills in the final value for the immediate in jump instructions. Note that the label
we are jumping to exists in a different file than the jump instruction.
Incorporates statically-linked libraries.
Produces executable filecontaining text and data (plus header).
Produce an executable binary from a collection of C source files.
Relocation

Loader

Copies instructions and data from executable file into the address space.
Runtime symbol resolution

Lecture 10: SDS

CMOS

逻辑电路化简公式

数字电路化简

并项法： AB + AB’ = A
两项合并为一项，消去B与B’

吸收法： A + AB = A
短项吸收长项

消项法： AB+ A’C + BC =AB + A’C
可拓展为：
AB+ A’C + BCD =AB + A’C

消因子法：A + A’B = A + B
短项能够消去长项中的相反项
此处也能这样理解：A看作A*(1+B), 即A+AB+A’B

配项法：基本公式 A + A = A
可以在逻辑函数中重复写入某一项，或如下图过程中所示，乘上(A+A’)

Lecture 11: FSM

SDS相关计算

min clk period= clk-to-q delay + max CL delay +set up time

longest set up time = min (clk period - input delay) (包括寄存器clk delay) 算从输入到寄存器的delay

longest hold time = min output delay (包括寄存器clk delay，从输出口子到output的delay）

clk-q + best case CL delay < hold time

Lecture 12 : CPU Control & Datapath

Complete Datapath

Lecture 12: Pipelining

• Pipelining

• Hazards

– Structural

– Data

• R-type instructions

• Load

– Control

Single Cycle Performance without pipeline: $t_{clk}=t_{pc-clk-q}+t_{IMEMread}+t_{RF read}+t_{mux}+t_{ALU}+t_{DMEMread}+t_{mux}+t_{RFsetup}$

how to speed up of 5 through pipelining:

No hazards
No pipeline register delay
All pipeline stages have equal delay

5 stages pipelining

IF : $t_{clk-q}+t_{MEMread}+t_{Regsetup}$

ID：$t_{clk-q}+t_{RFread}+t_{Regsetup}$

EX：$t_{clk-q}+t_{mux}+t_{ALU}+t_{Regsetup}+t_{mux}$

MEM：$t_{Regclk-q}+t_{MEMread}+t_{mux}+t_{Regsetup}$

WB：$t_{Regclk-q}+t_{RFsetup}$

Structural hazards occur when more than one instruction needs to use the same datapath resource at the same time. Structural hazards can always be resolved by adding more hardware.

Data Hazard solution

stalling
forwarding

Control Hazards solution

prediction

Lecture 13: Superscalar

• Processor Performance - Overview

• Complex Pipelines

• Static Multiple Issues (VLIW)

• Dynamic Multiple Issues (Superscalar)

“Iron Law” of Processor

cpi

Static Multiple Issue

• aka.: Very Long Instruction Word (VLIW)

• Compiler bundles instructions together

• Compiler takes care of hazards

• CPU executes at the same time

fetch buffer between 1 and 2

Lecture 14-16: Caches Part I

Cache Lecture I

– Caches Introduction

– Principle of Locality

• Temporal Locality (locality in time)

– If a memory location is referenced, then it will

tend to be referenced again soon

• Spatial Locality (locality in space)

– If a memory location is referenced, the locations

with nearby addresses will tend to be referenced soon

– Simple Cache

– Direct Mapped & Set-Associative Caches

• Cache Lecture II

– Stores to Caches

– Cache Performance

– Cache Misses

– Cache Configurations

– Cache Examples

increase associativity

hit time increase
miss rate decrease
miss penalty unchange

increase entries

hit time increase
miss rate decrease
miss penalty unchange

increase block size

hit time unchange
miss rate decrease
miss penalty increase

矩阵乘法最快的循环时jki

Lecture 19: DLP

Flynn* Taxonomy

Single-Instruction/Single-Data Stream(SISD)
Single-Instruction/Multiple-Data Stream(SIMD or “sim-dee”)
Multiple-Instruction/Multiple-Data Streams(MIMD or “mim-dee”)
Multiple-Instruction/Single-Data Stream(MISD)

SIMD simplify in LAB

int sum_naive( int n, int *a )
{
    int sum = 0;
    for( int i = 0; i < n; i++ )
        sum += a[i];
    return sum;
}

int sum_unrolled( int n, int *a )
{
    int sum = 0;

    /* do the body of the work in a faster unrolled loop */
    for( int i = 0; i < n/4*4; i += 4 )
    {
        sum += a[i+0];
        sum += a[i+1];
        sum += a[i+2];
        sum += a[i+3];
    }

    /* handle the small tail in a usual way */
    for( int i = n/4*4; i < n; i++ )   
        sum += a[i];

    return sum;
}

int sum_vectorized( int n, int *a )
{
    __m128i tp=_mm_setzero_si128( );
    for(int i=0;i<n/4*4;i+=4){
        tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i)));
    }

    int z[4],ans;
    _mm_storeu_si128( (__m128i *)z,  tp );
    ans=z[0]+z[1]+z[2]+z[3];
    for(int i=n/4*4;i<n;i++)
        ans+=a[i];
    return ans;
}
int sum_vectorized_unrolled( int n, int *a )
{
    __m128i tp=_mm_setzero_si128( );
    for(int i=0;i<n/16*16;i+=16){
        tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i)));
        tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i+4)));
        tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i+8)));
        tp=_mm_add_epi32(tp,_mm_loadu_si128((__m128i *)(a+i+12)));
    }

    int z[4],ans;
    _mm_storeu_si128( (__m128i *)z,  tp );
    ans=z[0]+z[1]+z[2]+z[3];
    for(int i=n/16*16;i<n;i++)
        ans+=a[i];


    return ans;
}

Amdahl’s Law

Loop Unrolled

• Amdahl’s Law: Serial sections limit speedup

• Flynn Taxonomy

• Intel SSE SIMD Instructions

– Exploit data-level parallelism in loops

– One instruction fetch that operates on multiple

operands simultaneously

– 128-bit XMM registers

• SSE Instructions in C

– Embed the SSE machine instructions directly into C

programs through use of intrinsics

– Achieve efficiency beyond that of optimizing compiler

Lecture 20: TLP Thread-Level Parallelism

Hardware Multithreading
Operating System Threads
Hyper-threading (simplified)
OpenMP

【OpenMP学习笔记】

OpenMP problem data race
OpenMP lock (too slow)

OpenMP for loop divides index regions sequentially per thread in default.
For potential spatial locality and cache optimization.

OpenMP in lab

// EDIT THIS FUNCTION PART 1
double dotp_manual_optimized (double *x, double *y, int arr_size)
{
  double global_sum = 0.0;
  #pragma omp parallel
  {
    int curr_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    int start = curr_id * (arr_size / num_threads);
    int end = start + (arr_size / num_threads);
    double thread_sum = 0.0;
    for (int i = start; i < end; i++)
      thread_sum += x[i] * y[i];
    if (curr_id == num_threads - 1 && end < arr_size){
      for (int i = end; i < arr_size; i++)
        thread_sum += x[i] * y[i];        
    }
    #pragma omp critical
      global_sum += thread_sum;    
  }

  return global_sum;
}

// EDIT THIS FUNCTION PART 2
double dotp_reduction_optimized (double *x, double *y, int arr_size)
{
  double global_sum = 0.0;
  #pragma omp parallel
  {
    #pragma omp for reduction (+ : global_sum)
    for (int i = 0; i < arr_size; i++)
      global_sum += x[i] * y[i];
  }
  return global_sum;
}


void v_add_for(double* x, double* y, double* z)
{
    #pragma omp parallel
    {
    #pragma omp for
        for(int i=0; i<ARRAY_SIZE; i++)
            z[i] = x[i] + y[i];

void v_add_optimized_adjacent (double *x, double *y, double *z)
{
  #pragma omp parallel
  {
    int curr_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    for(int i = curr_id; i < ARRAY_SIZE; i += num_threads)
    z[i] = x[i] + y[i];

  }
}

// Edit this function (Method 2) 
void v_add_optimized_chunks (double *x, double *y, double *z)
{
  #pragma omp parallel
  {
    int curr_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    int start = curr_id * (ARRAY_SIZE / num_threads);
    int end = start + (ARRAY_SIZE / num_threads);
    for (int i = start; i < end; i++)
      z[i] = x[i] + y[i];
    if (curr_id == num_threads - 1 && end < ARRAY_SIZE){
      for (int i = end; i < ARRAY_SIZE; i++)
        z[i] = x[i] + y[i];        
    }    
  }
}

Lecture 22: OS

• OS Boot Sequence and Operation

• Devices and I/O, interrupt and traps

I/O Example (polling) :mouse/processor

Interrupt-driven I/O:Disk

Polling compare with Interrupt

Polling: CPU periodically queries device to determine if they need attention
Interrupts: Each device signals to CPU that it wants to be serviced
The latter one is better since it utilizes CPU better

Traps/Interrupts/Exceptions

• Application, Multiprogramming/time-sharing

Lecture 24: VM

What do we need Virtual Memory for?

Reason 1: Adding Disks to Hierarchy

Reason 2: Simplifying Memory for Apps

Reason 3: Protection Between Processes

Pages advantages:

run program longer than DRAM
no memory fragamentation

TLB

TLB check -> page check -> allocate new memory -> update page -> update TLB

Some true or false statement

In a bare system without virtual memory, a program can modify any part of the memory. T
Both base and bound memory system and paged memory system can run programs with larger memory than DRAM. F
The TLB should be flushed after a context switch or when any new memory is allocated. F
Increasing page table size can always lead to reduction in page faults. F

Lecture 24: FGPA

FPGA application/devices

• Communication devices
Wired and wireless routers and switches
• Automotive applications
Braking systems, traction control, airbag release systems, and cruise-control
applications
• Aerospace applications
Flight-control systems, engine controllers, auto-pilots and passenger in-flight
entertainment systems
• Defense systems
Radar systems, fighter aircraft flight-control systems, radio systems, and missile guidance systems

Embedded System Design

– Communication devices

• Wired and wireless routers and switches

– Automotive applications

• Braking systems, traction control, airbag release systems,

and cruise-control applications

– Aerospace applications

• Flight-control systems, engine controllers, auto-pilots and

passenger in-flight entertainment systems

– Defence systems

• Radar systems, fighter aircraft flight-control systems, radio

systems, and missile guidance systems

some statements

Embedded System are designed to perform predefined functions on certain platforms.
Embedded Systems usually provide limited functions and have fewer components than Computer Systems.
ASIC can not be changed once designed, but it is suitable
for high-volume mass production.

FPGA vs others

Lecture 25: WSC

Warehouse Scale Computing

• Request-level Parallelism

e.g. Web search

• Data-level Parallelism

Static web servers use request-level Parallelism while MapReduce is Data-level Parallelism.

Inefficient load balancing in a warehouse-scale computer may lead to higher
energy consumptions.

– MapReduce

– Hadoop, Spark

two parallelism strategies in WSC. : Request-level, Data-level

Impact on WSC software

• Latency, bandwidth à Performance

– Independent data set within an array

– Locality of access within server or rack

• High failure rate à Reliability, Availability

– Preventing failures is expensive

– Cope with failures gracefully

• Varying workloads à Scalability, Availability

– Scale up and down gracefully

MapReduce

Warehouse-Scale Computers (WSCs)

– New class of computers

– Scalability, energy efficiency, high failure rate

• Cloud Computing

– Benefits of WSC computing for third parties

– “Elastic” pay as you go resource allocation

• Request-Level Parallelism

– High request volume, each largely independent of other

– Use replication for better request throughput, availability

• MapReduce Data Parallelism

– Map: Divide large data set into pieces for independent parallel processing

– Reduce: Combine and process intermediate results to obtain final result

– Hadoop, Spark

Lecture 26: Advanced Caches

Advanced Caches:

MRU is LRU

Reduce the size of LLC
LLC is not monolithic

Lecture 27: I/O: DMA, Disks, Networking

• Direct Memory Access (DMA)

• Disks

• Networking

Shared vs. Switch-Based Networks

Software Protocol to Send and Receive

PIO/DMA

PIO模式和DMA模式的区别_

硬盘和内存之间数据传送的两种方式：一是PIO模式，二是DNA模式

PIO模式下通过CPU来控制硬盘和内存之间的数据传输，是一种通过CPU执行I/O端口指令来进行数据的读写的数据交换模式。

DMA模式下，CPU并不全程参与数据的传送工作，只需下达命令即可。DMA方式下有控制器和通道，CPU只须向DMA控制器下达指令，让DMA控制器来处理数据的传送，数据传送完毕再把信息反馈给CPU，这样就很大程度上减轻了CPU资源占有率。DMA模式与PIO模式的区别就在于，DMA模式不过分依赖CPU，可以大大节省系统资源，二者在传输速度上的差异并不十分明显。DMA模式又可以分为Single-Word DMA（单字节DMA）和Multi-Word DMA（多字节DMA）两种，其中所能达到的最大传输速率也只有16.6MB/s。

DMA 传送方式的优先级高于程序中断，两者的区别主要表现在对CPU的干扰程度不同。程序中断请求不但使CPU停下来，而且要CPU执行中断服务程序为中断请求服务，这个请求包括了对断点和现场的处理以及CPU与外设的传送，所以CPU付出了很多的代价；DMA请求仅仅使CPU暂停一下，不需要对断点和现场的处理，并且是由DMA控制外设与主存之间的数据传送，无需CPU的干预，DMA只是借用了一点CPU的时间而已。还有一个区别就是，CPU对这两个请求的响应时间不同，对程序中断请求一般都在执行完一条指令的时钟周期末尾响应，而对DMA的请求，由于考虑它的高效性，CPU在每条指令执行的各个阶段之中都可以让给DMA使用，是立即响应。　DMA主要由硬件来实现，此时高速外设和内存之间进行数据交换不通过CPU的控制，而是利用系统总线。DMA方式是I/O系统与主机交换数据的主要方式之一，另外还有程序查询方式和中断方式。

• I/O gives computers their 5 senses

• I/O speed range is 100-million to one

• Polling vs. Interrupts

• DMA to avoid wasting CPU time on data transfers

• Disks for persistent storage, replaced by flash

• Networks: computer-to-computer I/O

– Protocol suites allow networking of heterogeneous

components. Abstraction!!!

Lecture 28: Dependability and RAID

Dependability via Redundancy

汉明码 Parity /Hamming Error Correction Code

汉明码解释

RAID

RAID磁盘阵列详解

1.什么是RAID磁盘阵列
RAID是英文Redundant Array of Independent Disks的缩写，中文翻译过来就是“独立冗余磁盘阵列”。简单的说，RAID是一种把多块独立的硬盘（物理硬盘）按不同的方式组合起来形成一个硬盘组（逻辑硬盘），从而提供比单个硬盘更高的存储性能和提供数据备份技术。利用这项技术，将数据切割成许多区段，分别存放在各个硬盘上。磁盘阵列还能利用同位检查（Parity Check）的观念，在数组中任意一个硬盘故障时，仍可读出数据，在数据重构时，将数据经计算后重新置入新硬盘中。

RAID功能实现

提高IO能力,磁盘并行读写

提高耐用性,磁盘冗余算法来实现

RAID实现的方式

外接式磁盘阵列：通过扩展卡提供适配能力

内接式RAID：主板集成RAID控制器，安装OS前在BIOS里配置

软件RAID：通过OS实现

2.RAID各种级别
2.1 RAID 0
RAID 0连续以位或字节为单位分割数据，并行读/写于多个磁盘上，因此具有很高的数据传输率，但它没有数据冗余
RAID 0只是单纯地提高性能，并没有为数据的可靠性提供保证，而且其中的一个磁盘失效将影响到所有数据
RAID 0不能应用于数据安全性要求高的场合

2.2 RAID 1
通过磁盘数据镜像实现数据冗余，在成对的独立磁盘上产生互为备份的数据
当原始数据繁忙时，可直接从镜像拷贝中读取数据，因此RAID1可以提高读取性能
RAID1是磁盘阵列中单位成本最高的，但提供了很高的数据安全性和可用性。当一个磁盘失效时，系统可以自动切换到镜像磁盘上读写，而不需要重组失效的数据

2.3 RAID 5
N（N>=3）块盘组成阵列，一份数据产生N-1个条带，同时还有1份校验数据，共N份数据在N块盘上循环均衡存储
N块盘同时读写，读性能很高，但由于有校验机制的问题，写性能相对不高
（N-1）/N磁盘利用率
可靠性高，允许坏1块盘，不影响所有数据

. RAID 3 employs the strategy of byte-level striping with single parity disk, which is inefficient to detect errors.

2.4 RAID 6
N（N>=4）块盘组成阵列，（N-2）/N磁盘利用率
与RAID5相比，RAID6增加了第二个独立的奇偶校验信息块
两个独立的奇偶系统使用不同的算法，即使两块磁盘同时失效也不会影响数据的使用
相对于RAID5有更大的“写损失，因此写性能较差

2.5 RAID 10
RAID10其实是RAID 1+0

N（偶数，N>=4）块盘两两镜像后，再组合成一个RAID0
N/2磁盘利用率
N/2块盘同时写入，N块盘同时读取
性能高，可靠性高

RAID等级	需要硬盘数量	容错能力	读写能力
RAID 0	最少1个	无	读写性能高
RAID 1	N（偶数）	有	读性能高、写性能低
RAID 5	N>=3	有（最多一个坏盘）	读写性能高
RAID 10	N>=4(偶数)	有（每组最多坏一个盘）	读写性能高

Lecture 29: Security

Heartbleed

Flush+Reload

Flash memory can be used as a disk cache to a magnetic hard disk.

Meltdown

covers： Virtual Memory; Protection Levels; Instruction Pipelining; Out-of

order Execution; Speculative Execution; CPU Caching.
Keywords: Cache, timing, speculative execution, memory paging, OS pages, data of another process (4 of those keywords are enough).
We want to read from OS pages that hold data of other processes. Those are in our
Virtual Memory space in order for the OS to process them quickly. But it is forbidden for our process to read them. If we try to read them the OS will check (with an ”if”) if we are allowed to read them and thus return with a page fault. But speculative execution will nevertheless cause those pages to be read (but the process doesn’t see the result).

Spectre

• Every part of a computer can be vulnerable

– Be vigilant and attentive in designing and

programming

• Security and privacy have different presences

– More than DoS, DDoS, virus, Trojan, ransomware,

spyware, and phishing emails

• The challenges for a computer architect

– To rule out any possibility of vulnerabilities

– To achieve both high performance and security