Floating-Point Register

Floating point

Larry D. Pyeatt , William Ughetta , in ARM 64-Bit Assembly Language, 2020

nine.v Data move instructions

With the addition of all of the FP registers, at that place many more possibilities for how data tin be moved. At that place are many more registers, and FP registers may be 32 or 64 bit. This results in several combinations for moving data amongst all of the registers. The FP instruction set includes instructions for moving data betwixt two FP registers, between FP and integer registers, and between the various system registers.

9.5.1 Moving between data registers

The most basic movement instruction involving FP registers simply moves data betwixt two floating point registers, or moves information between an FP register and an Integer register. The instruction is:

fmov

Move Between Data Registers.

9.five.1.1 Syntax

The ii registers specified must be the same size.

Image 43
refers to the top 64 bits of register Vn.
9.five.one.2 Operations
Name Result Description
fmov Fd ←Fn Move Fn to Fd
ix.5.1.3 Examples

9.5.2 Floating bespeak move immediate

The FP/NEON education set provides an instruction for moving an firsthand value into a register, but there are some restrictions on what the immediate value tin be. The teaching is:

fmov

Floating Bespeak Move Immediate.

9.5.ii.1 Syntax

The floating point constant, fpimm, may be specified as a decimal number such as 1.0.

The floating point value must exist expressable as ± n ÷ 16 × two r , where n and r are integers such that 16 northward 31 and iii r 4 .

The floating bespeak number volition exist stored as a normalized binary floating point encoding with 1 sign bit, 4 bits of fraction and a iii-bit exponent (see Chapter 8, Section 8.seven).

Notation that this encoding does not include the value 0.0, however this value may be loaded using the

Image 46
education.
ix.five.2.2 Operations
Proper name Effect Description
fmov Fd ←fpimm Move Firsthand Data to Fd
9.5.2.3 Examples

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B978012819221400016X

Embedded Software in Existent-Fourth dimension Betoken Processing Systems: Blueprint Technologies

GERT GOOSSENS , ... MEMBER, IEEE, in Readings in Hardware/Software Co-Design, 2002

2 Information Routing

The to a higher place mentioned extension of graph coloring toward heterogeneous annals structures has been applied to general-purpose processors, which typically have a few register classes (e.thou., floating-point registers, fixed-point registers, and accost registers). DSP and ASIP architectures often have a strongly heterogeneous register construction with many special-purpose registers.

In this context, more than specialized register allocation techniques have been adult, often referred to as information routing techniques. To transfer data between functional units via intermediate registers, specific routes may accept to be followed. The selection of the most appropriate route is nontrivial. In some cases indirect routes may have to exist followed, requiring the insertion of extra register-transfer operations. Therefore an efficient mechanism for stage coupling between annals allocation and scheduling becomes essential [73].

Equally an illustration, Fig. 12 shows a number of alternative solutions for the multiplication operand of the symmetrical FIR filter application, implemented on the ADSP-21xx processor (see Fig. 8).

Fig. 12. 3 culling register allocations for the multiplication operand in the symmetrical FIR filter. The road followed is indicated in bold: (a) storage in AR, (b) storage in AR followed by MX, and (c) spilling to information memory DM. The last ii alternatives require the insertion of actress annals transfers.

Several techniques have been presented for information routing in compilers for embedded processors. A start arroyo is to determine the required data routes during the execution of the scheduling algorithm. This approach was first applied in the Bulldog compiler for VLIW machines [18], and subsequently adjusted in compilers for embedded processors like the RL compiler [48] and CBC [74]. In guild to prevent a combinational explosion of the problem, these methods but contain local, greedy search techniques to determine data routes. The arroyo typically lacks the power to place skilful candidate values for spilling to memory.

A global information routing technique has been proposed in the Chess compiler [75]. This method supports many different schemes to route values betwixt functional units. Information technology starts from an unordered description, but may introduce a partial ordering of operations to reduce the number of overlapping alive ranges. The algorithm is based on co-operative-and-spring searches to insert new data moves, to introduce partial orderings, and to select candidate values for spilling. Stage coupling with scheduling is supported, by the utilise of probabilistic scheduling estimators during the annals allocation process.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9781558607026500399

Architecture

Sarah 50. Harris , David Harris , in Digital Design and Estimator Architecture, 2022

vi.half-dozen.4 Floating-Point Instructions

The RISC-V architecture defines optional floating-indicate extensions called RVF, RVD, and RVQ for operating on single-, double-, and quad-precision floating-signal numbers, respectively. RVF/D/Q define 32 floating-bespeak registers, f0 to f31, with a width of 32, 64, or 128 bits, respectively. When a processor implements multiple floating-indicate extensions, it uses the lower part of the floating-point register for lower-precision instructions. f0 to f31 are separate from the plan (also chosen integer) registers, x0 to x31. Every bit with program registers, floating-point registers are reserved for certain purposes by convention, equally given in Table six.7.

Tabular array 6.seven. RISC-V floating-betoken register ready

Name Register Number Employ
ft0–seven f0–seven Temporary variables
fs0–one f8–9 Saved variables
fa0–1 f10–eleven Function arguments/Return values
fa2–7 f12–17 Function arguments
fs2–xi f18–27 Saved variables
ft8–11 f28–31 Temporary variables

Table B.3 in Appendix B lists all of the floating-signal instructions. Computation and comparison instructions utilise the same mnemonics for all precisions, with .s, .d, or .q appended at the finish to indicate precision. For example, fadd.s, fadd.d, and fadd.q perform single-, double-, and quad-precision improver, respectively. Other floating-point instructions include fsub, fmul, fdiv, fsqrt, fmadd (multiply-add together), and fmin. Retentiveness accesses use carve up instructions for each precision. Loads are flw, fld, and flq, and stores are fsw, fsd, and fsq.

Floating-signal instructions use R-, I-, and S-blazon formats, as well every bit a new format, the R4-type didactics format (see Figure B.1 in Appendix B). This format is needed for multiply-add together instructions, which apply four register operands. Code Example half-dozen.31 modifies Code Example six.21 to operate on an assortment of single-precision floating-indicate scores. The changes are in bold.

Code Example 6.31

Using a for Loop to Access an Array of Floats

High-Level Lawmaking

int i;

float scores[200];

for (i = 0; i < 200; i = i + 1)

  scores[i] = scores[i] + 10;

RISC-V Assembly Code

# s0 = scores base of operations address, s1 = i

  addi s1, zero, 0   # i = 0

  addi t2, naught, 200   # t2 = 200

  addi t3, aught, 10   # t3 = 10

fcvt.s.w ft0,   t3 # ft0 = 10.0

for:

  bge   s1, t2, washed   # if i >= 200 then washed

  slli   t3, s1, two   # t3 = i * four

  add   t3, t3, s0   # address of scores[i]

flw   ft1, 0(t3)  # ft1 = scores[i]

fadd.s ft1, ft1, ft0  # ft1 = scores[i] + 10

fsw   ft1, 0(t3)  # scores[i] = t1

  addi   s1, s1, 1   # i = i + 1

  j   for   # repeat

washed:

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128200643000064

Operating Systems Overview

Peter Barry , Patrick Crowley , in Modern Embedded Computing, 2012

Chore Context

Each task or thread has a context store; the context store keeps all the chore-specific data for the chore. The kernel scheduler will save and restore the task state on a context switch. The task's context is stored in a Job Control Block in VxWorks; the equivalent in Linux is the struct task_struct.

The Task Control Cake in VxWorks contains the following elements, which are saved and restored on each context switch:

The task program/pedagogy counter.

Virtual memory context for tasks within a process if enabled.

CPU registers for the chore.

Non-core CPU registers, such as SSE registers/floating-point register, are saved/restored based on apply of the registers by a thread. Information technology is prudent for an RTOS to minimize the data it must save and restore for each context switch to minimize the context switch times.

Job program stack storage.

I/O assignments for standard input/output and mistake. As in Linux, a tasks/process output is directed to standard panel for input and output, but the file handles can exist redirected to a file.

A delay timer, to postpone the tasks availability to run.

A time piece timer (more on that later in the scheduling section).

Kernel structures.

Betoken handles (for C library signals such equally split by zero).

Task environment variables.

Errno—the C library error number fix past some C library functions such equally strtod().

Debugging and functioning monitoring values.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780123914903000072

Compages

David Coin Harris , Sarah L. Harris , in Digital Design and Computer Architecture (Second Edition), 2013

vi.seven.4 Floating-Point Instructions

The MIPS architecture defines an optional floating-signal coprocessor, known equally coprocessor 1. In early MIPS implementations, the floating-point coprocessor was a separate scrap that users could purchase if they needed fast floating-point math. In most recent MIPS implementations, the floating-indicate coprocessor is congenital in alongside the main processor.

MIPS defines thirty-two 32-flake floating-point registers, $f0$f31. These are separate from the ordinary registers used so far. MIPS supports both unmarried- and double-precision IEEE floating-point arithmetic. Double-precision (64-scrap) numbers are stored in pairs of 32-bit registers, so but the 16 even-numbered registers ($f0, $f2, $f4, … , $f30) are used to specify double-precision operations. By convention, certain registers are reserved for sure purposes, as given in Table 6.8.

Tabular array 6.eight. MIPS floating-betoken register fix

Name Number Utilise
$fv0$fv1 0, 2 function render value
$ft0$ft3 4, 6, 8, ten temporary variables
$fa0$fa1 12, xiv function arguments
$ft4$ft5 16, 18 temporary variables
$fs0$fs5 twenty, 22, 24, 26, 28, 30 saved variables

Floating-betoken instructions all have an opcode of 17 (100012). They crave both a funct field and a cop (coprocessor) field to indicate the type of instruction. Hence, MIPS defines the F-type education format for floating-point instructions, shown in Effigy 6.35. Floating-point instructions come in both single- and double-precision flavors. cop = 16 (100002) for unmarried-precision instructions or 17 (100012) for double-precision instructions. Similar R-blazon instructions, F-type instructions have 2 source operands, fs and ft, and one destination, fd.

Figure half dozen.35. F-type auto instruction format

Teaching precision is indicated past .southward and .d in the mnemonic. Floating-bespeak arithmetic instructions include improver (add.due south, add.d), subtraction (sub.southward, sub.d), multiplication (mul.s, mul.d), and division (div.s, div.d) as well as negation (neg.s, neg.d) and absolute value (abs.s, abs.d).

Floating-bespeak branches take two parts. First, a compare instruction is used to set or clear the floating-indicate condition flag (fpcond). And so, a conditional branch checks the value of the flag. The compare instructions include equality (c.seq.s/c.seq.d), less than (c.lt.s/c.lt.d), and less than or equal to (c.le.s/c.le.d). The provisional co-operative instructions are bc1f and bc1t that branch if fpcond is FALSE or TRUE, respectively. Inequality, greater than or equal to, and greater than comparisons are performed with seq, lt, and le, followed by bc1f.

Floating-point registers are loaded and stored from memory using lwc1 and swc1. These instructions motility 32 bits, so two are necessary to handle a double-precision number.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123944245000069

Device architectures

David Kaeli , ... Dong Ping Zhang , in Heterogeneous Computing with OpenCL 2.0, 2015

Server CPUs

Intel's Itanium architecture and its more successful successors (the latest being the Itanium 9500), represent an interesting endeavor to brand a mainstream server processor based on VLIW techniques [6]. The Itanium architecture includes a large number of registers (128 integer and 128 floating point registers). Information technology uses a VLIW approach known as Ballsy, in which instructions are stored in 128-bit, three-education bundles. The CPU fetches four instruction bundles per cycle from its L1 cache and can hence executes 12 instructions per clock wheel. The processor is designed to exist efficiently combined into multicore and multisocket servers.

The goal of EPIC is to move the problem of exploiting parallelism from runtime to compile time. It does this by feeding back information from execution traces into the compiler. It is the chore of the compiler to package instructions into the VLIW/Epic packets, and every bit a result, functioning on the architecture is highly dependent on compiler capability. To assist with this, numerous execution masks, dependence flags between bundles, prefetch instructions, speculative loads, and rotating register files are built into the architecture. To amend the throughput of the processor, the latest Itanium microarchitectures have included SMT, with the Itanium 9500 supporting independent front-end and back-end pipeline execution.

The SPARC T-series family (Figure 2.9), originally from Lord's day and under continuing development at Oracle, takes a throughput computing multithreaded arroyo to server workloads [7]. Workloads on many servers, particularly transactional and Web workloads, are often heavily multithreaded, with a large number of lightweight integer threads using the retention system. The UltraSPARC Tx and later SPARC Tx CPUs are designed to efficiently execute a big number of threads to maximize overall work throughput with minimal power consumption. Each of the cores is designed to be simple and efficient, with no out-of-gild execution logic, until the SPARC T4. Inside a core, the focus on thread-level parallelism is immediately credible, as information technology tin can interleave operations from eight threads with but a dual event pipeline. This design shows a articulate preference for latency hiding and simplicity of logic compared with the mainstream x86 designs. The simpler design of the SPARC cores allows upward to 16 cores per processor in the SPARC T5.

Figure 2.ix. The Niagara ii   CPU from Sun/Oracle. The design intends to make a loftier level of threading efficient. Note its relative similarity to the GPU design seen in Figure ii.8. Given enough threads, we can cover all memory access time with useful compute, without extracting instruction-level parallelism (ILP) through complicated hardware techniques.

To support many active threads, the SPARC architecture requires multiple sets of registers, but equally a trade-off requires less speculative register storage than a superscalar blueprint. In add-on, coprocessors allow acceleration of cryptographic operations, and an on-chip Ethernet controller improves network throughput.

Equally mentioned previously, the latest generations, the SPARC T4 and T5, back off slightly from the earlier multithreading pattern. Each CPU cadre supports out-of-order execution and can switch to a single-thread style where a single thread can utilize all of the resources that previously had to be dedicated to multiple threads. In this sense, these SPARC architectures are becoming closer to other mod SMT designs such as those from Intel.

Server chips, in general, try to maximize parallelism at the cost of some single-threaded operation. As opposed to desktop fries, more area is devoted to supporting quick transitions betwixt thread contexts. When broad-issue logic is present, as in the Itanium processors, it relies on help from the compiler to recognize instruction-level parallelism.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/commodity/pii/B9780128014141000028

Multicore and data-level optimization

Jason D. Bakos , in Embedded Systems, 2016

ii.ten.1 ARM11 VFP short vector instructions

The ARMv6 VFP instruction fix offers SIMD instructions through a feature called short vector instructions, in which the programmer can specify a vector width and stride field through the floating-betoken status and control register (FPSCR). Setting the FPSCR volition cause all the thread's subsequently issued floating-indicate instructions to perform the number of operations and access the registers using a stride equally divers in the FPSCR. Note that VFP short vector instructions are non supported by ARMv7 processors. Attempting to alter the vector width or stride on a NEON-equipped processor will trigger an invalid education exception.

The 32 floating-bespeak VFP registers are arranged in iv banks of eight registers each (iv registers each if using double precision). Each bank can exist used equally a brusk vector when performing short vector instructions. The first bank, registers s0-s7 (or d0-d3), will exist used equally scalars in a short vector instruction when specified as the second input operand. For instance, when the vector width is viii, the fadds s16,s8,s0 instruction will add each element of the vector held in registers s8-15 with the scalar held in s0 and shop the result vector in registers s16-s23.

The fmrx and fmxr instructions allow the programmer to read and write the FPSCR annals. The latency of the fmrx pedagogy is two cycles and the latency of the fmxr instruction is 4 cycles. The vector width is stored in FPSCR bits 18:16 and is encoded such that values 0 through 7 specify lengths 1-viii.

When writing to the FPSCR register you must exist careful to modify only the bits you intend to change and leave the others alone. To practise this, you must get-go read the existing value using the fmrx educational activity, change bits xviii:xvi, and so write the back using the fmxr instruction.

Be sure to modify the length dorsum to its default value of one after the kernel since the compiler would non do this automatically, and any compiler-generated floating-point lawmaking can potentially be adversely affected by the modify to the FPSCR.

Y'all tin can use the following function to change the length field in the FPSCR:

void set_fpscr_reg (unsigned char len) {

unsigned int fpscr;

asm("fmrx %[val], fpscr\n\t" : [val]"=r"(fpscr));

len = len - 1;

fpscr = fpscr & ~(0x7<<16);

fpscr = fpscr | ((len&0x7)<<16);

asm("fmxr fpscr, %[val]\n\t" : : [val]"r"(fpscr));

}

To maximize the benefit of the short vector instructions, target the maximum vector size of 8 past unrolling the outer loop by 8. In the original assembly implementation, each fmacs instruction is followed by a dependent fmacs instruction two instructions later. To fully embrace the 8-cycle latency of all the fmacs instructions, use each fmacs pedagogy to perform its operations for 8 loop iterations.

In other words, unroll the outer loop to calculate eight polynomial values on each iteration and utilize short vector instructions of length 8 for each educational activity. Since the fmacs education adds the value in its Fd register, the code requires the power to load copies of each coefficient into each of the four Fd registers. To make this easier, re-write your coefficient array so each coefficient is replicated eight times:

float coeff[64] = {1.2,1.ii,1.2,1.ii,one.two,1.2,1.2,1.two,

1.4,one.4,1.4,one.4,1.iv,one.4,1.four,one.4,…

2.6,two.six,two.6,2.6,two.half dozen,two.6,2.6,2.half dozen};

Alter the curt vector length to 8 and unroll the outer loop by 8, then change the iteration step in the outer loop to 4:

set_fpscr_reg (8);

for (i=0;i<Northward/4;i+=8) {

At present load the first coefficient into a scalar register and eight values of the x array into vector register s15:s8:

asm("flds s0, %[mem]\northward\t" : : [mem]"thou" (coeff[0]) : "s0");

asm("fldmias%[mem],{s8,s9,s10,s11,s12,s13,s14,s15}\due north\t"::

[mem]"r"(&10[i]) : "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15");

Next load eight copies of the second coefficient into vector register s23:s16 and perform our kickoff fmacs past multiplying the 10 vector by the commencement coefficient and adding the event to the 2d coefficient, leaving the running sum in vector register s23:s16:

asm("fldmias %[mem],{s16,s17,s18,s19,s20,s21,s22,s23}\n\t": :

  [mem]"r"(&coeff[8]) :

  "s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

asm("fmacs s16, s8, s0\n\t" : : :

  "s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

Now repeat this procedure simply now swapping the vector registers s23:s16 with s31:s24:

asm("fldmias %[mem],{s24,s25,s26,s27,s28,s29,s30,s31}\northward\t": :

  [mem]"r"(&coeff[16]) :

  "s24", "s25", "s26", "s27", "s28", "s29", "s30", "s31");

asm("fmacs s24, s8, s16\n\t" : : :

  "s20", "s17", "s18", "s19", "s28", "s29", "s30", "s31");

Now repeat these last two steps two more times. End with the following code:

asm("fldmias %[mem],{s16,s17,s18,s19,s20,s21,s22,s23}\n\t": :

  [mem]"r"(&coeff[56]) :

  "s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

asm("fmacs s16, s8, s24\n\t" : : :

  "s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

asm("fstmias %[mem],{s16,s17,s18,s19,s20,s21,s22,s23}\north\t" : :

  [mem]"r" (&d[i]));

Be sure to reset the short vector length to 1 after the outer loop:

set_fpscr_reg (1);

Table 2.4 shows the resulting functioning improvement on the Raspberry Pi relative to the software pipelined implementation. The utilize of scheduled SIMD instructions provides a 37% performance improvement over software pipelining. This optimization increases CPI because each eight-way SIMD instruction requires viii cycles to issue, only comes with a larger relative decrease in instructions per flop (the product of CPI slowdown and instructions per flop speedup gives a total speedup of 1.36).

Table two.4. Performance Comeback from Short Vector Instructions Versus Software Pipelining

Platform Raspberry Pi
CPU ARM11
Throughput/efficiency one.37 speedup
55.2% efficiency
CPI 0.43 speedup (slowdown)
Cache miss charge per unit i.89 speedup
Instructions per flop 3.17 speedup

Another benefit of this optimization is the reduction in cache miss rate due to the SIMD load and store instructions.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012800342800002X

Management of Cache Contents

Bruce Jacob , ... David T. Wang , in Memory Systems, 2008

3.3.1 Combined Approaches to Partitioning

Several examples of partitioning revolve around the PlayDoh compages from Hewlett-Packard Labs.

HPL-PD, PlayDoh v1.ane — Full general Compages

1 content-management mechanism in which the hardware and software cooperate in interesting ways is the HPL PlayDoh compages, renamed the HPL-PD architecture, embodied in the Epic line of processors [Kathail et al. 2000]. Ii facets of the memory arrangement are exposed to the programmer and compiler through instruction-fix hooks: (1) the retentiveness-organization structure and (2) the retentivity disambiguation scheme.

The HPL-PD compages exposes its view or definition of the memory organization, shown in Figure three.36, to the programmer and compiler. The education-set architecture is aware of four components in the memory system: the L1 and L2 caches, an L1 streaming or information-prefetch cache (sits side by side to the L1 enshroud), and main memory. The exact organization of each structure is non exposed to the architecture. As with other mechanisms that have placed separately managed buffers adjacent to the L1 cache, the explicit goal of the streaming/prefetch enshroud is to division data into disjoint sets: (ane) information that exhibits temporal locality and should reside in the L1 cache, and (2) everything else (e.1000., data that exhibits just spatial locality), which should reside in the streaming cache.

FIGURE 3.36. The memory arrangement defined by the HPL-PD architecture. Each component in the memory arrangement is shown with the assembly-code instruction modifier used past a load or store instruction to specify that component. The L1 cache is called C1, the streaming or prefetch enshroud is V1, the L2 cache is C2, and the main retentiveness is C3.

To manage data movement in this hierarchy, the education prepare provides several modifiers for the standard set of load and store instructions.

Load instructions have two modifiers:

1.

A latency and source cache specifier hints to the hardware where the data is expected to be found (i.east., the L1 cache, the streaming cache, the L2 cache, main memory) and also specifies to the hardware the compiler's assumed latency for scheduling this detail load education. In car implementations that require rigid timing (e.thou., traditional VLIW), the hardware must stall if the data is not available with this latency; in machine implementations that have dynamic scheduling effectually cache misses (e.1000., a superscalar implementation of the architecture), the hardware can ignore the value.

ii.

A target cache specifier indicates to hardware where the load data should exist placed within the memory system (i.eastward., place it in the L1 cache, place it in the streaming cache, bring it no college than the L2 cache, or exit information technology in master memory). Note that all loads specify a target register, but the target annals may be r0, a read-simply bit-bucket in both general-purpose and floating-point annals files, providing a de facto form of not-bounden prefetch. Presumably the processor core communicates the binding/non-bounden status to the retention system to avert useless bus activeness.

Shop instructions accept ane modifier:

i.

The target cache specifier, like that for load instructions, indicates to the hardware the highest component in the memory system in which the store data should be retained. A store education'south ultimate target is main retentiveness, and the instruction can exit a re-create in the enshroud organisation if the compiler recognizes that the value will exist reused soon or can specify main memory equally the highest level if the compiler expects no immediate reuse for the data.

Abraham's Profile-Directed Sectionalisation

Abraham describes a compiler machinery to exploit the Play Doh facility [Abraham et al. 1993]. At outset glance, the authors note that it seems to offering too few choices to be of much use: a compiler can only distinguish between brusque-latency loads (expected to be constitute in L1), long-latency loads (expected in L2), and very long-latency loads (in chief memory). A uncomplicated enshroud-operation analysis of a blocked matrix multiply shows that all loads accept relatively depression miss rates, which would advise using the expectation of short latencies to schedule all load instructions.

However, the authors show that past loop peeling i can practise much improve. Loop peeling is a relatively simple compiler transformation that extracts a specific iteration of a loop and moves information technology outside the loop torso. This increases code size (the loop body is replicated), but it opens upward new possibilities for scheduling. In particular, keeping in listen the facilities offered by the HPL-PD educational activity set, many loops brandish the post-obit behavior: the outset iteration of the loop makes (perchance numerous) data references that miss the cache; the principal body of the loop enjoys reasonable cache hit rates; and the terminal iteration of the loop has high hitting rates, but information technology represents the last time the data will be used.

The HPL-PD transformation of the loop peels off start and last iterations:

The first iteration of the loop uses load instructions that specify principal memory as the likely source cache; the store instructions target the L1 enshroud.

The body of the loop uses load instructions that specify the L1 cache as the likely source; the store instructions likewise target the L1 cache.

The last iteration of the loop uses load instructions that specify the L1 enshroud equally the likely source; the store instructions target main retentiveness.

The authors note that such a transformation is easily automated for regular codes, but irregular codes present a difficult claiming. The focus of the Abraham et al. study is to quantify the predictability of memory access in irregular applications. The written report finds that, in almost programs, a very small number of load instructions cause the majority of cache misses. This is encouraging because if those instructions can be identified at compile time, they tin can exist optimized by mitt or perhaps past a compiler.

Hardware/Software Memory Disambiguation

The HPL-PD's memory disambiguation scheme comes from the memory conflict buffer in William Chen'south Ph.D. thesis [1993]. The hardware provides to the software a mechanism that tin can detect and patch up retentivity conflicts, provided that the software identifies loads that are risky and and so follows each up with an explicit invocation of a hardware check. The compiler/programmer tin can exploit the scheme to speculatively issue loads alee of when it is rubber to issue them, or it can ignore the scheme. The scheme by definition requires the cooperation of software and hardware to reap whatsoever benefits. The point of the scheme is to enable the compiler to improve its scheduling of code for which compile-time analysis of pointer addresses is not possible. For example, the following code uses pointer addresses in registers a1, a2, a3, and a4 that cannot exist guaranteed to be disharmonize free:

The code has the post-obit bourgeois schedule (assuming 2-cycle load latencies—equivalent to a 1-cycle load-utilize punishment, as in separate EX and MEM pipeline stages in an in-order pipe—and 1-cycle latencies for all else):

A better schedule would be the following, which moves the 2nd load instruction ahead of the commencement shop:

If we assume two memory ports, the following schedule is slightly ameliorate:

However, the compiler cannot guarantee the safety of this lawmaking, because information technology cannot guarantee that a3 and a2 will contain different values at run fourth dimension. Chen'south solution, used in HPL-PD, is for the compiler to inform the hardware that a particular load is risky. This allows the hardware to brand note of that load and to compare its run-time address to stores that follow it. The scheme also relies upon the compiler to perform a post-verification that tin patch upward errors if it turns out that in that location was indeed a conflict by aggressively scheduling the load ahead of the shop.

The scheme centers around the LDS log, a record of speculatively issued load instructions that maintains in each of its entries the target register of the load and the retentiveness accost that the load uses. There are 2 types of instructions that the compiler uses to manage the log's state, and shop instructions touch on its state implicitly:

i.

LDS instructions are load-speculative instructions that explicitly allocate a new entry in the log (recollect an entry contains the target annals and retentiveness address). On executing an LDS instruction, the hardware creates a new entry and invalidates whatever old entries that have the same target annals.

two.

Store instructions modify the log implicitly. On executing a store, the hardware checks the log for a live entry that matches the aforementioned memory address and deletes any entries that match.

3.

LDV instructions are load-verification instructions that must be placed conservatively in the code (later a potentially alien store educational activity). They check to see if in that location was a conflict between the speculative load and the store. On executing an LDV instruction, the hardware checks the log for a valid entry with the matching target register. If an entry exists, the instruction can exist treated as an NOP; if no entry matches, the LDV is treated every bit a load teaching (it computes a memory address, fetches the datum from retentivity, and places it into the target register).

The case code becomes the following, where the second LD instruction is replaced by an LDS/LDV pair:

The compiler can schedule the LDS educational activity aggressively, keeping the matching LDV educational activity in the conservative spot behind the store education (note that in HPL-PD, memory operations are prioritized left to right, and then the LDV operation is technically "backside" the ST).

If we presume two memory ports, there is non much to exist gained, considering the LDV must be scheduled to happen subsequently the potentially aliasing ST (store) instruction, which would yield effectively the same schedule as above. To address this type of issue (too as many like scenarios) the architecture likewise provides a BRDV teaching, a post-verification educational activity similar to LDV that, instead of loading information, branches to a specified location on detection of a retentivity conflict. This teaching is used in conjunction with compiler-generated patch-upwardly code to handle more than complex scenarios. For instance, the following could be used for implementations with a single memory port:

The following can be used with multiple memory ports:

where the patch-upward code is given as follows:

Using the BRDV instruction, the compiler can achieve optimal scheduling.

There are a number of issues that the HPL-PD mechanism must handle. For case, the hardware must ensure that no virtual-accost aliases can crusade problems (e.thousand., different virtual addresses that map to the same physical address, if the operating system supports this). The hardware must also handle partial overwrites, for example, a write instruction that writes a single byte to a four-byte discussion that was previously read speculatively (the addresses would not necessarily match). The compiler must ensure that every LDS is followed by a matching LDV that uses the aforementioned target register and address register (for obvious reasons), and the compiler also must ensure that no intervening operations disturb the log or the target annals. The LDV teaching must block until complete to achieve finer single-wheel latencies.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/commodity/pii/B9780123797513500059

EXCEPTION AND INTERRUPT HANDLING

ANDREW N. SLOSS , ... CHRIS WRIGHT , in ARM System Developer'south Guide, 2004

9.3.2 NESTED INTERRUPT HANDLER

A nested interrupt handler allows for another interrupt to occur within the currently called handler. This is accomplished past reenabling the interrupts earlier the handler has fully serviced the current interrupt.

For a existent-time system this feature increases the complexity of the system but also improves its performance. The additional complication introduces the possibility of subtle timing issues that can cause a system failure, and these subtle bug can exist extremely hard to resolve. A nested interrupt method is designed carefully so as to avoid these types of issues. This is achieved by protecting the context restoration from pause, then that the next interrupt will non make full the stack (cause stack overflow) or decadent any of the registers.

The start goal of any nested interrupt handler is to respond to interrupts apace and then the handler neither waits for asynchronous exceptions, nor forces them to wait for the handler. The second goal is that execution of regular synchronous code is not delayed while servicing the diverse interrupts.

The increment in complexity means that the designers accept to residue efficiency with safety, by using a defensive coding style that assumes problems volition occur. The handler has to cheque the stack and protect against register corruption where possible.

Figure 9.ix shows a nested interrupt handler. As can been seen from the diagram, the handler is quite a bit more complicated than the simple nonnested interrupt handler described in Section nine.iii.i.

Figure ix.ix. Nested interrupt handler.

The nested interrupt handler entry code is identical to the simple nonnested interrupt handler, except that on leave, the handler tests a flag that is updated past the ISR. The flag indicates whether further processing is required. If farther processing is non required, then the interrupt service routine is consummate and the handler tin can go out. If farther processing is required, the handler may accept several actions: reenabling interrupts and/or performing a context switch.

Reenabling interrupts involves switching out of IRQ mode to either SVC or arrangement fashion. Interrupts cannot just be reenabled when in IRQ style because this would lead to possible link annals r14_irq abuse, peculiarly if an interrupt occurred after the execution of a BL instruction. This problem will exist discussed in more item in Section nine.3.3.

Performing a context switch involves flattening (emptying) the IRQ stack because the handler does non perform a context switch while there is information on the IRQ stack. All registers saved on the IRQ stack must be transferred to the task'south stack, typically on the SVC stack. The remaining registers must then be saved on the job stack. They are transferred to a reserved cake of retentiveness on the stack called a stack frame.

EXAMPLE 9.ix

This nested interrupt handler example is based on the menstruum diagram in Effigy 9.9. The residue of this section will walk through the handler and depict in detail the various stages.

This case uses a stack frame construction. All registers are saved onto the frame except for the stack register r13. The order of the registers is unimportant except that FRAME_LR and FRAME_PC should be the concluding two registers in the frame considering we will return with a single instruction:

There may exist other registers that are required to be saved onto the stack frame, depending upon the operating system or application being used. For example:

Registers r13_usr and r14_usr are saved when there is a requirement by the operating system to back up both user and SVC modes.

Floating-bespeak registers are saved when the organisation uses hardware floating signal.

There are a number of defines declared in this example. These defines map various cpsr/spsr changes to a particular label (for case, the I_Bit).

A set of defines is besides alleged that maps the various frame register references with frame arrow offsets. This is useful when the interrupts are reenabled and registers have to be stored into the stack frame. In this case we store the stack frame on the SVC stack.

The entry betoken for this example handler uses the same code equally for the simple nonnested interrupt handler. The link register r14 is offset modified so that information technology points to the correct return address, and then the context plus the link annals r14 are saved onto the IRQ stack.

An interrupt service routine then services the interrupt. When servicing is consummate or partially complete, control is passed back to the handler. The handler then calls a role called read_RescheduleFlag, which determines whether further processing is required. Information technology returns a nonzero value in register r0 if no further processing is required; otherwise it returns a null. Note we accept not included the source for read_RescheduleFlag considering it is implementation specific.

The return flag in register r0 is and then tested. If the register is not equal to null, the handler restores context and returns control back to the suspended task.

Annals r0 is fix to cypher, indicating that further processing is required. The first functioning is to relieve the spsr, so a re-create of the spsr_irq is moved into register r2. The spsr can then be stored in the stack frame by the handler after on in the code.

The IRQ stack address pointed to by annals r13_irq is copied into register r0 for subsequently use. The side by side pace is to flatten (empty) the IRQ stack. This is done by adding 6 * 4 bytes to the superlative of the stack because the stack grows downwards and an Add education can be used to set the stack.

The handler does non need to worry about the data on the IRQ stack being corrupted by another nested interrupt because interrupts are nonetheless disabled and the handler will not reenable the interrupts until the information on the IRQ stack has been recovered.

The handler then switches to SVC mode; interrupts are still disabled. The cpsr is copied into register r1 and modified to set the processor fashion to SVC. Register r1 is and then written dorsum into the cpsr, and the current mode changes to SVC manner. A copy of the new cpsr is left in register r1 for subsequently use.

The next stage is to create a stack frame past extending the stack by the stack frame size. Registers r4 to r11 tin be saved onto the stack frame, which will free upwardly enough registers to allow us to recover the remaining registers from the IRQ stack still pointed to past register r0.

At this phase the stack frame volition contain the information shown in Table 9.7. The only registers that are not in the frame are the registers that are stored upon entry to the IRQ handler.

Tabular array nine.vii. SVC stack frame.

Label Outset Register
FRAME_R0 +0
FRAME_R1 +4
FRAME_R2 +8
FRAME_R3 +12
FRAME_R4 +16 r4
FRAME_R5 +twenty r5
FRAME_R6 +24 r6
FRAME_R7 +28 r7
FRAME_R8 +32 r8
FRAME_R9 +36 r9
FRAME_R10 +forty r10
FRAME_R11 +44 r11
FRAME_R12 +48
FRAME_PSR +52
FRAME_LR +56
FRAME_PC +60

Tabular array 9.eight shows the registers in SVC mode that correspond to the existing IRQ registers. The handler can now retrieve all the data from the IRQ stack, and it is safe to reenable interrupts.

Table nine.8. Data retrieved from the IRQ stack.

Registers (SVC) Retrieved IRQ registers
r4 r0
r5 r1
r6 r2
r7 r3
r8 r12
r9 r14 (return address)

IRQ exceptions are reenabled, and the handler has saved all the important registers. The handler can at present complete the stack frame. Table nine.9 shows a completed stack frame that can be used either for a context switch or to handle a nested interrupt.

Table ix.9. Consummate frame stack.

Characterization Offset Register
FRAME_R0 +0 r0
FRAME_R1 +four r1
FRAME_R2 +8 r2
FRAME_R3 +12 r3
FRAME_R4 +xvi r4
FRAME_R5 +20 r5
FRAME_R6 +24 r6
FRAME_R7 +28 r7
FRAME_R8 +32 r8
FRAME_R9 +36 r9
FRAME_R10 +twoscore r10
FRAME_R11 +44 r11
FRAME_R12 +48 r12
FRAME_PSR +52 spsr_irq
FRAME_LR +56 r14
FRAME_PC +threescore r14_irq

At this phase the remainder of the interrupt servicing may be handled. A context switch may be performed by saving the current value of register r13 in the current task's control block and loading a new value for register r13 from the new task's control block.

Information technology is now possible to return to the interrupted task/handler, or to another job if a context switch occurred.

SUMMARY

Nested Interrupt Handler

Handles multiple interrupts without a priority consignment.

Medium to high interrupt latency.

Advantage—can enable interrupts before the servicing of an individual interrupt is complete reducing interrupt latency.

Disadvantage—does not handle prioritization of interrupts, and so lower priority interrupts tin can block higher priority interrupts.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9781558608740500101

Hardware and Awarding Profiling Tools

Tomislav Janjusic , Krishna Kavi , in Advances in Computers, 2014

3.iii Multiple-Component Simulators

Medium-complication simulators model multiple components and the interactions among the components, including a complete CPU with in-order or out-of-guild execution pipelines, branch prediction and speculation, and memory subsystem. A prime number example of such a arrangement is the widely used SimpleScalar tool set [viii]. It is aimed at architecture inquiry although some academics deem SimpleScalar to be invaluable for teaching reckoner architecture courses. An extension known every bit ML-RSIM [x] is an execution-driven figurer system simulating several subcomponents including an OS kernel. Other extension includes M-Sim [12], which extends SimpleScalar to model multithreaded architectures based on simultaneous multithreading (SMT).

3.3.ane SimpleScalar

SimpleScalar is a set of tools for computer compages inquiry and education. Developed in 1995 as part of the Wisconsin Multiscalar project, it has since sparked many extensions and variants of the original tool. Information technology runs precompiled binaries for the SimpleScalar compages. This also implies that SimpleScalar is non an FS simulator but rather user-space single application simulator. SimpleScalar is capable of emulating Alpha, portable educational activity set up architecture (PISA) (MIPS like instructions), ARM, and x85 didactics sets. The simulator interface consists of the SimpleScalar ISA and POSIX organization call emulations.

The available tools that come with SimpleScalar include sim-fast, sim-safe, sim-profile, sim-cache, sim-bpred, and sim-outorder:

sim-fast is a fast functional simulator that ignores whatsoever microarchitectural pipelines.

sim-rubber is an pedagogy interpreter that checks for memory alignments; this is a expert way to bank check for application bugs.

sim-profile is an instruction interpreter and profiler. It can exist used to measure awarding dynamic instruction counts and profiles of code and data segments.

sim-cache is a retentiveness simulator. This tool tin simulate multiple levels of cache hierarchies.

sim-bpred is a branch predictor simulator. It is intended to simulate different co-operative prediction schemes and measures miss prediction rates.

sim-outorder is a detailed architectural simulator. Information technology models a superscalar pipelined architecture with out-of-club execution of instructions, branch prediction, and speculative execution of instructions.

3.3.ii G-Sim

Thousand-Sim is a multithreaded extension to SimpleScalar that models detailed private key pipeline stages. M-Sim runs precompiled Alpha binaries and works on nigh systems that also run SimpleScalar. Information technology extends SimpleScalar by providing a cycle-accurate model for thread context pipeline stages (reorder buffer, divide event queue, and separate arithmetic and floating-point registers). M-Sim models a single SMT capable core (and not multicore systems), which means that some processor structures are shared while others remain individual to each thread; details can be found in Ref. [12].

The await and feel of Chiliad-Sim is similar to SimpleScalar. The user runs the simulator equally a stand-alone simulation that takes precompiled binaries compatible with M-Sim, which currently supports but Alpha APX ISA.

3.3.3 ML-RSIM

This is an execution-driven computer system simulator that combines detailed models of mod computer hardware, including I/O subsystems, with a fully functional OS kernel. ML-RSIM's environs is based on RSIM, an execution-driven simulator for instruction-level parallelism (ILP) in shared retentiveness multiprocessors and uniprocessor systems. It extends RSIM with boosted features including I/O subsystem support and an OS. The goal behind ML-RSIM is to provide detailed hardware timing models then that users are able to explore Bone and awarding interactions. ML-RSIM is capable of simulating Os code and memory-mapped access to I/O devices; thus, it is a suitable simulator for I/O-intensive interactions.

ML-RSIM implements the SPARC V8 instruction set. It includes enshroud and TLB models, and exception handling capabilities. The cache hierarchy is modeled as a 2-level structure with support for cache coherency protocols. Load and store instructions to I/O subsystem are handled through an uncached buffer with support for store pedagogy combining. The retentiveness controller supports MESI (modify, sectional, shared, invalidate) snooping protocol with accurate modeling of queuing delays, bank contention, and dynamic random access retention (DRAM) timing. The I/O subsystem consists of a peripheral component interconnect (PCI) bridge, a real-time clock, and a number of minor estimator system interface (SCSI) adapters with hard disks. Unlike other FS simulators, ML-RSIM includes a detailed timing-authentic representation of various hardware components. ML-RSIM does non model any item system or device, rather it implements detailed general device prototypes that tin can be used to get together a range of real machines.

ML-RSIM uses a detailed representation of an OS kernel, Lamix kernel. The kernel is Unix-compatible, specifically designed to run on ML-RSIM and implements cadre kernel functionalities, primarily derived from NetBSD. Application linked for Lamix can (in most cases) run on Solaris. With a few exceptions, Lamix supports most of the major kernel functionalities such equally signal treatment, dynamic process termination, and virtual memory management.

3.3.4 ABSS

An augmentation-based SPARC simulator, or ABSS for curt, is a multiprocessor simulator based on AugMINT, an augmented Mips interpreter. ABSS simulator tin exist either trace-driven or program-driven. We have described examples of trace-driven simulators, including the DineroIV, where only some abstracted features of an application (i.eastward., educational activity or data accost traces) are simulation. Program-driven simulators, on the other hand, simulate the execution of an actual awarding (e.m., a criterion). Program-driven simulations can be either interpretive simulations or execution-driven simulations. In interpretive simulations, the instructions are interpreted by the simulator ane at a time, while in execution-driven simulations, the instructions are actually run on real hardware. ABSS is an execution-driven simulator that executes SPARC ISA.

ABSS consists of several components: a thread module, an augmenter, cycle-accurate libraries, memory arrangement simulators, and the criterion. Upon execution, the augmenter instruments the application and the bicycle-accurate libraries. The thread module, libraries, the memory organisation simulator, and the benchmark are linked into a single executable. The augmenter then models each processor every bit a separate thread and in the outcome of a break (context switch) that the retention organization must handle, the execution pauses, and the thread module handles the request, usually saving registers and reloading new ones. The goal backside ABSS is to allow the user to simulate timing-accurate SPARC multiprocessors.

iii.three.5 HASE

HASE, hierarchical compages design and simulation environment, and SimJava are educational tools used to pattern, test, and explore computer architecture components. Through abstraction, they facilitate the study of hardware and software designs on multiple levels. HASE offers a GUI for students trying to sympathize complex system interactions. The motivation for developing HASE was to develop a tool for rapid and flexible developing of new architectural ideas.

HASE is based in SIM++, a detached-issue simulation language. SIM++ describes the basic components and the user can link the components. HASE will and then produce the initial lawmaking set up that forms the bases of the desired simulator. Since HASE is hierarchical, new components can be built every bit interconnected modules to core entities.

HASE offers a diverseness of simulations models intended for utilize for teaching and educational laboratory experiments. Each model must be used with HASE, a Coffee-based simulation environment. The simulator and so produces a trace file that is later used equally input into the graphic surroundings to represent interior workings of an architectural component. The following are few of the models bachelor through HASE:

Uncomplicated pipelined processor based on MIPS

Processor with scoreboards (used for pedagogy scheduling)

Processor with prediction

Single instruction, multiple data (SIMD) array processors

A 2-level cache model

Cache coherency protocols (snooping and directory)

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124202320000039