added regX as input to alu component. added conditional register copy in bitwise instructions.
fully 8x vectorized all relevant instructions. 8-bit vector parameter is currently working as line enable 8x, not vector count 256x. added alu circuit that contains most of the instruction operations other than nop, jmp, ldi and mem.
added two integer instructions, clock incrementing counter with set value and continuous clock random value source at fixed start seed.
added floating point isinf, isnan, sin, tan, cos, asin, atan, acos, log, power and sqrt as instructions. changed all register/ram single-line ram components to octo-line ram components for future 8x vector instructions.
removed all bit index operations, changed bit index instruction parameter to 8-bit vector number, to have max 256x vector instructions duplicates running in array mode at once, default 8x vector instructions at once.
changed ram to dual port ram. every instruction is now 1 cycle, including ram load/store. simplified circuit in general. rearranged regX, regY and regZ dual port ram single-line modules to test octo-line array input/output mode. simplified mux risc core to single circuit running only single core with direct ram loading and full 64k x 64bit registry and 128MB unified ram.


RISC core-gate instruction set architecture (64-bit variation of RISC-V):
MISC compute chip contains 64k cores, total of 32GB register nvsram and 8TB memory nvsram.
Each core contains 64k local 64-bit ram registers. load/store instructions can address global memory.
Each core contains 24-bit addressed 128MB ram, including rom, ram, touch-display ram, and nand nvram.
Every instruction uses/operates on full 64-bit register values always, and runs in 1 cycle.
Instruction high bits can contain specific simple variations of instructions, and vector duplicates.
Each 64-bit instruction is formed from 16-bit [regX regY regZ insT] parameters.
insT parameter is formed from 8-4-4-bit [vecN insV insO] parameters.
Estimated logic transistors per core is 200k making 64k cores about 12.8 billion.
Estimated ram transistors per core is 8million 512KB and 256billion total 32GB.
Estimated compute 64-bit teraops at 5GHz x 8-vector per core is 40gops and 2560tops total.
Opcode | Instruction | Name | Description
----------------------------------------------------------------------------------------------------
any | ## | Any Raw Data | direct data line 64-bit value
0 | nopYZ | No Operation | no operation sleep constant regYZ cycles
[] empty line or white space line
// comment line
1 | jmpXY | Jump Destination | jump to regX if regY is not zero
jmpcXY insV=0 jump to regX if regY is not zero
jmpXY insV=1 unconditional jump to regX
2 | ldiXYZ | Load 32-bit Uint | load regX with constant regYZ
3 | memXY | Memory Double | store/load[insV] regX at memory[regY]
memrXY insV=0 load
memwXY insV=1 store
4 | cmpXY | Compare to Zero | clear regX to 0, set to 1 if regY comp[insV]
cmpeXY insV=0 integer equal to
cmplXY insV=1 integer less than
cmpefXY insV=2 float equal to
cmplfXY insV=3 float less than
5 | intXYZ | ALU Int Operation | store integer op[insV] regY regZ to regX
addXYZ insV=0 integer add
addoXYZ insV=1 integer add overflow bit
subXYZ insV=2 integer subtract
subbXYZ insV=3 integer subtract borrow bit
mulXYZ insV=4 integer multiply
muloXYZ insV=5 integer multiply overflow
divXYZ insV=6 integer divide
divrXYZ insV=7 integer divide remainder
negXYZ insV=8 integer negate
clkXYZ insV=9 integer clock counter
rndXYZ insV=A integer clock random
6 | bitXYZ | ALU Bit Operation | store bitwise op[insV] regY regZ to regX
shlXYZ insV=0 bitwise shift left regZ bits
shrXYZ insV=1 bitwise shift right regZ bits
sharXYZ insV=2 bitwise shift arithmetic right regZ bits
rotlXYZ insV=3 bitwise rotate left regZ bits
rotrXYZ insV=4 bitwise rotate right regZ bits
copyXYZ insV=5 bitwise copy
notXYZ insV=6 bitwise not
orXYZ insV=7 bitwise or
andXYZ insV=8 bitwise and
nandXYZ insV=9 bitwise nand
norXYZ insV=A bitwise nor
xorXYZ insV=B bitwise xor
xnorXYZ insV=C bitwise xnor
copycXYZ insV=D bitwise conditional copy if regZ is not zero
7 | flpXYZ | ALU Flp Operation | store float op[insV] regY regZ to regX
addfXYZ insV=0 float add
subfXYZ insV=1 float subtract
mulfXYZ insV=2 float multiply
divfXYZ insV=3 float divide
negfXYZ insV=4 float negate
itfXYZ insV=5 integer to float
ftinXYZ insV=6 float to integer nearest
ftidXYZ insV=7 float to integer round down
ftiuXYZ insV=8 float to integer round up
ftitXYZ insV=9 float to integer truncate
finfXYZ insV=A float is infinity
fnanXYZ insV=B float is not-a-number
8 | flpaXYZ | ALU FlpA Operation | store advanced float op[insV] regY regZ to regX
fsinXYZ insV=0 float sine
ftanXYZ insV=1 float tangent
fcosXYZ insV=2 float cosine
fasinXYZ insV=3 float arcsine
fatanXYZ insV=4 float arctangent
facosXYZ insV=5 float arccosine
flogXYZ insV=6 float logarithm
fpowXYZ insV=7 float power
fsqrtXYZ insV=8 float square root
Example looping test assembly code source and binary:
source listing | binary | explanation
----------------------------------------------------------------------------------------------------
[] | 0000000000000000 | empty line
// empty line | 0000000000000000 | comment line
nop 00000200 | 0000000002000000 | no operation sleep 512+1 cycles
ldi 0000 00000001 ff | 000000000001ff02 | load registers 0-7 with 0x1, current fibonacci num
ldi 0008 00000001 ff | 000800000001ff02 | load registers 8-15 with 0x1, previous fibonacci num
ldi 0010 00000000 ff | 001000000000ff02 | load registers 16-23 with 0x0, previous+ fibonacci num
ldi 0018 00000000 | 0018000000000002 | load register 24 with value 0x0, for loop index from 0
ldi 0019 00000020 | 0019000000200002 | load register 25 with value 0x20, for loop less than 32
ldi 001a 00000018 | 001a000000180002 | load register 26 with value 0x18, ram store start index
ldi 001b 00000001 | 001b000000010002 | load register 27 with value 0x1, constant 0x1 add
ldi 001c 00000008 | 001c000000080002 | load register 28 with value 0x8, constant 0x8 add
ldi 001d 0000000C | 001d0000000C0002 | load register 29 with value 0xC constant jump address
ldi 0020 00000000 | 0020000000000002 | load register 32 with value 0x0 constant jump address
copy 0010 0008 0000 ff | 001000080000ff56 | copy registers 8-15 to register 16-23
copy 0008 0000 0000 ff | 000800000000ff56 | copy registers 0-7 to register 8-15
add 0000 0008 0010 ff | 000000080010ff05 | store add of registers 8-15 and 16-23 to register 0-7
memw 0000 001a 0000 ff | 0000001a0000ff13 | store registers 0-7 to register 26 memory location 0-7
add 001a 001a 001c | 001a001a001c0005 | store add of register 26 and register 28 to register 26
add 0018 0018 001b | 00180018001b0005 | store add of register 24 and register 27 to register 24
sub 001e 0018 0019 | 001e001800190025 | store sub of register 24 and register 25 to register 30
cmpl 001f 001e | 001f001e00000014 | clear register 31, set if register 30 int less than 0
jmpc 001d 001f | 001d001f00000001 | jump to register 29 if register 31 is not zero
jmpu 0020 | 0020000000000011 | unconditional jump to register 32
## A123456789ABCDEF | a123456789abcdef | custom data segment with any instruction or data
Example looping test assembly to c-code approximate:
while(true) { // infinite while loop
register<0> long fib1[8] = 0x1; // init fib1 with registers array 0-7 to long integer value 1
register<8> long fib2[8] = 0x1; // init fib2 with registers array 8-15 to long integer value 1
register<16> long fib3[8] = 0x0; // init fib3 with registers array 16-23 to long integer value 0
register<24> long i = 0; // init loop i with register 24 long integer value 0
register<25> long imax = 32; // init loop imax with register 25 long integer value 32
register<26> long *mem = 0x18; // init mem with register 26 long integer pointer at 0x18
for (;i<imax;i++) { // for loop long integer i index value from 0 to 31
fib3{} = fib2{}; // copy array of old fib2 values to fib3 vectorized
fib2{} = fib1{}; // copy array of old fib1 values to fib2 vectorized
fib1{} = fib2{} + fib3{}; // calculate array of new fib1 adding fib2 and fib3 vectorized
mem{} = fib1{}; // store array of fib1 values to mem location index vectorized
mem += 8; // move memory pointer 8 indexes forward
} // for loop close
} // infinite while loop close
gate pipeline compute implemented as risc-v compute cores grid routing network. each core has their own 16-bit x 64bit = 512KB internal compute cache, but all middle level memory based caches are replaced with grid routing network for much higher bandwidth. each gate core has 2x 64-bit wide input and 1x 64-bit wide output.

gate pipeline compute architecture based on pre-computed nor-memory stored 8-bit values fpga architecture.

Logic gate pipeline compute.

Logic circuit/gate assembler.

MISC Compute Chip
