IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
integer multiplications on IA32 architecture.
wkwai
post Aug 6 2003, 14:24
Post #1


MPEG4 AAC developer


Group: Developer
Posts: 398
Joined: 1-June 03
Member No.: 6943



Hi,


I am used to working with Assembly Language Programming on the Pentium processor generation( 166 - 200 Mhz MMX). I noticed that for operations like int16 and int32 multiplications / divisions, it used to take as long as 20 clock cycles to complete the an instruction execution. However I noticed that on a Celeron processor, (using the VTune 7.0 evaluation kit from Intel's website) it takes on 1 clock cycle to execute.. Could anyone verify this? In the past, we would use a combination of shift and add operations to implement integer multiplications / divisions.


wkwai
Go to the top of the page
+Quote Post
NumLOCK
post Aug 6 2003, 15:06
Post #2


Neutrino G-RSA developer


Group: Developer
Posts: 852
Joined: 8-May 02
From: Geneva
Member No.: 2002



Hi,

QUOTE
I noticed that for operations like int16 and int32 multiplications / divisions, it used to take as long as 20 clock cycles to complete the an instruction execution.

What is the 'an' instruction ?

edit: ok, if I ignore the 'an': 20 cycles seems way out of line. Ensure your mul instruction doesn't fetch its argument from memory.

QUOTE
However I noticed that on a Celeron processor, (using the VTune 7.0 evaluation kit from Intel's website) it takes on 1 clock cycle to execute.. Could anyone verify this?

If you mean 1 cycle latency for 'mul' or 'imul' 32x32bit instruction, it is impossible.
Any x86-compatible processor to date will need at the very least 2 cycles (IIRC) because of the high frequency. I think the fastest one was the K6, with 2 cycle latency and 3-cycle execution time for mul/imul.

edit: I think on K6, the 32 lowest bits were available in 2 cycles, and the higher 32 bits were available 1 cycle later.

QUOTE
In the past, we would use a combination of shift and add operations to implement integer multiplications / divisions.

Yeah.
Nowadays, it's a bit different though: thanks to improved multiplication circuitry it's usually worth using special instructions only for:

- result = n*2^k => shl reg, k
- result = 3*n+k => lea reg, [reg+2*reg+k]
- result = 5*n+k => lea reg, [reg+4*reg+k]

In most other cases the multiply will be faster. Plus (depending on your program) you'll avoid saturating the AGU (address generation unit). Also while the mul runs, you can do something else.

Regards

This post has been edited by NumLOCK: Aug 6 2003, 15:17


--------------------
Try Leeloo Chat at http://leeloo.webhop.net
Go to the top of the page
+Quote Post
wkwai
post Aug 8 2003, 11:25
Post #3


MPEG4 AAC developer


Group: Developer
Posts: 398
Joined: 1-June 03
Member No.: 6943



QUOTE (NumLOCK @ Aug 6 2003, 06:06 AM)
20 cycles seems way out of line. Ensure your mul instruction doesn't fetch its argument from memory.

I think so, in fact there are also penalties in mixing type bytes and int16 with int32 instructions in old Pentium processors. In fact by using the MMX instructions for integer multiplications, the speed up time is about 100 factors.

However on a Celeron system, MMX multiplication instructions only speed things up by about a factor of 4 only.
I think the Celeron, Pentium II and Pentium III are all based on a different architecture.

wkwai
Go to the top of the page
+Quote Post
Gabriel
post Aug 8 2003, 13:41
Post #4


LAME developer


Group: Developer
Posts: 2950
Joined: 1-October 01
From: Nanterre, France
Member No.: 138



P6 architecture: pentium pro, pII, PIII, older celerons
Netburst architecture: p4, newer celerons
Go to the top of the page
+Quote Post
NumLOCK
post Aug 8 2003, 16:23
Post #5


Neutrino G-RSA developer


Group: Developer
Posts: 852
Joined: 8-May 02
From: Geneva
Member No.: 2002



QUOTE (wkwai @ Aug 8 2003, 11:25 AM)
I think so, in fact there are also penalties in mixing type bytes and int16 with int32 instructions in old Pentium processors.  In fact by using the MMX instructions for integer multiplications, the speed up time is about 100 factors. 

However on a Celeron system, MMX multiplication instructions only speed things up by about a factor of 4 only.
I think the Celeron, Pentium II and Pentium III are all based on a different architecture. 

wkwai

I think it's too bad that MMX doesn't support 32-bit multiplications.
Also the inability for MMX instructions to interoperate with x86 registers is a big design flaw in their architecture. If you want to mix both types of instructions, you have to use useless "MOVD" instructions which prevent many opportunities to optimize.

Have you seen how well Motorola's Altivec is designed, for instance ? You can do several 64x64bit multiplies in parallel...

Well after all, Intel is Intel... and stays Intel rolleyes.gif

Edit: To be completely impartial (laugh.gif), I must admit that MMX still proved useful for me, in several 24-bit graphics routines.

By the way, I loved their funny PCKUNMLL and PSKCNNNLXGLCBB mnemonics blink.gif

This post has been edited by NumLOCK: Aug 8 2003, 16:34


--------------------
Try Leeloo Chat at http://leeloo.webhop.net
Go to the top of the page
+Quote Post
wkwai
post Aug 10 2003, 07:45
Post #6


MPEG4 AAC developer


Group: Developer
Posts: 398
Joined: 1-June 03
Member No.: 6943



QUOTE (NumLOCK @ Aug 8 2003, 07:23 AM)
By the way, I loved their funny PCKUNMLL and PSKCNNNLXGLCBB mnemonics  blink.gif


I think those instructions does not exists for the Celeron and PII systems. For PIII and above, the MMX instructions actually work on 128 bit registers. That is what I noticed from the latest Intel Programmers guide.

I wondered how much performance gain does a 64 bit processor has over the IA32 architecture? It seems to me that most of the internal floating point operations of the IA32 architecture are already at 64 bit operations??? blink.gif

When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? blink.gif
Go to the top of the page
+Quote Post
Diocletian
post Aug 10 2003, 10:40
Post #7





Group: Members
Posts: 45
Joined: 11-October 02
Member No.: 3517



QUOTE (wkwai @ Aug 10 2003, 12:15 PM)
QUOTE (NumLOCK @ Aug 8 2003, 07:23 AM)
By the way, I loved their funny PCKUNMLL and PSKCNNNLXGLCBB mnemonics  blink.gif


I think those instructions does not exists for the Celeron and PII systems. For PIII and above, the MMX instructions actually work on 128 bit registers. That is what I noticed from the latest Intel Programmers guide.

I wondered how much performance gain does a 64 bit processor has over the IA32 architecture? It seems to me that most of the internal floating point operations of the IA32 architecture are already at 64 bit operations??? blink.gif

When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? blink.gif

The 64 bit FMUL instructions have nothing to do with the 64 bit IMUL instructions on IA64
or x86-64. The main advantage of a 64 bit CPU is that it can work with more or more fragmented
memory:
- you can work with more than 1.5 GB of memory per process
- you don't have to care with virtual address room fragmentation
- you can map files to memory
- you can built up sparse memory structures in the memory which do a lot of work
in hardware than in software
-----------------------------------

The rules about optimization which you find in books and in brains are typically 10 years and
older and are COMPLETELY out of day and often able to deoptimize programs.

To evaluate the speed of current CPUs is easier than the speed of 10 years old CPUs,
because in modern CPUs decoding and execeution is nearly complete decoupled.
This was not the case for CPUs like Pentium, Pentium MMX and AMD K5, where a
prediction of calculation speed was a pain.

Modern CPU executation time of code and data which is completely in the L1 cache can
be characterized by two parameters:

- Latency (the time from the input to the output register)
- Throughput (the average time from input to output register when executing multiple instructions)

Latency/Throughput is typically an integer which can be interpreted as the number of execution
pipelines. The execution time of the mul32 instruction:

- i386: Depending on the number of significant bits in the second operand: 6...37 clocks
- i486: Depending on the number of significant bits in the second operand: 9...40 clocks
- Pentium/Pentium MMX: 11 clocks (fixed)
- K6: 2 clocks , a 3rd clock for the upper 32 bits
- Athlon: 5 clocks (throughput: 2.5 clocks)
but: operand in memory: 4 clocks (throughput: 2 clocks)
- Pentium II: 4 clocks (throughput: 4 clocks) (?)
- Pentium 4: 14 clocks (throughput: 5.67 clocks)
operand in memory: 18 clocks (throughput: 6 clocks)

Pentium 4 is much slower than the Pentium II/III or the K6. Even shl don't helps, because
it is also very slow:

- shl reg,n: 4 clocks

Fast indeed is:

- add reg1, reg2: 0.5 clocks

MMX on pentium 4 is also slower than on the Pentium MMX/II/III, because there's only
ONE MMX pipeline instead of two. The Pentium 4 is clock speed optimized, not speed optimized. A lot of Latency (the time from the input to the output register)
- Throughput (the average time from input to output register when executing multiple instructions)

Latency/Throughput is typically an integer which can be interpreted as the number of execution
pipelines. The execution time of the mul32 instruction:

- i386: Depending on the number of significant bits in the second operand: 6...37 clocks
- i486: Depending on the number of significant bits in the second operand: 9...40 clocks
- Pentium/Pentium MMX: 11 clocks (fixed)
- K6: 2 clocks , a 3rd clock for the upper 32 bits
- Athlon: 5 clocks (throughput: 2.5 clocks)
but: operand in memory: 4 clocks (throughput: 2 clocks)
- Pentium II: 4 clocks (throughput: 4 clocks) (?)
- Pentium 4: 14 clocks (throughput: 5.67 clocks)
operand in memory: 18 clocks (throughput: 6 clocks)

Pentium 4 is much slower than the Pentium II/III or the K6. Even shl don't helps, because
it is also very slow:

- shl reg,n: 4 clocks

Fast indeed is:

- add reg1, reg2: 0.5 clocks

MMX on pentium 4 is also slower than on the Pentium MMX/II/III, because there's only
ONE MMX pipeline instead of two. The Pentium 4 is clock speed optimized, not speed optimized. A lot of these changes are to allow high clock speeds. In the first P4 stepping
there were additional serious penalties for misaligned memory accesses which dropped the
speed down to Pentium MMX times.


--------------------
Diocletian

Time Travel Agency
Book a journey to the Diocletian Palace. Not today!
Go to the top of the page
+Quote Post
NumLOCK
post Aug 10 2003, 11:23
Post #8


Neutrino G-RSA developer


Group: Developer
Posts: 852
Joined: 8-May 02
From: Geneva
Member No.: 2002



[quote=wkwai,Aug 10 2003, 07:45 AM] I think those instructions does not exists for the Celeron and PII systems. For PIII and above, the MMX instructions actually work on 128 bit registers. That is what I noticed from the latest Intel Programmers guide.
[/quote]
They don't really exist, I was joking about their habit for strange mnemonics.

I think the 128-bit version would be best called "MMX2".

[QUOTE]I wondered how much performance gain does a 64 bit processor has over the IA32 architecture? It seems to me that most of the internal floating point operations of the IA32 architecture are already at 64 bit operations??? blink.gif [/QUOTE]
You're right, there would be little performance gain switching to 64 bits. The real advantage is the addressing.

For x86, much more useful changes would be:
- an extension to raise number of registers (8 regs is ridiculous)a 64 bit processor has over the IA32 architecture? It seems to me that most of the internal floating point operations of the IA32 architecture are already at 64 bit operations??? blink.gif [/QUOTE]
You're right, there would be little performance gain switching to 64 bits. The real advantage is the addressing.

For x86, much more useful changes would be:
- an extension to raise number of registers (8 regs is ridiculous)
- the possibility to use 3-operand instructions, like on most sane architectures (ie: ADDL source1, source2, destination).

[QUOTE]When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? blink.gif[/QUOTE]Since most instructions see memory through 32-byte cache lines, the load will be done in one clock (assuming your 64-bit operand is duly a
- the possibility to use 3-operand instructions, like on most sane architectures (ie: ADDL source1, source2, destination).

[QUOTE]When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? blink.gif[/QUOTE]Since most instructions see memory through 32-byte cache lines, the load will be done in one clock (assuming your 64-bit operand is duly aligned).


--------------------
Try Leeloo Chat at http://leeloo.webhop.net
Go to the top of the page
+Quote Post
wkwai
post Aug 11 2003, 13:35
Post #9


MPEG4 AAC developer


Group: Developer
Posts: 398
Joined: 1-June 03
Member No.: 6943



Thanks. Another question, I have a Celeron 650 Mhz.. I thought that the Celeron Processor is almost identical to PII? But someone just said that the latest versions of the Celeron processors are based on the new P4 architecture??? blink.gif I am wondering if mine would support SSE2 instructions.

As for int64 operations, I think there are very limited applications apart from scientific and engineering purposes / memory intensive applications. Most of the intensive computational requirements are usually for 8 - 16 bits audio-visual data. I hardly use long int types in my programming..

I thought so, the IA32 architecture is already a "hybrid 32-64 bits" processor.


wkwai
Go to the top of the page
+Quote Post
Lefungus
post Aug 11 2003, 14:13
Post #10





Group: Members
Posts: 86
Joined: 10-November 02
Member No.: 3745



Celerons based on p4 = Celerons above 1.6 GHz, so no your celeron is just a pII with less L2 memory


--------------------
It's a 'Jump to Conclusions Mat'. You see, you have this mat, with different CONCLUSIONS written on it that you could JUMP TO.
Go to the top of the page
+Quote Post
Audible!
post Aug 16 2003, 23:23
Post #11





Group: Members
Posts: 523
Joined: 28-June 03
From: CA, USA
Member No.: 7426



QUOTE
Another question, I have a Celeron 650 Mhz.. I thought that the Celeron Processor is almost identical to PII? But someone just said that the latest versions of the Celeron processors are based on the new P4 architecture???  I am wondering if mine would support SSE2 instructions.


The very first Celerons (PPGA, not FCPGA) were PII's with less L2 cache.
Starting at the 533MHz clock rate (and going to about 1.4GHz), the Celerons were PIII's architecture with less L2 cache, meaning SSE (1 not 2).
This is the type of Celeron you have.

After 1.4Ghz or so, the Celeron moved to the NetBurst Architecture (PIV less cache, SSE2).

edit: note that there were 500 and 533 MHz PPGA AND FCPGA Celerons, the former being quite easy to spot because of the heat spreader. For more informaiton visit sandpile.org

This post has been edited by Audible!: Aug 16 2003, 23:24
Go to the top of the page
+Quote Post
CiTay
post Aug 17 2003, 01:03
Post #12


Administrator


Group: Admin
Posts: 2378
Joined: 22-September 01
Member No.: 3



QUOTE (Audible! @ Aug 17 2003, 12:23 AM)
Starting at the 533MHz clock rate (and going to about 1.4GHz), the Celerons were PIII's architecture with less L2 cache, meaning SSE (1 not 2).

To make things completely confusing, there were two types of PIII Celerons, the Coppermine- and the Tualatin-based ones.

You can see the various models on this roadmap, including some future CPUs up to Q4/2004... biggrin.gif
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 24th April 2014 - 01:42