lib/lzmadecode: Increase decoding speed by 30%

When CONFIG_SSE is enabled use the "prefetchnta" instruction to load
the next chunk of data into the CPU cache. This only works when the
input stream is covered by an MTRR. In case the input stream is read
from the SPI ROM MMIO area it allows to keep the SPI controller
busy fetching new data, which is automatically placed into the CPU
cache, resulting in less I/O wait on the CPU side and faster
decompression.

When the input stream is not cachable the prefetch instruction has no
effect.

The SPI interfaces on the tested device runs at 100Mbit/s and the
Sandy Bridge mobile CPU has quite some work to do decompressing the
LZMA stream.
That gives the SPI controller enough time to preload data into the
cache.

The payload of 1100213 bytes is now read in 164msec, resulting in an
input bandwidth of 53MBit/s.

TEST=Booted on Lenovo X220 and used cbmem -t:
Before:
  16:finished LZMA decompress (ignore for x86)   1,218,418 (210,054)
After:
  16:finished LZMA decompress (ignore for x86)   1,170,949 (164,868)

Boots 46msec faster than before or 30% faster than before.

Change-Id: I3b2ed7fe0883f271553ecd1ab4191e4848ad0299
Signed-off-by: Patrick Rudolph <patrick.rudolph@9elements.com>
Reviewed-on: https://review.coreboot.org/c/coreboot/+/88813
Tested-by: build bot (Jenkins) <no-reply@coreboot.org>
Reviewed-by: Angel Pons <th3fanbus@gmail.com>
This commit is contained in:
Patrick Rudolph 2025-08-17 20:01:32 +02:00 committed by Matt DeVillier
commit 159afbc5d5

View file

@ -25,6 +25,17 @@
#define __lzma_attribute_Ofast__
#endif
/* When the input stream is covered by an MTRR the "prefetch" instruction
* will load the next chunk of data into the CPU cache ahead of time.
* On a 100MBit/s SPI interface this reduces the time spent in I/O wait
* by 5usec for every cache-line (64bytes) prefetched.
*/
#if CONFIG(SSE)
#define __lzma_prefetch(x) {asm volatile("prefetchnta %0" :: "m" (x));}
#else
#define __lzma_prefetch(x)
#endif
#include "lzmadecode.h"
#include <types.h>
@ -68,6 +79,11 @@
RC_TEST; \
Range <<= 8; \
Code = (Code << 8) | RC_READ_BYTE; \
if (!((uintptr_t)Buffer & 63)) { \
if ((BufferLim - Buffer) >= 128) { \
__lzma_prefetch(Buffer[64]); \
} \
} \
}
#define IfBit0(p) \