• Home
  • Blog
  • Chinese Loongson 3A6000 processor of its own LoongArch architecture: performance testing

Chinese Loongson 3A6000 processor of its own LoongArch architecture: performance testing

26.08.2024 07:05

This CPU test is probably one of the most unusual on our site in recent years. We are used to desktops and laptops based on the x86 architecture, but there are other computing architectures, such as ARM, used in solutions from Huawei and Qualcomm, which recently launched new models, including those running Windows. The MIPS architecture, although not as widespread as x86, has also been popular in the past.

Today, we are seeing a growing interest in processors with alternative architectures. ARM, for example, has reached the level of power required for modern desktops and powerful laptops. Moreover, Chinese companies have made significant progress in the development of microelectronics, including general-purpose and graphics processors. One of the most interesting Chinese processors is the Loongson 3A6000, which we will look at in this test.

China is seeking independence in chip manufacturing, driven by the growing importance of computing technology for the economy and the need to reduce dependence on foreign solutions. Against the backdrop of US trade restrictions and sanctions that prohibit the sale of advanced technologies and chips to Chinese companies, China is actively developing its own microelectronics industry.

In this context, Chinese companies such as Loongson play a key role. Loongson, for example, is a leading manufacturer of general-purpose processors in China. We have already mentioned the Chinese-developed Moore Threads graphics processors, and now it is time to look at general-purpose processors that are used in laptops, desktops and servers.

A significant contribution to the development of Chinese microelectronics was made in response to US trade restrictions. These restrictions slowed access to the latest Western technologies, forcing China to seek alternative solutions and invest in domestic developments. Although Chinese factories have not yet reached the level of world leaders such as Taiwan's TSMC, progress is being made, and the semiconductor war may last for decades.

Loongson Science and Technology, a company backed by the Chinese Academy of Sciences, was founded over a decade ago and has released several series of processors. In 2022, they introduced the Godson 3C5000 and 3C6000 server models with 16 cores, and later the 32-core 3D6000, which combines two 3C6000 dies on a single substrate. The recently released 3A6000 processor shows a significant improvement in performance per clock compared to the previous 3A5000 model, providing serious competition for the best processors on the market.

To meet the needs of desktop PCs, a more powerful processor was required, and Loongson 3A6000 became the company's most productive solution to date. In early August 2023, the Chinese manufacturer announced the start of production of the new-generation Loongson 3A6000 quad-core processor, and in November it was officially unveiled. This processor has a 64-bit MIPS64 microarchitecture, modified by Chinese engineers, and includes a TPM hardware encryption module, as well as a dual-channel RAM controller with support for DDR4-3200.

At the time of the launch of Loongson 3A6000, more than 50 of the company's partners announced products based on it, including computers, laptops, boards, storage devices, and network security equipment. On Chinese trading platforms, you can buy Loongson 3A6000 both as part of a ready-made system and installed on a motherboard in two versions: one with a soldered processor without the ability to overclock, the other from Asus with some overclocking capabilities, which we will discuss in more detail in the second part of the review.

The manufacturer claims that the performance of Loongson 3A6000 is comparable to the quad-core 10th generation Intel Core processor — Core i3-10100. Judging by the tests, the 3A6000 processor really demonstrates competitive results in terms of performance per clock (IPC) with similar processors from Intel and AMD. In the single-threaded SPECint CPU 2006 test, 3A6000 scored 43.1 points, and in SPECfp CPU 2006 — 54.6 points. These results are close to the indicators of previous generations of Intel Core processors with an equal clock frequency of 2.5 GHz. In SPEC CPU 2017 tests, the 3A6000 also looks good, despite the clock speed lag.

The Loongson 3A6000 is still lagging behind Intel and AMD processors, despite expectations that it will reach the level of 11th generation Core and Zen 3 in terms of compute per clock. At 2.5 GHz, the 3A6000 is only slightly ahead of the Core i3-10100 at 3.6 GHz in some tests, but such tests are few. A comparison with the previous 3A5000 model shows that the new processor’s single-threaded performance has increased by 60%, and multi-threaded performance is even more impressive, which is largely due to the low base. If the Chinese engineers really managed to achieve an IPC level comparable to modern solutions from Intel and AMD, this is a significant achievement. However, as noted, IPC is only part of the success, and clock speed also plays a key role. The 3A6000’s maximum frequency of 2.5 GHz is significantly inferior to the turbo frequencies of modern competitors’ processors, which limits its competitiveness. The Core i3-10100, with a frequency of up to 4.3 GHz, is often faster.

The frequency limitation is due not only to architectural features, but also to the technological backwardness of the Chinese semiconductor industry. Despite these difficulties, Loongson continues to develop its technologies and strives to improve the performance of its processors.

Architectural features

The Loongson 3A6000 processor is an upgraded version of the previous 3A5000 model. Unlike the 3A5000, which has four cores and four threads, operates at 2.5 GHz, consumes up to 35 watts and supports DDR4-3200 memory, the new 3A6000 offers multi-threading support and increased maximum power consumption, but maintains the same operating frequency, which is quite modest by today's standards.

Like its predecessor, the 3A5000, the 3A6000 is manufactured using a 12 nm process, unchanged in this regard. The die size of the 3A6000 is approximately 116 mm² (11.6 x 10 mm), which is smaller than the 3A5000, which has an area of ​​142 mm². However, the new processor has twice the L1 cache, while the L2 cache size remains at 4 MB per core.

The processor we are reviewing is based on the LoongArch architecture, the latest generation of the Godson architecture, introduced in 2022. Before that, the company used the MIPS architecture, starting with unofficial solutions and then acquiring licenses for MIPS32 and MIPS64 from MIPS Technologies. The first Loongson processor with a 32-bit MIPS32 architecture appeared in the early 2000s, and 64-bit models followed with architectural extensions and support for x86 binary translation.

Architectural extensions included native instruction sets, virtualization, x86 and ARM binary translation acceleration, and vector extensions for 128-bit SIMD. In 2021, the LoongArch microarchitecture was introduced with the Loongson 3 5000 series processors. This architecture is based on MIPS64, but was adapted by Chinese engineers with the addition of their own instructions. As a result, the base microarchitecture includes 128-bit and 256-bit vector instructions (LSX and LASX), virtualization instructions (LVZ), and binary translation extensions (LBT).

The 3A6000 processor uses new LA664 cores compared to the LA464 in the 3A5000. These cores have a deeper pipeline and support multi-threading (SMT), which improves performance in multitasking modes. As a result, the quad-core 3A6000 supports eight threads and demonstrates a 32% increase in single-threaded performance and an 84% increase in multi-threaded performance compared to the 3A5000.

Each LA664 core has 64 KB of instruction cache and 64 KB of L1 data cache, as well as 256 KB of L2 cache. The total L3 cache for all cores is 16 MB.

Architecture of computing cores

The 3A6000's LA664 core features out-of-order execution and advanced instruction reordering, making it competitive with similar cores from previous generations of Intel and AMD. While the LA664 is based on the previous LA464 core found in the 3A5000, it has improved capabilities and can execute more instructions at once.

The 3A6000's branch predictor has been significantly improved over the 3A5000, offering performance close to that of Intel and AMD processors from several generations ago. While Loongson doesn't quite reach the level of Zen 3 and newer solutions, the improvements to the branch predictor contribute to a noticeable performance boost over the previous model.

Benchmarks show that the 3A6000's branch predictor performs at least as well as Zen 1 at tasks like data compression, and probably even close to Zen 2. While the Zen 4 architecture is vastly superior, the 3A6000's improvement over the 3A5000 is clear. While the instruction predictor determines the direction of execution, the instruction cache provides the core with data. The 3A6000 has increased its L1 instruction cache to 64 KB, an improvement over the 32 KB found in competitors AMD and Intel. This cache feeds the decoder, which is now more powerful than the 3A5000's.

Out-of-order execution of instructions during renaming and allocation uses slots in queues and buffers to track the state of instructions. More complex structures allow the core to predict the instruction flow further, hiding latency and improving parallelism discovery at the instruction level. The 3A6000 also increases the register files and memory queues by a quarter or more, which corrects the shortcomings of the branch buffer present in the LA464.

Large out-of-order buffers are important for improving single-threaded performance, but a well-tuned multi-threading technology (SMT), which distributes CPU resources among multiple threads, is also important. The 3A6000 implements a conservative version of SMT, with static resource sharing — register files, load queues, and store queues — a logical choice for the company's first processor to support SMT.

INT and FP execution units

Compared to the previous model, the integer execution units in the 3A6000 are almost unchanged, except for an increase in the performance of the scheduler, which makes the use of the ALUs more efficient. The 3A6000 retains four ALU channels, two of which handle branches and two — integer multiplications. This organization is reminiscent of AMD's Zen 2 solution, but with two channels for integer multiplications, as opposed to one channel in AMD, although the latter has a more capable scheduler.

The 3A5000 processor already supported 256-bit vector operations thanks to the LASX extension, but only with two pipelines. The 3A6000 has undergone a global upgrade: there are now four pipelines, each of which can handle 256-bit packed additions, which significantly increases the performance of floating point operations. For comparison, x86 processors can typically only perform two 256-bit packed additions per cycle.

However, the peak throughput of FMA operations with one rounding remains unchanged: both the LA664 and LA464 can only perform one FMA operation per cycle, which is half that of Zen 2 or Skylake. Despite the addition of additional channels, the scheduler capabilities have increased by 50%, which should provide a performance boost in floating-point operations. These improvements make the 3A6000 a very efficient processor for vector and floating-point workloads.

Cache and Memory Subsystem

A well-designed cache and memory hierarchy is critical to the efficient operation of a modern high-performance processor. While the 3A6000 retains the cache hierarchy of its predecessor, improvements have been made to reduce latency and simplify access to cached data. For example, the latency to access data from the L1 cache has been reduced from four to three cycles, which is especially important given the relatively low frequency of Loongson compared to higher-frequency Western processors.

Current-generation processors use a second-level (L2) cache to reduce the impact of L1 cache misses and high L3 cache latency. The 3A6000 uses a 256 KB L2 cache, just like older Intel architectures. Newer AMD and Intel processors have significantly larger L2 caches: up to 1 MB in Zen 4 and up to 2 MB in Raptor Lake. While the 3A6000 has the same L2 cache size, its latency has been reduced from 14 to 12 cycles. The 16 MB L3 cache shared by all four cores is unchanged, but L3 latency has also been reduced by a couple of cycles, possibly due to the improved L2 cache.

The 3A6000's DDR4 memory controller is a significant improvement over the 3A5000, with memory access latency reduced from 144 ns to 104 ns. However, due to Loongson's lower clock speed compared to modern AMD and Intel processors, actual latency numbers are still high. As a result, despite its improved instruction reordering capabilities and higher IPC, the 3A6000 is slower at every level of cache, and is slower than even older models like Zen 2.

Cache and memory bandwidth

Memory bandwidth is also critical to performance, especially in multi-threaded applications. The 3A6000 retains many elements of the memory hierarchy from its predecessor, but with some improvements. The 3A5000 had L1 cache bandwidth comparable to Skylake and Zen 2 processors, while the 3A6000 has doubled it for writes. The L1 data cache now provides two 256-bit accesses per cycle, which provides excellent throughput even at a relatively low processor clock rate. This compares favorably with the Golden Cove core, which has similar L1 bandwidth.

The 3A6000's 256 KB L2 cache remains almost identical to its predecessor, with read and write bandwidth of 21-22 bytes per cycle, which is lower than modern Western processors, especially Intel, which have 64 bytes per cycle. However, the L3 cache bandwidth has increased by a third to 18 bytes per cycle, which allows the Loongson 3A6000 to compete with older Intel processors, although AMD's L3 cache is even better.

The DDR4 memory controller in the 3A6000 has been significantly improved compared to the 3A5000. Although support for DDR4-3200 is declared, stable operation at this speed was achieved only in single-channel mode. In dual-channel mode, the 3A6000 worked at DDR4-1800/DDR4-1866 frequencies, although some tests showed the possibility of operating at 2133 MHz. This may depend on compatibility with specific memory modules, and there are no memory settings in the BIOS Setup. In general, by modern standards, the memory performance of the 3A6000 remains mediocre.

However, the 3A6000 shows about 38% higher single-threaded performance compared to the 3A5000, and in multi-threaded tasks the difference is even greater thanks to SMT support, which the 3A5000 lacks. SMT provides a performance increase of 20-30% compared to the non-SMT variant, although the gains of competitors can reach 40%. Despite this, in Loongson's tests, the 3A6000 shows competitive results compared to older solutions from AMD and Intel, although this depends on specific tests and software.

At the launch event, Loongson presented three models: LS3A6000-HV for server and high-performance systems, LS3A6000-LL for desktop PCs, and LS3A6000M for mobile devices. All models have the same physical specifications: a 35x35mm FCBGA package with 1190 contacts. The clock speed ranges from 2.0 to 2.5 GHz depending on the model, and the power consumption ranges from 30 to 80 watts.

The Loongson 3A6000 processor memory controller supports two DDR4-3200 channels and ECC error correction. For input/output, the processor uses a HyperTransport 3.0 controller with a maximum speed of 6.4 Gbps, which is compatible with the HT 1.0 and HT 3.0 standards operating at frequencies of 200-800 MHz and 1000-3200 MHz, respectively. For power consumption management, the functions of dynamic shutdown of the main module clock signal, dynamic frequency conversion of the main clock signal and dynamic regulation of the main domain voltage are provided. The 3A6000 processor only works in a single-processor configuration, and the physical width of the bus address is limited to 44 bits.

The processor comes with a 7A2000 chipset, which includes an LG110 graphics core. It supports resolutions up to 1920×1080 at 120Hz or up to 4K at 30Hz via HDMI and VGA, as well as OpenGL 2.1 and OpenGL ES 2.0. The chipset provides 32 PCIe 3.0 lanes, four SATA600 ports, four USB 3.0, and eight USB 2.0. These features provide ample expansion and connectivity for entry-level PCs.

Performance testing

Test systems and conditions

The testing used both a ready-made Chinese PC based on the Loongson 3A6000 processor with the specified hardware, and a motherboard from Asus with a soldered processor, a complete air cooler and additional components.

  • Processor: Loongson 3A6000 (4 cores/8 threads, 2.5 GHz)
  • Cooling system: small-sized air coolers
  • System boards:
    • PNXC PN-L530A (7A2000 chipset)
    • Asus XC-LS3A6M (7A2000 chipset)
  • RAM:
    • 16GB (2x8GB) DDR4-3200 CL16
    • 16GB (1×16GB) DDR4-3200 CL22
  • Video cards: integrated into the LG110 chipset and external AMD Radeon RX 480 (8 GB)
  • Storage: Kimtigo TP3000 512GB SSD

The Chinese processor supports DDR4-3200 memory, just like its predecessor, the quad-core, quad-thread Loongson 3A5000. Dual-channel operation is possible, but our DDR4 memory kits only ran at DDR4-1800/1866, which limits the performance gains from dual-channel operation. We also tested with dual-channel memory to see how it would impact the system. For reference, the PNXC pre-built system has one module running in single-channel mode at full DDR4-3200 speed.

Unfortunately, there are no settings for adjusting the frequency and timings of the memory in the BIOS Setup, the motherboard automatically sets all parameters, without the ability to select or configure XMP profiles, which is due to the limitations of a relatively new platform. Together with the Loongson graphics card integrated into the chipset, we used the AMD Radeon RX 480 graphics card for additional tests, and the reasons for choosing this model will be explained in the practical part of the article.

Choosing competitors for the Loongson 3A6000 was difficult, since we did not have models such as the Core i3-10100 in stock. Therefore, for comparison, we used the minimum configurations of available systems based on Intel and AMD processors: Core i3-12100 with DDR5 memory and a Ryzen 7 1700 processor, which we configured to emulate a quad-core Ryzen 5 1500X. We disabled half the cores to simulate a quad-core Ryzen 5 1500X while keeping the full 16MB L3 cache but with the limited Infinity Fabric bandwidth. We also set the power limit accordingly to achieve a near-perfect match with the Ryzen 5 1500X.

For the Ryzen and Core processors, we used standard test motherboards and typical memory with settings from the XMP profiles, and power limits were set according to the processor specifications. We also tested the Core i3-12100 at a constant frequency of 2.5 GHz, disabling all overclocking technologies such as Turbo Boost and Thermal Velocity Boost, and setting a lower power limit. Similarly, the simulated Ryzen 5 1500X was set to a constant frequency of 2.5 GHz with overclocking technologies such as Precision Boost Overdrive disabled (see screenshot).

As a result, we got comparable processors, where all cores operate at 2.5 GHz, like the Chinese processor. With the help of Core and Ryzen, operating at the same frequency, we will be able to evaluate the architecture of the Chinese processor and find out how comparable it is with modern, but not the newest solutions from Intel and AMD at the same clock frequency. This will allow us to compare the so-called IPC indicator — performance per clock, or the number of instructions executed per clock.

As for the software, at the moment, two full-fledged operating systems with official support for the LoongArch architecture have been released: Loongnix and UOS. There are also test builds of the Russian ALT Linux, and support may expand in the future. The PC provided to us was pre-installed with the Linux-based UOS operating system, but in our tests we also used Loongnix on a second system based on an Asus board. To ensure equal testing conditions, Ryzen 5 and Core i3 were launched under the x86-compatible version of UOS. Please note that other versions of Linux and/or Windows may show better results for x86 processors.

The choice of test software was quite difficult. Due to the lack of supported versions of Windows for Loongson and the lack of familiar software for Linux, we decided to use the Phoronix Test Suite and test as many available tests as possible that are compatible with the LoongArch64 architecture.

This process was not easy, since many tests either do not support this architecture or depend on libraries and optimizations that are unique to x86-64 and cannot be built on other CPUs. Even if there are no obvious architectural dependencies in the code, there may be problems with the build, and when trying to run the tests, incorrect operation, no results, or errors in completing the tests may be observed. As a result, we were able to use only those tests from the suite that were able to run on a system with a Loongson processor. Some of them may use binary translation of x86 code, which complicates determining their operation.

Synthetic tests

Memory and caching system performance

First, let's evaluate the efficiency of the memory controller and data caching system of the Chinese engineers. Unfortunately, the testing conditions for different processors varied, since Intel supports DDR5 memory, while AMD and Loongson work only with DDR4. The Chinese processor was unable to function in dual-channel DDR4-3200 mode with the modules we had, although it worked without problems with one DDR4-3200 bar. However, the difference in performance between single-channel DDR4-3200 and dual-channel DDR4-1866 was less than expected.

The first test was to check the bandwidth of the cache memory and RAM using CacheBench from the LLCbench package. This test measures the bandwidth for reading, writing, and mixed mode data operations. The results show that the main impact on performance is exerted by the bandwidth of the caches, not the RAM, since the difference between single-channel and dual-channel modes in Loongson is insignificant.

The Intel processor demonstrates a clear advantage in all modes, which is expected — it is the newest and, using DDR5-5200, showed twice the performance in working with cache and memory compared to the Chinese Loongson. Ryzen 5 1500X turned out to be almost one and a half times faster than Loongson in this test, which is due to the support of dual-channel memory mode.

However, the difference between the processors decreases when the frequency is reduced to 2.5 GHz. In this case, Loongson not only caught up with the Ryzen 5 1500X, but also beat it, demonstrating better cache performance than the older Zen 1 at the same core count. The Core i3-12100, scaled to 2.5 GHz, stayed on par with the full-speed Zen 1. Overall, Loongson's results so far don't look too bad.

The second benchmark, Memory Bandwidth (MBW), is a simple memory bandwidth test that measures the speed of data copy operations. We used two data sizes: 128 MB and 4 GB. In this test, the memory bandwidth is more important than the cache, so the dual-channel mode, even with DDR4-1866, shows an advantage over single-channel DDR4-3200, despite the lower frequency in the former case.

Compared to its competitors, the Core i3-12100 was again the fastest, which is explained by the use of DDR5 memory. The Ryzen 5 1500X also demonstrated better performance compared to the Chinese processor, especially in the first copy mode. However, with a fixed data block size, Loongson in dual-channel mode even showed higher speed than the Zen 1 family processors, which is a good result considering the use of DDR4-1866.

RAMspeed is a RAM performance test where we use two modes of average data transfer rate: integer and floating point. In this test, RAM bandwidth is key, and the difference between single- and dual-channel modes was noticeable, with the latter option being superior despite the reduced memory frequency.

However, even with this advantage, the Loongson 3A6000 was unable to come close to the results of older Western processors. The Core i3-12100 and Ryzen 5 1500X showed significantly better performance compared to the Chinese processor, even when their operating frequency was adjusted to 2.5 GHz, which is the maximum for Loongson. In this test, the 3A6000 was two to three times slower than its competitors.

The last test in this section is called Stream. This is a popular benchmark for evaluating the bandwidth of RAM, offering four different measurement methods. In this test, we examined the efficiency of RAM, and the results showed differences depending on the number of memory channels. Unfortunately, the results for the Chinese processor did not show significant achievements. In one of the modes, it was able to get close to the Ryzen 5 1500X, but in others it showed a significant lag behind both AMD and the more powerful Intel processor with DDR5 memory, which turned out to be the fastest in this comparison. Overall, the Chinese processor has not yet reached the level of efficiency of older Western processors in the memory controller.

Synthetic and general tests

Synthetic benchmarks from different packages can be useful for assessing the low-level characteristics of a processor in specific tasks, although some of them also offer some versatility. This group of tests demonstrates the relative performance of a processor in various applications and scenarios.

Core-Latency is a test that measures the latencies between all combinations of processor cores, providing minimum, average, and maximum values. The results are especially interesting in chiplet architectures or multi-processor systems, where the latencies between cores can vary greatly.

The test results show that in dual-channel and single-channel DDR4 memory modes, the latencies are almost the same. It is clearly visible that the Loongson 3A6000 crystal is monolithic, and the differences in latencies between its cores are minimal — almost the same as that of the monolithic Core i3-12100, although slightly worse. However, when the frequency is reduced to 2.5 GHz, Loongson begins to show an advantage. The Ryzen 5 1500X processor, thanks to its chiplet architecture, is clearly lagging behind — the latencies between active cores located in different chiplets are significantly higher, especially the average and maximum values.

EEMBC CoreMark is a set of synthetic benchmarks designed to measure the performance of processors and microcontrollers, replacing the old Dhrystone test. It includes data search and sorting algorithms, matrix operations, checksum calculation, and other tasks. The results are presented as the number of repetitions per second, which makes it convenient to compare different systems.

This test is weakly dependent on memory speed, and in it, the Loongson 3A6000 is not too far behind the Ryzen 5 1500X, even at a full frequency of 3.5 GHz. When AMD's frequency is reduced to 2.5 GHz, which corresponds to the frequency of the Chinese processor, Loongson shows higher efficiency than the Zen 1 generation processor. The Core i3-12100, operating at 2.5 GHz, also outperforms the 3A6000, although its nominal frequency is higher — 3.3 GHz. This puts the Chinese processor at a disadvantage with its significantly worse results.

Swet is a synthetic benchmark for evaluating the performance of CPUs and RAM, including multi-processor and multi-core systems. The results are expressed in operations per second. Despite the stated impact of RAM speed, no difference was noticed between dual-channel and single-channel modes on the Loongson 3A6000.

When compared to other processors, the Chinese CPU shows weak results: it is twice as slow as the Ryzen 5 1500X at full frequency and almost four times slower than the Core i3-12100 operating at the nominal frequency. Even with the frequency reduced to 2.5 GHz for Western processors, Loongson does not reach the Ryzen, not to mention the Core i3. Probably, this test is poorly suited for the Chinese processor due to insufficient optimization for its architecture or the use of a binary translator, which makes such results unattractive for the manufacturer's advertising materials.

HardInfo is a system and hardware information and monitoring tool that includes several performance tests covering various tasks, from ray tracing to cryptography. The results are presented as execution time or points.

Here, the Chinese processor shows impressive results. In the ray tracing test, the Loongson 3A6000 showed a result better than the Ryzen 5 1500X at full frequency, and comparable to the Core i3-12100 at the nominal frequency. In the N-Queens task, known for its complexity, Loongson was the fastest, while the Core i3-12100 showed the worst result, which may be due to the peculiarities of the specific implementation.

In the CryptoHash, Zlib and Fibonacci tests, Loongson shows good results, comparable to the Ryzen 5 at its normal frequency and the Core i3 at 2.5 GHz. Also in the FFT (discrete Fourier transform) test, the Chinese processor shows good results, comparable to the Ryzen 5 1500X at 3.5 GHz, although it is inferior to the Core i3-12100 even at 2.5 GHz. Overall, the Loongson 3A6000 is close in performance to the Ryzen 5 1500X at its normal frequencies, but the Core i3-12100 is still clearly faster.

This is a Java version of the SciMark 2.0 scientific computing benchmark, which includes various algorithms such as Monte Carlo, Fast Fourier Transform, Jacobi Overrelaxation, Sparse Matrix Multiplication, and LU Matrix Decomposition. The tests show some impact from the increased memory bandwidth for dual-channel mode, although not for all algorithms.

The Loongson 3A6000 performed well, especially when compared to the AMD and Intel processors slowed down to 2.5 GHz. This indicates a good IPC, although it still falls short of the full-speed Ryzen 5 1500X and Core i3-12100. The results vary by subtest, with the Loongson reaching Ryzen levels at full clock in some.

In the Jacobi successive overrelaxation method and in sparse matrix multiplication operations, the Chinese processor was inferior even to the slowed-down Ryzen 5 1500X. However, in the fast Fourier transform test, it showed better results than Ryzen at full frequency, and in the LU matrix decomposition, it became the fastest of the three tested processors. This shows that much depends on the specific task and optimization. In general, in terms of IPC, the Chinese processor is close to representatives of the Zen 1 family.

Benchmark Stress-NG

This is a specialized utility for complex hardware load testing with many different tests. Since the package includes many tests, we decided to present the results in a convenient tabular format, including only those tests that were successfully completed on all systems.

This test also includes results for Loongson running a different operating system, Loongnix, in dual-channel memory mode. As you can see from the table, the results vary, and in some tests the difference can reach 1.5 times. You can analyze the table yourself, but we will note a few key points by comparing the Loongson 3A6000 with Western processors.

First of all, it is worth noting that Loongson is often not inferior to the Ryzen 5 1500X, operating at a frequency reduced to 2.5 GHz, and is close to the Core i3-12100 in the same mode, although on average it is still slightly inferior to the more modern Intel processor. The greatest losses are observed in tasks that require active use of matrix calculations and specialized SIMD instructions, which indicates that the tests may not be optimized for the Loongson instruction set.

Rendering

Rendering tests are among the most challenging for modern processors due to the multi-threaded nature of the ray tracing workload. Under these conditions, processors strive to maintain the highest possible clock rate, consume maximum power, and run very hot. Manufacturers often use rendering tests to compare the performance of their processors with competitors, since such workloads are better handled by processors with more cores and threads.

We have put the results of four rendering benchmarks on one chart:

  • AOBench: a lightweight renderer using ambient occlusion, resolution 2048x2048 pixels.
  • C-Ray: a multi-threaded ray tracer for testing floating point calculations.
  • POV-Ray: a Persistence of Vision ray tracer.
  • Smallpt: a Monte Carlo path tracing global illumination renderer using OpenMP multithreading.

Memory bandwidth has almost no effect on the results; rendering speed in different modes with the number of memory channels is almost the same. The results of the Loongson 3A6000 in the rendering tests were quite impressive. Not only was it practically on par with its competitors at the same frequency of 2.5 GHz, which is emphasized by the high IPC, but in some tests it was also close to the Ryzen 5 1500X at the nominal frequency of 3.5 GHz.

The Core i3-12100, as expected, was significantly ahead of both competitors at a frequency of 3.3 GHz. Even when the frequency was reduced to 2.5 GHz, it remained faster than the Chinese processor and was only slightly inferior to AMD. The results of the Loongson 3A6000 can be considered successful, especially considering that its IPC is not much inferior even to more modern processors. Expectations were lower, and there are still many different tests ahead.

Working with media data

In this section, we look at several media processing tests, including photos and videos. These tests cover practical tasks such as encoding audio and video data into specialized formats, as well as more specialized tasks such as speech synthesis. Since these tasks are frequently performed by many users, the results in this section have significant practical significance.

First, we tested audio compression in various formats: APE, FLAC, and WavPack. All of these formats are designed for lossless audio compression, and the RAM bandwidth did not have a noticeable impact on the results.

Unfortunately, the Loongson 3A6000 processor did not perform well in the audio encoding tests. In all three formats, the Chinese processor was inferior to both AMD and Intel, and this lag remained even with the frequency reduced to 2.5 GHz. The performance gap compared to Western processors at the same frequency reached two to three times, and sometimes even four times! This is due to the lack of optimizations for less common LoongArch processors. Although it can be said that audio encoding is not the most demanding task, and the process still happens relatively quickly, let's look at other audio processing tests.

The chart shows the results of two tests related to speech synthesis and audio processing. The first test, Google SynthMark, a cross-platform tool for measuring real-time CPU performance in audio processing, includes a polyphonic synthesizer and evaluates latency, jitter, and computational bandwidth. The second test, eSpeak, measures the time to synthesize speech from the book «The Outline of Science» using the improved eSpeak-NG engine and outputting audio in WAV format.

The difference between dual-channel and single-channel memory modes did not affect the results, so we compare the processors. In the first test, the Loongson 3A6000 showed good results: its performance was comparable to the Core i3-12100 at a reduced frequency of 2.5 GHz and almost reached the level of the Ryzen 5 1500X at the nominal frequency, which is an excellent result.

In the second speech synthesis test, the results were less impressive, but still decent: the Loongson 3A6000 was faster than the Ryzen 5 1500X slowed down to 2.5 GHz and slightly slower than the Core i3-12100 at the same frequency. This confirms that the Loongson's performance in calculations per clock is good, but requires additional optimizations and the use of specialized instructions to achieve better results. Taking into account the nominal frequencies of Western processors, the Core i3-12100 significantly outperforms the Chinese CPU.

Dav1d is a high-speed software decoder for AV1 video. We tested it by decoding two videos with a Full HD and 4K resolution. Interestingly, the results vary depending on the memory operating mode: dual-channel DDR4-1866 showed slightly better performance compared to single-channel DDR4-3200.

When decoding AV1 video, the same problems appeared as when encoding audio data — lack of optimization and use of specialized instructions. As a result, the decoding speed on the Loongson 3A6000 was half that of the Core i3-12100, operating at a reduced frequency of 2.5 GHz, and one and a half times slower than the Ryzen 5 1500X at the same frequency.

When compared with Intel and AMD processors at nominal frequencies, the difference becomes even more noticeable — the Loongson 3A6000 lags behind its competitors by 2-3.5 times. This makes the Chinese processor uncompetitive in this benchmark. However, for an average user who does not have to decode several 4K videos at the same time, 40 FPS may be quite sufficient performance. Despite this, the result still leaves the Loongson 3A6000 noticeably behind its competitors.

The next test focuses on software encoding of video data to the H.265 format using the popular x265 encoder. We tested two resolutions: Full HD and 4K. To ensure high encoding performance, SIMD instructions such as SSE, AVX, AVX2 and AVX-512 for x86-compatible processors, and LSX and LASX for Loongson are usually used.

The results in this test show that high FPS is not achieved, and every frame per second is critical. Although video encoding and decoding is often performed by the graphics processing unit (GPU) in modern systems, in the absence of such support, this task is performed by the CPU. Memory speed does not affect the results; the key factors are the processing power and the quality of optimization for specific architectures.

However, in this test, the Loongson 3A6000 again demonstrates weak results, probably due to the lack of optimizations for the LoongArch architecture, including specialized instructions. As a result, the Chinese processor is significantly inferior to the Core i3-12100 and Ryzen 5 1500X — the difference is up to 5-6 times compared to the Intel processor, even at a reduced frequency of 2.5 GHz. When compared to AMD and Intel processors at nominal frequencies, the difference becomes even more significant, reaching 5-10 times. Interestingly, among competitors, the Intel processor shows better performance than Ryzen, but this does not help Loongson, which is still far behind in video encoding.

Let's look at another demanding test — VVenC, a fast and efficient H.266/VVC (Fraunhofer Versatile Video Encoder) video encoder. It uses SIMD Everywhere (SIMDe), a library for portable SIMD implementation on various platforms that are not supported natively. Unfortunately, the products of the Chinese manufacturer are not included in the list of supported platforms. While x86 processors use all types of SSE and AVX instructions, and ARM platforms use SIMD-accelerated Neon operations, Loongson again has performance problems.

Indeed, the complexity of the task increases, and without SIMD instructions it is difficult to achieve good results here. A clear difference between the code optimized for x86 processors and what works on LoongArch becomes obvious. Although Loongson 3A6000 works, its performance is noticeably inferior to Western processor models. The performance difference is not 10 times, but much more when compared to the Core i3-12100 at the nominal frequency, and the Ryzen 5 1500X also shows better results. Even at a reduced frequency of 2.5 GHz, the AMD processor is more than 5 times faster than Loongson. The Chinese company still has a lot of work to do to create optimizations for their processors in various software products to avoid such significant failures.

Image processing

This section of tests partially overlaps with the previous one, but we decided to highlight it separately. It focuses exclusively on working with static 2D images, including their processing, compression and decompression in various tasks.

G’MIC is an open-source digital imaging platform that offers a variety of algorithms and functions for image transformation and processing. It supports multithreading and can use OpenMP to speed up calculations by distributing the load across multiple cores.

The test results are presented in seconds required to complete each of the three tasks. There is a clear impact of memory bandwidth: dual-channel mode, even at a lower frequency, provides better results in image processing. This allows the Loongson processor to approach the results of the Ryzen 5 1500X in the first two tests, operating at a frequency reduced to 2.5 GHz. However, the Core i3-12100, even in slow mode, demonstrates significantly better performance. When comparing all processors at their nominal frequencies, the Chinese processor turns out to be an outsider in the first two tests.

It is interesting that the third subtest stands out in that the Ryzen processor showed a significant decrease in speed compared to the other two. Loongson showed better performance than the older AMD quad-core processor, even at normal frequency. In tests at nominal frequency, the Intel processor was faster than Loongson, but at 2.5 GHz, the Chinese processor outperformed it, which is a very good result for Loongson.

The next test — RSVG/librsvg — evaluates the performance when working with vector graphics in SVG format. The benchmark measures the time required to convert vector graphics to PNG format (rasterization). This is a typical task, often encountered when browsing modern websites, where in practice you have to process many small images.

In the vector graphics rasterization test, the Loongson 3A6000 processor showed decent results, placing between the Ryzen 5 1500X and the Core i3-12100 at the same frequency of 2.5 GHz. It slightly outperformed the AMD solution and slightly lost to the Intel processor. However, the higher frequencies of Western processors, especially the Core i3 at 3.3 GHz, provide a significant advantage — in real conditions, the Core i3 processor becomes twice as fast, and the Ryzen 5 also slightly outperforms the Chinese processor. Nevertheless, for Loongson, the result can be considered quite successful.

Let's take a look at another universal image processing test — RawTherapee. This is a cross-platform application for cataloging and processing RAW files from digital cameras, similar to Adobe Lightroom and Aperture, but open source. This benchmark measures the time it takes to process and convert RAW files, which is a common task for professional photographers.

Unfortunately, the Loongson processor again demonstrates weak results in this test due to the lack of optimization for its command architecture. The RAW conversion speed in RawTherapee was significantly lower than that of the Ryzen 5 1500X, operating at a frequency reduced to 2.5 GHz. The Core i3-12100 processor, operating at nominal frequencies, was more than twice as fast as the Chinese processor. This can become a noticeable problem, especially when processing a large number of photos, where the difference in speed becomes noticeable.

Moving on to testing image compression and decompression. tjbench is a benchmark for evaluating the performance of decompressing JPEG files using the libjpeg-turbo library, which is optimized using SIMD instructions of modern processors. Although the library seems to support SIMD instructions from Loongson, the effectiveness of this support is questionable.

The results are again disappointing for the Chinese processor: it lags behind the AMD processor slowed down to 2.5 GHz and even more so behind the Core i3-12100. The Ryzen 5 1500X at the nominal frequency is twice as fast as the Loongson 3A6000, and the Intel processor completes the task 3.3 times faster. Despite the fact that batch conversion of a large number of JPEG files is rare, the results show that the Chinese processor exhibits significant performance problems in this task, which may be due to insufficient software optimization for its architecture.

Let's move on to the image encoding test, which requires more computing resources. The OpenJPEG test uses a large 717 MB panoramic TIFF file, which is converted to JPEG 2000 format. The conversion time results are presented in milliseconds.

Here we see much more positive results for the Loongson 3A6000. Perhaps the newer version of the software is better optimized for its instruction set, so the Chinese processor almost catches up with the Ryzen 5 1500X at the nominal frequency of 3.5 GHz, and even outperforms it at equal frequencies. The Core i3-12100 is still faster in nominal mode, but when its frequency is reduced to 2.5 GHz, the Loongson again shows better results. This means that in terms of IPC in this test, the Chinese processor surpasses older AMD and Intel models, which was unexpected, but let's look at the results of other compression formats.

The next test is about image compression, using Google's libwebp library to transcode a 6000x4000 JPEG file to WebP format using the cwebp utility. Performance is measured in megapixels per second.

The results are again disappointing: it seems that the software optimization for the hardware capabilities of Loongson leaves much to be desired. The Chinese processor shows weak results in this test. It is inferior not only to the Core i3-12100 and Ryzen 5 1500X in their full modes, but also when they are reduced to 2.5 GHz, lagging behind at best by two times.

Yes, in the lossless compression subtest, the results are close to the AMD and Intel processors at a single frequency, but this can hardly be considered a consolation against the background of a significant lag in the first two subtests. Ryzen 5 at the nominal frequency turned out to be 2.8 times faster than the Chinese processor, and the Core i3 — four times. This is a sad result that may get worse in future tests.

Another similar test uses Google's libwebp2 library to encode images in WebP2 format. This format, which is still in development, supports 10-bit HDR, more efficient lossy and improved lossless compression, and full multithreading.

The test results are again disappointing: the Loongson 3A6000 processor demonstrates comparatively low performance. However, the gap here is slightly smaller. At nominal frequencies, the Core i3-12100 is 2.5-2.7 times faster, and the Ryzen 5 1500X is less than twice as fast. Although this is still a significant gap, at 2.5 GHz, the Loongson is approaching the old AMD processor, which leaves much to be desired.

The last test in this section is Etcpak, which is positioned as «the fastest ETC compressor on the planet.» This tool is designed to compress textures into ETC and S3 formats as quickly as possible, using a texture with a resolution of 8Kx8K and supporting both single-threaded and multi-threaded modes.

The test results were extremely unexpected: the Loongson 3A6000 processor showed a very low texture compression speed — an order of magnitude slower compared to AMD and Intel processors. The difference in performance is approximately 13 times compared to the Ryzen 5 1500X and up to 22 times compared to the Core i3-12100. This highlights the lack of optimization for the specific computing architecture of Loongson. We hope that such unoptimized programs will be encountered by users of Chinese processors as rarely as possible, but this case should be taken into account when evaluating performance.

Cryptographic tests

The next important section of CPU testing is cryptographic tasks. Modern CPUs are capable of encrypting large amounts of data in real time, and many of them support special instructions for the most common encryption algorithms, such as AES.

Aircrack-ng is a suite of tools for detecting WiFi networks, intercepting traffic, and testing the strength of WEP and WPA/WPA2 encryption keys. In such tests, the number of computing cores and high performance architecture are important, not cache or fast memory.

The results for the Loongson 3A6000 show that the processor is more than three times slower than the Ryzen 5 1500X and more than eight times slower than the Core i3-12100 at their nominal frequencies. Interestingly, the Intel Core i3-12100 turned out to be more than twice as fast as the Ryzen 5 1500X. Even with the frequency reduced to 2.5 GHz, the Loongson 3A6000 remains 2.5 times slower than the slowed down Ryzen. Perhaps other tests in this section will show more varied results.

Bork is a cross-platform file encryption utility written in Java. The test measures the time it takes to encrypt a sample file. Hardware encryption acceleration on CPUs that support it does not seem to be used in this case.

The Loongson 3A6000 showed more satisfactory results in this test. It is likely that the benchmark is not perfectly optimized for any particular architecture. The Chinese processor demonstrated a result on par with the Core i3-12100 at 2.5 GHz and was slightly faster than the Ryzen 5 1500X at its full frequency. In nominal mode, the Intel Core i3-12100 is still faster, but the Loongson showed decent results in this test.

Crypto++ is an open-source library for C++ designed to work with various cryptographic algorithms. It supports many algorithms and, for x86 processors, the use of the AES-NI extension. For Loongson, no additional optimizations were probably made, so the Chinese processor will most likely be among the laggards. The overall result for all supported algorithms was used in the testing.

Unfortunately, the lack of optimization had a negative effect. Loongson 3A6000 was among the laggards: although the results are not as bad as in the first test, the Chinese CPU is almost twice as slow as the Ryzen 5 1500X at 2.5 GHz and slightly slower than the Core i3-12100. In the nominal mode, Intel and AMD processors show an advantage of about 2.5 and 4 times, respectively. As a result, Loongson again failed to demonstrate good results in cryptographic testing.

The last test is using OpenSSL, an open-source cryptographic library widely known for its SSL/TLS extension for HTTPS. The library supports most hashing, encryption, and popular cryptographic standards. In this test, we measured performance in two parameters: RSA4096, where the speed is indicated in the number of signatures per second, and SHA512, measured in MB/s.

The results for the Loongson 3A6000 in this test were better than in the previous ones. In the first subtest, the Chinese processor was still lagging behind, but the difference was reduced: it was 2 and 3.8 times slower than the Ryzen 5 1500X and Core i3-12100, respectively, at nominal frequencies. When the Ryzen 5 frequency was reduced to 2.5 GHz, the difference decreased to one and a half times.

In the second subtest, the results for Loongson 3A6000 were significantly better: the performance was at the level of the Ryzen 5 1500X slowed down to 2.5 GHz, which indicates a comparative IPC level. However, the Core i3-12100 again showed superiority, with a difference in the nominal mode of more than two times. Despite this, the OpenSSL test can be considered relatively successful for the Chinese processor, especially against the background of other tests.

Compression and decompression

Compression and decompression of data in archives are familiar to most users, as are popular archivers. We have conducted tests using several of them, including the most common ones on Unix/Linux systems.

Gzip is a popular lossless compression format used in Unix systems, based on the Deflate (LZ77 and Huffman) algorithms. The test measures the compression time of two copies of the Linux 4.13 kernel source code. The results showed that memory bandwidth does not affect this: single-channel and dual-channel modes show the same compression time.

The Loongson 3A6000 showed good results in this test, especially when compared to AMD and Intel processors at 2.5 GHz. In this mode, the Chinese processor was slightly faster than the Ryzen 5 1500X and close to the Core i3-12100, showing good IPC. However, when operating at nominal frequencies, the competitors significantly outperform Loongson: the Intel processor is twice as fast, and the Ryzen 5 1500X is only 23% faster than the Chinese CPU. Overall, this is a good result for Loongson.

7-zip is a popular archiver known for its efficient and resource-intensive compression method. 7-zip tests are cross-platform, which allows you to compare results on different operating systems. Using dual-channel DDR4 memory has a positive effect on compression speed, although this dependence is less pronounced during decompression.

The results of testing the Loongson 3A6000 processor were quite remarkable. When compressing, Loongson demonstrates performance that is two times worse than that of the Core i3-12100 at full frequency (and even in a slowed-down mode to 2.5 GHz), but it is close to the Ryzen 5 1500X, and in terms of IPC, it is faster than the AMD processor.

Interestingly, when decompressing, the Loongson 3A6000 performed better than AMD and Intel at their frequencies of 2.5 GHz, which indicates a slightly higher IPC. However, under normal operating conditions, the Core i3 and Ryzen 5 processors remain faster, but the difference is not so great: only 30% and 12%, respectively.

The compression test used the LZ4 algorithm, which, although it provides a lower compression ratio than gzip, offers significantly higher speeds for both compression and decompression. We tested Level 9 compression for performance analysis.

In terms of compression speed, the Loongson 3A6000 outperformed the Core i3-12100 and Ryzen 5 1500X when their frequencies were reduced to 2.5 GHz, which indicates a good IPC level for the Chinese processor. However, due to its lower operating frequency, it is inferior to its competitors at their nominal frequencies, although the difference with the Ryzen 5 is relatively small. The Core i3-12100 at full frequency was 60% faster.

Data decompression is faster and does not depend on RAM bandwidth, unlike compression. In this test, the Loongson lost to its competitors even with equal frequencies of all processors. At the nominal frequency, the Core i3 unpacks the file twice as fast, and the Ryzen 5 is almost one and a half times faster than the Chinese processor.

The compression test used the Zstd (Zstandard) algorithm, which combines LZ77 dictionary compression with efficient ANS entropy coding, similar to the Huffman code. The Level 19 Long compression level was selected for testing.

The Loongson 3A6000 demonstrated compression performance in this format, being between the Core i3-12100 and Ryzen 5 1500X at 2.5 GHz, which indicates good IPC for the Chinese processor. However, due to the higher operating frequency of its competitors, the Intel processor was one and a half times faster. When unpacking, the results are similar: at the same frequency, Loongson is between the slower versions of AMD and Intel, but at nominal frequencies, AMD and Intel processors are again ahead of it — Ryzen by 25%, and Core i3 by twice as much.

The bzip2 compression format, based on the Burrows-Wheeler algorithm, provides a multi-threaded implementation, providing efficient compression, but with a higher CPU load and slower speed compared to gzip and zip. The benchmark measures the time to compress and decompress the FreeBSD-13.0-RELEASE-amd64-memstick.img file using Parallel BZIP2.

In this test, the performance of the Loongson 3A6000 leaves much to be desired. At 2.5 GHz, the Chinese processor loses to both competitors. At the nominal frequency, the Core i3-12100 compresses the file more than twice as fast, and the Ryzen 5 1500X is 50% faster.

The situation looks better when decompressing. The Loongson 3A6000 outperforms the Ryzen 5 1500X at 2.5 GHz and comes close to it at the nominal frequency. Even in slow mode, the Intel processor is slightly inferior to the Chinese CPU, although at the nominal frequency, the Core i3 provides a 64% increase in speed. Thus, Loongson shows noticeable improvements in decompression compared to compression.

The last test in this section measures the time to unpack a .tar.xz archive containing the installation files for the Mozilla Firefox 84.0 web browser. Memory bandwidth has a small impact on the results, about 10%, so we compare the Loongson's dual-channel performance with its competitors.

The Loongson 3A6000 shows significantly better unpacking performance than the Ryzen 5 1500X and slightly better than the Core i3-12100 when all processors are running at 2.5 GHz, which is the Loongson's nominal frequency. In this test, the Chinese CPU shows better IPC results than both AMD and Intel processors, even though they are older. Interestingly, the Loongson reaches the level of the Ryzen at its nominal frequency of 3.5 GHz. However, the Core i3-12100 is ahead of both competitors by 1.5 times, thanks to its more modern architecture and support for DDR5 memory.

Compilation and development

Although this section is not the most extensive and perhaps not the most popular among our readers, it is still of interest. Software developers, although few in number, will certainly be interested in new solutions. We will consider how the Chinese Loongson processor and LoongArch architecture cope with code compilation, application assembly and other tasks related to software development.

Build2 is a cross-platform toolkit for building C/C++ code. The first test in this section measures the time it takes to install Build2 from source code. Interestingly, memory bandwidth had no effect on the results, and dual-channel DDR4 memory mode did not bring any improvements.

Unfortunately, Loongson 3A6000 cannot yet boast of high results. Even when the competitors' frequency is reduced to 2.5 GHz, the Chinese CPU loses to both competitors. Loongson's IPC is clearly inferior. When operating in nominal modes, the Core i3-12100 and Ryzen 5 1500X processors show a noticeable advantage: Intel is more than twice as fast, and AMD beats the Chinese CPU by almost one and a half times.

PyBench is a test that evaluates the overall performance of a system using Python by measuring the execution time of various functions such as BuiltinFunctionCalls and NestedForLoops. The overall result helps determine the average Python performance on the platform.

In this test, the Loongson 3A6000 shows improved results compared to previous tests, although not without reservations. In terms of speed, it is on par with the Ryzen 5 1500X slowed down to 2.5 GHz, indicating comparability with the Zen 1 architecture, which is already quite outdated. The more modern Core i3-12100 significantly outperforms the Loongson, even at 2.5 GHz, winning by 77%, and in nominal mode, the difference reaches three times. At the same time, the AMD processor is 40% faster than the Loongson.

This short section is rounded off by two compilation time tests: for examples from the C++ Eigen linear algebra library and the Erlang programming language. The tests measure the compilation time of the specified projects in seconds, and although memory bandwidth has some effect, it is insignificant and can be ignored.

In terms of compilation speed for both projects, the Loongson processor is comparable to a Ryzen 5 1500X slowed down to 2.5 GHz, which indicates that the Chinese engineers have reached the level of the Zen 1 architecture. However, there are two problems: this architecture is already outdated, and even it allows the AMD processor to operate at a significantly higher frequency, making it 25%-30% faster. The Core i3-12100, in turn, demonstrates a significant advantage, being 1.9-2.4 times faster in nominal mode and 20%-50% faster at 2.5 GHz. This emphasizes that the Chinese processor has yet to catch up with modern performance levels.

High Performance Computing

This section of the test does raise a lot of questions. On the one hand, high-performance computing places serious demands on processors, but on the other hand, it is unlikely that anyone will use an entry-level processor for such tasks. However, the Loongson 3A6000 test results can give an idea of ​​the performance of server CPUs based on the same architecture with a higher core count. We evaluate not only the capabilities of a specific desktop processor, but also the potential of the LoongArch computing architecture as a whole.

The first test in this section, Algebraic Multi-Grid (AMG), evaluates the performance of a parallel algebraic multigrid solver for linear systems on unstructured grids. The test result shows the final computation speed, where a higher value means better performance. As expected, the results depend on the memory bandwidth — the dual-channel mode clearly wins, despite the reduced memory frequency. Unfortunately, even using the dual-channel mode, the Loongson 3A6000 demonstrates a speed more than half that of the Ryzen 5 1500X slowed down to 2.5 GHz. The IPC indicator of the Chinese processor in this test leaves much to be desired. Ryzen 5 at 3.5 GHz and Core i3-12100 show similar results, indicating that the performance in this test largely depends on memory work. In this aspect, the Intel processor is three times faster than the Loongson.

The High Performance Conjugate Gradient (HPCG) benchmark solves a system of linear algebraic equations with a large sparse square matrix using the conjugate gradient method with a Gauss-Seidel preconditioner. The algorithm is implemented using MPI and OpenMP, which supports multi-core CPUs.

Here, the influence of memory bandwidth is also noticeable, and the dual-channel mode provides a significant increase in system performance. The performance of AMD and Intel processors at different clock rates remains close, which emphasizes the importance of memory bandwidth. In this test, the Loongson 3A6000 demonstrates results that are 3.3 times lower than the performance of the Ryzen 5 1500X and more than four times lower than the Core i3-12100.

In less demanding tests, Loongson may show better results. The Himeno benchmark is a linear Poisson pressure equation solver using the Jacobi method and measures performance in megaflops. Although memory bandwidth has a clear impact, it is not a determining factor, and the results of AMD and Intel processors at different frequencies differ significantly. In this test, the Loongson 3A6000 demonstrated results almost on par with the Ryzen 5 1500X, even beating it when running at 2.5 GHz. This shows that the Chinese architecture copes well with tasks compared to Zen 1. However, when compared to the newer Core i3-12100 using DDR5 memory, the Chinese processor is significantly slower, yielding more than twice as much to its competitor running at the nominal frequency.

The Mocassin (Monte Carlo Simulations of Ionised Nebulae) test models ionised nebulae using the Monte Carlo method and includes two variants — a more complex and a simpler one. The solution time is measured in seconds, and the influence of memory bandwidth is present here, but not as significant as in previous tests.

The Loongson 3A6000 showed excellent results in this test. In the more complex version of the test (the first), the Chinese processor was faster than the Ryzen 5 1500X, operating at 3.5 GHz, and in the second variant (less complex), it outperformed the Ryzen 5 1500X at 2.5 GHz. This shows that Loongson has a good IPC compared to the Zen 1 architecture. Moreover, in the Dust2D subtest, the Core i3-12100 at 2.5 GHz was slightly slower than the Loongson 3A6000, which shows that the Chinese processor has good IPC. However, when running at its native frequency of 3.3 GHz, the Core i3-12100 significantly outperforms Loongson, demonstrating twice the performance in the second subtest.

NAS Parallel Benchmarks (NPB) is a test developed by NASA to evaluate high-performance computing systems, including tasks of varying complexity and size. We considered two of the proposed options, and the results are expressed in millions of operations per second. A clear impact of memory bandwidth is observed only in the first subtest.

In the 3D Fourier transform (3D FFT) variant, the Loongson 3A6000 showed very low performance, which indicates insufficient optimization for the Chinese architecture. In this test, it was four times slower than the Ryzen 5 1500X and more than six times slower than the Core i3-12100 at their native frequencies. However, in the second test, the Chinese processor did better, showing speeds almost at the level of the Ryzen 5 at 2.5 GHz. This indicates that the Loongson has good IPC compared to Zen 1, although only in this case. In this test, the Core i3-12100 turned out to be more than one and a half times faster at a reduced frequency and more than 2.6 times faster at its nominal frequency.

Parboil is a benchmark suite from the University of Illinois for evaluating the performance of computing architectures that supports OpenMP, OpenCL, and CUDA. In this case, we used OpenMP to run four subtests. The results are measured by task execution time.

Memory bandwidth does not affect all subtests, but it has a significant impact in the last two — dual-channel memory mode significantly increases Loongson's performance. However, this is not enough to compare with competitors from AMD and Intel: in most tests, the Loongson 3A6000 demonstrates performance 2.5-3 times lower than the Ryzen 5 1500X and Core i3-12100. Even reducing the frequency of these processors to 2.5 GHz does not bring their performance closer to the speed of the Chinese CPU.

However, in the second subtest — MRI Gridding — Loongson was faster than the Core i3-12100 at its full operating frequency of 3.3 GHz and almost twice as fast as the Ryzen 5 1500X at its nominal frequency. This may be due to the specifics of how the test is performed on Loongson, but it may also be that the algorithm is particularly well optimized for the Chinese processor.

Rodinia is a suite of benchmarks for accelerating computations using CUDA, OpenMP, and OpenCL. In this case, we only used OpenMP, as there is no GPU acceleration capability. The results of the four subtests are measured in execution time in seconds — less time means better performance. Memory bandwidth affects the results in the first two subtests.

Compared to the Core i3-12100 and Ryzen 5 1500X, the Loongson 3A6000 processor shows the following results: in the first and second subtests, the Chinese CPU was slightly faster than the Ryzen at 2.5 GHz, but significantly slower than the Core i3 at its full operating frequency, more than twice. This indicates that, despite good IPC results, Loongson does not have enough frequency to be competitive.

In the third subtest, Loongson was 5 times slower than the Core i3-12100 at full frequency and more than 3 times slower than the Ryzen 5 1500X at the nominal frequency. In the last subtest, the gap increased to 11 and 6 times, respectively. This significant lag indicates serious problems with software optimization for the new LoongArch architecture.

Molecular dynamics

These tests also relate to high-performance computing, which we already discussed in the previous section (including some computational fluid dynamics). However, we decided to separate them into a separate subsection due to their significant number.

CloverLeaf is a fluid dynamics test using the Lagrange-Euler method and OpenMP for multi-threaded processors. We tested the basic clover_bm calculation, and the test execution time is displayed in seconds. Memory bandwidth has a significant impact on the result, and adding a second DDR4 bar significantly improved the performance of Loongson, despite the less than optimal memory controller.

The story with Loongson 3A6000 continues: in computationally complex tasks, the Chinese processor shows results that are not up to par, which is due to insufficient software optimization for its architecture. As a result, Loongson turned out to be 75% slower than the Ryzen 5 1500X slowed down to 2.5 GHz, and the full-speed AMD is twice as fast. The Core i3-12100 outperforms the Chinese processor by 3.3 times.

Dolfyn is a computational fluid dynamics (CFD) benchmark that measures the execution time of demo programs. Memory bandwidth has a negligible effect here, while CPU frequency has a noticeable impact on the results.

The Loongson 3A6000 does quite well in this test: at 2.5 GHz, it delivers performance on par with the Ryzen 5 1500X, and IPC figures are comparable to Zen 1. However, the maximum frequency still plays a role, and in nominal mode, the Ryzen is 40% faster. The Core i3-12100, on the other hand, is 2.3 times faster than the Loongson 3A6000 at full frequency, highlighting the need to improve the performance of the Chinese processor.

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a molecular dynamics package used for complex calculations. The tests used the MPI interface, and the Rhodopsin Protein model was chosen for our tests. The impact of memory bandwidth in this benchmark is negligible.

The Loongson 3A6000 showed very good results in this test, beating the Ryzen 5 1500X at equal frequencies and almost catching up with the Core i3-12100 at 2.5 GHz. This shows a good IPC indicator of the Chinese processor, despite the fact that we are comparing it with older generations of AMD and Intel processors. Even with the full-speed Ryzen 5 1500X, the Loongson 3A6000 is almost on par, but the Core i3-12100 is still significantly ahead, twice as fast as the Chinese processor due to its higher frequency.

Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) is a 3D unstructured Lagrangian hydrodynamic simulation for solving the Sedov problem. The impact of memory bandwidth on performance in this test is minimal and can be ignored.

The Loongson 3A6000 processor again showed relatively good results, it is slightly ahead of the Ryzen 5 1500X at 2.5 GHz, but inferior to the Core i3-12100, which is 37% faster in the same mode. The main problem with Loongson is its insufficient operating frequency, since Ryzen and Core i3 are significantly faster at full frequency, by 25% and 2.2 times, respectively. The Core i3-12100, which is newer and supports DDR5 memory, is significantly ahead of both Ryzen and Loongson. However, in terms of IPC, the Chinese processor shows quite good results.

Pennant is a fluid dynamics application for unstructured 2D meshes. It includes two subtests, the results of which are expressed in seconds. In this test, memory bandwidth has a noticeable impact on performance, with dual-channel mode significantly improving the results in both subtests.

However, even with this improvement, the Loongson 3A6000 did not reach the performance level of the Ryzen 5 1500X, which was slowed down to 2.5 GHz; the Chinese processor was always lagging behind. The full-speed Ryzen and Core i3-12100 variants demonstrated performance that was 1.5 and 2-3 times higher than the Loongson, respectively. In this test, Loongson again found itself among the laggards, and it is difficult to say whether this is due to a lack of optimization or other architectural issues.

Let's take a look at the last benchmark in this section, Incompact3d. This is a high-performance Fortran-MPI code for solving the Navier-Stokes equations for incompressible fluids. We used the base case with 129 cells per direction, and the results are presented in seconds spent on the calculation. The impact of memory bandwidth is significant here, and the dual-channel memory mode shows a noticeable improvement, even at a lower frequency. Unfortunately, the Loongson 3A6000 did not perform very well in this test. It was more than twice as slow as the full-speed version of the Core i3-12100 and more than 30% slower than the Ryzen 5 1500X at its nominal frequency. At the same time, the 3A6000 was only 17% slower than the Ryzen 5 at a frequency of 2.5 GHz, which indicates relatively good IPC indicators. However, it should be taken into account that Zen 1 processors were released seven years ago, and the fifth generation will soon be presented — Chinese engineers have a lot of work to do to catch up with current standards.

Machine learning

We couldn't ignore the current topic of resource-intensive computing in machine learning. In this section, there will be two tests that are interesting, despite the fact that more efficient graphics processors are often used for such tasks. However, general-purpose CPUs also find their application in this area.

NumPy (Numerical Python) is an open-source library for Python that provides support for multidimensional arrays and high-level mathematical functions for working with them.

The results of testing Loongson 3A6000 using NumPy are not very encouraging: the Chinese processor even lost to Ryzen 5 1500X at a reduced frequency of 2.5 GHz. Although the difference is not critical, it is noticeable. The full-speed Ryzen 5 is 1.7 times faster than Loongson, and the Core i3-12100 at 3.3 GHz is almost three times faster. Thus, the Intel processor demonstrates a noticeable advantage in matrix calculations compared to AMD and Loongson.

TNN is a high-performance, open-source, cross-platform deep learning framework from Tencent that scales well from mobile devices to powerful GPU servers. We used only two of the four available models for testing.

Loongson performed quite well in this test, approaching the Ryzen 5 1500X in performance at a reduced frequency of 2.5 GHz. This indicates that in terms of the number of instructions executed per clock, the Chinese processor is close to Zen 1. However, the Core i3-12100, as in other tests, significantly outperforms Loongson — both at the nominal frequency and when reduced to the level of the Chinese processor. However, the 2.5x gap from Intel does not seem too serious compared to the results in other tests.

Energy consumption

The Loongson 3A6000 power consumption rating in this context is rather introductory, as this is not a high-end processor with a high power consumption level. The 3A6000-HV model used in the tests has a set peak power consumption of 80 W, but in practice, such values ​​were not achieved.

For high-performance Intel and AMD models, TDP values ​​are often lower than the actual peak power consumption due to frequency and operating voltage increase technologies that allow temporarily exceeding the nominal consumption. In the case of simple processors, such as the Loongson 3A6000, the TDP value is usually not achieved, and much depends on the frequency features, temperature characteristics and other parameters. Motherboard manufacturers can also increase the power consumption and voltage limits to improve performance.

Practical tests of the Loongson 3A6000 showed that its frequency consistently matches the declared parameters, not exceeding or falling below 2.5 GHz under any load. This is different from modern AMD and Intel processors, which can reach higher frequencies in single-thread mode and reduce the frequency under full load on all cores. For Loongson, the situation is simpler and more predictable.

We compared the power consumption of systems with the specified processors in three scenarios: idle, when watching high-resolution videos, and in maximum consumption mode with resource-intensive applications for mathematical calculations. For the video test, we used the built-in media player with a video with a resolution of 1920x1080 pixels in H.264 format, which can load both the video card and the central processor. Please note the differences in consumption when running two supported operating systems, so the data for both OS are provided.

The power consumption comparison is only carried out between systems based on Loongson and Core i3-12100, since the Ryzen 5 1500X does not have an integrated video core, and using an external video card leads to a significant increase in the overall system consumption.

In idle mode, a PC with a Loongson processor consumes slightly less than an Intel-based system, although the difference is insignificant. This indicates good efficiency of the Chinese processor in the most economical mode. In the maximum power consumption mode, when running scientific calculations, the results between the two Linux systems showed that UOS was slightly more economical. The final consumption of 74-77 W for the Loongson-based system was significantly lower than the 86 W for the Intel-based PC, although the Intel processor provides significantly higher performance.

The video playback mode was of the greatest interest. Due to differences in the hardware support capabilities of the Loongson processor, the UOS system demonstrated significantly better performance in video decoding — remember, we used the media player pre-installed with the OS, without additional software. While the Loongnix system performs most of the decoding in software, loading the processor's computing cores, the player in UOS uses dedicated units to process video data. As a result, the Loongnix-based system consumed up to 64 W, while UOS consumed only 46 W, and the system with the Core i3-12100 — 60 W. This indicates that the Chinese processor needs high-quality software support to achieve high energy efficiency.

In general, the Loongson 3A6000 processor consumes slightly less energy than the Core i3-12100. However, the Intel processor usually provides significantly greater performance — often 1.5-2 times or more. Thus, the Loongson 3A6000 is not particularly energy efficient. Perhaps, when compared to the Core i3-10100, the Chinese processor would show slightly lower power consumption with comparable performance, but the difference would hardly be significant.

As for the temperature regime, you should not expect significant heating from a system with simple and minimalistic cooling. In practice, the Loongson 3A6000 processor heated up to no more than 60 degrees, according to the built-in sensor and monitoring system. This temperature level is not a cause for concern — overheating can only occur in the absence of a cooling system. Even a standard air cooler copes well with cooling the Loongson 3A6000.

Conclusions

Based on the benchmark results, it can be said that the Chinese company Loongson has done a great job creating a pretty good processor. Although the Loongson 3A6000 is not a perfect product and does not reach the level of modern AMD and Intel processors, the Chinese engineers managed to make significant progress. For example, the branch predictor in the 3A6000 is at the level of Zen 2, and the SMT technology is also close to this level. The DDR4 memory controller has been improved compared to the previous 3A5000 model, although it is still far from AMD and Intel solutions that support DDR5. These improvements have led to a significant increase in performance, and the LA664 core in the 3A6000 demonstrates comparable results to the Ryzen 5 1500X at a frequency of 2.5 GHz.

However, there is still room for improvement, and further hardware and software improvements can be expected in the future. The Loongson 3A6000 is an important step for China to reduce its dependence on Western microelectronics. For most basic tasks, such as working with a browser and office applications, the 3A6000's performance is quite sufficient. The main problem remains problems with software support, especially for software not optimized for the LoongArch architecture.

Although the Loongson 3A6000 shows good results in some tests, it still lags behind even older processors such as the Ryzen 5 1500X, not to mention modern AMD and Intel. Comparisons with the latest generation of processors show a significant lag in both performance and energy efficiency. Problems also arise with software that does not always effectively use SIMD instruction sets on the Chinese processor, which can negatively affect performance in real-world tasks.

In addition, the Loongson 3A6000 does not boast outstanding energy efficiency, and its power consumption is comparable to the Core i3-10100 and other competitors. Although the Chinese company has the potential to improve IPC and increase frequencies, they still have a lot of work to do to reach the level of modern Western processors. This will require not only an improvement in the architecture, but also an improvement in manufacturing technology.

The current 12 nm process technology used for the 3A6000 limits the capabilities of the processor, although even modern solutions from AMD and Intel are not perfect either. If Loongson wants to become a significant player in the global market, they will have a lot of work to do to compete with leading manufacturers such as Intel and AMD. It is important to note that Chinese companies are facing restrictions in semiconductor manufacturing due to US sanctions and must adapt to domestic manufacturing capabilities. This may slow down their progress, but interest in what they can achieve in the current environment remains high.

The slide shows not only server processors with more cores and threads, but also future desktop models based on new cores with higher frequencies and updated integrated graphics cores. What is interesting is that the Chinese company Loongson plans to move to a 7nm process technology this year. Although new products are unlikely to appear until next year, the move to 7nm could provide a performance increase of 20-30% compared to current models.

Due to US sanctions, the processors can only be manufactured in China at the SMIC fab, which limits performance and efficiency compared to TSMC. However, the move to 7nm would be a significant step up from the current 12nm process and would represent an advanced achievement for the Chinese semiconductor industry.

If Loongson's next models do indeed use the 7nm process technology, it will allow for higher clock speeds and more cores on a single die, which will significantly improve the competitiveness of these processors. Although the Loongson 3A6000 is already suitable for many tasks, the company still has a lot to do to reach the level of AMD and Intel processors that were relevant 2-3 years ago, not to mention more modern models.

The Loongson 3A6000 demonstrates the potential of the Chinese manufacturer, and it is important that progress continues, and Chinese processors develop in a competitive environment. This can open up not only the large Chinese market for Loongson, but also the opportunity to enter international markets, including Russia.

We remind you that this is only the first part of the material about the Loongson 3A6000 processor and systems based on it. In the second article, we will consider the practical use of a PC based on a Chinese processor with its own architecture. We will test both a ready-made PNXC computer and a PC assembled on an Asus motherboard. We will also discuss the operating systems available for Loongson and examine in detail the subtleties and possible disadvantages of using these processors for users accustomed to PCs based on x86-compatible processors and the Windows operating system.