Precision Issue in matmul_clamp_f16_f16_f16p Example

I'm currently attempting to utilize matmul_clamp_f16_f16_f16p in an LLM application. Regrettably, it appears that the matmul_clamp_f16_f16_f16p example has precision - related issues. I've made some modifications to the existing matmul_clamp_f16_f16_f16p example. Specifically, I've changed the fill_matrix function to the following:

void fill_uniform_random(size_t num_rows, size_t num_cols, float16_t* dst, size_t seed) {
  std::srand(seed);

  // Populate the array with random values ranging from -1 to 1
  for (size_t i = 0; i < num_rows * num_cols; i++) {
    dst[i] = (float16_t)((double)std::rand() / RAND_MAX) * 2 - 1;
  }
}

Moreover, I've set M, N, and K all to 1024. Under this configuration, matmul_clamp_f16_f16_f16p fails to generate the correct output.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information