Hi Daisuke,

On Oct 28, 2020, at 4:37 AM, d.hatayama@fujitsu.com wrote:

/*
+ * Find virtual (VA) and physical (PA) addresses of kernel start
+ *
+ * va:
+ *   Actual address of the kernel start (_stext) placed
+ *   randomly by kaslr feature. To be more accurate,
+ *   VA = _stext(from vmlinux) + kaslr_offset
+ *
+ * pa:
+ *   Physical address where the kerenel is placed.
+ *
+ * In nokaslr case, VA = _stext (from vmlinux)
+ * In kaslr case, virtual address of the kernel placement goes
+ * in this range: ffffffff80000000..ffffffff9fffffff, or
+ * __START_KERNEL_map..+512MB
+ *
+ * https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2FDocumentation%2Fx86%2Fx86_64%2Fmm.txt&data=04%7C01%7Camakhalov%40vmware.com%7C426ba673d45a4fb69de408d87b35d0f7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637394818358089889%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ujyld4sSOtwS8EvY0Onni%2FYc9oCbl63nBOM4hLf6pL8%3D&reserved=0
+ *
+ * Randomized VA will be the first valid page starting from
+ * ffffffff80000000 (__START_KERNEL_map). Page tree entry of
+ * this page will contain the PA of the kernel start.

I didn't come up with this natural idea, which is better in that
IDTR is unnecessary.

+ *
+ * NOTES:
+ * 1. This method does not support PTI (Page Table Isolation)
+ * case where CR3 points to the isolated page tree.

calc_kaslr_offset() already deals with PTI here:

       if (st->pti_init_vmlinux || st->kaiser_init_vmlinux)
               pgd = cr3 & ~(CR3_PCID_MASK|PTI_USER_PGTABLE_MASK);
       else
               pgd = cr3 & ~CR3_PCID_MASK;

Thus it's OK to think that the CR3 points at the kernel counterpart.
Good point, thanks!


+ * 2. 4-level paging support only, as caller (calc_kaslr_offset)
+ * does not support 5-level paging.

According to the mm.txt the address range for kernel text appears
same in 5-level paging. What is the reason not to cover 5-level
paging in this patch? Is there something that cannot be assumed on
5-level paging?
There are no technical challenges. I just do not have Ice lake machine to test it.
Do you know how I can get dump/vmlinux with 5-level paging enabled?
Once 5-level paging support is done, this method can be used as default, as there are
no limitations. 


+ */
+static int
+find_kernel_start(ulong *va, ulong *pa)
+{
+       int i, pgd_idx, pud_idx, pmd_idx, pte_idx;
+       uint64_t pgd_pte, pud_pte, pmd_pte, pte;
+
+       pgd_idx = pgd_index(__START_KERNEL_map);
+       pud_idx = pud_index(__START_KERNEL_map);
+       pmd_idx = pmd_index(__START_KERNEL_map);
+       pte_idx = pte_index(__START_KERNEL_map);
+
+       for (; pgd_idx < PTRS_PER_PGD; pgd_idx++) {
+               pgd_pte = ULONG(machdep->pgd + pgd_idx * sizeof(uint64_t));

machdep->pgd is not guaranteed to be aligned by PAGE_SIZE.
This could refer to the pgd for userland that resides in the next page.
I guess it's necessary to get the 1st pgd entry in the page machdep->pgd belongs to.
Like this?

   pgd_pte = ULONG((machdep->pgd & PHYSICAL_PAGE_MASK) + pgd_idx * sizeof(uint64_t));
As I understand machdep->pgd is a buffer, cached value of some pgd table from the dump.
machdep->pgd does not have to be aligned in the memory.
We just need to read at specific offset "pgd_idx * sizeof(uint64_t)” to get our pgd_pte.
I think " & PHYSICAL_PAGE_MASK” is not needed here.
Let me know if I wrong.

But i’m going to introduce pgd prefetch from inside find_kernel_start() to do not depend on
prefetch from the caller. So caller must provide top pgd physical address

@@ -350,7 +350,7 @@ quit:
  * does not support 5-level paging.
  */
 static int
-find_kernel_start(ulong *va, ulong *pa)
+find_kernel_start(uint64_t pgd, ulong *va, ulong *pa)
 {
        int i, pgd_idx, pud_idx, pmd_idx, pte_idx;
        uint64_t pgd_pte, pud_pte, pmd_pte, pte;
@@ -361,6 +358,7 @@ find_kernel_start(ulong *va, ulong *pa)
        pmd_idx = pmd_index(__START_KERNEL_map);
        pte_idx = pte_index(__START_KERNEL_map);

+       FILL_PGD(pgd & PHYSICAL_PAGE_MASK, PHYSADDR, PAGESIZE());
        for (; pgd_idx < PTRS_PER_PGD; pgd_idx++) {
                pgd_pte = ULONG(machdep->pgd + pgd_idx * sizeof(uint64_t));
                if (pgd_pte & _PAGE_PRESENT)

Thanks for review. Again will wait for 5-level paging dump/machine availability and send the improved patch.
Do you want me to switch to this method as default (to use it before IDTR method)?

Regards,
—Alexey