[coreboot] Hardware diagnostic

Discussion:

yanvasilij yan

2018-10-26 05:59:07 UTC

Hello!

I have two Intel Atom E3800 based boards. The first one is older version,
which woks properly, and coreboot starts there. There are the follows main
blocks on the board:

- 3 WGI210IT Ethernet controller connected directly to SOC PCIe
bus;

- 2 SD-cards, 1 emmc;

- USB-host;

- Serial legacy port for console;

- 8 MB SPI NOR flash memory for BIOS.

The second one is modernized version, we added 89HPES5T5 PCIe I/O expansion
switch. And WGI210IT based Ethernet ports are connected to switch. Further
we rebuild a bit a power up circuit of the E3800 SOC. So the coreboot
doesnât work fine on this board. I suppose it is because hardware bug,
there may be mistake in circuit design, or it could be problems with
manufacturing of the board. So know Iâm trying to figure out what exactly
is wrong.

Launching stops when starts payload loading. The launch log of this board
see in attached ânot_working_board_log.txtâ. The log of the working board
see in the âworking_board_log.txtâ.

nico_h in the IRC chat noticed that in non-working board appears a starnge
device with vid/did PCI: 00:00.0 [8086/0000]. He suggested, that it could
be 8086/0f00, and advised to me change the coreboot sources correspondingly:

<nico_h> yanvasilij: in fsp_baytrail/northcluster.c you can duplicate the
`struct pci_driver` at the end, and set `.device = 0x0000,` in the copy

I changed the fsp_baytrail/northcluster.c such way:

âŠ

static const struct pci_driver nc_driver __pci_driver = {

.ops = &nc_ops,

.vendor = PCI_VENDOR_ID_INTEL,

.device = 0x0000,

};

âŠ

And it didnât helped.

My .config file in attachments. Can somebody to advice to me some
diagnostics way to figure out what is wrong?

Regards,

Vasily

Naresh G. Solanki

2018-10-26 19:16:17 UTC

Permalink

I suggest you better check
1. power on sequence,
2. every power/ voltage rail stability during boot process.
3. Over/undershoot.
4. all crystal osc frequency stability during boot process.

5. Try to print the DID early during bootblock and romstage.

Provide your observations.

Regards,
Naresh G Solanki

Peter Stuge

2018-10-27 22:25:11 UTC

Permalink

Post by yanvasilij yan
I have two Intel Atom E3800 based boards. The first one is older version,
which woks properly,

Post by yanvasilij yan
The second one is modernized version, we added 89HPES5T5 PCIe I/O
expansion switch. And WGI210IT based Ethernet ports are connected
to switch.

So the good news is that your log clearly shows the PCIe switch
working correctly and both NICs behind reachable by software.

Post by yanvasilij yan
Further we rebuild a bit a power up circuit of the E3800 SOC.

This is quite possibly the root cause, but I wouldn't exclude any
other possibilities.

Post by yanvasilij yan
Launching stops when starts payload loading. The launch log of this board
see in attached “not_working_board_log.txt”.

The log is very clear about why: (down near the end)

--8<--
SELF segment doesn't target RAM: 0x00800000, 4259840 bytes
-->8--

Looking at the coreboot table a little further up, we see:
--8<--
Writing coreboot table at 0x3add3000
0. 0000000000000000-0000000000000fff: CONFIGURATION TABLES
1. 0000000000e00000-0000000000e39fff: RAMSTAGE
2. 000000003ad9e000-000000003adfffff: CONFIGURATION TABLES
3. 00000000feb00000-00000000fec00fff: RESERVED
4. 00000000fed01000-00000000fed01fff: RESERVED
5. 00000000fed03000-00000000fed03fff: RESERVED
6. 00000000fed05000-00000000fed05fff: RESERVED
7. 00000000fed08000-00000000fed08fff: RESERVED
8. 00000000fed0c000-00000000fed0ffff: RESERVED
9. 00000000fed1c000-00000000fed1cfff: RESERVED
10. 00000000fef00000-00000000feffffff: RESERVED
-->8--

Compare that with your working board:
--8<--
Writing coreboot table at 0x3add3000
0. 0000000000000000-0000000000000fff: CONFIGURATION TABLES
1. 0000000000001000-000000000009ffff: RAM
2. 00000000000a0000-00000000000fffff: RESERVED
3. 0000000000100000-0000000000dfffff: RAM
4. 0000000000e00000-0000000000e39fff: RAMSTAGE
5. 0000000000e3a000-000000003ad9dfff: RAM
6. 000000003ad9e000-000000003adfffff: CONFIGURATION TABLES
7. 000000003ae00000-000000003fffffff: RESERVED
8. 00000000e0000000-00000000efffffff: RESERVED
9. 00000000feb00000-00000000fec00fff: RESERVED
10. 00000000fed01000-00000000fed01fff: RESERVED
11. 00000000fed03000-00000000fed03fff: RESERVED
12. 00000000fed05000-00000000fed05fff: RESERVED
13. 00000000fed08000-00000000fed08fff: RESERVED
14. 00000000fed0c000-00000000fed0ffff: RESERVED
15. 00000000fed1c000-00000000fed1cfff: RESERVED
16. 00000000fee00000-00000000fee00fff: RESERVED
17. 00000000fef00000-00000000feffffff: RESERVED
-->8--

The new board ends up with no RAM regions in the coreboot table.

That results in the payload loader not finding RAM where the payload is
to be loaded, so boot stops.

Why are there no RAM regions? I don't know.

Looking near the beginning of the log about FSP memory init:
--8<--
Memory Down Data Existed : Enabled
- Speed (0: 800, 1: 1066, 2: 1333, 3: 1600): 1
- Type (0: DDR3, 1: DDR3L) : 1
- DIMM0 : Enabled
- DIMM1 : Disabled
- Width : x16
- Density : 2Gbit
- BudWidth : 64bit
- Rank # : 1
- tCL : 0B
- tRPtRCD : 0B
- tWR : 0C
- tWTR : 06
- tRRD : 06
- tRTP : 06
- tFAW : 14
Using 1066 MHz DDR3 settings.
1 GB Minnowboard Max detected.
romstage_main_continue status: 0 hob_list_ptr: 3ae20000
FSP Status: 0x0
PM1_STS = 0x1 PM1_CNT = 0x0 GEN_PMCON1 = 0x1001808
romstage_main_continue: prev_sleep_state = S0
Baytrail Chip Variant: Bay Trail-I (ISG/embedded)
MRC v0.102
1 channels of DDR3 @ 1066MHz
-->8--

It appears OK - but do check that those numbers actually match the
DRAM chips assembled on the board. Are DRAM parts identical between
old and new?

Were there *any* hardware changes between SoC and RAM?

That's worth checking, but..

Post by yanvasilij yan
nico_h in the IRC chat noticed that in non-working board appears a starnge
device with vid/did PCI: 00:00.0 [8086/0000].

The 0000 is a HUGE red sign, screaming to be thoroughly investigated.

This also hints that the power up changes may be the problem.

It's VERY unlikely that Intel has suddenly released a variant of this
particular SoC with PCI DID=0000 when it used to be DID=0f00. In fact
it's really unlikely that 0000 would be used in correct operation at all.

Very likely on the other hand is that the SoC isn't being powered on
correctly, and so it ends up in some half-initialized state, with the
memory controller not working, and while some part of coreboot seems
to notice (because no RAM regions in coreboot table) clearly that
isn't causing a fatal error, which I think is a bug. Oh well.

If you go through every single powerup hardware change together with
a hardware engineer, starting with the previous circuit and manually
applying one change at a time, maybe you can find one or even more
changes causing that device ID symptom. It depends on how many changes
you have there, but with a good hardware engineer you could perhaps go
through them all in a couple days, which would be really fast results
for a problem like this.

Maybe even simpler, hack this into some early part of the code, maybe
even console_init() works, if pci_early is available there.

while (1)
if (0x0f00 == pci_read_config32(PCI_DEV(0,0,0), PCI_VENDORID))
printk(BIOS_INFO, "PASS\n");
else
printk(BIOS_INFO, "FAIL: want 0f00 is %04x\n", pci_read_config32(PCI_DEV(0,0,0), PCI_VENDORID));

Then hardware engineering can do analysis on their own. But make sure
to confirm that your test is reliable, using the hardware you have
(old and new) before you give a flash image to them.

Oh, and test on multiple new boards, a single unit in a new batch isn't
representative. New PCB; potentially the process has to be tuned. I don't
know how early in bringup you are.

Good luck and have fun! :)

//Peter

--
coreboot mailing list: ***@coreboot.org
https://mail.co

Nico Huber

2018-10-27 23:18:08 UTC

Permalink

Post by Peter Stuge
Why are there no RAM regions? I don't know.

Quite simple, because the code that adds them is tied to 8086:0f00.

Post by Peter Stuge

Post by yanvasilij yan
nico_h in the IRC chat noticed that in non-working board appears a starnge
device with vid/did PCI: 00:00.0 [8086/0000].

The 0000 is a HUGE red sign, screaming to be thoroughly investigated.
This also hints that the power up changes may be the problem.

I agree, but note that in the datasheet (or EDS at least), the DID is
not read-only. A strap (that I coudn't find more about) is supposed to
set the initial value plus there is a message mentioned (to be sent to
sth. I didn't look into) that may change it.

So alternatively to diagnosing the hardware changes, you could also
follow the bread crumbs in the documentation.

Nico

--
coreboot mailing list: ***@coreboot.org
https://mail.coreboot.org/mailman/listinfo/coreboot

yanvasilij yan

2018-11-06 11:51:49 UTC

Permalink

Thank lot too all!

Sory for last response!

You were right, the problem was in power up sequence. After fixing all
become correct. The 0x8086/0x0000 device disappeared, and payload
succesfuly downloaded.

Peter Stuge, thank you for detailed comments!

Nico Huber, thank you for active participation!

Post by Nico Huber

Post by Peter Stuge
Why are there no RAM regions? I don't know.

Quite simple, because the code that adds them is tied to 8086:0f00.

Post by Peter Stuge

Post by yanvasilij yan
nico_h in the IRC chat noticed that in non-working board appears a

starnge

Post by Peter Stuge

Post by yanvasilij yan
device with vid/did PCI: 00:00.0 [8086/0000].

The 0000 is a HUGE red sign, screaming to be thoroughly investigated.
This also hints that the power up changes may be the problem.

I agree, but note that in the datasheet (or EDS at least), the DID is
not read-only. A strap (that I coudn't find more about) is supposed to
set the initial value plus there is a message mentioned (to be sent to
sth. I didn't look into) that may change it.
So alternatively to diagnosing the hardware changes, you could also
follow the bread crumbs in the documentation.
Nico