Leaderboard

In this new leaderboard, hyperparameter spaces are larger, and all hyperparameters are selected according to 3-run average performance. All results are obtained by 10 runs. Values in parenthesis are standard deviations. -(-) denotes abnormal results caused by under-fitting.

Graph-level

GOOD-HIV

scaffold

size

covariate

concept

covariate

concept

ERM

69.55(2.39)

72.48(1.26)

59.19(2.29)

61.91(2.29)

IRM

70.17(2.78)

71.78(1.37)

59.94(1.59)

-(-)

VREx

69.34(3.54)

72.21(1.42)

58.49(2.28)

61.21(2.00)

GroupDRO

68.15(2.84)

71.48(1.27)

57.75(2.86)

59.77(1.95)

Coral

70.69(2.25)

72.96(1.06)

59.39(2.90)

60.29(2.50)

DANN

69.43(2.42)

71.70(0.90)

62.38(2.65)

65.15(3.13)

Mixup

70.65(1.86)

71.89(1.73)

59.11(3.11)

62.80(2.43)

DIR

68.44(2.51)

71.40(1.48)

57.67(3.75)

74.39(1.45)

GSAT

70.07(1.76)

72.51(0.97)

60.73(2.39)

56.96(1.76)

GOOD-Motif

basis

size

covariate

concept

covariate

concept

ERM

63.80(10.36)

81.31(0.69)

53.46(4.08)

70.83(0.79)

IRM

59.93(11.46)

80.37(0.80)

53.68(4.11)

70.15(0.64)

VREx

66.53(4.04)

81.34(0.75)

54.47(3.42)

70.58(1.16)

GroupDRO

61.96(8.27)

81.00(0.60)

51.69(2.22)

70.35(0.40)

Coral

66.23(9.01)

81.47(0.49)

53.71(2.75)

70.52(0.59)

DANN

51.54(7.28)

81.43(0.60)

51.86(2.44)

70.74(0.65)

Mixup

69.67(5.86)

77.64(0.58)

51.31(2.56)

68.21(0.89)

DIR

39.99(5.50)

82.96(4.47)

44.83(4.00)

54.96(9.32)

GSAT

55.13(5.41)

75.30(1.57)

60.76(5.94)

59.00(3.42)

GOOD-CMNIST

color

covariate

concept

ERM

27.82(3.24)

42.90(0.67)

IRM

29.04(2.10)

42.73(0.71)

VREx

27.65(2.31)

43.22(0.64)

GroupDRO

29.23(2.12)

43.33(0.67)

Coral

29.47(3.15)

42.98(0.59)

DANN

28.77(1.49)

42.84(0.61)

Mixup

28.30(1.74)

40.70(0.56)

DIR

26.20(4.48)

28.71(4.66)

GSAT

35.62(5.52)

47.58(1.15)