Skip to content
Snippets Groups Projects
Unverified Commit 0f19b136 authored by Zirui Wang's avatar Zirui Wang Committed by GitHub
Browse files

Update metric descriptions in leaderboard

parent e7c18c26
No related branches found
No related tags found
No related merge requests found
......@@ -138,6 +138,14 @@ This python script will automatically match the `scores-<model_name>-<mode>_<spl
We release full results on the validation set (i.e., generated responses, grading done by LLMs and the aggregated stats) for all models we tested in our [HuggingFace Repo](https://huggingface.co/datasets/princeton-nlp/CharXiv/tree/main/existing_evaluations). If you are interested in doing some fine-grained analysis on these results or calculate some customized metrics, feel free to use them.
## 🏆 Leaderboard
| Reasoning | Descriptive |
|------------------------|-------------------------------|
| TC = Text-in-Chart | INEX = Information Extraction |
| TG = Text-in-General | ENUM = Enumeration |
| NC = Number-in-Chart | PATT = Pattern Recognition |
| NG = Number-in-General | CNTG = Counting |
| | COMP = Compositionality |
<table><tbody><tr><th>Metadata</th><th></th><th></th><th>Reasoning</th><th></th><th></th><th></th><th></th><th>Descriptive</th><th></th><th></th><th></th><th></th><th></th></tr><tr><td>Model</td><td>Weight</td><td>Size [V/L] (B)</td><td>Overall</td><td>TC</td><td>TG</td><td>NC</td><td>NG</td><td>Overall</td><td>INEX</td><td>ENUM</td><td>PATT</td><td>CNTG</td><td>COMP</td></tr><tr><td>Human</td><td>N/A</td><td>Unknown</td><td>80.50</td><td>77.27</td><td>77.78</td><td>84.91</td><td>83.41</td><td>92.10</td><td>91.40</td><td>91.20</td><td>95.63</td><td>93.38</td><td>92.86</td></tr><tr><td>Claude 3.5 Sonnet</td><td>Proprietary</td><td>Unknown</td><td>60.20</td><td>61.14</td><td>78.79</td><td>63.79</td><td>46.72</td><td>84.30</td><td>82.62</td><td>88.86</td><td>90.61</td><td>90.08</td><td>48.66</td></tr><tr><td>GPT-4o</td><td>Proprietary</td><td>Unknown</td><td>47.10</td><td>50.00</td><td>61.62</td><td>47.84</td><td>34.50</td><td>84.45</td><td>82.44</td><td>89.18</td><td>90.17</td><td>85.50</td><td>59.82</td></tr><tr><td>Gemini 1.5 Pro</td><td>Proprietary</td><td>Unknown</td><td>43.30</td><td>45.68</td><td>56.57</td><td>45.69</td><td>30.57</td><td>71.97</td><td>81.79</td><td>64.73</td><td>79.48</td><td>76.34</td><td>15.18</td></tr><tr><td>InternVL Chat V2.0 Pro</td><td>Proprietary</td><td>Unknown</td><td>39.80</td><td>40.00</td><td>60.61</td><td>44.40</td><td>25.76</td><td>76.83</td><td>77.11</td><td>84.67</td><td>77.07</td><td>78.88</td><td>27.23</td></tr><tr><td>InternVL Chat V2.0 76B</td><td>Open</td><td>5.9 / 70</td><td>38.90</td><td>40.00</td><td>59.60</td><td>42.67</td><td>24.02</td><td>75.17</td><td>77.11</td><td>78.69</td><td>76.20</td><td>79.13</td><td>32.14</td></tr><tr><td>GPT-4V</td><td>Proprietary</td><td>Unknown</td><td>37.10</td><td>38.18</td><td>57.58</td><td>37.93</td><td>25.33</td><td>79.92</td><td>78.29</td><td>85.79</td><td>88.21</td><td>80.92</td><td>41.07</td></tr><tr><td>GPT-4o Mini</td><td>Proprietary</td><td>Unknown</td><td>34.10</td><td>35.23</td><td>47.47</td><td>32.33</td><td>27.95</td><td>74.92</td><td>74.91</td><td>82.81</td><td>69.21</td><td>79.13</td><td>35.71</td></tr><tr><td>Gemini 1.5 Flash</td><td>Proprietary</td><td>Unknown</td><td>33.90</td><td>36.36</td><td>54.55</td><td>30.60</td><td>23.58</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>InternVL Chat V2.0 26B</td><td>Open</td><td>5.9 / 20</td><td>33.40</td><td>33.18</td><td>51.52</td><td>41.81</td><td>17.47</td><td>62.40</td><td>71.35</td><td>61.02</td><td>55.90</td><td>67.94</td><td>6.25</td></tr><tr><td>Claude 3 Sonnet</td><td>Proprietary</td><td>Unknown</td><td>32.20</td><td>31.59</td><td>50.51</td><td>31.47</td><td>26.20</td><td>73.65</td><td>75.74</td><td>81.92</td><td>76.64</td><td>72.26</td><td>8.48</td></tr><tr><td>Claude 3 Haiku</td><td>Proprietary</td><td>Unknown</td><td>31.80</td><td>29.77</td><td>45.45</td><td>34.48</td><td>27.07</td><td>65.08</td><td>69.87</td><td>69.98</td><td>64.85</td><td>61.83</td><td>8.04</td></tr><tr><td>Phi-3 Vision</td><td>Open</td><td>0.3 / 4</td><td>31.60</td><td>31.36</td><td>46.46</td><td>35.78</td><td>21.40</td><td>60.48</td><td>67.62</td><td>61.18</td><td>54.59</td><td>65.39</td><td>6.25</td></tr><tr><td>Claude 3 Opus</td><td>Proprietary</td><td>Unknown</td><td>30.20</td><td>26.36</td><td>50.51</td><td>33.62</td><td>25.33</td><td>71.55</td><td>75.62</td><td>73.69</td><td>73.58</td><td>70.48</td><td>26.79</td></tr><tr><td>InternVL Chat V1.5</td><td>Open</td><td>5.9 / 20</td><td>29.20</td><td>30.00</td><td>45.45</td><td>32.33</td><td>17.47</td><td>58.50</td><td>69.63</td><td>52.95</td><td>53.06</td><td>64.63</td><td>5.80</td></tr><tr><td>Reka Core</td><td>Proprietary</td><td>Unknown</td><td>28.90</td><td>27.50</td><td>41.41</td><td>28.45</td><td>26.64</td><td>55.60</td><td>58.90</td><td>50.52</td><td>65.72</td><td>71.25</td><td>10.71</td></tr><tr><td>Ovis 1.5 Gemma2 9B</td><td>Open</td><td>0.4 / 9</td><td>28.40</td><td>26.14</td><td>44.44</td><td>33.19</td><td>20.96</td><td>62.60</td><td>64.29</td><td>71.75</td><td>56.33</td><td>66.16</td><td>5.80</td></tr><tr><td>Ovis 1.5 Llama3 8B</td><td>Open</td><td>0.4 / 8</td><td>28.20</td><td>27.27</td><td>49.49</td><td>31.03</td><td>17.90</td><td>60.15</td><td>61.39</td><td>68.93</td><td>56.33</td><td>61.83</td><td>7.14</td></tr><tr><td>Cambrian 34B</td><td>Open</td><td>1.9 / 34</td><td>27.30</td><td>24.55</td><td>44.44</td><td>27.59</td><td>24.89</td><td>59.73</td><td>59.31</td><td>70.94</td><td>53.28</td><td>64.63</td><td>5.36</td></tr><tr><td>Reka Flash</td><td>Proprietary</td><td>Unknown</td><td>26.60</td><td>26.59</td><td>39.39</td><td>30.60</td><td>17.03</td><td>56.45</td><td>61.39</td><td>48.59</td><td>69.87</td><td>72.52</td><td>7.14</td></tr><tr><td>Mini Gemini HD Yi 34B</td><td>Open</td><td>0.5 / 34</td><td>25.00</td><td>26.59</td><td>43.43</td><td>27.16</td><td>11.79</td><td>52.68</td><td>53.86</td><td>55.04</td><td>65.50</td><td>53.94</td><td>2.23</td></tr><tr><td>InternLM XComposer2 4KHD</td><td>Open</td><td>0.3 / 7</td><td>25.00</td><td>23.86</td><td>43.43</td><td>29.31</td><td>14.85</td><td>54.65</td><td>61.09</td><td>54.08</td><td>51.53</td><td>59.80</td><td>6.70</td></tr><tr><td>MiniCPM-V2.5</td><td>Open</td><td>0.4 / 8</td><td>24.90</td><td>25.23</td><td>43.43</td><td>25.43</td><td>15.72</td><td>59.27</td><td>62.28</td><td>61.90</td><td>56.77</td><td>68.96</td><td>10.27</td></tr><tr><td>Qwen VL Max</td><td>Proprietary</td><td>Unknown</td><td>24.70</td><td>26.14</td><td>41.41</td><td>24.57</td><td>14.85</td><td>41.48</td><td>50.42</td><td>28.41</td><td>53.71</td><td>51.15</td><td>4.46</td></tr><tr><td>VILA 1.5 40B</td><td>Open</td><td>5.9 / 34</td><td>24.00</td><td>21.59</td><td>41.41</td><td>25.00</td><td>20.09</td><td>38.67</td><td>42.88</td><td>29.62</td><td>51.31</td><td>50.89</td><td>9.82</td></tr><tr><td>Reka Edge</td><td>Proprietary</td><td>Unknown</td><td>23.50</td><td>20.23</td><td>32.32</td><td>30.60</td><td>18.78</td><td>33.65</td><td>36.65</td><td>28.49</td><td>34.72</td><td>52.16</td><td>4.91</td></tr><tr><td>Gemini 1.0 Pro</td><td>Proprietary</td><td>Unknown</td><td>22.80</td><td>20.91</td><td>48.48</td><td>18.10</td><td>20.09</td><td>54.37</td><td>67.97</td><td>39.23</td><td>60.48</td><td>62.60</td><td>8.93</td></tr><tr><td>LLaVA 1.6 Yi 34B</td><td>Open</td><td>0.3 / 34</td><td>22.50</td><td>20.45</td><td>37.37</td><td>23.71</td><td>18.78</td><td>51.05</td><td>46.38</td><td>63.44</td><td>56.11</td><td>51.91</td><td>5.80</td></tr><tr><td>Mini Gemini HD Llama3 8B</td><td>Open</td><td>0.5 / 8</td><td>19.00</td><td>19.77</td><td>36.36</td><td>21.12</td><td>7.86</td><td>44.42</td><td>49.41</td><td>39.23</td><td>51.09</td><td>55.98</td><td>1.79</td></tr><tr><td>InternLM XComposer2</td><td>Open</td><td>0.3 / 7</td><td>18.70</td><td>16.14</td><td>38.38</td><td>21.98</td><td>11.79</td><td>38.75</td><td>34.10</td><td>43.58</td><td>46.72</td><td>52.93</td><td>5.80</td></tr><tr><td>MiniCPM-V2</td><td>Open</td><td>0.4 / 2.4</td><td>18.50</td><td>17.95</td><td>33.33</td><td>19.40</td><td>12.23</td><td>35.77</td><td>39.74</td><td>36.56</td><td>26.42</td><td>44.53</td><td>5.36</td></tr><tr><td>IDEFICS 2</td><td>Open</td><td>0.4 / 7</td><td>18.20</td><td>15.45</td><td>35.35</td><td>17.24</td><td>17.03</td><td>32.77</td><td>36.12</td><td>27.28</td><td>40.83</td><td>43.26</td><td>3.12</td></tr><tr><td>IDEFICS 2 Chatty</td><td>Open</td><td>0.4 / 7</td><td>17.80</td><td>15.45</td><td>34.34</td><td>19.83</td><td>13.10</td><td>41.55</td><td>34.88</td><td>54.56</td><td>45.63</td><td>44.27</td><td>6.70</td></tr><tr><td>MoAI</td><td>Open</td><td>0.3 / 7</td><td>17.50</td><td>9.32</td><td>36.36</td><td>21.12</td><td>21.40</td><td>28.70</td><td>31.20</td><td>21.23</td><td>39.96</td><td>40.46</td><td>7.59</td></tr><tr><td>DeepSeek VL</td><td>Open</td><td>0.5 / 7</td><td>17.10</td><td>16.36</td><td>32.32</td><td>19.83</td><td>9.17</td><td>45.80</td><td>49.11</td><td>45.20</td><td>42.79</td><td>60.31</td><td>4.91</td></tr><tr><td>SPHINX V2</td><td>Open</td><td>1.9 / 13</td><td>16.10</td><td>13.86</td><td>28.28</td><td>17.67</td><td>13.54</td><td>30.25</td><td>35.59</td><td>24.37</td><td>41.05</td><td>29.52</td><td>1.79</td></tr><tr><td>Qwen VL Plus</td><td>Proprietary</td><td>Unknown</td><td>16.00</td><td>15.45</td><td>45.45</td><td>12.07</td><td>8.30</td><td>28.93</td><td>33.33</td><td>17.92</td><td>32.10</td><td>56.23</td><td>2.23</td></tr><tr><td>LLaVA 1.6 Mistral 7B</td><td>Open</td><td>0.3 / 7</td><td>13.90</td><td>11.36</td><td>32.32</td><td>16.81</td><td>7.86</td><td>35.40</td><td>34.70</td><td>33.98</td><td>48.91</td><td>42.49</td><td>8.48</td></tr><tr><td>ChartGemma</td><td>Open</td><td>0.4 / 2</td><td>12.50</td><td>11.59</td><td>24.24</td><td>16.81</td><td>4.80</td><td>21.30</td><td>27.58</td><td>18.97</td><td>14.19</td><td>19.59</td><td>4.46</td></tr><tr><td>Random (GPT-4o)</td><td>N/A</td><td>Unknown</td><td>10.80</td><td>4.32</td><td>39.39</td><td>5.60</td><td>16.16</td><td>19.85</td><td>21.65</td><td>16.71</td><td>23.80</td><td>25.70</td><td>5.36</td></tr></tbody></table>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment