Reading Notes | Towards Understanding Chain-of-Thought Prompting – An Empirical Study of What Matters

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Poster]

Change Logs:

  • 2023-10-20: First draft. The paper appears at ACL 2023 as the best paper honorable mention.

Method

  • The experiments of this paper was done on text-davinci-002 with greedy decoding with temperature 0. The datasets they work on is quite small due to manual efforts required.
  • The paper focus on QA and arithmetic reasoning tasks; the authors introduce two concepts:

    • Bridging Objects
    • Language Template
  • The authors define the intermediate F1 scores for bridging objects. It is likely that the authors only accept generations that satisfy the predefined template and compute these metrics.
  • Observations:

    • The correctness of reasoning during CoT is not important.
    • Query should be (1) relevant and (2) follow the order of reasoning steps.
  • Additional Observations:

    • CoT does not make LLMs better; it unlocks the ability already learned by LLMs during pre-training. For example, the conclusions drawn on text-davinci-002 does not apply to Flan-PaLM; this is because Flan-PaLM has been fine-tuned on the two tasks.

      Given limited resources and an ability to fine-tune the model, we should include more and more data to pre-training or instruction tuning to improve the model rather than focusing the specific prompt engineering tricks.

Experiment

Additional Notes

Reference