[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Poster]
Change Logs:
- 2023-10-20: First draft. The paper appears at ACL 2023 as the best paper honorable mention.
Method
-
The experiments of this paper was done on
text-davinci-002with greedy decoding with temperature 0. The datasets they work on is quite small due to manual efforts required. -
The paper focus on QA and arithmetic reasoning tasks; the authors introduce two concepts:
- Bridging Objects
- Language Template
- The authors define the intermediate F1 scores for bridging objects. It is likely that the authors only accept generations that satisfy the predefined template and compute these metrics.
-
Observations:
- The correctness of reasoning during CoT is not important.
- Query should be (1) relevant and (2) follow the order of reasoning steps.
-
Additional Observations:
-
CoT does not make LLMs better; it unlocks the ability already learned by LLMs during pre-training. For example, the conclusions drawn on
text-davinci-002does not apply to Flan-PaLM; this is because Flan-PaLM has been fine-tuned on the two tasks.Given limited resources and an ability to fine-tune the model, we should include more and more data to pre-training or instruction tuning to improve the model rather than focusing the specific prompt engineering tricks.
-
CoT does not make LLMs better; it unlocks the ability already learned by LLMs during pre-training. For example, the conclusions drawn on