๐งโโ๏ธACL 2023 Tutorial: Retrieval-based Language Models and Applications
Akari Asai, Sewon Min, Zexuan Zhong, Danqi Chen https://acl2023-retrieval-lm.github.io/
่ฟ้ไธป่ฆไธญๆๆป็ปๆฌๆ็จไธญ็ไธไบ้็นๅ ๅฎน
่ฎฒ่ ่ฏดๆ๏ผ ๆฌๆ็จๆฏๆๅๆฒฟ็๏ผไธๅๆฐๅllm็ธๆฏ๏ผๆไปฌ่ฟ่ฟ่ฟไธ่ฝ็่งฃๅฆไฝๆๅฅฝๅฐๅผๅๅบไบๆฃ็ดข็lm๏ผ่ฟไธชๆ็จไธป่ฆๅไบซ๏ผ
็ฐๆ็ ็ฉถ็ๅ็ฑปๅๅ ณ้ฎ่ง่งฃ
ๆไปฌๅฏนๅฝๅๆๆๅๅผๆพ้ฎ้ข็็ๆณ
1. Introduction
1. ไปไนๆฏRetrieval-based language models (LMs)๏ผ
Retrieval-based LMs = Retrieval + LMs ่ฏญ่จๆจกๅไปๅค้จๆฐๆฎๅญๅจไธญ่ฟ่กๆฃ็ดข๏ผ่ณๅฐๅจๆจ็ๆ้ด๏ผ

่ฟๆ ท็ๆจกๅไน่ขซ็งฐไธบๅๅๆฐๆจกๅๅ้ๅๆฐๆจกๅ๏ผsemiparametric and non-parametric models๏ผ
2. The age of large language models (LLMs)๏ผไธป่ฆไป็ป็ฎๅๅคง่ฏญ่จๆจกๅ็ไธไบ็น็น
Transformers-based, fully parametric
Trained on next-token prediction tasks (+ RLHF;)
Model size โ, data sizeโ
3. Retrieval for knowledge-intensive NLP tasks ๅฏน็ฅ่ฏๅฏ้ๅไปปๅก็ๆฃ็ดข
Representative tasks: open-domain QA, fact-checking, entity linking...
LMๆจๅจไบๅคง้ๅ ณไบๅฏ้ๆฃ็ดข็ๆดๅฅฝ็ฎๆณ็็ ็ฉถ๏ผไพๅฆ๏ผDPR๏ผColBERT,ANCE,Contriever๏ผ..
4. Why retrieval-based LMs?
LLMs canโt memorize all (long-tail) knowledge in their parameters ๅคงๆจกๅ็ๅๆฐๅฏน็ฅ่ฏ็่ฎฐๅฟๆ้
LLMsโ knowledge is easily outdated and hard to update ๅคงๆจกๅ็็ฅ่ฏๅฎนๆ่ฟๆถ๏ผ้พไปฅๆดๆฐ----็ฐๆ็็ฅ่ฏ็ผ่พๆนๆณไป็ถๆฏไธๅฏๆฉๅฑ็๏ผ็ ็ฉถๆนๅ๏ผ๏ผ่ๆฐๆฎๅญๅจๅฏไปฅๅพๅฎนๆๅฐๆดๆฐๅๆฉๅฑโโ็่ณไธ้่ฆ้ๆฐ่ฎญ็ปๆจกๅ
LLMsโ output is challenging to interpret and verify ๅคงๆจกๅ็่พๅบ้พไปฅ้ช่ฏๅ่งฃ้--ไปๆฃ็ดข็ปๆไธญๆดๆฐ็ฅ่ฏๆฅๆบๅฏไปฅ่ทๅพๆดๅฅฝ็่งฃ้ๆงๅๆงๅถๆง๏ผGenerating text with citations๏ผlike newbing๏ผ
LLMs are shown to easily leak private training data ๅคงๆจกๅๅฎนๆๆณๆผ็งๆ่ฎญ็ปๆฐๆฎ ๏ผๆไปฅๅฏไปฅ้่ฟๅฐ็งไบบๆฐๆฎๅญๅจๅจๆฐๆฎๅญๅจๅจไธญ๏ผไป่ๅฏนๅ ถ่ฟ่กไธชๆงๅๅค็๏ผ่ไธๆฏ็ดๆฅๅไธๆจกๅๅๆฐ่ฎญ็ป๏ผ๏ผ
LLMs are large and expensive to train and run ๅคงๆจกๅ่ฎญ็ปๅ่ฟ่กๆๆฌ้ซ๏ผ่ๆฐๆฎๅญๅจๅจๅฏไปฅๅจๆจ็ๆ้ด่ฟ่กๆฃ็ดข๏ผๅ ๆญคๅฏไปฅๅๅฐๆจกๅ็ๅคงๅฐๅๆๆฌ --Long-term goal: can we possibly reduce the training and inference costs, and scale down the size of LLMs?
2. Definition & Preliminaries
1. A Retrieval-based LM: Definition - A language model (LM) that usesan external datastore at test time ๅจๆต่ฏๆ้ดไฝฟ็จๅค้จๆฐๆฎๅญๅจ็่ฏญ่จๆจกๅ
2. A language model (LM): Categories

่ฟ้ๆไธไธช้ฎ้ขๆฏไธบไปไนDecoder-onlyๆจกๅๅ ไนๆไธบไบ็ฐๅจLLM็ไธปๆตๆถๆ๏ผ
ๅ่ๅๅฎข๏ผ
https://kexue.fm/archives/9529
https://www.zhihu.com/question/588325646
ไธป่ฆ่ง็น: ไปปไฝNLPไปปๅก้ฝๅฏไปฅๅ่งฃไธบโ่พๅ ฅโ่ทโ่พๅบโไธค้จๅ๏ผๆไปฌๅฏไปฅๆๅค็โ่พๅ ฅโ็ๆจกๅๅซๅEncoder๏ผ็ๆโ่พๅบโ็ๆจกๅๅซๅDecoder๏ผ้ฃไนๆๆไปปๅก้ฝๅฏไปฅไปโEncoder-Decoderโ็่ง่งๆฅ็่งฃ๏ผ่ไธๅๆจกๅไน้ด็ๅทฎ่ทๅจไบEncoderใDecoder็ๆณจๆๅๆจกๅผไปฅๅๆฏๅฆๅ ฑไบซๅๆฐ,ๆฏๅฆ:
GPT
ๅๅ
ๅๅ
ๆฏ
UniLM
ๅๅ
ๅๅ
ๆฏ
T5
ๅๅ
ๅๅ
ๅฆ
้พๆฅ้ ่ฏป๏ผ
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
3. A language model (LM): Prompting
ๅณ้่ฟไธๅ็prompt่ฎฉllmๅฎๆไธๅ็ไปปๅก
4. A language model (LM): Often evaluated with
็ปๅบ่ฏไปทๆๆ ๏ผ1. Perplexity 2. Downstream accuracy (Zero-shot or few-shot in-context learning,or fine-tuning) ไผๅจ็ฌฌไบ่่ฏฆ็ปไป็ป
ไธไธช้ฎ้ข๏ผไธบไปไน่ฆ็จperplexityๆฅไฝไธบๆฌ่ฏพ็จ็ไธป่ฆๆๆ
โๅจๆฏ่พๅๆฐๅ็่ฏญ่จๆจกๅๆถ๏ผๅฐๆๅบฆ๏ผPPL๏ผ็ปๅธธ่ขซ็จๅฐใไฝๅฐๆๅบฆ็ๆนๅ่ฝๅฆ่ฝฌๅไธบไธๆธธๅบ็จไป็ถๆฏไธไธช็ ็ฉถ้ฎ้ข๏ผ็ฐๅทฒๆ็ ็ฉถ่กจๆ๏ผๅฐๆๅบฆไธไธๆธธไปปๅก๏ผๅฐคๅ ถๆฏ็ๆไปปๅก๏ผๆๅพๅฅฝ็็ธๅ ณๆง๏ผๅนถไธๅฐๆๅบฆ้ๅธธๅฏๆไพ้ๅธธ็จณๅฎ็็ปๆ๏ผๅฎๅฏไปฅๅจๅคง่งๆจก่ฏไผฐๆฐๆฎไธ่ฟ่ก่ฏไผฐใ(็ธๅฏนไบไธๆธธไปปๅกๆฅ่ฏด๏ผ่ฏไผฐๆฐๆฎๆฏๆฒกๆๆ ็ญพ็๏ผ่ไธๆธธไปปๅกๅฏ่ฝไผๅๅฐๆ็คบ็ๆๆๆงๅ็ผบไนๅคง่งๆจกๆ ่ฎฐๆฐๆฎ็ๅฝฑๅ๏ผไป่ๅฏผ่ด็ปๆไธ็จณๅฎ๏ผใโ
5. Inference: Datastore

6. Inference: Index
็ฎๆ ๏ผๅจๆฐๆฎๅญๅจไธญๆพๅฐไธๆฅ่ฏขๆ็ธไผผ็ไธๅฐ้จๅๅ ็ด
sim๏ผa similarity score between two pieces of text ไธ้ขๆฏsimilarity score็ไธไบไพๅญ

index๏ผ็ปๅฎquery๏ผ้่ฟfast nearest neighbor search๏ผ่ฟไนๆฏไธไธช็ ็ฉถๆนๅ-ๅฆไฝๆดๅ ๅฟซๅฟซ้ๅๅ็กฎ๏ผ๏ผ่พๅบsimๆๅคง็kไธชๅ ็ด
็ธๅ ณsoftware: FAISS, Distributed FAISS, SCaNN, etcโฆ
ๅ่๏ผFaiss
3. Retrieval-based LM: Architecture
1. Categorization of retrieval-based LMs

2. Roadmap
ๆ นๆฎ ๆฃ็ดขไปไน๏ผๅฆไฝไฝฟ็จๆฃ็ดข๏ผๅจไปไนๆถๅๆฃ็ดขๅฐๆ่ฟ็็ ็ฉถๆป็ปๅฑ็คบๅจไธ้ข็่ทฏ็บฟๅพ๏ผ

REALM (Guu et al 2020)--10 Feb 2020
ๆฌๆฎตๅผๅงไป็ป็ฌฌไธไธช็ปๆ REALM๏ผRetrieval-Augmented Language Model Pre-Training--ๆฃ็ดขๅขๅผบ็้ข่ฎญ็ป่ฏญ่จๆจกๅ

็ฅไนไธไธไบ้ ่ฏป็ฌ่ฎฐ๏ผ
ๅจๆบ๏ผ้ข่ฎญ็ป่ฏญ่จๆจกๅ่ฝๅคไปๆ ็็ฃๆๆฌ่ฏญๆไธญๅญฆไน ๅฐๅพๅคๅ ฌๅ ฑ็ฅ่ฏใ็ถ่๏ผ่ฟไบ็ฅ่ฏๅญๅจๅจๅๆฐไธญ๏ผๆไปฅไธไธคไธช็ผบ็น๏ผ1. ่ฟไบ็ฅ่ฏๆฏ้ๅผ็๏ผไฝฟ็จๆถ้พไปฅ่งฃ้ๆจกๅๅจๅญใไฝฟ็จ็็ฅ่ฏ๏ผ2. ๆจกๅๅญฆไน ๅฐ็็ฅ่ฏ็้็บงๅๆจกๅๅคงๅฐ๏ผๅๆฐ้๏ผ็ธๅ ณ๏ผๅ ๆญคไธบไบๅญฆไน ๅฐๆดๅค็็ฅ่ฏ๏ผ้่ฆๆฉๅ ๆจกๅๅคงๅฐใ
้ข่ฎญ็ป้ถๆฎต็ๆต็จ๏ผ1. ไป้ข่ฎญ็ป่ฏญๆไธญ้ๆ ท ๏ผๅนถๅฐ้จๅtoken mask๏ผthe [MASK] at the top of the pyramid๏ผ๏ผ2. ้่ฟๆฃ็ดขๆจกๅ๏ผๆ นๆฎๆ ทๆฌ ๅปๅค้จ็ฅ่ฏๅบ๏ผๅฆ็ปดๅบ็พ็งๆๆกฃ๏ผไธญๆฃ็ดข่ฝๅคๅธฎๅฉๆขๅคmask token็ๆๆกฃ ๏ผThe pyramidion on top allows for lessmaterial higher up the pyramid๏ผ๏ผ3. ไฝฟ็จๆ ทๆฌ x ๅ ้จ็ไฟกๆฏ๏ผไปฅๅๆฃ็ดขๅฐ็ๆๆกฃ ไธญ็ไฟกๆฏ๏ผๅ ฑๅ้ขๆต่ขซmaskๆ็token๏ผpyramidion๏ผ๏ผ
ๆจกๅ็ปๆ๏ผๆจกๅ็pre-trainingๅfine-tuning้ฝๅปบๆจกไธบretrieve-then-predict็่ฟ็จ๏ผไฝ่ ๅฐ$z$ ่งไธบไธไธช้ๅ้๏ผๅฐๆๅ็ไปปๅก็ฎๆ $y|x)$ๅปบๆจกไธบๅฏนไบๆๆๆฝๅจๆๆกฃ ็่พน็ผๆฆ็๏ผ
ไธคไธช้จๅ๏ผthe neural knowledge retriever(็ฅ็ป็ฅ่ฏๆฃ็ดขๅจ), -> , and the knowledge-augmented encoder(็ฅ่ฏๅขๅผบ็encoder), -> .


ๅจ้ข่ฎญ็ป้ถๆฎต๏ผไปปๅกไธบMLM๏ผๅจfine-tune้ถๆฎต๏ผไปปๅกไธบOpen-domain QA
่ฎญ็ป็ป่๏ผ้ๅฏนๆฐๆฎ้่พๅคง็่งฃๅณๅๆณ--pretraining้ถๆฎตไฝฟ็จMaximum Inner Product Search๏ผๆๅคงๅ ็งฏๆ็ดข--ๅณๅ ็งฏ็ฉบ้ดไธ็KNN๏ผMIPS๏ผ็็ฎๆณๆฅๆพๅฐtop-kไธชๆ็ธๅ ณๆๆกฃ๏ผไธบไบ้ฟๅ ไธ็ดๅทๆฐMIPS็ดขๅผ้ ๆ่ๆถไธฅ้๏ผๆฏ้่ฅๅนฒstepๆๅทๆฐไธๆฌกMIPS็ดขๅผ๏ผ่ฏฅ็ดขๅผไป ็จๆฅ้ๆฉtop-kไธชๆๆกฃ๏ผ่ๅจๆฏไธๆญฅ่ฎญ็ปๆขฏๅบฆๅไผ ็ๆถๅ๏ผไป็ถไฝฟ็จ็ๆฏๆๆฐ็retreiver็ๅๆฐ๏ผใๅจfine-tune้ถๆฎต๏ผMIPS็ดขๅผไป ๅจไธๅผๅงๅปบ็ซไธๆฌก๏ผไฝฟ็จ้ข่ฎญ็ป็retrieverๅๆฐ๏ผ๏ผไนๅไพฟไธๅๆดๆฐใไฝ่ ่ฎคไธบๅจ้ข่ฎญ็ป้ถๆฎตๆฃ็ดขๅจๅฐฑๅทฒ็ปๅญฆไน ๅฐไบ่ถณๅคๅฅฝ็ๆๆกฃ็ธๅ ณๆง่กจๅพ๏ผไฝไฝ่ ่ฎคไธบๅฆๆๅๆ ทๅจfine-tune้ถๆฎต่ฟญไปฃๆดๆฐMIPS็ดขๅผ็่ฏ๏ผๆๆๅฏ่ฝไผๆดๅฅฝใ
trick๏ผ1. Salient span masking๏ผSSM๏ผ๏ผๅณๅจMLM้ข่ฎญ็ป้ถๆฎต๏ผ้ฎ็ๅ ณ้ฎ็ๅฎไฝ/ๆฐๅญ๏ผ่ไธๆฏ้ๆบtoken๏ผ2. null document๏ผ้จๅMLMๆ ทๆฌไธ้่ฆๅค้จๆๆกฃๆฏๆ๏ผ3. ้ฟๅ ไฟกๆฏๆณๆผ๏ผๅฝMLM็่ฎญ็ป่ฏญๆๅๆฃ็ดข่ฏญๆๆ้ๅ ๆถ๏ผ้ฟๅ ็ดๆฅๆ็ดขๅฐๆ ทๆฌx็ๅๆ๏ผ4. ๆฃ็ดขๅจ็ๅๅงๅใๅทๅฏๅจ้ฎ้ข๏ผๅฆๆไธๅผๅง้ๆบๅๅงๅๆฃ็ดขๅจ๏ผ้ฃไนๆๆกฃๅฐไผๅคงๆฆ็ๆฏๅฎๅ จๆ ๅ ณ็๏ผๆจกๅๅพไธๅฐๆๆ็ๆขฏๅบฆ๏ผไธบไบ้ฟๅ ่ฟไธช้ฎ้ข๏ผไฝ่ ไฝฟ็จInverse Cloze Test๏ผICT้ๅฎๅฝขๅกซ็ฉบ๏ผไปปๅกๆฅๅๅงๅ่ฎญ็ปๆฃ็ดขๅจใ
็ธๅ ณๅทฅไฝๆป็ป๏ผ REALM and subsequent work
REALM (Guu et al 2020): MLM followed by fine-tuning, focusing on open-domain QA
DPR (Karpukhin et al 2020): Pipeline training instead of joint training, focusing on open-domain QA (no explicit language modeling)
RAG (Lewis et al 2020): โGenerativeโ instead of โmasked language modelingโ, focusing on open-domain QA & knowledge intensive tasks (no explicit language modeling)
Atlas (Izcard et al 2022): Combine RAG with retrieval-based language model pre-training based on the encoder-decoder architecture (more to come in Section 4), focusing on open-domain QA & knowledge intensive tasks
Papers that follow this approach focusing on LM perplexity have come out quite recently (Shi et al. 2023, Ram et al. 2023) ๏ผRam et al. 2023. โIn-Context Retrieval-Augmented Language Modelsโ&Shi et al. 2023.โREPLUG: Retrieval-Augmented Black-Box Language Models
Retrieval-in-context LM
็ธๅ ณ่ฎบๆ๏ผ
In-Context Retrieval-Augmented Language Models
ๅจไธ้ข่ฟ็ฏ่ฎบๆไธญๆไธไบๅฎ้ช็ป่ฎบ:1. Retrieval helps overall sizes of LMs 2. A shorter prefix (more recent tokens) as a query helps 3. Retrieving more frequently helps(ไฝๆฏไผๆถ่ๆดๅค็ๆจ็ๆถ้ดๆๆฌ)
REPLUG: Retrieval-Augmented Black-Box Language Models

โIncorporation in the โintermediate layerโ instead of the โinputโ layer โ designed for many chunks, frequently, more efficientlyโ
RETRO(Retrieval-Enhanced Transformer )-- improving language models through explicit memory at unprecedented scale
ๅๅนถๅฐไธญ้ดๅฑ่ไธๆฏ่พๅ ฅๅฑ + ๆฐๆฎ่งๆจก็ๅขๅ
็ธๅ ณ็ฌ่ฎฐ๏ผ
https://zhuanlan.zhihu.com/p/475346411
https://www.cnblogs.com/Matrix_Yao/p/16480698.html

ๅจๆบ๏ผๆจกๅๅๆฐโ ๆจกๅๆฐๆฎ้โ ๅฎนๆๅ็ๆฐๆฎ้้พ็่งฃใๅขๅ ๆจกๅๅๅทฎ็ญไธ็ณปๅ้ฎ้ข๏ผไธบไบ่งฃๅณ่ฟไธช้ฎ้ข๏ผDeepMindๅข้็ ๅไธ็งๅธฆๆไบ่็ฝ่งๆจกๆฃ็ดข็้ซๆ้ข่ฎญ็ปๆจกๅใไฝฟ็จ RETRO๏ผๆจกๅไธไป ้ไบ่ฎญ็ปๆ้ด็ๅฐ็ๆฐๆฎ๏ผๅฎ่ฟๅฏไปฅ้่ฟๆฃ็ดขๆบๅถ่ฎฟ้ฎๆดไธช่ฎญ็ปๆฐๆฎ้ใไธๅ ทๆ็ธๅๆฐ้ๅๆฐ็ๆ ๅ Transformer ็ธๆฏ๏ผ่ฟไผๅธฆๆฅๆพ็็ๆง่ฝๆๅใ
ๆฐๆฎ้๏ผMassiveTextๆฐๆฎ้(ๆฅ่ชgopherๆจกๅ่ฎบๆ)
ๆๅบไบไธ็ง้ฟๅ ๆฐๆณ้ฒ็ๆนๆณ๏ผๆฃ็ดข็่ฟ็จๅฐฑ่ฝ็ดๆฅ่ฎฟ้ฎ่ฎญ็ป้ๆไปฅ้ฒๆญขๆฐๆฎๆณ้ฒๅพ้่ฆ-ไธบๆญค่ฎบๆไฝ่ ๆๅบไบไธ็ง่กก้ๆต่ฏๆๆกฃไธ่ฎญ็ป้ๆฅ่ฟ็จๅบฆ็่ฏไผฐๆนๅผDeduplicating Training Data Makes Language Models Better
ๆจกๅ็ปๆ RETROๆจกๅๆถๆ็ฑไธไธช็ผ็ ๅจๅ ๆ ๏ผๅค็่ฟ้ป๏ผๅไธไธช่งฃ็ ๅจๅ ๆ ๏ผๅค็่พๅ ฅ๏ผ็ปๆ๏ผ ็ผ็ ๅจๅ ๆ ็ฑๆ ๅ็ Transformer ็ผ็ ๅจๅ็ปๆ๏ผ่งฃ็ ๅจๅ ๆ ๅ ๅซไบTransformer่งฃ็ ๅจๅๅRETRO ่งฃ็ ๅจๅ๏ผATTN + Chunked cross attention (CCA) + FFNN๏ผFeed-forward neural network๏ผ๏ผใ

็ฎๅๆต็จ๏ผ

ๅฏนๆฏ๏ผ

ๆ่๏ผ้คไบๆฃ็ดขsplitๆchunks๏ผ่ฟๅฏไปฅๆไนๅค็dbไธญ็ๆฐๆฎ๏ผ
โ
ๆๅบkNN-LMs๏ผๆ่ฏญไน็ผ็ ็นๅพๅ้็kๆ่ฟ้ปๅไธ่ฌ็่ฏญ่จๆจกๅ็ปๅไป่ๆพ่ๆ้ซ่ฏญ่จๆจกๅ็ๆๆ
โA different way of using retrieval, where the LM outputs a nonparametric distribution over every token in the data.โ ๅฆไธ็งไฝฟ็จๆฃ็ดข็ๆนๆณ๏ผๅ ถไธญLMๅจๆฐๆฎไธญ็ๆฏไธชๆ ่ฎฐไธ่พๅบไธไธช้ๅๆฐๅๅธใ
โCan be seen as an incorporation in the โoutputโ layerโ ๅฏไปฅ็ๅๆฏๅจ่พๅบๅฑ็ไธไธชๅๅนถ
ๅจๆบ๏ผ่ฏญ่จๆจกๅ๏ผLanguage Model, LM๏ผๆ็ๆฏๅฉ็จ้พๅผๆณๅ็ปๅบไธไธชๅฅๅญ็ๆฆ็๏ผไธป่ฆ่ฆ่งฃๅณไธคไธช้ฎ้ข๏ผ๏ผ1๏ผๅพๅฐไธๆ่กจ็คบ๏ผ๏ผ2๏ผ็จไธๆ่กจ็คบ้ขๆตไธไธไธชtokenใ่ฟไธคไธช้ฎ้ขไธ่ฌไฝฟ็จไธไธชautoregressiveๆจกๅ่งฃๅณใไฝฟ็จARๆจกๅๅป่ฟ่ก่ฏญ่จๅปบๆจก็ไธไธชๆฎ้้ฎ้ขๆฏ๏ผ้พไปฅๅ ๅๅปบ็ซ้ฟ่ท็ฆปไพ่ตใ็ฑๆญคๅบๅ๏ผๆฌๆๆๅบ้่ฟ่ฎก็ฎไธๆ่กจ็คบ็kๆ่ฟ้ปๅป็ปๅ่ฏญ่จๆจกๅไป่ๆดๅฅฝๅฐๆๆไธไธๆไน้ด็่ฏญไนๅ ณ็ณปใ
ๆจกๅ็ปๆ๏ผ

ๅ ทไฝ็ๆต็จๅฏไปฅๅป็slide่ฎฒ็ๅพๆธ ๆฅ
ๆจกๅๅฎ้ช็ปๆ:


Can use in-domain datastore even if parameters were not trained in-domain
ๅฏนๆฏๆป็ป๏ผ
KNN-LM็ไผ็น:ๆด็ป็ฒๅบฆ๏ผๅฏไปฅๆดๅฅฝๅฐๅค็็ฝ่ง็ๆจกๅผ&ๅๅคๆฐๆฎ๏ผๅฏไปฅ้ๅธธ้ซๆ๏ผๅ ไธบKNNๆ็ดขๅพๅฟซ๏ผ
็ผบ็น: ่พๅ ฅๅๆฃ็ดข็ปๆไน้ดๆฒกๆไบคๅๆณจๆ๏ผDatastoreๆถ่ๆฏ่พๅคง

ๆ่: ๅจwhen to retrieveไธญ๏ผevery n tokensๅevery tokensๆฏๅฆๅฏไปฅๅปๅ adaptive ๏ผ โ
Adaptive retrieval for efficiency
ๅไธบไธค็ฑป๏ผAdaptive retrieval of text chunks (following retrieve-in-context)๏ผAdaptive retrieval of tokens (following kNN-LM)
ๅจๆบๅๆฆ่ฟฐ๏ผๅคงๅคๆฐ็ฐๆ็ๆฃ็ดขๅขๅผบๅ่ฏญ่จๆจกๅ้ฝ้็จretrieve-and-generate่ฎพ็ฝฎ๏ผๆ นๆฎquery่ฟ่กไธๆฌกไฟกๆฏๆฃ็ดขใ็ถ่๏ผๅจๆถๅ็ๆ้ฟๆๆฌ็ๆดไธ่ฌๅบๆฏไธญ๏ผๅจๆดไธช็ๆ่ฟ็จไธญไธๆญๆถ้ไฟกๆฏ่ณๅ ณ้่ฆใ่ฟๅปๅทฒ็ปๆไธไบๅจ็ๆ่พๅบๆถๅคๆฌกๆฃ็ดขไฟกๆฏ็ๅชๅ๏ผ่ฟไบๅชๅๅคงๅคไฝฟ็จๅ ๅ็ไธไธๆไฝไธบๆฅ่ฏขไปฅๅบๅฎ็ๆถ้ด้ด้ๆฃ็ดขๆๆกฃใๅจ่ฟ้กนๅทฅไฝไธญ๏ผๆไปฌๆไพไบไธปๅจๆฃ็ดขๅขๅผบ็ๆ็ๆฆๆฌ่งๅพ๏ผๅณๅจ็ๆ่ฟ็จไธญไธปๅจๅณๅฎไฝๆถๆฃ็ดขไปฅๅๆฃ็ดขไปไนๅ ๅฎน็ๆนๆณใๆไปฌๆๅบไบๅ็ปๆงไธปๅจๆฃ็ดขๅขๅผบ็ๆ๏ผFLARE๏ผ๏ผ่ฟๆฏไธ็ง้็จ็ๆฃ็ดขๅขๅผบ็ๆๆนๆณ๏ผๅฎ่ฟญไปฃๅฐไฝฟ็จๅฏนๅณๅฐๅฐๆฅ็ๅฅๅญ็้ขๆตๆฅ้ขๆตๆชๆฅ็ๅ ๅฎน๏ผ็ถๅๅฆๆๅฎๅ ๅซไฝๅฏไฟกๅบฆไปค็๏ผๅๅฐๅ ถ็จไฝๆฅ่ฏขๆฅๆฃ็ดข็ธๅ ณๆๆกฃไปฅ้ๆฐ็ๆๅฅๅญใ
็ฑ้ฟๆๆฌ็ๆไปปๅกๅผๅบ๏ผไธๆฌกๆฃ็ดขๅนถไธ่ฝๆปก่ถณ้่ฆ๏ผไธไบบ็ฑปๅจๅๅปบ่ฎบๆๆไนฆ็ฑ็ญๅ ๅฎนๆถ้ๆธๆถ้ไฟกๆฏ็ๆนๅผ็ฑปไผผ๏ผไฝฟ็จ LM ่ฟ่ก้ฟๆ ผๅผ็ๆ้่ฆๅจๆดไธช็ๆ่ฟ็จไธญๆถ้ๅค็ง็ฅ่ฏใ ๆฌๆ้ๅ็ๆนๆณๆฏ้่ฟ็ๆไธดๆถ็ไธไธไธชๅฅๅญ๏ผๅฐๅ ถไฝไธบๆฃ็ดข็ธๅ ณๆๆกฃ็ๆฅ่ฏข๏ผ็ถๅๆ นๆฎๆฃ็ดขๅฐ็ๆๆกฃ้ๆฐ็ๆไธไธไธชๅฅๅญๆฅ้ขๆตๆชๆฅใ
FLARE่ฟญไปฃ็ๆไธไธชไธดๆถ็ไธไธไธชๅฅๅญ๏ผๅฆๆๅฎๅ ๅซlow-probability tokens๏ผๅๅฐๅ ถ็จไฝๆฃ็ดข็ธๅ ณๆๆกฃ็ๆฅ่ฏข๏ผๅนถ้ๆฐ็ๆไธไธไธชๅฅๅญๅฅๅญ็ดๅฐ็ปๆใ
ๆ่๏ผไปไนๆฏlow-probability tokens ๅฆไฝ็ๅฎ

่ฏฆ็ปๆต็จๅ่slides
Adaptive retrieval of tokens -Judge necessity-- Efficient Nearest Neighbor Language Models

Adaptive retrieval of tokens Use local info -- RETOMATON -- Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

ๆป็ป๏ผ

ๆ่: What else beyond text chunks and tokens to retrieve? โ
ๅฎไฝไธๅฎถๆจกๅ
Introduce a new modelโEntities as Experts (EAE)that can access distinct memories of the entities mentioned in a piece of text . ๆๅบโๅฎไฝไธๅฎถโๆจกๅ๏ผๅฏไปฅ่ฎฟ้ฎๆๆฌไธญๆๅฐ็ๅฎไฝ็ไธๅmemories๏ผไธๅ ถไปๅฐๅฎไฝ็นๅฎ็ฅ่ฏๆณจๅ ฅๅบๅๆจกๅ็ๅชๅไธๅ๏ผๆฌๆจกๅไปๆๆฌไธญๅญฆไน ๅฎไฝ่กจ็คบไปฅๅๆๆๅ ถไปๆจกๅๅๆฐใ
ไธๅพๅฏไธ็ๅฐ๏ผไผ ็ป็Transformer้่ฆๆ นๆฎโCharlesโๅโDarwinโ่ฟไธคไธช่ฏๆๅปบ Charles Darwin ็ๅ
้จ่กจ็คบ๏ผ่ฟไธคไธช่ฏ้ฝๅฏไปฅไนๆไธๅ็ๅฎไฝ๏ผไพๅฆๆฅๅฐๆฏๆฒณๆ่พพๅฐๆๅธใ็ธๅ๏ผEAE ๅฏไปฅ่ฎฟ้ฎโๆฅๅฐๆฏยท่พพๅฐๆโ็ไธ็จ่กจ็คบ๏ผๅฎๆฏๅ
ๅๆๅฐ่ฟ่ฏฅๅฎไฝ็ๆๆไธไธๆ็่ฎฐๅฟใ

ไปๆฏไธชๅฎไฝไธไธชๅ้ๅฐๆฏไธชๅฎไฝๆๅไธไธชๅ้็่ฝฌๅ--Mention Memory:incorporating textual knowledge into Transformers through entity mention attention้่ฟๅฎไฝๆๅๆณจๆๅๅฐๆๆฌ็ฅ่ฏ่ๅ
ฅtransformerไธญ
ๆ่ฆ็ฟป่ฏ๏ผ
่ฏธๅฆๅผๆพๅ้ฎ็ญไน็ฑป็่ช็ถ่ฏญ่จ็่งฃไปปๅก้ๅธธ้่ฆไปๅคไธชๆฅๆบๆฃ็ดขๅๅธๆถไบๅฎไฟกๆฏใๆไปฌๅปบ่ฎฎ้่ฟๅฐๅคงๅๆๆฌ่ฏญๆๅบ็ๅๅๆฐ่กจ็คบ้ๆๅฐ Transformer ๆจกๅไธญไฝไธบไบๅฎ็ฅ่ฏ็ๆฅๆบๆฅ่งฃๅณ่ฟไธช้ฎ้ขใ
ๅ ทไฝๆฅ่ฏด๏ผๆไปฌ็ๆนๆณ็จโๆๅ่ฎฐๅฟโๆฅ่กจ็คบ็ฅ่ฏ๏ผโๆๅ่ฎฐๅฟโๆฏ่ฏญๆๅบไธญๆๅ็ๆฏไธชๅฎไฝ็ๅฏ้ๅ้่กจ็คบ่กจใๆๆๅบ็ๆจกๅ - TOME - ๆฏไธไธช Transformer๏ผๅฎ้่ฟๅ ้จ่ฎฐๅฟๅฑ่ฎฟ้ฎไฟกๆฏ๏ผๅ ถไธญ่พๅ ฅๆฎต่ฝไธญๆๅ็ๆฏไธชๅฎไฝ้ฝๆถๅๆๅ่ฎฐๅฟใ่ฟ็งๆนๆณๅฏไปฅๅจๅไธช Transformer ๆจกๅไธญๅฏน่ฎธๅคไธๅ็ไฟกๆฏๆบ่ฟ่ก็ปผๅๅๆจ็ใๅจไฝฟ็จ 1.5 ไบฟๆก็ปดๅบ็พ็งๆๅ็ๅ ๅญ่ฟ่ก็ๅฎ้ชไธญ๏ผTOME ๅจๅคไธชๅผๆพ้ขๅ็ฅ่ฏๅฏ้ๅไปปๅกไธๅๅพไบๅบ่ฒ็ๆง่ฝ๏ผๅ ๆฌๅฃฐๆ้ช่ฏๅบๅ HoVer ๅ FEVER ไปฅๅๅคไธชๅบไบๅฎไฝ็ QA ๅบๅใๆไปฌ่ฟ่กจๆ๏ผ่ฏฅๆจกๅๅจๆฒกๆไปปไฝ็ดๆฅ็็ฃ็ๆ ๅตไธๅญฆไผไบๅ ณๆณจinformative mentionsใๆๅ๏ผๆไปฌ่ฏๆ่ฏฅๆจกๅๅฏไปฅ้่ฟๆดๆฐๅ ๅญ่ๆ ้้ๆฐ่ฎญ็ปๆฅๆจๅนฟๅฐๆฐ็็ไธ่ง็ๅฎไฝใ

ๆป็ป๏ผ

ไผๅฟ๏ผๅฏนไบไปฅๅฎไฝไธบไธญๅฟ็ไปปๅกๅพๆๆ&็ฉบ้ด้ซๆ
ๅฃๅฟ๏ผ้่ฆ้ขๅค็ๅฎไฝๆฃๆต
ไธ้ขๆๆ็ๆจกๅ้ฝๆฏๅบไบๅค้จๆๆฌ็๏ผ่ฟๆๅ ถไปๆนๆณๅ๏ผโ
Retrieval for long-range LM
่ฏญ่จๆจกๅ้ๅธธ้่ฆ่ฟ่ก่ฎญ็ปๆๅพฎ่ฐๆ่ฝ่ทๅๆฐ็ฅ่ฏ๏ผ่ฟๆถๅๆดๆฐๅ ถๆ้ใ็ธๅ๏ผๆไปฌ่ฎพๆณ่ฏญ่จๆจกๅๅฏไปฅๅจๆจ็ๆถ็ฎๅๅฐ่ฏปๅๅ่ฎฐๅฟๆฐๆฐๆฎ๏ผไป่็ซๅณ่ทๅๆฐ็ฅ่ฏใๅจ่ฟ้กนๅทฅไฝไธญ๏ผๆไปฌๆฉๅฑไบ่ฏญ่จๆจกๅ๏ผไฝฟๅ ถ่ฝๅค่ฎฐไฝ่ฟๅป่พๅ ฅ็ๅ ้จ่กจ็คบใๆไปฌ่ฏๆ๏ผๅฏนๆ่ฟ๏ผ้ฎใๅผ๏ผๅฏน็ไธๅฏๅพฎ่ฎฐๅฟ่ฟ่ก่ฟไผผ kNN ๆฅๆพๅฏไปฅๆน่ฟ่ทจๅ็งๅบๅๅไปปๅก็่ฏญ่จๅปบๆจก๏ผๅ ๆฌ้็จ็ฝ็ปๆๆฌ (C4)ใๆฐๅญฆ่ฎบๆ (arXiv)ใไนฆ็ฑ (PG-19)ใไปฃ็ ๏ผGithub๏ผ๏ผไปฅๅๅฝขๅผๅฎ็๏ผIsabelle๏ผใๆไปฌ่กจๆ๏ผๅฝๆไปฌๅฐๅ ๅญๅคงๅฐๅขๅ ๅฐ 262K ไปค็ๆถ๏ผๆง่ฝไผ็จณๆญฅๆ้ซใๅจๅ ๆฌไปฃ็ ๅๆฐๅญฆๅจๅ ็ๅบๅๆต่ฏไธญ๏ผๆไปฌๅ็ฐ่ฏฅๆจกๅ่ฝๅคๅจๆต่ฏๆ้ดไฝฟ็จๆฐๅฎไน็ๅฝๆฐๅๅฎ็ใ(ๅบไบKNNๅปๅๆฃ็ดข)
ๅฏน้ฟๅบๅ็ๆณจๆๅไฝไธบๅฟซ้ๅญฆไน ็ไธ็งๅฝขๅผไนๅพๆ็จใไปฅๆ้็ฉ้ตๅฝขๅผๅญๅจ็ไบๅฎๅไฟกๆฏๅฟ ้กป็ป่ฟๆฐๅไธไธช่ฎญ็ปๆญฅ้ชค็ผๆ ข่ฎญ็ปใ็ถ่๏ผ้่ฟไฝฟ็จๆณจๆๅ๏ผๆจกๅๅฏไปฅ้่ฟๅฐไบๅฎ๏ผไพๅฆๅฝๆฐๅฎไน๏ผไฝไธบ๏ผ้ฎ๏ผๅผ๏ผๅฏนๅญๅจๅจ้ฟๆ่ฎฐๅฟไธญๆฅ็ฎๅๅฐ่ฎฐไฝๅฎไปฌ๏ผ็ถๅ้่ฟๅๅปบๅ ณๆณจๅฎไปฌ็ๆฅ่ฏขๆฅๆฃ็ดข่ฟไบไบๅฎใๅจ่ฟ็งๆ ๅตไธ๏ผๆณจๆๅๅ ๅฝไฟกๆฏๆฃ็ดข็ไธ็งๅฝขๅผ๏ผๅ ่ฎธๆจกๅๆฅๆพๅฎไปฅๅ่ง่ฟ็ไบๅฎใ

โๆฉๅฑ Transformer ๆฅ่ฎฟ้ฎๅ ๅ็ๅฐ็ๅญๅบๅ็๏ผ้ฎ๏ผๅผ๏ผๅฏนใ
Bertsch et al. 2023. Unlimiformer: Long-Range Transformers with Unlimited Length Input
้ๅฝ๏ผๆฆๅฟต่กฅๅ
ๆขฏๅบฆๅไผ
ๅ ถๅฎๅฐฑๆฏๆขฏๅบฆไธ้ๅๅๅไผ ๆญ ๅ่๏ผhttps://atcold.github.io/pytorch-Deep-Learning/zh/week02/02-1/
ๆขฏๅบฆๅ่ฝฌ
็จไบ้ขๅ่ช้ๅบ ๅ่๏ผ https://zhuanlan.zhihu.com/p/75470256
Last updated