LLM Research Papers: The 2024 List

Dec 08, 2024

It’s been a very eventful and exciting year in AI research. This is especially true if you are interested in LLMs.

I had big plans for this December edition and was planning to publish a new article with a discussion of all my research highlights from 2024. I still plan to do so, but due to an accident and serious injury, I am currently unable to work at a computer and finish the draft. But I hope to recover in the upcoming weeks and be back on my feet soon.

In the meantime, I want to share my running bookmark list of many fascinating (mostly LLM-related) papers I stumbled upon in 2024. It’s just a list, but maybe it will come in handy for those who are interested in finding some gems to read for the holidays.

And if you are interested in more code-heavy reading and tinkering, My Build A Large Language Model (From Scratch) book is out on Amazon as of last month.

In addition, I added a lot of bonus materials to the GitHub repository.

Bonus materials in the GitHub repository (stars highlight my personal favorites)

Thanks for your understanding and support, and I hope to make a full recovery soon and be back with the Research Highlights 2024 article in a few weeks!

January 2024

1 Jan, Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models, https://arxiv.org/abs/2401.00788
2 Jan, A Comprehensive Study of Knowledge Editing for Large Language Models, https://arxiv.org/abs/2401.01286
2 Jan, LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, https://arxiv.org/abs/2401.01325
2 Jan, Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models, https://arxiv.org/abs/2401.01335
2 Jan, LLaMA Beyond English: An Empirical Study on Language Capability Transfer, https://arxiv.org/abs/2401.01055
3 Jan, A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity, https://arxiv.org/abs/2401.01967
4 Jan, LLaMA Pro: Progressive LLaMA with Block Expansion, https://arxiv.org/abs/2401.02415
4 Jan, LLM Augmented LLMs: Expanding Capabilities through Composition, https://arxiv.org/abs/2401.02412
4 Jan, Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM, https://arxiv.org/abs/2401.02994
5 Jan, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, https://arxiv.org/abs/2401.02954
5 Jan, Denoising Vision Transformers, https://arxiv.org/abs/2401.02957
7 Jan, Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon, https://arxiv.org/abs/2401.03462
8 Jan, Mixtral of Experts, https://arxiv.org/abs/2401.04088
8 Jan, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, https://arxiv.org/abs/2401.04081
8 Jan, A Minimaximalist Approach to Reinforcement Learning from Human Feedback, https://arxiv.org/abs/2401.04056
9 Jan, RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation, https://arxiv.org/abs/2401.04679
10 Jan, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, https://arxiv.org/abs/2401.05566
11 Jan, Transformers are Multi-State RNNs, https://arxiv.org/abs/2401.06104
11 Jan, A Closer Look at AUROC and AUPRC under Class Imbalance, https://arxiv.org/abs/2401.06091
12 Jan, An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models, https://arxiv.org/abs/2401.06692
16 Jan, Tuning Language Models by Proxy, https://arxiv.org/abs/2401.08565
16 Jan, Scalable Pre-training of Large Autoregressive Image Models, https://arxiv.org/abs/2401.08541
16 Jan, Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering, https://arxiv.org/abs/2401.08500
16 Jan, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, https://arxiv.org/abs/2401.08406
17 Jan, ReFT: Reasoning with Reinforced Fine-Tuning, https://arxiv.org/abs/2401.08967
18 Jan, DiffusionGPT: LLM-Driven Text-to-Image Generation System, https://arxiv.org/abs/2401.10061
18 Jan, Self-Rewarding Language Models, https://arxiv.org/abs/2401.10020
18 Jan, VMamba: Visual State Space Model, https://arxiv.org/abs/2401.10166
19 Jan, Knowledge Fusion of Large Language Models, https://arxiv.org/abs/2401.10491
22 Jan, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, https://arxiv.org/abs/2401.12168
22 Jan, WARM: On the Benefits of Weight Averaged Reward Models, https://arxiv.org/abs/2401.12187
22 Jan, Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text, https://arxiv.org/abs/2401.12070
24 Jan, MambaByte: Token-free Selective State Space Model, https://arxiv.org/abs/2401.13660
24 Jan, SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection, https://arxiv.org/abs/2401.13160
25 Jan, Rethinking Patch Dependence for Masked Autoencoders, https://arxiv.org/abs/2401.14391
25 Jan, Pix2gestalt: Amodal Segmentation by Synthesizing Wholes, https://arxiv.org/abs/2401.14398
25 Jan, Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities, https://arxiv.org/abs/2401.14405
26 Jan, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, https://arxiv.org/abs/2401.15077
29 Jan, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models, https://arxiv.org/abs/2401.15947
29 Jan, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
31 Jan, KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079

February 2024

1 Feb, Efficient Exploration for LLMs, https://arxiv.org/abs/2402.00396
1 Feb, OLMo: Accelerating the Science of Language Models, https://arxiv.org/abs/2402.00838
1 Feb, Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?, https://arxiv.org/abs/2402.00841
1 Feb, Repeat After Me: Transformers are Better than State Space Models at Copying, https://arxiv.org/abs/2402.01032
2 Feb, LiPO: Listwise Preference Optimization through Learning-to-Rank, https://arxiv.org/abs/2402.01878
2 Feb, FindingEmo: An Image Dataset for Emotion Recognition in the Wild, https://arxiv.org/abs/2402.01355
3 Feb, More Agents Is All You Need, https://arxiv.org/abs/2402.05120
5 Feb, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, https://arxiv.org/abs/2402.03300
6 Feb, MobileVLM V2: Faster and Stronger Baseline for Vision Language Model, https://arxiv.org/abs/2402.03766
6 Feb, A Phase Transition Between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention, https://arxiv.org/abs/2402.03902
6 Feb, Scaling Laws for Downstream Task Performance of Large Language Models, https://arxiv.org/abs/2402.04177
6 Feb, MOMENT: A Family of Open Time-series Foundation Models, https://arxiv.org/abs/2402.03885
6 Feb, Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models, https://arxiv.org/abs/2402.03749
6 Feb, Self-Discover: Large Language Models Self-Compose Reasoning Structures, https://arxiv.org/abs/2402.03620
7 Feb, Grandmaster-Level Chess Without Search, https://arxiv.org/abs/2402.04494
7 Feb, Direct Language Model Alignment from Online AI Feedback, https://arxiv.org/abs/2402.04792
8 Feb, Buffer Overflow in Mixture of Experts, https://arxiv.org/abs/2402.05526
9 Feb, The Boundary of Neural Network Trainability is Fractal, https://arxiv.org/abs/2402.06184
11 Feb, ODIN: Disentangled Reward Mitigates Hacking in RLHF, https://arxiv.org/abs/2402.07319
12 Feb, Policy Improvement using Language Feedback Models, https://arxiv.org/abs/2402.07876
12 Feb, Scaling Laws for Fine-Grained Mixture of Experts, https://arxiv.org/abs/2402.07871
12 Feb, Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model, https://arxiv.org/abs/2402.07610
12 Feb, Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping, https://arxiv.org/abs/2402.07610
12 Feb, Suppressing Pink Elephants with Direct Principle Feedback, https://arxiv.org/abs/2402.07896
13 Feb, World Model on Million-Length Video And Language With RingAttention, https://arxiv.org/abs/2402.08268
13 Feb, Mixtures of Experts Unlock Parameter Scaling for Deep RL, https://arxiv.org/abs/2402.08609
14 Feb, DoRA: Weight-Decomposed Low-Rank Adaptation, https://arxiv.org/abs/2402.09353
14 Feb, Transformers Can Achieve Length Generalization But Not Robustly, https://arxiv.org/abs/2402.09371
15 Feb, BASE TTS: Lessons From Building a Billion-Parameter Text-to-Speech Model on 100K Hours of Data, https://arxiv.org/abs/2402.08093
15 Feb, Recovering the Pre-Fine-Tuning Weights of Generative Models, https://arxiv.org/abs/2402.10208
15 Feb, Generative Representational Instruction Tuning, https://arxiv.org/abs/2402.09906
16 Feb, FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models, https://arxiv.org/abs/2402.10986
17 Feb, OneBit: Towards Extremely Low-bit Large Language Models, https://arxiv.org/abs/2402.11295
18 Feb, LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration, https://arxiv.org/abs/2402.11550
19 Feb, Reformatted Alignment, https://arxiv.org/abs/2402.12219
19 Feb, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling, https://arxiv.org/abs/2402.12226
19 Feb, Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs, https://arxiv.org/abs/2402.12030
19 Feb, LoRA+: Efficient Low Rank Adaptation of Large Models, https://arxiv.org/abs/2402.12354
20 Feb, Neural Network Diffusion, https://arxiv.org/abs/2402.13144
21 Feb, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information, https://arxiv.org/abs/2402.13616
21 Feb, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, https://arxiv.org/abs/2402.13753
21 Feb, Large Language Models for Data Annotation: A Survey, https://arxiv.org/abs/2402.13446
22 Feb, TinyLLaVA: A Framework of Small-scale Large Multimodal Models, https://arxiv.org/abs/2402.14289
22 Feb, Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, https://arxiv.org/abs/2402.14740
23 Feb, Genie: Generative Interactive Environments, https://arxiv.org/abs/2402.15391
26 Feb, CARTE: Pretraining and Transfer for Tabular Learning, https://arxiv.org/abs/2402.16785
27 Feb, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, https://arxiv.org/abs/2402.17764
27 Feb, Sora Generates Videos with Stunning Geometrical Consistency, https://arxiv.org/abs/2402.17403
27 Feb, When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method, https://arxiv.org/abs/2402.17193
29 Feb, Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models, https://arxiv.org/abs/2402.19427

March 2024

1 Mar, Learning and Leveraging World Models in Visual Representation Learning, https://arxiv.org/abs/2403.00504
3 Mar, Improving LLM Code Generation with Grammar Augmentation, https://arxiv.org/abs/2403.01632
3 Mar, The Hidden Attention of Mamba Models, https://arxiv.org/abs/2403.01590
4 Mar, Training-Free Pretrained Model Merging, https://arxiv.org/abs/2403.01753
4 Mar, Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, https://arxiv.org/abs/2403.02308
5 Mar, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning, https://arxiv.org/abs/2403.03218
5 Mar, Evolution Transformer: In-Context Evolutionary Optimization, https://arxiv.org/abs/2403.02985
5 Mar, Enhancing Vision-Language Pre-training with Rich Supervisions, https://arxiv.org/abs/2403.03346
5 Mar, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, https://arxiv.org/abs/2403.03206
5 Mar, Design2Code: How Far Are We From Automating Front-End Engineering?, https://arxiv.org/abs/2403.03163
6 Mar, ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, https://arxiv.org/abs/2403.03853
6 Mar, Backtracing: Retrieving the Cause of the Query, https://arxiv.org/abs/2403.03956
6 Mar, Learning to Decode Collaboratively with Multiple Language Models, https://arxiv.org/abs/2403.03870
6 Mar, SaulLM-7B: A pioneering Large Language Model for Law, https://arxiv.org/abs/2403.03883
6 Mar, Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning, https://arxiv.org/abs/2403.03864
6 Mar, 3D Diffusion Policy, https://arxiv.org/abs/2403.03954
6 Mar, MedMamba: Vision Mamba for Medical Image Classification, https://arxiv.org/abs/2403.03849
6 Mar, GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, https://arxiv.org/abs/2403.03507
6 Mar, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL, https://arxiv.org/abs/2403.03950
7 Mar, How Far Are We from Intelligent Visual Deductive Reasoning?, https://arxiv.org/abs/2403.04732
7 Mar, Common 7B Language Models Already Possess Strong Math Capabilities, https://arxiv.org/abs/2403.04706
8 Mar, Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context, https://arxiv.org/abs/2403.05530
8 Mar, Is Cosine-Similarity of Embeddings Really About Similarity?, https://arxiv.org/abs/2403.05440
8 Mar, LLM4Decompile: Decompiling Binary Code with Large Language Models, https://arxiv.org/abs/2403.05286
9 Mar, Algorithmic Progress in Language Models, https://arxiv.org/abs/2403.05812
11 Mar, Stealing Part of a Production Language Model, https://arxiv.org/abs/2403.06634
12 Mar, Chronos: Learning the Language of Time Series, https://arxiv.org/abs/2403.07815
13 Mar, Simple and Scalable Strategies to Continually Pre-train Large Language Models, https://arxiv.org/abs/2403.08763
13 Mar, Language Models Scale Reliably With Over-Training and on Downstream Tasks, https://arxiv.org/abs/2403.08540
14 Mar, BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences, https://arxiv.org/abs/2403.09347
14 Mar, LocalMamba: Visual State Space Model with Windowed Selective Scan, https://arxiv.org/abs/2403.09338
14 Mar, GiT: Towards Generalist Vision Transformer through Universal Language Interface, https://arxiv.org/abs/2403.09394
14 Mar, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training, https://arxiv.org/abs/2403.09611
15 Mar, RAFT: Adapting Language Model to Domain Specific RAG, https://arxiv.org/abs/2403.10131
18 Mar, TnT-LLM: Text Mining at Scale with Large Language Models, https://arxiv.org/abs/2403.12173
18 Mar, Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression, https://arxiv.org/abs/2403.15447
19 Mar, PERL: Parameter Efficient Reinforcement Learning from Human Feedback, https://arxiv.org/abs/2403.10704
20 Mar, RewardBench: Evaluating Reward Models for Language Modeling, https://arxiv.org/abs/2403.13787
20 Mar, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models, https://arxiv.org/abs/2403.13372
21 Mar, RakutenAI-7B: Extending Large Language Models for Japanese, https://arxiv.org/abs/2403.15484
22 Mar, SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series, https://arxiv.org/abs/2403.15360
22 Mar, Can Large Language Models Explore In-Context?, https://arxiv.org/abs/2403.15371
22 Mar, LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement, https://arxiv.org/abs/2403.15042
25 Mar, LLM Agent Operating System, https://arxiv.org/abs/2403.16971
26 Mar, The Unreasonable Ineffectiveness of the Deeper Layers, https://arxiv.org/abs/2403.17887
27 Mar, BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text, https://arxiv.org/abs/2403.18421
27 Mar, ViTAR: Vision Transformer with Any Resolution, https://arxiv.org/abs/2403.18361
27 Mar, Long-form Factuality in Large Language Models, https://arxiv.org/abs/2403.18802
27 Mar, Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models, https://arxiv.org/abs/2403.18814
26 Mar, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning, https://arxiv.org/abs/2403.17919
26 Mar, Mechanistic Design and Scaling of Hybrid Architectures, https://arxiv.org/abs/2403.17844
28 Mar, MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions, https://arxiv.org/abs/2403.19651
28 Mar, Model Stock: All We Need Is Just a Few Fine-Tuned Models, https://arxiv.org/abs/2403.19522

April 2024

1 Apr, Do Language Models Plan Ahead for Future Tokens?, https://arxiv.org/abs/2404.00859
1 Apr, Bigger is not Always Better: Scaling Properties of Latent Diffusion Models, https://arxiv.org/abs/2404.01367
1 Apr, The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis, https://arxiv.org/abs/2404.01204
1 Apr, Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models, https://arxiv.org/abs/2404.04478
2 Apr, Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models, https://arxiv.org/abs/2404.02258
2 Apr, Long-context LLMs Struggle with Long In-context Learning, https://arxiv.org/abs/2404.02060
2 Apr, Emergent Abilities in Reduced-Scale Generative Language Models, https://arxiv.org/abs/2404.02204
2 Apr, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, https://arxiv.org/abs/2404.02151
3 Apr, On the Scalability of Diffusion-based Text-to-Image Generation, https://arxiv.org/abs/2404.02883
3 Apr, BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models, https://arxiv.org/abs/2404.02827
3 Apr, Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models, https://arxiv.org/abs/2404.02747
4 Apr, Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences, https://arxiv.org/abs/2404.02151
4 Apr, Training LLMs over Neurally Compressed Text, https://arxiv.org/abs/2404.03626
4 Apr, CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues, https://arxiv.org/abs/2404.03820
5 Apr, ReFT: Representation Finetuning for Language Models, https://arxiv.org/abs/2404.03592
5 Apr, Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data, https://arxiv.org/abs/2404.03862
5 Apr, Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation, https://arxiv.org/abs/2404.04256
8 Apr, AutoCodeRover: Autonomous Program Improvement, https://arxiv.org/abs/2404.05427
8 Apr, Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence, https://arxiv.org/abs/2404.05892
8 Apr, CodecLM: Aligning Language Models with Tailored Synthetic Data, https://arxiv.org/abs/2404.05875
9 Apr, MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies, https://arxiv.org/abs/2404.06395
9 Apr, Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models, https://arxiv.org/abs/2404.06209
9 Apr, LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, https://arxiv.org/abs/2404.05961
10 Apr, Adapting LLaMA Decoder to Vision Transformer, https://arxiv.org/abs/2404.06773
10 Apr, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, https://arxiv.org/abs/2404.07143
11 Apr, LLoCO: Learning Long Contexts Offline, https://arxiv.org/abs/2404.07979
11 Apr, JetMoE: Reaching Llama2 Performance with 0.1M Dollars, https://arxiv.org/abs/2404.07413
11 Apr, Best Practices and Lessons Learned on Synthetic Data for Language Models, https://arxiv.org/abs/2404.07503
11 Apr, Rho-1: Not All Tokens Are What You Need, https://arxiv.org/abs/2404.07965
12 Apr, Pre-training Small Base LMs with Fewer Tokens, https://arxiv.org/abs/2404.08634
12 Apr, Dataset Reset Policy Optimization for RLHF, https://arxiv.org/abs/2404.08495
13 Apr, LLM In-Context Recall is Prompt Dependent, https://arxiv.org/abs/2404.08865
15 Apr, State Space Model for New-Generation Network Alternative to Transformers: A Survey, https://arxiv.org/abs/2404.09516
15 Apr, Chinchilla Scaling: A Replication Attempt, https://arxiv.org/abs/2404.10102
15 Apr, Learn Your Reference Model for Real Good Alignment, https://arxiv.org/abs/2404.09656
16 Apr, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, https://arxiv.org/abs/2404.10719
16 Apr, Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies, https://arxiv.org/abs/2404.08197
16 Apr, How Faithful Are RAG Models? Quantifying the Tug-of-War Between RAG and LLMs' Internal Prior, https://arxiv.org/abs/2404.10198
17 Apr, A Survey on Retrieval-Augmented Text Generation for Large Language Models, https://arxiv.org/abs/2404.10981
18 Apr, When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes, https://arxiv.org/abs/2404.12365
18 Apr, Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing, https://arxiv.org/abs/2404.12253
18 Apr, OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data, https://arxiv.org/abs/2404.12195
19 Apr, The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, https://arxiv.org/abs/2404.13208
22 Apr, How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study, https://arxiv.org/abs/2404.14047
22 Apr, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, https://arxiv.org/abs/2404.14219
22 Apr, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, https://arxiv.org/abs/2404.14619
22 Apr, A Survey on Self-Evolution of Large Language Models, https://arxiv.org/abs/2404.14662
23 Apr, Multi-Head Mixture-of-Experts, https://arxiv.org/abs/2404.15045
23 Apr, NExT: Teaching Large Language Models to Reason about Code Execution, https://arxiv.org/abs/2404.14662
23 Apr, Graph Machine Learning in the Era of Large Language Models (LLMs), https://arxiv.org/abs/2404.14928
24 Apr, Retrieval Head Mechanistically Explains Long-Context Factuality, https://arxiv.org/abs/2404.15574
25 Apr, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710
25 Apr, Make Your LLM Fully Utilize the Context, https://arxiv.org/abs/2404.16811
28 Apr, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, https://arxiv.org/abs/2405.00732
30 Apr, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737
30 Apr, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing, https://arxiv.org/abs/2404.19543
30 Apr, A Primer on the Inner Workings of Transformer-based Language Models, https://arxiv.org/abs/2405.00208
30 Apr, When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively, https://arxiv.org/abs/2404.19705
30 Apr, KAN: Kolmogorov–Arnold Networks, https://arxiv.org/abs/2404.19756

May 2024

1 May, Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3, https://arxiv.org/abs/2405.00664
1 May, Self-Play Preference Optimization for Language Model Alignment, https://arxiv.org/abs/2405.00675
1 May, A Careful Examination of Large Language Model Performance on Grade School Arithmetic, https://arxiv.org/abs/2405.00332
2 May, Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, https://arxiv.org/abs/2405.01535
3 May, What Matters When Building Vision-Language Models?, https://arxiv.org/abs/2405.02246
5 May, Is Flash Attention Stable?, https://arxiv.org/abs/2405.02803
7 May, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437
7 May, xLSTM: Extended Long Short-Term Memory, https://arxiv.org/abs/2405.04517
8 May, You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254
8 May, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
8 May, Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models, https://arxiv.org/abs/2405.05417
9 May, Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?, https://arxiv.org/abs/2405.05904
10 May, Value Augmented Sampling for Language Model Alignment and Personalization, https://arxiv.org/abs/2405.06639
12 May, PHUDGE: Phi-3 as Scalable Judge, https://arxiv.org/abs/2405.08029
13 May, RLHF Workflow: From Reward Modeling to Online RLHF, https://arxiv.org/abs/2405.07863
15 May, LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673
15 May, Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model, https://arxiv.org/abs/2405.09215
16 May, Chameleon: Mixed-Modal Early-Fusion Foundation Models, https://arxiv.org/abs/2405.09818
17 May, Towards Modular LLMs by Building and Reusing a Library of LoRAs, https://arxiv.org/abs/2405.11157
19 May, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization, https://arxiv.org/abs/2405.11582
20 May, MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning, https://arxiv.org/abs/2405.12130
22 May, Attention as an RNN, https://arxiv.org/abs/2405.13956
22 May, Dense Connector for MLLMs, https://arxiv.org/abs/2405.13800
23 May, AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability, https://arxiv.org/abs/2405.14129
23 May, SimPO: Simple Preference Optimization with a Reference-Free Reward, https://arxiv.org/abs/2405.14734
23 May, Instruction Tuning With Loss Over Instructions, https://arxiv.org/abs/2405.14394
24 May, The Road Less Scheduled, https://arxiv.org/abs/2405.15682
26 May, Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training, https://arxiv.org/abs/2405.15319
26 May, gzip Predicts Data-dependent Scaling Laws, https://arxiv.org/abs/2405.16684
27 May, Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning, https://arxiv.org/abs/2405.17258
28 May, VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections, https://arxiv.org/abs/2405.17991
28 May, LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models, https://arxiv.org/abs/2405.18377
29 May, Contextual Position Encoding: Learning to Count What's Important, https://arxiv.org/abs/2405.18719

June 2024

2 Jun, Show, Don't Tell: Aligning Language Models with Demonstrated Feedback, https://arxiv.org/abs/2406.00888
3 Jun, Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models, https://arxiv.org/abs/2406.06563
3 Jun, OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2406.01775
3 Jun, The Geometry of Categorical and Hierarchical Concepts in Large Language Models, https://arxiv.org/abs/2406.01506
3 Jun, Towards Scalable Automated Alignment of LLMs: A Survey, https://arxiv.org/abs/2406.01252
4 Jun, Scalable MatMul-free Language Modeling, https://arxiv.org/abs/2406.02528
4 Jun, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657
6 Jun, Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models, https://arxiv.org/abs/2406.04271
6 Jun, The Prompt Report: A Systematic Survey of Prompting Techniques, https://arxiv.org/abs/2406.06608
6 Jun, Transformers Need Glasses! Information Over-Squashing in Language Tasks, https://arxiv.org/abs/2406.04267
6 Jun, Are We Done with MMLU?, https://arxiv.org/abs/2406.04127
6 Jun, Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step, https://arxiv.org/abs/2406.04314
7 Jun, Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach, https://arxiv.org/abs/2406.04594
7 Jun, CRAG -- Comprehensive RAG Benchmark, https://arxiv.org/abs/2406.04744
7 Jun, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, https://arxiv.org/abs/2406.04770
7 Jun, Mixture-of-Agents Enhances Large Language Model Capabilities, https://arxiv.org/abs/2406.04692
7 Jun, BERTs are Generative In-Context Learners, https://arxiv.org/abs/2406.04823
7 Jun, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination, https://arxiv.org/abs/2406.05132
8 Jun, Creativity Has Left the Chat: The Price of Debiasing Language Models, https://arxiv.org/abs/2406.05587
10 Jun, Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation, https://arxiv.org/abs/2406.06525
10 Jun, Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference, https://arxiv.org/abs/2406.06424
10 Jun, Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning, https://arxiv.org/abs/2406.06469
10 Jun, Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters, https://arxiv.org/abs/2406.05955
10 Jun, Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching, https://arxiv.org/abs/2406.06326
11 Jun, An Image is Worth 32 Tokens for Reconstruction and Generation, https://arxiv.org/abs/2406.07550
11 Jun, TextGrad: Automatic "Differentiation" via Text, https://arxiv.org/abs/2406.07496
11 Jun, Simple and Effective Masked Diffusion Language Models, https://arxiv.org/abs/2406.07524
11 Jun, Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement, https://arxiv.org/abs/2406.07138
11 Jun, Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling, https://arxiv.org/abs/2406.07522
12 Jun, Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing, https://arxiv.org/abs/2406.08464
12 Jun, What If We Recaption Billions of Web Images with LLaMA-3?, https://arxiv.org/abs/2406.08478
12 Jun, Large Language Model Unlearning via Embedding-Corrupted Prompts, https://arxiv.org/abs/2406.07933
12 Jun, Large Language Models Must Be Taught to Know What They Don't Know, https://arxiv.org/abs/2406.08391
12 Jun, An Empirical Study of Mamba-based Language Models, https://arxiv.org/abs/2406.07887
12 Jun, Discovering Preference Optimization Algorithms with and for Large Language Models, https://arxiv.org/abs/2406.08414
13 Jun, Transformers Meet Neural Algorithmic Reasoners, https://arxiv.org/abs/2406.09308
13 Jun, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297
13 Jun, An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels, https://arxiv.org/abs/2406.09415
13 Jun, FouRA: Fourier Low Rank Adaptation, https://arxiv.org/abs/2406.08798
14 Jun, Bootstrapping Language Models with DPO Implicit Rewards, https://arxiv.org/abs/2406.09760
14 Jun, Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs, https://arxiv.org/abs/2406.10209
14 Jun, Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs, https://arxiv.org/abs/2406.10216
16 Jun, THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation, https://arxiv.org/abs/2406.10996
17 Jun, Task Me Anything, https://arxiv.org/abs/2406.11775
17 Jun, How Do Large Language Models Acquire Factual Knowledge During Pretraining?, https://arxiv.org/abs/2406.11813
17 Jun, mDPO: Conditional Preference Optimization for Multimodal Large Language Models, https://arxiv.org/abs/2406.11839
17 Jun, Nemotron-4 340B Technical Report, https://arxiv.org/abs/2406.11704
17 Jun, DataComp-LM: In Search of the Next Generation of Training Sets for Language Models, https://arxiv.org/abs/2406.11794
17 Jun, Tokenization Falling Short: The Curse of Tokenization, https://arxiv.org/abs/2406.11687
17 Jun, DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence, https://arxiv.org/abs/2406.11931
17 Jun, Unveiling Encoder-Free Vision-Language Models, https://arxiv.org/abs/2406.11832
17 Jun, Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level, https://arxiv.org/abs/2406.11817
17 Jun, HARE: HumAn pRiors, a key to small language model Efficiency, https://arxiv.org/abs/2406.11410
17 Jun, Measuring memorization in RLHF for code completion, https://arxiv.org/abs/2406.11715
17 Jun, Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts, https://arxiv.org/abs/2406.12034
18 Jun, From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries, https://arxiv.org/abs/2406.12824
18 Jun, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, https://arxiv.org/abs/2406.12624
19 Jun, Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?, https://arxiv.org/abs/2406.13121
20 Jun, Instruction Pre-Training: Language Models are Supervised Multitask Learners, https://arxiv.org/abs/2406.14491
20 Jun, Can LLMs Learn by Teaching? A Preliminary Study, https://arxiv.org/abs/2406.14629
21 Jun, A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems, https://arxiv.org/abs/2406.14972
21 Jun, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, https://arxiv.org/abs/2406.15319
21 Jun, MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, https://arxiv.org/abs/2406.14909
21 Jun, Efficient Continual Pre-training by Mitigating the Stability Gap, https://arxiv.org/abs/2406.14833
24 Jun, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747
24 Jun, WARP: On the Benefits of Weight Averaged Rewarded Policies, https://arxiv.org/abs/2406.16768
24 Jun, Adam-mini: Use Fewer Learning Rates To Gain More, https://arxiv.org/abs/2406.16793
25 Jun, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale, https://arxiv.org/abs/2406.17557
25 Jun, LongIns: A Challenging Long-context Instruction-based Exam for LLMs, https://arxiv.org/abs/2406.17588
25 Jun, Following Length Constraints in Instructions, https://arxiv.org/abs/2406.17744
26 Jun, A Closer Look into Mixture-of-Experts in Large Language Models, https://arxiv.org/abs/2406.18219
26 Jun, RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
26 Jun, Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs, https://arxiv.org/abs/2406.18629
27 Jun, Dataset Size Recovery from LoRA Weights, https://arxiv.org/abs/2406.19395
27 Jun, From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data, https://arxiv.org/abs/2406.19292
27 Jun, Changing Answer Order Can Decrease MMLU Accuracy, https://arxiv.org/abs/2406.19470
28 Jun, Direct Preference Knowledge Distillation for Large Language Models, https://arxiv.org/abs/2406.19774
28 Jun, LLM Critics Help Catch LLM Bugs, https://arxiv.org/abs/2407.00215
28 Jun, Scaling Synthetic Data Creation with 1,000,000,000 Personas, https://arxiv.org/abs/2406.20094

Jul 2024

1 Jul, LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives, https://arxiv.org/abs/2407.01490
1 Jul, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219
1 Jul, Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models, https://arxiv.org/abs/2407.01906
1 Jul, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, https://arxiv.org/abs/2407.01392
1 Jul, Eliminating Position Bias of Language Models: A Mechanistic Approach, https://arxiv.org/abs/2407.01100
2 Jul, JMInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490
2 Jul, TokenPacker: Efficient Visual Projector for Multimodal LLM, https://arxiv.org/abs/2407.02392
2 Jul, Reasoning in Large Language Models: A Geometric Perspective, https://arxiv.org/abs/2407.02678
2 Jul, RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs, https://arxiv.org/abs/2407.02485
3 Jul, AgentInstruct: Toward Generative Teaching with Agentic Flows, https://arxiv.org/abs/2407.03502
3 Jul, HEMM: Holistic Evaluation of Multimodal Foundation Models, https://arxiv.org/abs/2407.03418
4 Jul, Mixture of A Million Experts, https://arxiv.org/abs/2407.04153
5 Jul, Learning to (Learn at Test Time): RNNs with Expressive Hidden States, https://arxiv.org/abs/2407.04620
9 Jul, Vision Language Models Are Blind, https://arxiv.org/abs/2407.06581
9 Jul, Self-Recognition in Language Models, https://arxiv.org/abs/2407.06946
10 Jul, Inference Performance Optimization for Large Language Models on CPUs, https://arxiv.org/abs/2407.07304
11 Jul, Gradient Boosting Reinforcement Learning, https://arxiv.org/abs/2407.08250
11 Jul, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608
12 Jul, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models, https://arxiv.org/abs/2407.09025
12 Jul, New Desiderata for Direct Preference Optimization, https://arxiv.org/abs/2407.09072
12 Jul, Context Embeddings for Efficient Answer Generation in RAG, https://arxiv.org/abs/2407.09252
15 Jul, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
15 Jul, The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism, https://arxiv.org/abs/2407.10457
15 Jul, From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients, https://arxiv.org/abs/2407.11239
16 Jul, GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression, https://arxiv.org/abs/2407.12077
16 Jul, Scaling Diffusion Transformers to 16 Billion Parameters, https://arxiv.org/abs/2407.11633
16 Jul, NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?, https://arxiv.org/abs/2407.11963
17 Jul, Patch-Level Training for Large Language Models, https://arxiv.org/abs/2407.12665
17 Jul, LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models, https://arxiv.org/abs/2407.12772
17 Jul, A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks, https://arxiv.org/abs/2407.12994
17 Jul, Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models, https://arxiv.org/abs/2407.12327
18 Jul, Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation, https://arxiv.org/abs/2407.13481
18 Jul, Weak-to-Strong Reasoning, https://arxiv.org/abs/2407.13647
18 Jul, Understanding Reference Policies in Direct Preference Optimization, https://arxiv.org/abs/2407.13709
18 Jul, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies, https://arxiv.org/abs/2407.13623
19 Jul, BOND: Aligning LLMs with Best-of-N Distillation, https://arxiv.org/abs/2407.14622
19 Jul, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679
19 Jul, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
22 Jul, Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training, https://arxiv.org/abs/2407.15892
22 Jul, DDK: Distilling Domain Knowledge for Efficient Large Language Models, https://arxiv.org/abs/2407.16154
23 Jul, Generation Constraint Scaling Can Mitigate Hallucination, https://arxiv.org/abs/2407.16908
23 Jul, Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, https://arxiv.org/abs/2407.16833
23 Jul, Course-Correction: Safety Alignment Using Synthetic Preferences, https://arxiv.org/abs/2407.16637
26 Jul, Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?, https://arxiv.org/abs/2407.16607
28 Jul, Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge, https://arxiv.org/abs/2407.19594
29 Jul, Improving Retrieval Augmented Language Model with Self-Reasoning, https://arxiv.org/abs/2407.19813
29 Jul, Apple Intelligence Foundation Language Models, https://arxiv.org/abs/2407.21075
30 Jul, ThinK: Thinner Key Cache by Query-Driven Pruning, https://arxiv.org/abs/2407.21018
31 Jul, The Llama 3 Herd of Models, https://arxiv.org/abs/2407.21783
31 Jul, Gemma 2: Improving Open Language Models at a Practical Size, https://arxiv.org/abs/2408.00118

August 2024

1 Aug, SAM 2: Segment Anything in Images and Videos, https://arxiv.org/abs/2408.00714
2 Aug, POA: Pre-training Once for Models of All Sizes, https://arxiv.org/abs/2408.01031
2 Aug, RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, https://arxiv.org/abs/2408.01262
2 Aug, A Survey of Mamba, https://arxiv.org/abs/2408.01129
3 Aug, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800
5 Aug, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545
5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666
5 Aug, BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba, https://arxiv.org/abs/2408.02600
5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666
7 Aug, EXAONE 3.0 7.8B Instruction Tuned Language Model, https://arxiv.org/abs/2408.03541
7 Aug, 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, https://arxiv.org/abs/2408.03506
8 Aug, Conversational Prompt Engineering, https://arxiv.org/abs/2408.04560
8 Aug, Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP, https://arxiv.org/abs/2408.04303
12 Aug, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, https://arxiv.org/abs/2408.06292
15 Aug, Hermes 3 Technical Report, https://arxiv.org/abs/2408.12570
19 Aug, Customizing Language Models with Instance-wise LoRA for Sequential Recommendation, https://arxiv.org/abs/2408.10159
20 Aug, Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, https://arxiv.org/abs/2408.10615
20 Aug, To Code, or Not To Code? Exploring Impact of Code in Pre-training, https://arxiv.org/abs/2408.10914
21 Aug , LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
22 Aug, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570
22 Aug, Controllable Text Generation for Large Language Models: A Survey, https://arxiv.org/abs/2408.12599
23 Aug, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233
26 Aug, A Practitioner's Guide to Continual Multimodal Pretraining, https://arxiv.org/abs/2408.14471
26 Aug, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637
26 Aug, CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation, https://arxiv.org/abs/2408.14572
27 Aug, The Mamba in the Llama: Distilling and Accelerating Hybrid Models, https://arxiv.org/abs/2408.15237
28 Aug, ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496
29 Aug, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737
31 Aug, LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models, https://arxiv.org/abs/2409.00509

September 2024

3 Sep, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060
3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
5 Sep, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752
5 Sep, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA, https://arxiv.org/abs/2409.02897
5 Sep, How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data, https://arxiv.org/abs/2409.03810
6 Sep, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431
10 Sep, LLaMA-Omni: Seamless Speech Interaction with Large Language Models, https://arxiv.org/abs/2409.06666
10 Sep, What is the Role of Small Models in the LLM Era: A Survey, https://arxiv.org/abs/2409.06857
11 Sep, Policy Filtration in RLHF to Fine-Tune LLM for Code Generation, https://arxiv.org/abs/2409.06957
16 Sep, RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval, https://arxiv.org/abs/2409.10516
18 Sep, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, https://arxiv.org/abs/2409.12122
18 Sep, Qwen2.5-Coder Technical Report, https://arxiv.org/abs/2409.12186
21 Sep, Instruction Following without Instruction Tuning, https://arxiv.org/abs/2409.14254
30 Sep, Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis, https://arxiv.org/abs/2409.20059
30 Sep, The Perfect Blend: Redefining RLHF with Mixture of Judges, https://arxiv.org/abs/2409.20370 (New paper by Meta on how they did RLHF for Llama 3)

October 2024

1 Oct, Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907
2 Oct Quantifying Generalization Complexity for Large Language Models, https://arxiv.org/abs/2410.01769
2 Oct, When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1, https://arxiv.org/abs/2410.01792
2 Oct, Were RNNs All We Needed?, https://arxiv.org/abs/2410.01201
3 Oct, Selective Attention Improves Transformer, https://arxiv.org/abs/2410.02703
3 Oct, LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations, https://arxiv.org/abs/2410.02707
3 Oct, LLaVA-Critic: Learning to Evaluate Multimodal Models, https://arxiv.org/abs/2410.02712
7 Oct, Differential Transformer, https://arxiv.org/abs/2410.05258
7 Oct, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, https://arxiv.org/abs/2410.05229
8 Oct, ARIA: An Open Multimodal Native Mixture-of-Experts Model, https://arxiv.org/abs/2410.05993
8 Oct, O1 Replication Journey: A Strategic Progress Report -- Part 1, https://arxiv.org/abs/2410.18982
8 Oct, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
9 Oct, From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning, https://arxiv.org/abs/2410.06456
10 Oct, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391
11 Oct, Baichuan-Omni Technical Report, https://arxiv.org/abs/2410.08565
13 Oct, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models, https://arxiv.org/abs/2410.10139
13 Oct, LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models, https://arxiv.org/abs/2410.09732
15 Oct, AFlow: Automating Agentic Workflow Generation, https://arxiv.org/abs/2410.10762
15 Oct, Toward General Instruction-Following Alignment for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.09584
21 Oct, Pre-training Distillation for Large Language Models: A Design Space Exploration, https://arxiv.org/abs/2410.16215
23 Oct, MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models, https://arxiv.org/abs/2410.17637
23 Oct, Scalable Ranked Preference Optimization for Text-to-Image Generation, https://arxiv.org/abs/2410.18013
23 Oct, Scaling Diffusion Language Models via Adaptation from Autoregressive Models, https://arxiv.org/abs/2410.17891
24 Oct, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback, https://arxiv.org/abs/2410.19133
25 Oct, Counting Ability of Large Language Models and Impact of Tokenization, https://arxiv.org/abs/2410.19730
25 Oct, A Survey of Small Language Models, https://arxiv.org/abs/2410.20011
26 Oct, Accelerating Direct Preference Optimization with Prefix Sharing, https://arxiv.org/abs/2410.20305
27 Oct, Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse, https://arxiv.org/abs/2410.21333
28 Oct, LongReward: Improving Long-context Large Language Models with AI Feedback, https://arxiv.org/abs/2410.21252
28 Oct, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465
29 Oct, Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications, https://arxiv.org/abs/2410.21943
30 Oct, CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation, https://arxiv.org/abs/2410.23090
31 Oct, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, https://arxiv.org/abs/2410.23743
31 Oct, GPT or BERT: why not both?, https://arxiv.org/abs/2410.24159
31 Oct, Language Models can Self-Lengthen to Generate Long Texts, https://arxiv.org/abs/2410.23933

November 2024

1 Nov, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations, https://arxiv.org/abs/2411.00640
1 Nov 2024, Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation, https://arxiv.org/abs/2411.00412
1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
3 Nov, Sample-Efficient Alignment for LLMs, https://arxiv.org/abs/2411.01493
4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
4 Nov, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
4 Nov, Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study, https://arxiv.org/abs/2411.02462
5 Nov, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems, https://arxiv.org/abs/2411.02959
6 Nov, Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination, https://arxiv.org/abs/2411.03823
6 Nov, Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding, https://arxiv.org/abs/2411.04282
6 Nov, Number Cookbook: Number Understanding of Language Models and How to Improve It, https://arxiv.org/abs/2411.03766
7 Nov, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models, https://arxiv.org/abs/2411.04996
7 Nov, BitNet a4.8: 4-bit Activations for 1-bit LLMs, https://arxiv.org/abs/2411.04965
7 Nov, Scaling Laws for Precision, https://arxiv.org/abs/2411.04330
8 Nov, Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation, https://arxiv.org/abs/2411.05966
8 Nov, Balancing Pipeline Parallelism with Vocabulary Parallelism, https://arxiv.org/abs/2411.05288
11 Nov, Toward Optimal Search and Retrieval for RAG, https://arxiv.org/abs/2411.07396
12 Nov, Large Language Models Can Self-Improve in Long-context Reasoning, https://arxiv.org/abs/2411.08147
12 Nov, Stronger Models are NOT Stronger Teachers for Instruction Tuning, https://arxiv.org/abs/2411.07133
12 Nov, Direct Preference Optimization Using Sparse Feature-Level Constraints, https://arxiv.org/abs/2411.07618
13 Nov, Cut Your Losses in Large-Vocabulary Language Models, https://arxiv.org/abs/2411.09009
15 Nov, Does Prompt Formatting Have Any Impact on LLM Performance?, https://arxiv.org/abs/2411.10541
17 Nov, SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization, https://arxiv.org/abs/2411.11909
17 Nov, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2411.10958
18 Nov, Bi-Mamba: Towards Accurate 1-Bit State Space Models, https://arxiv.org/abs/2411.11843
19 Nov, RedPajama: an Open Dataset for Training Large Language Models, https://arxiv.org/abs/2411.12372
20 Nov, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
20 Nov, Loss-to-Loss Prediction: Scaling Laws for All Datasets, https://arxiv.org/abs/2411.12925
21 Nov, When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training, https://arxiv.org/abs/2411.13476
21 Nov, Multimodal Autoregressive Pre-training of Large Vision Encoders, https://arxiv.org/abs/2411.14402
21 Nov, Natural Language Reinforcement Learning, https://arxiv.org/abs/2411.14251
22 Nov, Large Multi-modal Models Can Interpret Features in Large Multi-modal Models, https://arxiv.org/abs/2411.14982
22 Nov, TÜLU 3: Pushing Frontiers in Open Language Model Post-Training, https://arxiv.org/abs/2411.15124
23 Nov, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs, https://arxiv.org/abs/2411.15296
24 Nov, LLMs Do Not Think Step-by-step In Implicit Reasoning, https://arxiv.org/abs/2411.15862
25 Nov, O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, https://arxiv.org/abs/2411.16489
26 Nov, Star Attention: Efficient LLM Inference over Long Sequences, https://arxiv.org/abs/2411.17116
27 Nov, Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691
27 Nov, Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration, https://arxiv.org/abs/2411.17686
29 Nov, Reverse Thinking Makes LLMs Stronger Reasoners, https://arxiv.org/abs/2411.19865
29 Nov, Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability, https://arxiv.org/abs/2411.19943

December 2024

2 Dec, Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis, https://arxiv.org/abs/2412.01819
2 Dec, X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models, https://arxiv.org/abs/2412.01824
2 Dec, Free Process Rewards without Process Labels, https://arxiv.org/abs/2412.01981
3 Dec, Scaling Image Tokenizers with Grouped Spherical Quantization, https://arxiv.org/abs/2412.02632
3 Dec, RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models, https://arxiv.org/abs/2412.02830
4 Dec, Perception Tokens Enhance Visual Reasoning in Multimodal Language Models, https://arxiv.org/abs/2412.03548
4 Dec, Evaluating Language Models as Synthetic Data Generators, https://arxiv.org/abs/2412.03679
4 Dec, Best-of-N Jailbreaking, https://arxiv.org/abs/2412.03556
4 Dec, PaliGemma 2: A Family of Versatile VLMs for Transfer, https://arxiv.org/abs/2412.03555
5 Dec, VisionZip: Longer is Better but Not Necessary in Vision Language Models, https://arxiv.org/abs/2412.04467
5 Dec, Evaluating and Aligning CodeLLMs on Human Preference, https://arxiv.org/abs/2412.05210
6 Dec, MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale, https://arxiv.org/abs/2412.05237
6 Dec, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling, https://arxiv.org/abs/2412.05271
7 Dec, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, https://arxiv.org/abs/2412.05579
8 Dec, Does RLHF Scale? Exploring the Impacts From Data, Model, and Method, https://arxiv.org/abs/2412.06000
9 Dec, Unraveling the Complexity of Memory in RL Agents: An Approach for Classification and Evaluation, https://arxiv.org/abs/2412.06531
9 Dec, Training Large Language Models to Reason in a Continuous Latent Space, https://arxiv.org/abs/2412.06769
9 Dec, AutoReason: Automatic Few-Shot Reasoning Decomposition, https://arxiv.org/abs/2412.06975
11 Dec, Large Concept Models: Language Modeling in a Sentence Representation Space, https://arxiv.org/abs/2412.08821
12 Dec, Phi-4 Technical Report, https://arxiv.org/abs/2412.08905
13 Dec, Byte Latent Transformer: Patches Scale Better Than Tokens, https://arxiv.org/abs/2412.09871
13 Dec, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319
13 Dec, Cultural Evolution of Cooperation among LLM Agents, https://arxiv.org/abs/2412.10270
13 Dec, DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding, https://arxiv.org/abs/2412.10302
16 Dec, No More Adam: Learning Rate Scaling at Initialization is All You Need, https://arxiv.org/abs/2412.11768
16 Dec, Precise Length Control in Large Language Models, https://arxiv.org/abs/2412.11937
16 Dec, The Open Source Advantage in Large Language Models (LLMs), https://arxiv.org/abs/2412.12004
16 Dec, A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges, https://arxiv.org/abs/2412.11936
17 Dec, Are Your LLMs Capable of Stable Reasoning?, https://arxiv.org/abs/2412.13147
18 Dec, LLM Post-Training Recipes, Improving Reasoning in LLMs, https://arxiv.org/abs/2412.14135
18 Dec, Hansel: Output Length Controlling Framework for Large Language Models, https://arxiv.org/abs/2412.14033
18 Dec, Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning, https://arxiv.org/abs/2412.13631
18 Dec, Alignment Faking in Large Language Models, https://arxiv.org/abs/2412.14093
18 Dec, SCOPE: Optimizing Key-Value Cache Compression in Long-Context Generation, https://arxiv.org/abs/2412.13649
19 Dec, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks, https://arxiv.org/abs/2412.15204
20 Dec, Offline Reinforcement Learning for LLM Multi-Step Reasoning, https://arxiv.org/abs/2412.16145
24 Dec, Mulberry: Empowering MLLM with O1-like Reasoning and Reflection via Collective Monte Carlo Tree Search, https://arxiv.org/abs/2412.18319
31 Dec, Titans: Learning to Memorize at Test Time, https://arxiv.org/abs/2501.00663

This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book. (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.)

*Build a Large Language Model (From Scratch) now available on Amazon*

If you read the book and have a few minutes to spare, I'd really appreciate a brief review. It helps us authors a lot!

Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly.

Dec 9, 2024

Thanks for the kind wishes everyone! Don't worry, I am currently focusing on healing up and should hopefully be better in a few weeks! I'll be back :)

1 reply

Andrew Jennings

Dec 8, 2024

So upset to hear of your condition. Hope you are healing well. Your publications are just wonderful.

26 more comments...

Discussion about this post

Ready for more?