Another month, another round of interesting research papers ranging from large language modeling to computer vision.
One recent focus is on refining Large Language Models (LLMs). For instance, introducing models like Platypus and the Reinforced Self-Training (ReST) method are the latest attempts to improve alignment with human preferences.
In addition, Self-Alignment with Instruction Backtranslation and OctoPack leverage existing knowledge structures, be it instructions or code, to improve model performances.
A recurring theme is the endeavor to make models more efficient and accessible. There's LongLoRA with another sparse attention mechanism for LLMs and simpler, more fundamental ideas like replacing softmax with ReLU in vision transformers to boost computational efficiency and pave the way for better parallelization.
But this is only a snapshot of what has happened this month. I hope you get something useful out of the subset of the 22 research highlights below.
Large Language Models
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, (1 Sep 2023, https://arxiv.org/abs/2309.00267)
The recent reinforcement learning with AI feedback (RLAIF) study shows that the ratings for the reward model training in RLHF don't necessarily have to be provided by humans but can be generated by an LLM (here: PaLM 2). Human evaluators prefer the RLAIF model half of the time over traditional RLHF models, which means that they actually don't prefer one model over the other. Additionally, it's worth noting that both RLHF and RLAIF strongly outperformed models that are purely trained via supervised instruction finetuning.
Reinforced Self-Training (ReST) for Language Modeling (17 Aug 2023, https://arxiv.org/abs/2308.08998)
ReST is an alternative to reinforcement learning with human feedback (RLHF) that aligns LLMs with human preferences. ReST uses a sampling approach to create an improved dataset, iteratively training on increasingly higher-quality subsets to refine its reward function. According to the authors, ReST achieves greater efficiency compared to standard online RLHF methods (like RLHF with proximal policy optimization, PPO) by generating its training dataset offline, but a comprehensive comparison to standard RLHF PPO methods as used in InstructGPT or Llama 2 is missing.
Platypus: Quick, Cheap, and Powerful Refinement of LLMs (14 Aug 2023, https://arxiv.org/abs/2308.07317)
Platypus is a new open-source LLM suite at the top of the HF LLM leaderboard (as of this writing). Its two pillars of success are curating the underlying datasets (removing similar and duplicate questions) and finetuning and merging Low-Rank Approximation (LoRA) modules. Specifically, what's interesting is that the authors focus on the non-attention modules as opposed to the attention modules when applying LoRA.
Self-Alignment with Instruction Backtranslation (11 Aug, https://arxiv.org/abs/2308.06259)
Instead of collecting datasets for instruction finetuning written by humans or using an LLM to generate instruction-response pairs (i.e., distillation), this paper proposes a third approach. Here, starting with a pretrained LLM, the researchers generate instructions for unlabeled text, such as websites. LLMs finetuned with this so-called instruction backtranslation approach outperform LLMs trained on distillation data, such as Alpaca.
Efficient Benchmarking of Language Models (31 Aug 2023, https://arxiv.org/abs/2308.11696)
If you have run LLM benchmarks such as HELM or Evaluation Harness, you know that benchmarking can be very expensive and take many hours. To improve LLM evaluation efficiency, researchers propose Flash-HELM, which produces similar model rankings as HELM but comes at a much-reduced cost. For instance, if models rank poorly in early evaluation rounds, they are given less weight and evaluated on fewer examples and sample prompts.
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (21 Sep 2023, https://arxiv.org/abs/2309.12307)
LongLoRA is an efficient method for finetuning LLMs that expands their context sizes without the usual high computational costs associated with longer contexts. The approach uses sparse local attention, which allows for computation savings during finetuning. One can then still use regular, dense attention during inference time.
Textbooks Are All You Need II: phi-1.5 Technical Report (11 Sep 2023, https://arxiv.org/abs/2309.05463)
This paper proposes a new "small" 1.3B parameter open-source LLM with really good performance. It doesn't break any benchmark records but compares favorably to 5x larger LLMs like the popular 7B parameter Llama 2 model. The training followed the Textbooks Are All You Need procedure using ~20B tokens of textbook-quality synthetic text on top of 7B tokens of code from The Stack, StackOverflow, and others.
Pretraining on the Test Set Is All You Need (19 Sep 2023, https://arxiv.org/abs/2309.08632)
In this satirical paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be "cheated" intentionally or unintentionally (due to data contamination).
GPT Can Solve Mathematical Problems Without a Calculator (6 Sep 2023, https://arxiv.org/abs/2309.03241)
LLMs struggle with reliable arithmetic on large numbers, as seen when comparing ChatGPT's calculations to standard calculators (e.g., try 16342391 * 12325234243). In this proof-of-concept paper, researchers show that LLMs can be finetuned to improve math performance without external APIs. However, for practical purposes, using a calculator or granting LLMs access to a calculator API remains, of course, preferable.
RAIN: Your Language Models Can Align Themselves without Finetuning (13 Sep 2023, https://arxiv.org/abs/2309.07124)
This paper explores how we can align LLMs with human preferences without extra data or updating the model parameters. The proposed Rewindable Auto-regressive INference (RAIN) mechanism is centered around self-evaluation and rewinding an LLM's response. Tested on Llama 30B, this method maintained the model's helpfulness rate while improving the model from 82% to 97% on the harmlessness scale.
Are Emergent Abilities in Large Language Models Just In-Context Learning? (4 Sep 2023, https://arxiv.org/abs/2309.01809)
A common perception is that LLMs develop emergent capabilities not seen in their smaller counterparts as their size increases. In this new comprehensive study of 18 models with up to 175 billion parameters across 22 tasks, researchers found that these abilities mainly stem from in-context learning, not reasoning capabilities. This should put people concerned about LLMs developing emergent capabilities as they grow in size somewhat at rest.
Language Modeling Is Compression (19 Sep 2023, https://arxiv.org/abs/2309.10668)
Predictive deep learning models are usually trained to minimize the cross-entropy loss, which essentially means that a model's predictions require the least number of bits to encode the true labels of the training data. This paper further investigates how and why compression and classification are equivalent. Interestingly, the authors found that compression in LLMs is not only competitive text but also on modalities they have never been trained on, for example, image data.
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? (16 Sep 2023, https://arxiv.org/abs/2309.08963)
This paper evaluates LLMs on complex, structured outputs such as HTML and LaTeX tables. In addition, the authors propose a specific chain-of-thought format to generate instructions from target outputs. Lastly, the authors find that structure-aware finetuning methods (via standard instruction-finetuning on the generated datasets) improve LLMs on tasks requiring structured output generation.
Code Llama: Open Foundation Models for Code (24 Aug 2023, https://arxiv.org/abs/2308.12950)
Code Llama is the latest collection of LLMs from Meta AI based on the popular Llama 2 base LLM released earlier this year. The Code Llama models, including sizes of 7B, 13B, and 34B parameters, have been trained and finetuned on code. The Code Llama models have been trained on context sizes of 16k (four times larger than Llama 2) and perform well up to context sizes of 100k during inference.
OctoPack: Instruction Tuning Code Large Language Models (14 Aug 2023, https://arxiv.org/abs/2308.07124)
Researchers trained LLMs on 4 terabytes of Git commit data (including 350 programming languages), which the researchers released as CommitPack. Here, they leveraged the natural structure of Git commit messages (for example, "Change to sin() function with noise") to finetune an LLM such that it can modify code accordingly. The performance of the resulting LLM is comparable to other popular code LLMs, such as WizardCoder, on code fixing, explanation, and synthesis tasks.
Computer Vision and Multimodal Models
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities (24 Aug 2023, https://arxiv.org/abs/2308.12966)
This series of Qwen-VL models is another entry in the multimodel AI field. The proposed models are large-scale vision-language transformers designed to understand both text and images, supporting. Qwen-VL and Qwen-VL-Chat are designed for tasks such as image captioning, question answering, and visual localization, and they support English, Chinese, and multilingual conversation.
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE (23 Aug 2023, https://arxiv.org/abs/2308.11971)
Traditionally, building vision-language models involves training on multiple specialized pre-training tasks that are designed for each modality. This paper proposes a unified multimodal Transformer that is pretrained using a single task. To accomplish this, the proposed EVE model combines vision and language in a shared transformer network, utilizing modality-aware sparse Mixture-of-Experts (MoE) modules to process modality-specific information.
ImageBind-LLM: Multi-modality Instruction Tuning (7 Sep 2023, https://arxiv.org/abs/2309.03905)
ImageBind-LLM is a multi-modal instruction tuning method for LLMs using ImageBind. The method and resulting model is capable of understanding diverse modalities including image, audio, 3D point clouds, and video through a new image-text alignment training method. Here, a learnable Bind network aligns the embedding spaces of LLaMA and ImageBind’s image encoder, and a training-free cache model addresses the discrepancy between training and inference modalities.
Replacing Softmax with ReLU in Vision Transformers (15 Sep 2023, https://arxiv.org/abs/2309.08586)
Vision (and language) transformers usually use a softmax operation in self-attention, which is hard to parallelize. In this short paper, researchers show that one can replace the softmax function with a point-wise ReLU function without incurring a noticeable accuracy loss (or gain). This opens up new avenues for parallelization.
General Methods and Other Methods
Explaining Grokking Through Circuit Efficiency (5 Sep 2023, https://arxiv.org/abs/2309.02390)
This paper offers further explanations and insights about grokking, a phenomenon where neural networks transition from perfect training accuracy with poor generalization to perfect generalization. The authors propose that this occurs when a task has both a memorizing and generalizing solution, with the latter being slower to learn but more efficient. Interestingly, with larger datasets, it appears that memorizing becomes less efficient while generalizing remains consistent, leading to grokking.
Curriculum Learning with Adam: The Devil Is in the Wrong Details (23 Aug 2023, https://arxiv.org/abs/2308.12202)
Curriculum learning is a training strategy where models are taught progressively using tasks of increasing complexity, similar to how human education is structured from easy to hard topics. Recent curriculum methods in NLP are found to be brittle, often due to their interaction with the Adam optimization algorithm, leading to suboptimal parameter choices. Moreover, this paper found that none of the curriculum methods outperformed solely using Adam with well-selected hyperparameters.
Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees (18 Sep 2023, https://arxiv.org/abs/2309.09968),
This paper introduces a new method for generating and imputing mixed-type (categorical and continuous) tabular data using score-based diffusion and conditional flow matching, leveraging XGBoost instead of deep neural networks. The approach not only produces highly realistic synthetic data, even when trained on datasets with missing values. Furthermore, it can be trained using CPUs (without requiring a GPU).
This magazine is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of one of my books. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues.
Your support means a great deal! Thank you!
I am curious how you select which arXiv papers to summarize. The torrent of new postings is totally overwhelming! Do you have any tricks or hints to share? Certainly it is not random! ;)