Jul 6, 2024

Understanding LLMs & Summarization

Fundamentals of LLMs

Huge Dataset

LLMs require vast amounts of text data, encompassing a range of sources from the internet, including online books, news articles, scientific papers, Wikipedia, and social media posts. This extensive dataset enables the models to learn from a diverse array of language patterns, styles, and contexts, which is crucial for their ability to generalize and understand language in various applications.

Immense Number of Parameters

LLMs have a large number of parameters, which are the internal variables that represent the knowledge learned by the model during training and determine its behavior. These parameters allow the model to capture complex language patterns and nuances, making it possible to perform sophisticated language tasks such as translation, summarization, and question answering.

Attention is All You Need (Transformer)

Previous Model: Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs)

These models processed text sequentially, which meant they could only consider a fixed-size window of context at a time. This limitation made it difficult for them to capture long-distance dependencies in text, where the relationship between words is spread out over many intervening words. Consequently, they often struggled with tasks requiring an understanding of context over longer sequences.

Transformer Model

Self-attention mechanisms allow all parts of the input to be processed simultaneously. This parallel processing capability is a game-changer, as it enables the model to consider the entire sequence of input data at every step, making it particularly effective at understanding context and drawing relationships between distant elements in a text sequence. By leveraging self-attention, Transformers can capture more complex dependencies and provide more accurate predictions.

LLMs Training Through Next-Token Prediction

Self-Attention Mechanism

Global Context Awareness: The self-attention mechanism enables each word in a sentence to attend to all other words, regardless of their position in the sequence. This allows the model to capture global context and dependencies that may not be immediately adjacent. As a result, the model can generate more coherent and contextually appropriate responses.

Dynamic Weights: The attention scores are dynamically computed, allowing the model to focus more on certain words that are contextually relevant to the prediction task at hand. This dynamic weighting ensures that the model can adapt to different contexts and emphasize important information as needed.

Multi-Head Attention: The Transformer employs multi-head attention, where the input is processed through several attention layers (or "heads") in parallel. Each head learns different aspects or representations of the input data. This multi-faceted approach enhances the model's ability to understand and generate complex language patterns.

Basic Structure & Cleaning

Tokenization

Text Segmentation: The text is segmented into meaningful units, which could be words, phrases, or subwords (in the case of subword tokenization). This segmentation is essential for the model to process the text effectively and capture the nuances of language.

Vocabulary Construction: A vocabulary is constructed from all training data, assigning a unique index to each unique token. This vocabulary serves as the foundation for the model's understanding of language, enabling it to map text inputs to numerical representations.

Embedding

Vector Representation: Each token is converted into a fixed-size vector, and positional encoding is added to these vectors to retain information about the word order. These vector representations allow the model to process and analyze the text in a structured and meaningful way.

Positional Encoding: Positional encoding is added to provide the model with information about the relative or absolute position of the tokens in the sequence. This encoding helps the model understand the structure and flow of the text, which is crucial for tasks that require an understanding of sequence and context.

Encoder-Decoder Architecture (Lots of Layers)

Encoder: The encoder component of the Transformer is responsible for reading the input sequence and producing a continuous representation of the input. This representation captures the essential information and context needed for the subsequent decoding process.

Decoder: The decoder takes the output of the encoder and generates the target sequence. It is also capable of self-attention over its own output to generate subsequent tokens. This iterative process allows the model to produce coherent and contextually relevant text.

LLM Models

Encoder Model: BERT

BERT (Bidirectional Encoder Representations from Transformers) is an encoder model used for tasks like emotion analysis and summarization. By processing text bidirectionally, BERT can capture context from both preceding and following words, making it highly effective for understanding and generating language.

### Decoder Model: GPT

GPT (Generative Pre-trained Transformer) is a decoder model used for predicting the next word in a sequence. By focusing on autoregressive text generation, GPT excels at producing coherent and contextually appropriate language, making it ideal for tasks such as text completion and conversation generation.

Encoder-Decoder Model: T5, (BART?)

Encoder-decoder models like T5 (Text-to-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformers) combine the strengths of both architectures. They are designed to perform a wide range of language tasks by converting them into a text-to-text format, enabling versatile and powerful language processing capabilities.

Pre-training and Instruction Tuning

Training Process

Forward Propagation The model undergoes forward propagation with input data, calculating the loss function (e.g., cross-entropy loss). This step involves passing the input through the model to generate predictions and compute the error.

Backpropagation: Gradients are computed based on the loss function, and the model's weights are updated through backpropagation. This process iteratively adjusts the model's parameters to minimize the error and improve performance.

Fundamentals of Text Summarization

Extractive Summarization

Extractive summarization involves selecting and directly extracting parts of the original text to form a summary. It is based on the idea that important information is already present in the text and needs to be identified and concatenated to form a coherent summary.

Sentence Scoring: Sentences with the highest scores are selected for inclusion in the summary. This scoring is typically based on features such as sentence length, position, and the presence of key phrases.

Extractive summarization is data-driven, easier to implement, and often gives better results for certain types of text, such as news articles and technical documents.

Abstractive Summarization

Abstractive summarization goes beyond selecting text from the original document; it involves understanding the text at a deeper level and generating new sentences that convey the same meaning in a more condensed form. It is more complex and mimics how humans summarize information.

With new phrase generation, The model makes inferences to fill in gaps or to deduce broader implications not explicitly stated in the text. This generative process allows for more flexibility and creativity in the summaries produced.

Abstractive summarization mimics how humans tend to summarize text, but it's challenging for algorithms since it involves semantic representation, inference, and natural language generation. Despite these challenges, advancements in NLP are continually improving the effectiveness of abstractive summarization models.