Apprendimento automatico: dati e preprocessing
Dalla regressione lineare fino agli algoritmi ensemble di ultima generazione, questa sezione di FAQ esplora i principali modelli e tecniche di machine learning supervisionato. Una guida chiara e completa che combina teoria, vantaggi e limiti, metodologie di implementazione e strategie di ottimizzazione. Perfetta per studenti, sviluppatori e professionisti che desiderano padroneggiare gli strumenti fondamentali del machine learning, applicarli in contesti reali e rimanere aggiornati sulle pratiche più efficaci.
- Quali sono le principali architetture di reti neurali (MLP, CNN, RNN, Transformer)?
- Come funziona l'addestramento di una rete neurale (backpropagation, ottimizzatori, learning rate)?
- Cosa sono dropout, batch normalization e altri metodi di regolarizzazione?
- Come scegliere e applicare funzioni di attivazione (ReLU, leaky ReLU, GELU)?
Quali sono le principali architetture di reti neurali (MLP, CNN, RNN, Transformer)?
Le principali architetture di reti neurali rappresentano paradigmi computazionali distinti, ognuno ottimizzato per specifici tipi di dati e task, evolvendo dalla semplice connettività fully-connected verso strutture specializzate che sfruttano inductive biases appropriati.
Multi-Layer Perceptron (MLP): Rappresentano l'architettura fondamentale con strati densi dove ogni neurone è connesso a tutti i neuroni del layer precedente. Formalmente, ogni layer computa y = σ(Wx + b) dove W è la weight matrix, x l'input, b il bias e σ la funzione di attivazione. Eccellenti per dati tabellari, feature engineering, e problemi dove relationships spaziali o temporali non sono critiche. Limitazioni includono elevato numero di parametri per high-dimensional inputs e inability di catturare spatial o temporal patterns intrinsicamente.
Convolutional Neural Networks (CNN): Sfruttano convolutional layers che applicano filtri learnable attraverso spatial dimensions, preservando locality e translation equivariance. Core components includono: convolutional layers che detect local patterns tramite kernel sliding, pooling layers (max/average) per dimensionality reduction e translation invariance, activation functions per non-linearity. Architectures influenti includono LeNet, AlexNet, VGG, ResNet (con skip connections), DenseNet, EfficientNet. Applications spaziano da image classification a object detection, medical imaging, e signal processing.
Recurrent Neural Networks (RNN): Processano sequential data mantenendo hidden state h_t che carries information across time steps: h_t = σ(W_hh h_{t-1} + W_xh x_t + b_h). Vanilla RNNs soffrono di vanishing gradient problem per long sequences. LSTM (Long Short-Term Memory) introduce gating mechanisms (forget, input, output gates) per selective information flow. GRU (Gated Recurrent Unit) simplifica LSTM combinando forget e input gates. Bidirectional variants process sequences in both directions. Applications includono language modeling, machine translation, time series forecasting, speech recognition.
Transformer Architecture: Rivoluziona sequential modeling eliminando recurrence in favore di self-attention mechanisms. Core innovation è multi-head attention che computa attention weights A = softmax(QK^T/√d_k)V dove Q, K, V sono learned projections dell'input. Positional encoding inject sequence order information. Encoder-decoder structure con layer normalization, residual connections, e feed-forward networks. Parallelizable training, ability di catturare long-range dependencies, e transferability hanno reso Transformers dominant in NLP e crescente adoption in computer vision, multimodal tasks.
Come funziona l'addestramento di una rete neurale (backpropagation, ottimizzatori, learning rate)?
L'addestramento di reti neurali combina forward propagation per computing predictions, loss calculation per quantifying errors, e backward propagation per computing gradients, seguito da parameter updates tramite optimization algorithms.
Backpropagation Algorithm: Implementa chain rule del calcolo differenziale per computing gradients di loss function rispetto a network parameters. Forward pass computa activations layer-by-layer: a^l = σ(W^l a^{l-1} + b^l). Loss L si calcola confrontando predictions con ground truth. Backward pass propaga error signals: δ^l = (W^{l+1})^T δ^{l+1} ⊙ σ'(z^l) dove ⊙ denota element-wise product. Gradients sono ∇W^l = δ^l (a^{l-1})^T e ∇b^l = δ^l. Modern automatic differentiation frameworks (PyTorch, TensorFlow) automatizzano questo processo con computational graphs.
Optimization Algorithms: Stochastic Gradient Descent (SGD) aggiorna parameters: θ_{t+1} = θ_t - η∇L(θ_t) dove η è learning rate. Momentum accumula exponentially decaying moving average di past gradients: v_t = βv_{t-1} + η∇L, θ_{t+1} = θ_t - v_t. Adam (Adaptive Moment Estimation) combina momentum con adaptive learning rates: mantiene moving averages di gradients (m_t) e squared gradients (v_t), poi computa bias-corrected estimates per parameter updates. AdamW decouples weight decay da gradient updates per better regularization.
Learning Rate Management: Critical hyperparameter che controls step size nell'optimization landscape. Large learning rates possono causare oscillations o divergence; small rates causano slow convergence o local minima entrapment. Learning rate scheduling strategies includono: step decay (reduce by factor ogni few epochs), exponential decay, cosine annealing (smooth cyclic reduction), polynomial decay. Warmup strategies iniziano con small learning rate e gradually increase per stabilize early training. Adaptive methods come ReduceLROnPlateau adjust basandosi su validation metrics.
Advanced Training Techniques: Gradient clipping prevents exploding gradients limiting gradient norm. Mixed precision training usa FP16 per forward/backward passes e FP32 per parameter updates, accelerando training su modern GPUs. Gradient accumulation simula larger batch sizes aggregating gradients across multiple mini-batches. Learning rate finding techniques (cyclical learning rates, one-cycle policy) optimize learning rate selection. Second-order methods (L-BFGS, natural gradients) use curvature information ma sono computationally intensive per large networks.
Cosa sono dropout, batch normalization e altri metodi di regolarizzazione?
La regolarizzazione comprende tecniche per preventing overfitting, improving generalization e stabilizing training dynamics, addressing il fundamental bias-variance tradeoff nel machine learning.
Dropout Regularization: During training, randomly sets frazione p di neurons a zero, forcing network a non rely su specific neurons e encouraging distributed representations. Matematicamente, durante forward pass: r ~ Bernoulli(p), ỹ = r ⊙ y dove y sono activations e r è binary mask. Durante inference, weights sono scaled da (1-p) per compensate expected reduction. Variants includono DropConnect (dropping connections invece di neurons), Spatial Dropout (dropping entire feature maps in CNNs), e Scheduled Dropout (varying dropout rate durante training).
Batch Normalization: Normalizza inputs ad ogni layer per avere zero mean e unit variance across batch dimension: BN(x) = γ(x-μ)/σ + β dove μ,σ sono batch statistics e γ,β sono learnable parameters. Benefits includono faster convergence, higher learning rates tolerance, reduced sensitivity a weight initialization, e implicit regularization effect. Durante inference, usa moving averages di training statistics. Alternatives includono Layer Normalization (normalizes across features), Group Normalization (normalizes within groups), e Instance Normalization (per-sample normalization).
Weight-based Regularization: L1 regularization aggiunge λ∑|w_i| al loss, encouraging sparsity e automatic feature selection. L2 regularization (weight decay) aggiunge λ∑w_i² penalizing large weights e encouraging smoother decision boundaries. Elastic Net combina L1 e L2. Modern implementations spesso implement weight decay direttamente nell'optimizer invece che nel loss function per better theoretical properties.
Advanced Regularization Techniques: Early Stopping monitors validation performance e stops training quando performance degrades, preventing overfitting automaticamente. Data Augmentation artificially expands training set con semantically meaningful transformations. Label Smoothing replaces hard targets con soft distributions, reducing overconfidence. Spectral Normalization constrains spectral norm di weight matrices per improved stability in GANs. Gradient Penalty methods regularize gradient magnitudes per enforcing Lipschitz constraints.
Come scegliere e applicare funzioni di attivazione (ReLU, leaky ReLU, GELU)?
Le funzioni di attivazione introducono non-linearità essential per learning complex patterns, con scelta che impacts gradient flow, computational efficiency, e model expressiveness.
ReLU (Rectified Linear Unit): f(x) = max(0,x) è diventata standard activation per hidden layers. Advantages includono computational simplicity, gradient preservation per positive inputs (alleviating vanishing gradients), e sparsity promotion (many neurons output zero). Dying ReLU problem occurs quando neurons permanently output zero, typically da large negative biases o excessive learning rates. Solutions includono careful initialization, learning rate tuning, e alternative activations.
Leaky ReLU e Variants: f(x) = max(αx, x) con small positive α (typically 0.01) allows small gradient flow per negative inputs, mitigating dying ReLU problem. Parametric ReLU (PReLU) learns α durante training. Exponential Linear Unit (ELU) f(x) = x se x>0, α(e^x - 1) altrimenti, provides smooth negative part con zero mean activations. Swish f(x) = x·σ(βx) combines sigmoid smoothness con ReLU-like properties.
GELU (Gaussian Error Linear Unit): f(x) = x·P(X ≤ x) dove X ~ N(0,1), approximated come x·Φ(x) o x·σ(1.702x). Provides smooth, probabilistically motivated activation che weights inputs by their percentile in Gaussian distribution. Particularly effective in Transformer architectures e large language models, offering better gradient properties than ReLU e maintaining input distribution characteristics.
Selection Guidelines: Use ReLU per default choice in most scenarios, particularly CNNs e standard MLPs. Switch a Leaky ReLU o ELU quando experiencing dying neurons o quando gradient flow è critical. GELU per Transformer-based models e applications requiring smooth activation landscapes. Tanh o sigmoid per output layers quando requiring bounded outputs. Consider Swish o Mish per state-of-the-art performance ma higher computational cost. Always initialize weights appropriately (He initialization per ReLU, Xavier per tanh/sigmoid) per complement activation choice.
Quali tecniche per evitare overfitting su dataset piccoli (data augmentation, transfer learning)?
Con dataset piccoli, traditional deep learning approaches sono prone a severe overfitting, richiedendo specialized strategies per improving generalization e maximizing utility di available data.
Data Augmentation Strategies: Artificially expand dataset attraverso label-preserving transformations. Computer vision: rotations, translations, scaling, flipping, color jittering, random crops, cutout, mixup. Natural language: synonym replacement, back-translation, random insertion/deletion, paraphrasing con language models. Advanced techniques: AutoAugment (learning optimal augmentation policies), RandAugment (simplified parameter space), adversarial augmentation. Domain-specific: time warping per time series, spectral augmentation per audio, molecular fingerprint perturbations per chemistry.
Transfer Learning Approaches: Leverage pre-trained models su large datasets per initialize networks con meaningful representations. Feature extraction: freeze pre-trained layers e train only classifier head, appropriate quando target domain è similar a pre-training domain. Fine-tuning: selectively unfreeze e retrain layers con smaller learning rates, effective per domain adaptation. Progressive unfreezing: gradually unfreeze layers durante training. Discriminative fine-tuning: use different learning rates per different layers, con lower rates per earlier layers.
Architectural e Training Modifications: Smaller model architectures reduce overfitting risk attraverso reduced parameter count. Aggressive regularization: higher dropout rates, stronger weight decay, early stopping con patience. Cross-validation strategies: k-fold CV per robust performance estimation, stratified sampling per maintaining class distributions. Ensemble methods: combine multiple models trained con different initializations, augmentations, o hyperparameters. Gradual unfreezing in transfer learning scenarios.
Advanced Techniques: Few-shot learning methods: prototypical networks, model-agnostic meta-learning (MAML), relation networks. Self-supervised pre-training su domain-specific data anche senza labels. Knowledge distillation da larger models trained su related tasks. Synthetic data generation usando GANs o diffusion models per augment real data. Multi-task learning quando related tasks sono available. Curriculum learning starting con easier examples e gradually increasing complexity.
Come usare transfer learning e fine-tuning con modelli pre-addestrati?
Transfer learning sfrutta knowledge learned da models trained su large-scale datasets, adapting questo knowledge a new tasks con limited data, dramatically reducing training time e improving performance su small datasets.
Transfer Learning Strategies: Selection dipende da relationship between source e target domains e size del target dataset. High similarity, small dataset: feature extraction freezing most layers. High similarity, large dataset: fine-tune most layers con smaller learning rates. Low similarity, small dataset: fine-tune only top layers o train linear classifier. Low similarity, large dataset: use pre-trained weights come initialization e train normally. Very different domains: might require training from scratch o finding better pre-trained models.
Fine-tuning Best Practices: Layer-wise learning rates: use smaller rates per earlier layers (0.1x base rate), higher rates per later layers (1x base rate), highest rates per new layers (10x base rate). Gradual unfreezing: start training only classifier, then progressively unfreeze layers from top to bottom. Discriminative fine-tuning: different learning rates per layer groups. Warm restart: periodically reset learning rate to escape local minima. Careful data preprocessing: match pre-training statistics (normalization, input size).
Domain-Specific Considerations: Computer Vision: ImageNet pre-trained models (ResNet, EfficientNet, Vision Transformers) transferable a diverse visual tasks. Consider input resolution matching, color channel alignment. Natural Language Processing: BERT, GPT, RoBERTa provide strong linguistic representations. Task-specific heads per classification, sequence labeling, generation. Speech: wav2vec, Whisper models per audio tasks. Multimodal: CLIP, BLIP models bridge vision e language domains.
Advanced Transfer Learning: Multi-task learning: jointly train su multiple related tasks sharing lower layers. Domain adaptation techniques quando distribution shift exists between source e target. Few-shot fine-tuning: adapter layers, LoRA (Low-Rank Adaptation) per parameter-efficient fine-tuning. Prompt tuning: optimize input prompts invece di model parameters per large language models. Intermediate task transfer: sequential transfer through multiple related tasks. Meta-learning approaches: learn a good initialization per fast adaptation a new tasks.
Come funzionano i transformer e quali sono le loro varianti (BERT, GPT, ViT)?
I Transformer hanno rivoluzionato deep learning attraverso self-attention mechanisms che model relationships between all positions in sequences simultaneously, eliminating recurrence e enabling parallel processing.
Transformer Architecture: Core building block è multi-head attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V dove Q,K,V sono learned linear projections. Multi-head attention runs multiple attention heads in parallel, concatenating results. Positional encoding inject sequence order information usando sinusoidal functions o learned embeddings. Encoder blocks include multi-head attention + feed-forward network con residual connections e layer normalization. Decoder blocks add masked self-attention per autoregressive generation.
BERT (Bidirectional Encoder Representations from Transformers): Uses only encoder stack con bidirectional self-attention. Pre-trained usando Masked Language Modeling (predicting masked tokens) e Next Sentence Prediction (classifying sentence pairs). Creates contextualized word embeddings che consider full sentence context. Fine-tuned per downstream tasks adding task-specific heads. Variants: RoBERTa (optimized training), DeBERTa (disentangled attention), ALBERT (parameter sharing), DistilBERT (knowledge distillation).
GPT (Generative Pre-trained Transformer): Uses decoder-only architecture con causal (masked) self-attention per autoregressive generation. Pre-trained su language modeling objective (predicting next token). GPT-1/2 established transfer learning paradigm. GPT-3/4 demonstrated few-shot learning capabilities attraverso in-context learning senza parameter updates. ChatGPT/InstructGPT add instruction following attraverso reinforcement learning from human feedback (RLHF).
Vision Transformer (ViT): Adapts Transformer architecture a computer vision dividendo images in fixed-size patches (typically 16x16), linearmente projecting patches a embeddings, adding positional encodings. Uses standard Transformer encoder con [CLS] token per classification. Requires large datasets o strong pre-training per competitive performance. Variants: DeiT (knowledge distillation), Swin Transformer (hierarchical processing con shifted windows), PVT (pyramid vision transformer), CoAtNet (combines convolution e attention).
Multimodal e Specialized Transformers: CLIP jointly trains text e image encoders con contrastive learning. BLIP unifies understanding e generation per vision-language tasks. DALL-E generates images da text descriptions. Flamingo handles interleaved text-image inputs. T5 (Text-to-Text Transfer Transformer) frames all NLP tasks come text generation. Switch Transformer e GLaM use mixture-of-experts per scaling efficiency. Perceiver handles arbitrary input modalities attraverso cross-attention.
Cos'è l'apprendimento auto-supervisionato e come si applica (SimCLR, MoCo)?
L'apprendimento auto-supervisionato crea supervisory signals dai dati stessi senza human annotations, enabling representation learning su large unlabeled datasets attraverso carefully designed pretext tasks.
Contrastive Learning Paradigm: Core idea è learning representations che pull similar examples together e push dissimilar examples apart in embedding space. Formalmente, per anchor sample x, positive samples x+ (augmented versions) dovrebbero have similar representations, mentre negative samples x- dovrebbero essere dissimilar. InfoNCE loss: L = -log[exp(sim(z,z+)/τ) / Σexp(sim(z,z_i)/τ)] dove sim è similarity function e τ è temperature parameter.
SimCLR (Simple Contrastive Learning of Representations): Applies random data augmentations (cropping, color distortion, Gaussian blur) a create positive pairs. Uses large batch sizes (4096+) per provide sufficient negative samples. Key components: data augmentation pipeline, base encoder (typically ResNet), projection head (MLP mapping representations a lower-dimensional space), contrastive loss. Shows strong augmentation strategies e large batch sizes sono critical per success.
MoCo (Momentum Contrastive Learning): Addresses batch size limitations attraverso momentum-updated queue di negative samples. Maintains dynamic dictionary como queue di encoded samples, updating keys con momentum encoder: θ_k ← mθ_k + (1-m)θ_q dove m è momentum coefficient (typically 0.999). Questo permite consistent negative samples across batches senza requiring large batch sizes. MoCo v2/v3 incorporate improvements da SimCLR (projection head, stronger augmentations).
Alternative Self-Supervised Approaches: BYOL (Bootstrap Your Own Latent) eliminates negative samples using asymmetric networks con stop-gradient operations. SwAV uses clustering assignments instead di contrastive learning. DINO applies self-distillation con Vision Transformers. MAE (Masked Autoencoder) masks random patches e learns a reconstruct them. BEiT adapts BERT's masking strategy a vision. SimSiam simplifies architecture removing momentum encoder e negative pairs.
Applications e Transfer Learning: Self-supervised pre-training creates versatile representations transferable a diverse downstream tasks. Linear evaluation protocol trains only linear classifier on frozen features per assess representation quality. Fine-tuning shows substantial improvements su supervised baselines, particularly con limited labeled data. Applications span computer vision, NLP, speech, robotics, e scientific domains dove large unlabeled datasets sono available ma annotation è expensive.
Faq
Quali vantaggi offrono le CNN rispetto a MLP per il processamento di immagini?
Le CNN sfruttano la convoluzione e il pooling per rilevare pattern locali e ridurre dimensionalità, preservando informazioni spaziali, mentre MLP richiederebbero troppi parametri e non catturano relazioni locali in modo efficiente.
Come LSTM e GRU risolvono il problema del vanishing gradient nelle RNN?
Introducono gating mechanisms per controllare il flusso di informazioni nel tempo: LSTM usa forget, input e output gates, mentre GRU combina forget e input gate, consentendo al modello di mantenere informazioni rilevanti su lunghe sequenze.
Perché il Transformer è diventato dominante rispetto alle RNN?
Il Transformer elimina la ricorrenza e utilizza meccanismi di self-attention che permettono parallelizzazione, catturano dipendenze a lungo raggio e sono facilmente trasferibili tra task diversi, migliorando efficienza e performance su NLP e vision tasks.
In che modo l’ottimizzazione con Adam differisce dal classico SGD?
Adam combina momentum e learning rate adattivi calcolando medie mobili dei gradienti e dei gradienti al quadrato, accelerando la convergenza rispetto al SGD standard che utilizza un learning rate fisso e semplice aggiornamento dei pesi.
Come influisce il learning rate sul training di una rete neurale?
Un learning rate troppo alto può causare oscillazioni o divergenza, uno troppo basso rallenta la convergenza. Strategie di scheduling, warmup o adattive methods permettono di stabilizzare il training e migliorare la performance finale.
Qual è lo scopo del dropout durante l’addestramento?
Il dropout disattiva casualmente una frazione di neuroni durante il training per evitare che il modello dipenda troppo da singoli neuroni, migliorando la generalizzazione e riducendo il rischio di overfitting.
In cosa consiste la batch normalization e quali benefici porta?
Normalizza gli input di ogni layer per avere media zero e varianza uno, stabilizzando il training, accelerando la convergenza e permettendo l’uso di learning rate più elevati, con un effetto regolarizzante implicito.
Quando conviene usare Leaky ReLU invece di ReLU?
Leaky ReLU permette un piccolo gradiente per input negativi, evitando che neuroni si blocchino a zero (dying ReLU), risultando utile in reti profonde o quando i dati hanno valori negativi significativi.
Perché GELU è preferita nei Transformer?
GELU applica un'attivazione probabilisticamente pesata, risultando più smooth e stabile rispetto a ReLU, migliorando gradient flow e performance in modelli di grandi dimensioni come quelli basati su Transformer.
Quali tecniche avanzate aiutano a prevenire overfitting nei modelli deep learning?
Early stopping, data augmentation, label smoothing, spectral normalization e gradient penalty sono tecniche che regolano il training o modificano i dati per migliorare generalizzazione e ridurre overfitting, garantendo modelli più robusti.