The Hub for Open-Source AI
Hugging Face has become the go-to platform for sharing, discovering, and using pre-trained machine learning models. Their Transformers library makes it possible to use state-of-the-art NLP, vision, and audio models with just a few lines of code.
Installation and Setup
pip install transformers datasets torch
Quick Start: Text Classification
The pipeline API is the fastest way to get started. It abstracts away tokenization, model loading, and inference into a single function call.
from transformers import pipeline
# Create a pipeline with a pre-trained model
classifier = pipeline("sentiment-analysis")
result = classifier([
"I absolutely love this new update!",
"This is the worst experience I've ever had.",
"The weather is okay, nothing special."
])
for item in result:
print(f"{item['label']}: {item['score']:.4f}")
# POSITIVE: 0.9998
# NEGATIVE: 0.9987
# POSITIVE: 0.5842
Named Entity Recognition
Extract structured information from unstructured text:
from transformers import pipeline
ner = pipeline("ner", grouped_entities=True)
text = "Elon Musk founded SpaceX in 2002. He is also the CEO of Tesla."
entities = ner(text)
for entity in entities:
print(f" {entity['word']}: {entity['entity_group']} ({entity['score']:.4f})")
# Elon Musk: PERSON
# SpaceX: ORGANIZATION
# 2002: DATE
# Tesla: ORGANIZATION
Fine-Tuning a Model
Pre-trained models are powerful, but fine-tuning on your own data unlocks domain-specific performance.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load a pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Prepare your dataset
dataset = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"})
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
evaluation_strategy="epoch",
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
trainer.train()
trainer.save_model("./fine-tuned-model")
The Hugging Face Hub
The Hugging Face Hub hosts over 900,000 models across NLP, computer vision, audio, and multimodal tasks. You can:
- Search for models by task, language, or license
- Upload your own models and datasets
- Run inferences directly in the browser
- Collaborate with the community
from huggingface_hub import login, list_models
# Login with your token
login("your_huggingface_token")
# Browse available models
models = list_models(task="text-classification", sort="downloads", direction=-1)
for model in list(models)[:5]:
print(model.id)
Beyond NLP: Vision and Audio
Hugging Face supports much more than text:
# Image classification
from transformers import pipeline
image_classifier = pipeline("image-classification")
result = image_classifier("dog.jpg")
# Speech recognition
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition")
result = transcriber("audio.wav")
print(result["text"])
Conclusion
Hugging Face has democratized access to cutting-edge AI. Whether you're a beginner trying your first inference or a researcher fine-tuning a large model, the platform provides the tools and community to get started quickly. Explore the Hub, experiment with different models, and contribute back to the open-source AI ecosystem.