written by Eric J. Ma on 2024-10-25 | tags: data science coding documentation readability distribution best practices cognition tool making
This blog post, which is my pyOpenSci Fall Training keynote, explores the importance of creating clean, distributable, and well-documented data science code, emphasizing the human dimension of coding practices. In it, I discuss key concepts such as readability, cognitive load, and the toolmaker's mindset, and provide practical insights on how to make code more accessible and impactful for both the creator and other users. I also touch on the role of AI in coding and documentation.
Read on... (3795 words, approximately 19 minutes reading time)written by Eric J. Ma on 2024-09-23 | tags: coding email htmx sqlite fastapi cloudmailin digital ocean deployment web development ai coding
In this blog post, I share my experience recreating Dan Ariely's Shortwhale using a tech stack that includes HTMX, SQLite, FastAPI, CloudMailin, and DigitalOcean. I highlight the transformative role of AI-assisted coding with Cursor, which allowed me to build core functionality in under two hours. The project, now live, was an experiment and a learning opportunity, emphasizing the speed and ease of AI-assisted web development. Curious about how AI can accelerate your web projects?
Read on... (503 words, approximately 3 minutes reading time)written by Eric J. Ma on 2024-09-19 | tags: pixi aws codeartifacts package management devops python configuration
In this blog post, I share my experience integrating Pixi with AWS CodeArtifact,
detailing the steps needed to configure Pixi for internal package publishing at work.
I discuss the installation of pipx
and keyrings.codeartifact
,
editing keyring
configurations,
and setting up Pixi's global configuration.
The guide aims to help others overcome similar integration challenges
(obviously without revealing company-specific details).
Curious about how these configurations can streamline your development process?
written by Eric J. Ma on 2024-09-15 | tags: github secrets environment-variables gh-cli automation devops productivity security til
Today, I learned that we can easily sync our local .env file with GitHub secrets using the GitHub CLI (gh). This method is much faster and less error-prone than manually entering secrets through the web interface. Curious to see how it works?
Read on... (206 words, approximately 2 minutes reading time)written by Eric J. Ma on 2024-09-14 | tags: coding productivity ai cursor developertools ide automation programming efficiency innovation
In this post, I explore how Cursor, an AI-powered IDE, transformed my coding workflow and supercharged my productivity. Learn about its standout features and why it's become my secret weapon for efficient development and writing. Are you ready to revolutionize your coding experience?
Read on... (885 words, approximately 5 minutes reading time)written by Eric J. Ma on 2024-09-06 | tags: evaluations pytest documentation automation testing validation changes criteria staleness
In this blog post, I explore the process of writing evaluations for LLM systems using pytest, aiming to move beyond subjective assessments to more structured testing. I detail the creation of specific tests to assess if LLMs can accurately determine documentation staleness, using various models and criteria. The challenges and insights gained from setting up these evaluations reveal the complexities involved in ensuring that LLMs perform as expected. Could this method enhance the reliability of your LLM evaluations?
Read on... (1739 words, approximately 9 minutes reading time)written by Eric J. Ma on 2024-08-31 | tags: structured generation llamabot python documentation llm pydantic software development testing structuredbot technology
In this blog post, I discuss the latest updates to LlamaBot, particularly focusing on the StructuredBot feature introduced by Elliot Salisbury. StructuredBot leverages JSON mode of LLMs for structured outputs, significantly simplifying the process of generating reliable and type-safe outputs without manual string parsing. I illustrate its application in an automated documentation checker and writer, enhancing productivity by integrating LLM-based and traditional programming methods. Curious about how StructuredBot can streamline your documentation process?
Read on... (1272 words, approximately 7 minutes reading time)written by Eric J. Ma on 2024-08-25 | tags: esm3 neural network multi-modality model training data tokenization model architecture vector embedding machine learning protein modeling journal club
In this blog post, I explore the ESM3 model, focusing on its handling of missing modalities in multi-modality training. I dissect the model's architecture, input and output configurations, and the strategic use of default values for absent data. By examining the source code and conducting a toy example, I illustrate how embeddings are calculated and how they shift in vector space when modalities are missing. This deep dive reveals the model's elegant design and its potential for multi-modality integration. Has this piqued your curiosity yet?
Read on... (2206 words, approximately 12 minutes reading time)written by Eric J. Ma on 2024-08-16 | tags: pixi tooling software development data science environment management containerization gpu packaging docker reproducibility testing
Post SciPy 2024, I had a chance to try out pixi
, a new environment manager from the prefix.dev team. I went cold turkey on my laptop, removing ~/anaconda
, and haven't looked back. In this (very long) blog post, I detail my experience switching from mamba
to pixi
, the ways that pixi
makes it easier to manage environments, how pixi
helps with onboarding onto a project, supports containerization, GPU access, and seamless integration with Docker, and how it facilitates publishing to PyPI and running tests. The switch has streamlined my workflow significantly. Was this enough to get you curious about how pixi
can optimize your development process too?
written by Eric J. Ma on 2024-08-09 | tags: protein engineering language models sequence generation bioinformatics protein sequence evals protein structure computational biology machine learning
In part 3 of the series on protein language models, I explore the critical phase of evaluating protein sequences generated by language models, emphasizing the importance of practical, bioinformatics-based evals to narrow down candidates for lab testing. I explore both sequence-based and structure-based evals, highlighting their roles in filtering and ranking sequences to prioritize for experimental validation. Additionally, I offer insights on fostering collaboration between computational and laboratory teams to enhance protein design efforts. How can these evals and collaborations accelerate protein engineering?
Read on... (1815 words, approximately 10 minutes reading time)